cs.RO / 1 / 2602.23404
Cybersecurity of Teleoperated Quadruped Robots: A Systematic Survey of Vulnerabilities, Threats, and Open Defense Gaps
遥控四足机器人的网络安全:脆弱性、威胁和开放防御缺口的系统性调查
Abstract
Teleoperated quadruped robots are increasingly deployed in safety-critical missions -- industrial inspection, military reconnaissance, and emergency response -- yet the security of their communication and control infrastructure remains insufficiently characterized. Quadrupeds present distinct security challenges arising from dynamic stability constraints, gait-dependent vulnerability windows, substantial kinetic energy, and elevated operator cognitive load. This survey synthesizes peer-reviewed literature and vulnerability disclosures (2019--2025) to provide comprehensive analysis of cybersecurity threats, consequences, and countermeasures for teleoperated quadruped systems. We contribute: (i) a six-layer attack taxonomy spanning perception manipulation, VR/AR operator targeting, communication disruption, control signal attacks, localization spoofing, and network intrusion; (ii) systematic attack-to-consequence mapping with timing characterization; (iii) Technology Readiness Level classification exposing critical maturity gaps between field-deployed communication protections (TRL 7--9) and experimental perception/operator-layer defenses (TRL 3--5); (iv) comparative security analysis of six commercial platforms; (v) pragmatic deployment guidance stratified by implementation timeline; and (vi) eight prioritized research gaps with implementation roadmaps. Limitations: Platform assessments rely on publicly available information. Attack success rates derive from cited studies under controlled conditions and require domain-specific validation.
Chinese Translation
遥控四足机器人在安全关键任务中越来越多地被部署,例如工业检查、军事侦察和紧急响应,但其通信和控制基础设施的安全性仍然不足以被充分表征。四足机器人面临独特的安全挑战,这些挑战源于动态稳定性约束、步态依赖的脆弱性窗口、巨大的动能以及较高的操作员认知负荷。本调查综合了2019年至2025年间的同行评审文献和脆弱性披露,提供了对遥控四足系统的网络安全威胁、后果和对策的全面分析。我们的贡献包括:(i) 一个涵盖感知操控、虚拟/增强现实操作员目标、通信干扰、控制信号攻击、定位欺骗和网络入侵的六层攻击分类;(ii) 系统性的攻击与后果映射及时间特征;(iii) 技术成熟度等级分类,揭示了现场部署的通信保护(TRL 7-9)与实验性感知/操作层防御(TRL 3-5)之间的关键成熟度差距;(iv) 六个商业平台的比较安全分析;(v) 按实施时间线分层的务实部署指导;以及(vi) 八个优先研究缺口及其实施路线图。限制:平台评估依赖于公开可用的信息。攻击成功率来源于在受控条件下的引用研究,并需要领域特定的验证。
cs.RO / 2 / 2602.23408
Demystifying Action Space Design for Robotic Manipulation Policies
揭示机器人操控策略的动作空间设计
Abstract
The specification of the action space plays a pivotal role in imitation-based robotic manipulation policy learning, fundamentally shaping the optimization landscape of policy learning. While recent advances have focused heavily on scaling training data and model capacity, the choice of action space remains guided by ad-hoc heuristics or legacy designs, leading to an ambiguous understanding of robotic policy design philosophies. To address this ambiguity, we conducted a large-scale and systematic empirical study, confirming that the action space does have significant and complex impacts on robotic policy learning. We dissect the action design space along temporal and spatial axes, facilitating a structured analysis of how these choices govern both policy learnability and control stability. Based on 13,000+ real-world rollouts on a bimanual robot and evaluation on 500+ trained models over four scenarios, we examine the trade-offs between absolute vs. delta representations, and joint-space vs. task-space parameterizations. Our large-scale results suggest that properly designing the policy to predict delta actions consistently improves performance, while joint-space and task-space representations offer complementary strengths, favoring control stability and generalization, respectively.
Chinese Translation
动作空间的规范在基于模仿的机器人操控策略学习中起着关键作用,根本上塑造了策略学习的优化格局。尽管近期的进展主要集中在扩大训练数据和模型容量上,但动作空间的选择仍然受到临时启发式方法或遗留设计的指导,导致对机器人策略设计理念的理解模糊。为了解决这一模糊性,我们进行了大规模和系统的实证研究,确认动作空间对机器人策略学习确实具有显著而复杂的影响。我们沿时间和空间轴剖析动作设计空间,促进了对这些选择如何影响策略可学习性和控制稳定性的结构化分析。基于对一台双手机器人进行的13,000多次真实世界的执行和对500多个训练模型在四种场景下的评估,我们考察了绝对表示与增量表示、关节空间与任务空间参数化之间的权衡。我们的规模化结果表明,合理设计策略以预测增量动作可以持续提高性能,而关节空间和任务空间表示则提供了互补的优势,分别有利于控制稳定性和泛化能力。
cs.RO / 3 / 2602.23457
Printed helicoids with embedded air channels make sensorized segments for soft continuum robots
嵌入空气通道的印刷螺旋体构成软连续机器人传感器化段
Abstract
Soft robots enable safe, adaptive interaction with complex environments but remain difficult to sense and control due to their highly deformable structures. Architected soft materials such as helicoid lattices offer tunable stiffness and strength but are challenging to instrument because of their sparse geometry. We introduce a fabrication method for embedding air channels into helicoid-based soft continuum robots. Multi-material segments fabricated via vision-controlled jetting in a single print interface with PCBs housing miniature pressure sensors and IMUs for distributed deformation sensing. We characterize the mechanical properties of four helicoid designs and validate the sensor response to fundamental deformation modes. To demonstrate the platform's scalability, we construct and mechanically evaluate a meter-scale, 14-DoF cable-driven soft arm capable of open-loop trajectory tracking and object grasping, with tactile-based stiffness detection demonstrated using the gripper sensors. This approach establishes a scalable fabrication strategy for sensorized architected materials in large-scale soft robotic systems.
Chinese Translation
软机器人能够安全、适应性地与复杂环境进行交互,但由于其高度可变形的结构,感知和控制仍然困难。构造的软材料,如螺旋体格子,提供可调的刚度和强度,但由于其稀疏的几何形状,难以进行仪器化。我们提出了一种将空气通道嵌入基于螺旋体的软连续机器人的制造方法。通过在单一打印界面中使用视觉控制喷射技术,制造多材料段,并在其上集成了包含微型压力传感器和惯性测量单元(IMU)的印刷电路板(PCB),用于分布式变形感知。我们对四种螺旋体设计的机械性能进行了表征,并验证了传感器对基本变形模式的响应。为了展示该平台的可扩展性,我们构建并机械评估了一种米级、14自由度的电缆驱动软臂,能够进行开环轨迹跟踪和物体抓取,且通过夹持器传感器演示了基于触觉的刚度检测。这种方法为大规模软机器人系统中的传感器化构造材料建立了一种可扩展的制造策略。
cs.RO / 4 / 2602.23478
Refining Almost-Safe Value Functions on the Fly
动态优化近似安全价值函数
Abstract
Control Barrier Functions (CBFs) are a powerful tool for ensuring robotic safety, but designing or learning valid CBFs for complex systems is a significant challenge. While Hamilton-Jacobi Reachability provides a formal method for synthesizing safe value functions, it scales poorly and is typically performed offline, limiting its applicability in dynamic environments. This paper bridges the gap between offline synthesis and online adaptation. We introduce refineCBF for refining an approximate CBF - whether analytically derived, learned, or even unsafe - via warm-started HJ reachability. We then present its computationally efficient successor, HJ-Patch, which accelerates this process through localized updates. Both methods guarantee the recovery of a safe value function and can ensure monotonic safety improvements during adaptation. Our experiments validate our framework's primary contribution: in-the-loop, real-time adaptation, in simulation (with detailed value function analysis) and on physical hardware. Our experiments on ground vehicles and quadcopters show that our framework can successfully adapt to sudden environmental changes, such as new obstacles and unmodeled wind disturbances, providing a practical path toward deploying formally guaranteed safety in real-world settings.
Chinese Translation
控制障碍函数(Control Barrier Functions, CBFs)是确保机器人安全的强大工具,但为复杂系统设计或学习有效的CBF是一项重大挑战。虽然哈密顿-雅可比可达性(Hamilton-Jacobi Reachability)提供了一种合成安全价值函数的正式方法,但其扩展性较差,通常在离线状态下进行,限制了其在动态环境中的适用性。本文弥补了离线合成与在线适应之间的差距。我们引入了refineCBF,用于通过热启动的哈密顿-雅可比可达性来优化近似CBF——无论是通过解析推导、学习,还是不安全的CBF。接着,我们提出了其计算效率更高的后继方法HJ-Patch,通过局部更新加速这一过程。这两种方法都保证能够恢复安全价值函数,并在适应过程中确保单调的安全性提升。我们的实验验证了框架的主要贡献:在环路内进行实时适应,既在仿真中(伴随详细的价值函数分析),也在物理硬件上进行。我们在地面车辆和四旋翼飞行器上的实验表明,我们的框架能够成功适应突发的环境变化,如新障碍物和未建模的风干扰,为在现实环境中部署形式上保证的安全提供了一条切实可行的路径。
cs.RO / 5 / 2602.23499
TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving
TaCarla:一个全面的端到端自动驾驶基准数据集
Abstract
Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.
Chinese Translation
收集高质量数据集是一项关键任务,需要对细节进行细致入微的关注,因为忽视某些方面可能导致整个数据集无法使用。自动驾驶挑战仍然是一个突出的研究领域,需要进一步探索以提升车辆的感知和规划性能。然而,现有数据集往往不完整。例如,包含感知信息的数据集通常缺乏规划数据,而规划数据集通常由大量的驾驶序列组成,其中自我车辆主要向前行驶,行为多样性有限。此外,许多真实数据集在评估其模型时面临困难,尤其是在规划任务中,因为它们缺乏适当的闭环评估设置。CARLA Leaderboard 2.0 挑战提供了一组多样化的场景,以解决自动驾驶中的长尾问题,已成为在开放环和闭环评估设置中开发感知和规划模型的有价值的替代平台。然而,在该平台上收集的现有数据集存在某些局限性。一些数据集似乎主要针对有限的传感器配置,具有特定的传感器配置。为了支持端到端自动驾驶研究,我们使用 CARLA 模拟环境为多样化的 Leaderboard 2.0 挑战场景收集了一个新的数据集,包含超过 285 万帧。我们的数据集不仅设计用于规划任务,还支持动态物体检测、车道分隔线检测、中心线检测、交通信号灯识别、预测任务和视觉语言动作模型。此外,我们通过使用我们的数据集训练各种模型来展示其多功能性。此外,我们还提供数值稀有性评分,以了解当前状态在数据集中出现的稀有程度。
cs.RO / 6 / 2602.23524
V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space
V-MORALS:基于视觉的摩尔斯图辅助的学习潜在空间中吸引区域的估计
Abstract
Reachability analysis has become increasingly important in robotics to distinguish safe from unsafe states. Unfortunately, existing reachability and safety analysis methods often fall short, as they typically require known system dynamics or large datasets to estimate accurate system models, are computationally expensive, and assume full state information. A recent method, called MORALS, aims to address these shortcomings by using topological tools to estimate3DR-eEgnciodnesr of Attraction (ROA) in a low-dimensional latent space. However, MORALS still relies on full state knowledge and has not been studied when only sensor measurements are available. This paper presents Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space (V- MORALS). V-MORALS takes in a dataset of image-based trajectories of a system under a given controller, and learns a latent space for reachability analysis. Using this learned latent space, our method is able to generate well-defined Morse Graphs, from which we can compute ROAs for various systems and controllers. V-MORALS provides capabilities similar to the original MORALS architecture without relying on state knowledge, and using only high-level sensor data. Our project website is at: https://v-morals.onrender.com.
Chinese Translation
可达性分析在机器人技术中变得越来越重要,以区分安全状态和不安全状态。不幸的是,现有的可达性和安全分析方法往往存在不足,因为它们通常需要已知的系统动态或大量数据集来估计准确的系统模型,计算成本高,并假设拥有完整的状态信息。一种名为 MORALS 的新方法旨在通过使用拓扑工具在低维潜在空间中估计吸引区域(ROA)来解决这些问题。然而,MORALS 仍然依赖于完整的状态知识,并且在仅有传感器测量可用的情况下尚未进行研究。本文提出了基于视觉的摩尔斯图辅助的学习潜在空间中吸引区域的估计(V-MORALS)。V-MORALS 接收在给定控制器下的系统的基于图像的轨迹数据集,并学习用于可达性分析的潜在空间。利用这一学习到的潜在空间,我们的方法能够生成定义良好的摩尔斯图,从中可以计算出各种系统和控制器的 ROA。V-MORALS 提供了与原始 MORALS 架构类似的功能,而无需依赖状态知识,仅使用高层次的传感器数据。我们的项目网站为:https://v-morals.onrender.com。
cs.RO / 7 / 2602.23576
Tilt-X: Enabling Compliant Aerial Manipulation through a Tiltable-Extensible Continuum Manipulator
Tilt-X:通过可倾伸缩连续操纵器实现顺应性空中操控
Abstract
Aerial manipulators extend the reach and manipulation capabilities of uncrewed multirotor aerial vehicles for inspection, agriculture, sampling, and delivery. Continuum arm aerial manipulation systems offer lightweight, dexterous, and compliant interaction opportunities. Existing designs allow manipulation only below the UAV which restricts their deployability in multiple directions and through clutter. They are also sensitive to propeller downwash. Addressing these limitations, we present Tilt-X, a continuum arm aerial manipulator that integrates a tilting mechanism, a telescopic stage, and a cable-driven continuum section. We present its design and kinematic model and validate it through flight demonstrations. Tilt-X enables a volumetric workspace with up to 75 mm extension and planar orientations between 0$^\circ$ to 90$^\circ$. Experiments comparing end effector pose with and without downwash quantitatively measure its accuracy, providing critical evidence to guide the design and control of reliable aerial manipulators. Results show stabilisation of end effector pose as the manipulator extends out of the propeller influence zone.
Chinese Translation
空中操纵器扩展了无人多旋翼飞行器在检查、农业、取样和投递等方面的操作范围和操控能力。连续臂空中操控系统提供了轻量、灵活和顺应的交互机会。现有设计仅允许在无人机下方进行操控,这限制了其在多个方向和复杂环境中的部署能力。同时,它们对螺旋桨的下洗气流也很敏感。为了解决这些局限性,我们提出了Tilt-X,一种集成了倾斜机制、伸缩阶段和缆驱动连续部分的连续臂空中操纵器。我们展示了其设计和运动学模型,并通过飞行演示进行了验证。Tilt-X实现了一个体积工作空间,具有高达75毫米的伸展能力和0°到90°之间的平面方向。通过比较有无下洗气流的末端执行器姿态的实验,定量测量了其准确性,为可靠空中操纵器的设计和控制提供了重要依据。结果显示,随着操纵器伸出螺旋桨影响区,末端执行器姿态得到了稳定。
cs.RO / 8 / 2602.23583
VCA: Vision-Click-Action Framework for Precise Manipulation of Segmented Objects in Target Ambiguous Environments
VCA:用于在目标模糊环境中精确操作分割对象的视觉-点击-动作框架
Abstract
The reliance on language in Vision-Language-Action (VLA) models introduces ambiguity, cognitive overhead, and difficulties in precise object identification and sequential task execution, particularly in environments with multiple visually similar objects. To address these limitations, we propose Vision-Click-Action (VCA), a framework that replaces verbose textual commands with direct, click-based visual interaction using pretrained segmentation models. By allowing operators to specify target objects clearly through visual selection in the robot's 2D camera view, VCA reduces interpretation errors, lowers cognitive load, and provides a practical and scalable alternative to language-driven interfaces for real-world robotic manipulation. Experimental results validate that the proposed VCA framework achieves effective instance-level manipulation of specified target objects. Experiment videos are available at https://robrosinc.github.io/vca/.
Chinese Translation
在视觉-语言-动作(VLA)模型中对语言的依赖引入了模糊性、认知负担以及在多个视觉相似对象的环境中精确识别对象和顺序执行任务的困难。为了解决这些局限性,我们提出了视觉-点击-动作(VCA)框架,该框架用直接的基于点击的视觉交互替代冗长的文本命令,使用预训练的分割模型。通过允许操作员在机器人2D摄像头视图中通过视觉选择清晰地指定目标对象,VCA减少了解释错误,降低了认知负担,并为现实世界的机器人操作提供了一种实用且可扩展的替代语言驱动接口的方案。实验结果验证了所提出的VCA框架在指定目标对象的实例级操作中取得了有效的效果。实验视频可在 https://robrosinc.github.io/vca/ 查看。
cs.RO / 9 / 2602.23592
KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
KEEP:一种以KV缓存为中心的内存管理系统,用于高效的具身规划
Abstract
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long-horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text, leading to excessively long prompts and high prefill latency. While it is possible to store and reuse the KV caches, the efficiency benefits are greatly undermined due to frequent KV cache updates. In this paper, we propose KEEP, a KV-cache-centric memory management system for efficient embodied planning. KEEP features 3 key innovations: (1) a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group; (2) a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups and reconstructs memory interactions iteratively; (3) a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention computation across different layers. Extensive experimental results have demonstrated that KEEP achieves 2.68x speedup with negligible accuracy loss compared with text-based memory methods on ALFRED dataset. Compared with the KV re-computation method CacheBlend (EuroSys'25), KEEP shows 4.13% success rate improvement and 1.90x time-to-first-token (TTFT) reduction. Our code is available on https://github.com/PKU-SEC-Lab/KEEP_Embodied_Memory.
Chinese Translation
增强记忆的大型语言模型(LLMs)在复杂和长时间跨度的具身规划中表现出了显著的能力。通过跟踪过去的经验和环境状态,记忆使得LLMs能够保持全局视角,从而避免重复探索。然而,现有的方法通常将记忆存储为原始文本,导致提示过长和预填充延迟过高。虽然可以存储和重用KV缓存,但由于频繁的KV缓存更新,效率收益大大降低。本文提出了KEEP,一种以KV缓存为中心的内存管理系统,用于高效的具身规划。KEEP具有三个关键创新:(1)一种静态-动态内存构建算法,通过混合粒度内存组减少KV缓存的重新计算;(2)一种多跳内存重新计算算法,动态识别不同内存组之间的重要交叉注意力,并迭代重构内存交互;(3)一种层平衡内存加载,消除不同层之间不平衡的KV缓存加载和交叉注意力计算。大量实验结果表明,与基于文本的记忆方法相比,KEEP在ALFRED数据集上实现了2.68倍的加速,且准确率损失微乎其微。与KV重新计算方法CacheBlend(EuroSys'25)相比,KEEP显示出4.13%的成功率提升和1.90倍的首次令牌时间(TTFT)减少。我们的代码可在https://github.com/PKU-SEC-Lab/KEEP_Embodied_Memory获取。
cs.RO / 10 / 2602.23607
MicroPush: A Simulator and Benchmark for Contact-Rich Cell Pushing and Assembly with a Magnetic Rolling Microrobot
MicroPush:一种用于磁滚动微机器人在接触丰富的细胞推动和组装中的模拟器和基准测试
Abstract
Magnetic rolling microrobots enable gentle manipulation in confined microfluidic environments, yet autonomy for contact-rich behaviors such as cell pushing and multi-target assembly remains difficult to develop and evaluate reproducibly. We present MicroPush, an open-source simulator and benchmark suite for magnetic rolling microrobots in cluttered 2D scenes. MicroPush combines an overdamped interaction model with contact-aware stick--slip effects, lightweight near-field damping, optional Poiseuille background flow, and a calibrated mapping from actuation frequency to free-space rolling speed. On top of the simulator core, we provide a modular planning--control stack with a two-phase strategy for contact establishment and goal-directed pushing, together with a deterministic benchmark protocol with fixed tasks, staged execution, and unified CSV logging for single-object transport and hexagonal assembly. We report success, time, and tracking metrics, and an actuation-variation measure $E_{\Delta\omega}$. Results show that controller stability dominates performance under flow disturbances, while planner choice can influence command smoothness over long-horizon sequences via waypoint progression. MicroPush enables reproducible comparison and ablation of planning, control, and learning methods for microscale contact-rich micromanipulation.
Chinese Translation
磁滚动微机器人能够在受限的微流体环境中实现温和的操作,但在接触丰富的行为(如细胞推动和多目标组装)方面,开发和评估自主性仍然困难。我们提出了MicroPush,一个用于磁滚动微机器人在杂乱的二维场景中的开源模拟器和基准测试套件。MicroPush结合了过阻尼的交互模型与接触感知的粘滑效应、轻量级近场阻尼、可选的泊肃叶背景流以及从驱动频率到自由空间滚动速度的校准映射。在模拟器核心之上,我们提供了一个模块化的规划-控制框架,采用两阶段策略用于接触建立和目标导向的推动,同时配备了一个确定性的基准协议,包含固定任务、分阶段执行和统一的CSV日志记录,用于单对象运输和六边形组装。我们报告了成功率、时间和跟踪指标,以及驱动变化度量$E_{ ext{Δω}}$。结果表明,在流动干扰下,控制器的稳定性主导了性能,而规划者的选择可以通过路径点进展影响长时间序列中的指令平滑度。MicroPush使得微尺度接触丰富的微操作的规划、控制和学习方法的可重复比较和消融成为可能。
cs.RO / 11 / 2602.23648
FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation
FAVLA:一种用于接触丰富的机器人操控的力自适应快慢VLA模型
Abstract
Force/torque feedback can substantially improve Vision-Language-Action (VLA) models on contact-rich manipulation, but most existing approaches fuse all modalities at a single operating frequency. This design ignores the mismatched sampling rates of real robot sensors, forcing downsampling of the high-frequency contact cues needed for reactive correction. Combined with common VLM-action-expert (AE) pipelines that execute action chunks largely open loop between expensive VLM updates, unified-frequency fusion often yields delayed responses to impacts, stick-slip, and force spikes. We propose FAVLA, a force-adaptive fast-slow VLA that decouples slow perception planning from fast contact-aware control. FAVLA runs a slow VLM at a fixed low frequency to encode modalities to produce latent representations and to predict near-future force variation. A fast AE then executes at a variable high frequency, conditioning on the latest force sequence data to generate reactive actions. We further introduce a force adapter that injects high-frequency force features into multiple AE layers, and adaptively schedules the AE's execution frequency based on the VLM's predicted force variation. Extensive experiments on contact-rich tasks demonstrate that FAVLA significantly outperforms baselines, achieving superior reactivity and success rates, especially with a smaller contact force during manipulation.
Chinese Translation
力/扭矩反馈可以显著改善视觉-语言-动作(VLA)模型在接触丰富操控中的表现,但大多数现有方法在单一操作频率下融合所有模态。这种设计忽略了真实机器人传感器的采样率不匹配,迫使高频接触线索进行下采样,而这些线索对于反应性修正是必需的。结合常见的VLM-动作-专家(AE)管道,这些管道在昂贵的VLM更新之间大多以开放循环执行动作块,统一频率融合往往导致对冲击、粘滑和力尖峰的响应延迟。我们提出FAVLA,一种力自适应的快慢VLA,它将慢感知规划与快接触感知控制解耦。FAVLA以固定低频率运行慢VLM,以编码模态以生成潜在表示并预测近未来的力变化。然后,快速AE以可变高频率执行,基于最新的力序列数据生成反应性动作。我们进一步引入了一个力适配器,将高频力特征注入多个AE层,并根据VLM预测的力变化自适应调度AE的执行频率。在接触丰富任务上的大量实验表明,FAVLA显著优于基线,尤其是在操控过程中以较小的接触力实现了更高的反应性和成功率。
cs.RO / 12 / 2602.23654
SpikingTac: A Miniaturized Neuromorphic Visuotactile Sensor for High-Precision Dynamic Tactile Imprint Tracking
SpikingTac:一种微型化神经形态视觉触觉传感器,用于高精度动态触觉印记跟踪
Abstract
High-speed event-driven tactile sensors are essential for achieving human-like dynamic manipulation, yet their integration is often limited by the bulkiness of standard event cameras. This paper presents SpikingTac, a miniaturized, highly integrated neuromorphic tactile sensor featuring a custom standalone event camera module, achieved with a total material cost of less than \$150. We construct a global dynamic state map coupled with an unsupervised denoising network to enable precise tracking at a 1000~Hz perception rate and 350~Hz tracking frequency. Addressing the viscoelastic hysteresis of silicone elastomers, we propose a hysteresis-aware incremental update law with a spatial gain damping mechanism. Experimental results demonstrate exceptional zero-point stability, achieving a 100\% return-to-origin success rate with a minimal mean bias of 0.8039 pixels, even under extreme torsional deformations. In dynamic tasks, SpikingTac limits the obstacle-avoidance overshoot to 6.2~mm, representing a 5-fold performance improvement over conventional frame-based sensors. Furthermore, the sensor achieves sub-millimeter geometric accuracy, with Root Mean Square Error (RMSE) of 0.0952~mm in localization and 0.0452~mm in radius measurement.
Chinese Translation
高速事件驱动的触觉传感器对于实现类人动态操作至关重要,但其集成往往受到标准事件相机体积庞大的限制。本文提出了SpikingTac,一种微型化、高度集成的神经形态触觉传感器,配备定制的独立事件相机模块,总材料成本低于150美元。我们构建了一个全球动态状态图,并结合无监督去噪网络,以实现1000 Hz感知率和350 Hz跟踪频率下的精确跟踪。针对硅弹性体的粘弹性滞后特性,我们提出了一种考虑滞后的增量更新法则,并引入空间增益阻尼机制。实验结果表明,该传感器在零点稳定性方面表现出色,在极端扭转变形下实现了100%的回归原点成功率,平均偏差仅为0.8039像素。在动态任务中,SpikingTac将避障超调限制在6.2 mm,较传统基于帧的传感器性能提升了5倍。此外,该传感器在定位方面实现了亚毫米级的几何精度,均方根误差(RMSE)为0.0952 mm,在半径测量方面为0.0452 mm。
cs.RO / 13 / 2602.23670
Physics-Embedded Neural ODEs for Learning Antagonistic Pneumatic Artificial Muscle Dynamics
嵌入物理结构的神经常微分方程用于学习对抗性气动人工肌肉动力学
Abstract
Pneumatic artificial muscles (PAMs) enable compliant actuation for soft wearable, assistive, and interactive robots. When arranged antagonistically, PAMs can provide variable impedance through co-contraction but exhibit coupled, nonlinear, and hysteretic dynamics that challenge modeling and control. This paper presents a hybrid neural ordinary differential equation (Neural ODE) framework that embeds physical structure into a learned model of antagonistic PAM dynamics. The formulation combines parametric joint mechanics and pneumatic state dynamics with a neural network force component that captures antagonistic coupling and rate-dependent hysteresis. The forward model predicts joint motion and chamber pressures with a mean R$^2$ of 0.88 across 225 co-contraction conditions. An inverse formulation, derived from the learned dynamics, computes pressure commands offline for desired motion and stiffness profiles, tracked in closed loop during execution. Experimental validation demonstrates reliable stiffness control across 126-176 N/mm and consistent impedance behavior across operating velocities, in contrast to a static model, which shows degraded stiffness consistency at higher velocities.
Chinese Translation
气动人工肌肉(PAMs)为柔性可穿戴、辅助和交互式机器人提供了顺应性驱动。当以对抗方式排列时,PAMs可以通过共同收缩提供可变阻抗,但表现出耦合的、非线性和滞后动力学,这对建模和控制提出了挑战。本文提出了一种混合神经常微分方程(Neural ODE)框架,将物理结构嵌入到对抗性PAM动力学的学习模型中。该公式结合了参数化的关节力学和气动状态动力学,以及一个神经网络力组件,该组件捕捉对抗耦合和速率依赖的滞后。前向模型在225种共同收缩条件下预测关节运动和腔室压力,平均R²值为0.88。基于学习到的动力学推导出的逆向公式,离线计算所需运动和刚度特征的压力指令,在执行过程中以闭环方式跟踪。实验验证表明,在126-176 N/mm的范围内实现了可靠的刚度控制,并且在不同操作速度下表现出一致的阻抗行为,与静态模型相比,后者在较高速度下显示出刚度一致性下降。
cs.RO / 14 / 2602.23694
Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion
通过对数似然比融合实现可解释的多模态手势识别,用于无人机和移动机器人遥操作
Abstract
Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.
Chinese Translation
人类操作员仍然经常面临灾区和工业设施等危险环境,在这些环境中,移动机器人和无人驾驶飞行器(UAV)的直观和可靠的遥操作至关重要。在这种背景下,无需双手的遥操作增强了操作员的机动性和情境意识,从而提高了在危险环境中的安全性。尽管基于视觉的手势识别已被探索作为一种无需双手的遥操作方法,但其性能在遮挡、光照变化和杂乱背景下往往会下降,限制了其在实际操作中的适用性。为克服这些限制,我们提出了一种多模态手势识别框架,该框架将来自双手腕的苹果手表的惯性数据(加速度计、陀螺仪和方向)与定制手套的电容传感信号相结合。我们设计了一种基于对数似然比(LLR)的后期融合策略,该策略不仅增强了识别性能,还通过量化模态特定的贡献提供了可解释性。为了支持这项研究,我们引入了一个新的数据集,包含20种受飞机引导信号启发的独特手势,数据集包括同步的RGB视频、IMU和电容传感器数据。实验结果表明,我们的框架在性能上可与最先进的基于视觉的基线相媲美,同时显著降低了计算成本、模型大小和训练时间,使其非常适合实时机器人控制。因此,我们强调基于传感器的多模态融合作为一种稳健且可解释的解决方案在手势驱动的移动机器人和无人机遥操作中的潜力。
cs.RO / 15 / 2602.23706
A Reliable Indoor Navigation System for Humans Using AR-based Technique
基于增强现实技术的人类可靠室内导航系统
Abstract
Reliable navigation systems are not available indoors, such as in campuses and small areas. Users must depend on confusing, time-consuming static signage or floor maps. In this paper, an AR-based technique has been applied to campus and small-site navigation, where Vuforia Area Target is used for environment modeling. AI navigation's NavMesh component is used for navigation purposes, and the A* algorithm is used within this component for shortest path calculation. Compared to Dijkstra's algorithm, it can reach a solution about two to three times faster for smaller search spaces. In many cases, Dijkstra's algorithm has difficulty performing well in high-complexity environments where memory usage grows and processing times increase. Compared to older approaches such as GPS, real-time processing and AR overlays can be combined to provide intuitive directions for users while dynamically updating the path in response to environmental changes. Experimental results indicate significantly improved navigation accuracy, better user experience, and greater efficiency compared to traditional methods. These results show that AR technology integrated with existing pathfinding algorithms is feasible and scalable, making it a user-friendly solution for indoor navigation. Although highly effective in limited and defined indoor spaces, further optimization of NavMesh is required for large or highly dynamic environments.
Chinese Translation
目前,室内导航系统并不可靠,例如在校园和小型区域内。用户必须依赖于令人困惑且耗时的静态标识或楼层地图。本文应用了一种基于增强现实(AR)的技术用于校园和小型场所的导航,其中使用了 Vuforia Area Target 进行环境建模。AI 导航的 NavMesh 组件用于导航目的,并在该组件内使用 A* 算法进行最短路径计算。与 Dijkstra 算法相比,它在较小的搜索空间中能够以大约两到三倍的速度达到解决方案。在许多情况下,Dijkstra 算法在高复杂度环境中表现不佳,因为内存使用量增加且处理时间延长。与 GPS 等旧方法相比,实时处理和 AR 覆盖可以结合提供直观的用户指引,同时根据环境变化动态更新路径。实验结果表明,与传统方法相比,导航准确性显著提高,用户体验更佳,效率更高。这些结果表明,AR 技术与现有路径寻找算法的结合是可行且可扩展的,成为一种用户友好的室内导航解决方案。尽管在有限和明确的室内空间中非常有效,但对于大型或高度动态的环境,仍需进一步优化 NavMesh。
cs.RO / 16 / 2602.23719
SAGE-LLM: Towards Safe and Generalizable LLM Controller with Fuzzy-CBF Verification and Graph-Structured Knowledge Retrieval for UAV Decision
SAGE-LLM:基于模糊控制屏障函数验证与图结构知识检索的无人机决策安全与可泛化的LLM控制器
Abstract
In UAV dynamic decision, complex and variable hazardous factors pose severe challenges to the generalization capability of algorithms. Despite offering semantic understanding and scene generalization, Large Language Models (LLM) lack domain-specific UAV control knowledge and formal safety assurances, restricting their direct applicability. To bridge this gap, this paper proposes a train-free two-layer decision architecture based on LLMs, integrating high-level safety planning with low-level precise control. The framework introduces three key contributions: 1) A fuzzy Control Barrier Function verification mechanism for semantically-augmented actions, providing provable safety certification for LLM outputs. 2) A star-hierarchical graph-based retrieval-augmented generation system, enabling efficient, elastic, and interpretable scene adaptation. 3) Systematic experimental validation in pursuit-evasion scenarios with unknown obstacles and emergent threats, demonstrating that our SAGE-LLM maintains performance while significantly enhancing safety and generalization without online training. The proposed framework demonstrates strong extensibility, suggesting its potential for generalization to broader embodied intelligence systems and safety-critical control domains.
Chinese Translation
在无人机动态决策中,复杂多变的危险因素对算法的泛化能力提出了严峻挑战。尽管大型语言模型(LLM)提供了语义理解和场景泛化,但缺乏特定领域的无人机控制知识和正式的安全保障,限制了其直接应用。为了解决这一问题,本文提出了一种基于LLM的无训练双层决策架构,结合了高层安全规划与低层精确控制。该框架引入了三个关键贡献:1)一种模糊控制屏障函数验证机制,用于语义增强的动作,为LLM输出提供可证明的安全认证。2)一种星形层级图基检索增强生成系统,实现高效、灵活和可解释的场景适应。3)在具有未知障碍物和突发威胁的追逃场景中进行系统实验验证,证明我们的SAGE-LLM在不进行在线训练的情况下,保持性能的同时显著增强了安全性和泛化能力。所提出的框架展现出强大的可扩展性,暗示其在更广泛的具身智能系统和安全关键控制领域的泛化潜力。
cs.RO / 17 / 2602.23721
StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation
StemVLA:一个开源的视觉-语言-动作模型,具备未来三维空间几何知识和四维历史表示
Abstract
Vision-language-action (VLA) models integrate visual observations and language instructions to predict robot actions, demonstrating promising generalization in manipulation tasks. However, most existing approaches primarily rely on direct mappings from 2D visual inputs to action sequences, without explicitly modeling the underlying 3D spatial structure or temporal world dynamics. Such representations may limit spatial reasoning and long-horizon decision-making in dynamic environments. To address this limitation, we propose StemVLA, a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D spatiotemporal representations into action prediction. First, instead of relying solely on observed images, StemVLA forecasts structured 3D future spatial-geometric world knowledge, enabling the model to anticipate upcoming scene geometry and object configurations. Second, to capture temporal consistency and motion dynamics, we feed historical image frames into a pretrained video-geometry transformer backbone to extract implicit 3D world representations, and further aggregate them across time using a temporal attention module, termed VideoFormer [20], forming a unified 4D historical spatiotemporal representation. By jointly modeling 2D observations, predicted 3D future structure, and aggregated 4D temporal dynamics, StemVLA enables more comprehensive world understanding for robot manipulation. Extensive experiments in simulation demonstrate that StemVLA significantly improves long-horizon task success and achieves state-of-the-art performance on the CALVIN ABC-D benchmark [46], achieving an average sequence length of XXX.
Chinese Translation
视觉-语言-动作(VLA)模型将视觉观察与语言指令结合起来,以预测机器人动作,在操作任务中展现出良好的泛化能力。然而,大多数现有方法主要依赖于将二维视觉输入直接映射到动作序列,而没有明确建模潜在的三维空间结构或时间动态。这种表示可能限制了在动态环境中的空间推理和长时间决策能力。为了解决这一局限性,我们提出了StemVLA,一个新颖的框架,明确结合了未来导向的三维空间知识和历史四维时空表示以进行动作预测。首先,StemVLA不仅依赖于观察到的图像,而是预测结构化的三维未来空间几何世界知识,使模型能够预见即将到来的场景几何和物体配置。其次,为了捕捉时间一致性和运动动态,我们将历史图像帧输入到一个预训练的视频几何变换器骨干网络中,以提取隐含的三维世界表示,并使用一个称为VideoFormer [20] 的时间注意模块在时间上进一步聚合这些表示,形成统一的四维历史时空表示。通过联合建模二维观察、预测的三维未来结构和聚合的四维时间动态,StemVLA使机器人操作的世界理解更加全面。在模拟中的大量实验表明,StemVLA显著提高了长时间任务的成功率,并在CALVIN ABC-D基准测试 [46] 上达到了最先进的性能,平均序列长度为XXX。
cs.RO / 18 / 2602.23821
Acceleration-Based Control of Fixed-Wing UAVs for Guidance Applications
基于加速度的固定翼无人机控制用于引导应用
Abstract
Acceleration-commanded guidance laws (e.g., proportional navigation) are attractive for high-level decision making, but their direct deployment on fixed-wing UAVs is challenging because accelerations are not directly actuated and must be realized through attitude and thrust under flight-envelope constraints. This paper presents an acceleration-level outer-loop control framework that converts commanded tangential and normal accelerations into executable body-rate and normalized thrust commands compatible with mainstream autopilots (e.g., PX4/APM). For the normal channel, we derive an engineering mapping from the desired normal acceleration to roll- and pitch-rate commands that regulate the direction and magnitude of the lift vector under small-angle assumptions. For the tangential channel, we introduce an energy-based formulation inspired by total energy control and identify an empirical thrust-energy acceleration relationship directly from flight data, avoiding explicit propulsion modeling or thrust bench calibration. We further discuss priority handling between normal and tangential accelerations under saturation and non-level maneuvers. Extensive real-flight experiments on a VTOL fixed-wing platform demonstrate accurate acceleration tracking and enable practical implementation of proportional navigation using only body-rate and normalized thrust interfaces.
Chinese Translation
加速度指令引导法(例如,比例导航)在高层决策中具有吸引力,但其在固定翼无人机上的直接应用面临挑战,因为加速度并不是直接驱动的,而必须通过姿态和推力在飞行包络约束下实现。本文提出了一种加速度级外环控制框架,将指令的切向和法向加速度转换为与主流自动驾驶仪(如 PX4/APM)兼容的可执行机体速率和归一化推力指令。对于法向通道,我们推导出从期望法向加速度到滚转率和俯仰率指令的工程映射,以在小角度假设下调节升力矢量的方向和大小。对于切向通道,我们引入了一种基于能量的公式,灵感来源于总能量控制,并直接从飞行数据中识别出经验推力-能量加速度关系,避免了显式的推进建模或推力台标定。我们进一步讨论了在饱和和非水平机动下法向和切向加速度之间的优先级处理。在一款垂直起降固定翼平台上进行的大量实飞实验展示了准确的加速度跟踪,并使得仅使用机体速率和归一化推力接口实现比例导航的实际应用成为可能。
cs.RO / 19 / 2602.23832
OmniTrack: General Motion Tracking via Physics-Consistent Reference
OmniTrack:通过物理一致的参考实现通用运动跟踪
Abstract
Learning motion tracking from rich human motion data is a foundational task for achieving general control in humanoid robots, enabling them to perform diverse behaviors. However, discrepancies in morphology and dynamics between humans and robots, combined with data noise, introduce physically infeasible artifacts in reference motions, such as floating and penetration. During both training and execution, these artifacts create a conflict between following inaccurate reference motions and maintaining the robot's stability, hindering the development of a generalizable motion tracking policy. To address these challenges, we introduce OmniTrack, a general tracking framework that explicitly decouples physical feasibility from general motion tracking. In the first stage, a privileged generalist policy generates physically plausible motions that strictly adhere to the robot's dynamics via trajectory rollout in simulation. In the second stage, the general control policy is trained to track these physically feasible motions, ensuring stable and coherent control transfer to the real robot. Experiments show that OmniTrack improves tracking accuracy and demonstrates strong generalization to unseen motions. In real-world tests, OmniTrack achieves hour-long, consistent, and stable tracking, including complex acrobatic motions such as flips and cartwheels. Additionally, we show that OmniTrack supports human-style stable and dynamic online teleoperation, highlighting its robustness and adaptability to varying user inputs.
Chinese Translation
从丰富的人类运动数据中学习运动跟踪是实现类人机器人通用控制的基础任务,使其能够执行多样化的行为。然而,人类与机器人在形态和动力学上的差异,加上数据噪声,导致参考运动中出现物理上不可行的伪影,例如漂浮和穿透。在训练和执行过程中,这些伪影在跟随不准确的参考运动与保持机器人稳定性之间产生了冲突,阻碍了可推广运动跟踪策略的发展。为了解决这些挑战,我们提出了OmniTrack,一个明确将物理可行性与通用运动跟踪解耦的通用跟踪框架。在第一阶段,特权通用策略通过在模拟中进行轨迹展开,生成严格遵循机器人动力学的物理上合理的运动。在第二阶段,训练通用控制策略以跟踪这些物理可行的运动,确保稳定且连贯的控制转移到真实机器人。实验表明,OmniTrack提高了跟踪精度,并展示了对未见运动的强泛化能力。在现实世界测试中,OmniTrack实现了长达一小时的一致且稳定的跟踪,包括翻转和侧手翻等复杂的特技动作。此外,我们还展示了OmniTrack支持人类风格的稳定和动态在线遥操作,突显其对不同用户输入的鲁棒性和适应性。
cs.RO / 20 / 2602.23843
OmniXtreme: Breaking the Generality Barrier in High-Dynamic Humanoid Control
OmniXtreme:打破高动态人形控制中的通用性障碍
Abstract
High-fidelity motion tracking serves as the ultimate litmus test for generalizable, human-level motor skills. However, current policies often hit a "generality barrier": as motion libraries scale in diversity, tracking fidelity inevitably collapses - especially for real-world deployment of high-dynamic motions. We identify this failure as the result of two compounding factors: the learning bottleneck in scaling multi-motion optimization and the physical executability constraints that arise in real-world actuation. To overcome these challenges, we introduce OmniXtreme, a scalable framework that decouples general motor skill learning from sim-to-real physical skill refinement. Our approach uses a flow-matching policy with high-capacity architectures to scale representation capacity without interference-intensive multi-motion RL optimization, followed by an actuation-aware refinement phase that ensures robust performance on physical hardware. Extensive experiments demonstrate that OmniXtreme maintains high-fidelity tracking across diverse, high-difficulty datasets. On real robots, the unified policy successfully executes multiple extreme motions, effectively breaking the long-standing fidelity-scalability trade-off in high-dynamic humanoid control.
Chinese Translation
高保真运动跟踪是可推广的人类水平运动技能的终极试金石。然而,当前的策略往往会遇到“通用性障碍”:随着运动库多样性的增加,跟踪的保真度不可避免地下降——尤其是在高动态运动的现实世界应用中。我们将这种失败归因于两个相互影响的因素:在多运动优化中扩展的学习瓶颈以及在现实世界驱动中出现的物理可执行性约束。为了克服这些挑战,我们提出了OmniXtreme,一个可扩展的框架,它将通用运动技能学习与从模拟到现实的物理技能精炼解耦。我们的方法使用高容量架构的流匹配策略,在不需要干扰密集的多运动强化学习优化的情况下扩展表示能力,随后进行一个考虑驱动的精炼阶段,以确保在物理硬件上的稳健性能。大量实验表明,OmniXtreme在多样化的高难度数据集上保持高保真跟踪。在真实机器人上,统一策略成功执行多种极端运动,有效打破了高动态人形控制中长期存在的保真度与可扩展性之间的权衡。
cs.RO / 21 / 2602.23870
Hybrid Offline-Online Reinforcement Learning for Sensorless, High-Precision Force Regulation in Surgical Robotic Grasping
用于无传感器高精度力调节的混合离线-在线强化学习在外科机器人抓取中的应用
Abstract
Precise grasp force regulation in tendon-driven surgical instruments is fundamentally limited by nonlinear coupling between motor dynamics, transmission compliance, friction, and distal mechanics. Existing solutions typically rely on distal force sensing or analytical compensation, increasing hardware complexity or degrading performance under dynamic motion. We present a sensorless control framework that combines physics-consistent modeling and hybrid reinforcement learning to achieve high-precision distal force regulation in a proximally actuated surgical end-effector. We develop a first-principles digital twin of the da Vinci Xi grasping mechanism that captures coupled electrical, transmission, and jaw dynamics within a unified differential-algebraic formulation. To safely learn control policies in this stiff and highly nonlinear system, we introduce a three-stage pipeline:(i)a receding-horizon CMA-ES oracle that generates dynamically feasible expert trajectories,(ii)fully offline policy learning via Implicit Q-Learning to ensure stable initialization without unsafe exploration, and (iii)online refinement using TD3 for adaptation to on-policy dynamics. The resulting policy directly maps proximal measurements to motor voltages and requires no distal sensing. In simulation, the controller maintains grasp force within 1% of the desired reference during multi-harmonic jaw motion. Hardware experiments demonstrate average force errors below 4% across diverse trajectories, validating sim-to-real transfer. The learned policy contains approximately 71k param and executes at kH rates, enabling real-time deployment. These results demonstrate that high-fidelity modeling combined with structured offline-online RL can recover precise distal force behavior without additional sensing, offering a scalable and mechanically compatible solution for surgical robotic manipulation.
Chinese Translation
在由腱驱动的外科器械中,精确的抓取力调节在根本上受到电机动态、传输柔性、摩擦和远端力学之间非线性耦合的限制。现有的解决方案通常依赖于远端力传感或解析补偿,这增加了硬件复杂性或在动态运动下降低了性能。我们提出了一种无传感器控制框架,结合物理一致的建模和混合强化学习,以实现近端驱动的外科末端执行器中高精度的远端力调节。我们开发了达芬奇Xi抓取机制的第一性原理数字双胞胎,捕捉了耦合的电气、传输和夹爪动态,并在统一的微分-代数形式中进行描述。为了在这个刚性且高度非线性的系统中安全地学习控制策略,我们引入了一个三阶段的流程:(i)一个递归视野的 CMA-ES 神谕,生成动态可行的专家轨迹,(ii)通过隐式 Q 学习进行完全离线的策略学习,以确保稳定的初始化而不进行不安全的探索,以及 (iii)使用 TD3 进行在线优化,以适应策略动态。最终的策略将近端测量直接映射到电机电压,并且不需要远端传感。在仿真中,控制器在多谐夹爪运动期间将抓取力保持在期望参考值的 1% 以内。硬件实验表明,在多种轨迹下平均力误差低于 4%,验证了仿真到现实的转移。学习到的策略包含约 71k 参数,并以 kH 的速率执行,支持实时部署。这些结果表明,高保真建模与结构化的离线-在线强化学习相结合,可以在没有额外传感的情况下恢复精确的远端力行为,为外科机器人操作提供了一种可扩展且机械兼容的解决方案。
cs.RO / 22 / 2602.23896
TSC: Topology-Conditioned Stackelberg Coordination for Multi-Agent Reinforcement Learning in Interactive Driving
TSC:面向多智能体强化学习的拓扑条件斯塔克尔博格协调在互动驾驶中的应用
Abstract
Safe and efficient autonomous driving in dense traffic is fundamentally a decentralized multi-agent coordination problem, where interactions at conflict points such as merging and weaving must be resolved reliably under partial observability. With only local and incomplete cues, interaction patterns can change rapidly, often causing unstable behaviors such as oscillatory yielding or unsafe commitments. Existing multi-agent reinforcement learning (MARL) approaches either adopt synchronous decision-making, which exacerbate non-stationarity, or depend on centralized sequencing mechanisms that scale poorly as traffic density increases. To address these limitations, we propose Topology-conditioned Stackelberg Coordination (TSC), a learning framework for decentralized interactive driving under communication-free execution, which extracts a time-varying directed priority graph from braid-inspired weaving relations between trajectories, thereby defining local leader-follower dependencies without constructing a global order of play. Conditioned on this graph, TSC endogenously factorizes dense interactions into graph-local Stackelberg subgames and, under centralized training and decentralized execution (CTDE), learns a sequential coordination policy that anticipates leaders via action prediction and trains followers through action-conditioned value learning to approximate local best responses, improving training stability and safety in dense traffic. Experiments across four dense traffic scenarios show that TSC achieves superior performance over representative MARL baselines across key metrics, most notably reducing collisions while maintaining competitive traffic efficiency and control smoothness.
Chinese Translation
在密集交通中安全高效的自主驾驶根本上是一个去中心化的多智能体协调问题,其中在合流和交织等冲突点的交互必须在部分可观测性下可靠地解决。由于仅有局部和不完整的线索,交互模式可能迅速变化,常常导致不稳定的行为,如振荡让行或不安全的承诺。现有的多智能体强化学习(MARL)方法要么采用同步决策,这加剧了非平稳性,要么依赖于集中式排序机制,随着交通密度的增加而扩展性差。为了解决这些局限性,我们提出了拓扑条件斯塔克尔博格协调(TSC),这是一个在无通信执行下的去中心化互动驾驶学习框架,它从轨迹之间的编织关系中提取出一个时变的有向优先图,从而定义局部的领导者-跟随者依赖关系,而无需构建全局的游戏顺序。在此图的条件下,TSC 内生性地将密集交互分解为图局部的斯塔克尔博格子游戏,并在集中训练和去中心化执行(CTDE)下,学习一种顺序协调策略,通过动作预测预见领导者,并通过动作条件值学习训练跟随者,以逼近局部最佳响应,从而提高密集交通中的训练稳定性和安全性。在四个密集交通场景中的实验表明,TSC 在关键指标上优于代表性的 MARL 基线,最显著的是在保持竞争性交通效率和控制平滑性的同时减少碰撞。
cs.RO / 23 / 2602.23901
ABPolicy: Asynchronous B-Spline Flow Policy for Real-Time and Smooth Robotic Manipulation
ABPolicy:用于实时和平滑机器人操作的异步 B-样条流策略
Abstract
Robotic manipulation requires policies that are smooth and responsive to evolving observations. However, synchronous inference in the raw action space introduces several challenges, including intra-chunk jitter, inter-chunk discontinuities, and stop-and-go execution. These issues undermine a policy's smoothness and its responsiveness to environmental changes. We propose ABPolicy, an asynchronous flow-matching policy that operates in a B-spline control-point action space. First, the B-spline representation ensures intra-chunk smoothness. Second, we introduce bidirectional action prediction coupled with refitting optimization to enforce inter-chunk continuity. Finally, by leveraging asynchronous inference, ABPolicy delivers real-time, continuous updates. We evaluate ABPolicy across seven tasks encompassing both static settings and dynamic settings with moving objects. Empirical results indicate that ABPolicy reduces trajectory jerk, leading to smoother motion and improved performance. Project website: https://teee000.github.io/ABPolicy/.
Chinese Translation
机器人操作需要平滑且对不断变化的观察具有响应性的策略。然而,在原始动作空间中的同步推理引入了若干挑战,包括块内抖动、块间不连续性以及停顿与启动的执行。这些问题削弱了策略的平滑性以及对环境变化的响应能力。我们提出了 ABPolicy,一种在 B-样条控制点动作空间中运行的异步流匹配策略。首先,B-样条表示确保了块内的平滑性。其次,我们引入了双向动作预测,并结合重新拟合优化以强制实现块间的连续性。最后,通过利用异步推理,ABPolicy 提供了实时的连续更新。我们在七个任务中评估了 ABPolicy,这些任务涵盖了静态设置和动态设置(包括移动物体)。实证结果表明,ABPolicy 减少了轨迹抖动,从而实现了更平滑的运动和更好的性能。项目网站:https://teee000.github.io/ABPolicy/
cs.RO / 24 / 2602.23923
Teleoperated Omni-directional Dual Arm Mobile Manipulation Robotic System with Shared Control for Retail Store
用于零售店的远程操作全向双臂移动操控机器人系统与共享控制
Abstract
The swiftly expanding retail sector is increasingly adopting autonomous mobile robots empowered by artificial intelligence and machine learning algorithms to gain an edge in the competitive market. However, these autonomous robots encounter challenges in adapting to the dynamic nature of retail products, often struggling to operate autonomously in novel situations. In this study, we introduce an omni-directional dual-arm mobile robot specifically tailored for use in retail environments. Additionally, we propose a tele-operation method that enables shared control between the robot and a human operator. This approach utilizes a Virtual Reality (VR) motion capture system to capture the operator's commands, which are then transmitted to the robot located remotely in a retail setting. Furthermore, the robot is equipped with heterogeneous grippers on both manipulators, facilitating the handling of a wide range of items. We validate the efficacy of the proposed system through testing in a mockup of retail environment, demonstrating its ability to manipulate various commonly encountered retail items using both single and dual-arm coordinated manipulation techniques.
Chinese Translation
迅速扩张的零售行业正日益采用由人工智能和机器学习算法驱动的自主移动机器人,以在竞争激烈的市场中获得优势。然而,这些自主机器人在适应零售产品的动态特性方面面临挑战,往往在新环境中难以自主操作。在本研究中,我们介绍了一种专门为零售环境设计的全向双臂移动机器人。此外,我们提出了一种远程操作方法,使机器人与人类操作员之间实现共享控制。该方法利用虚拟现实(Virtual Reality, VR)运动捕捉系统来捕捉操作员的指令,并将其传输到位于零售环境中的远程机器人。此外,机器人在两个操控器上配备了异构抓手,便于处理各种物品。我们通过在模拟零售环境中的测试验证了所提系统的有效性,展示了其使用单臂和双臂协调操控技术操作各种常见零售物品的能力。
cs.RO / 25 / 2602.23934
Learning to Build: Autonomous Robotic Assembly of Stable Structures Without Predefined Plans
学习构建:无预定义计划的自主机器人稳定结构组装
Abstract
This paper presents a novel autonomous robotic assembly framework for constructing stable structures without relying on predefined architectural blueprints. Instead of following fixed plans, construction tasks are defined through targets and obstacles, allowing the system to adapt more flexibly to environmental uncertainty and variations during the building process. A reinforcement learning (RL) policy, trained using deep Q-learning with successor features, serves as the decision-making component. As a proof of concept, we evaluate the approach on a benchmark of 15 2D robotic assembly tasks of discrete block construction. Experiments using a real-world closed-loop robotic setup demonstrate the feasibility of the method and its ability to handle construction noise. The results suggest that our framework offers a promising direction for more adaptable and robust robotic construction in real-world environments.
Chinese Translation
本文提出了一种新颖的自主机器人组装框架,用于在不依赖预定义建筑蓝图的情况下构建稳定结构。与遵循固定计划不同,施工任务通过目标和障碍物进行定义,使系统能够更灵活地适应建筑过程中环境的不确定性和变化。采用深度 Q 学习与后继特征训练的强化学习(RL)策略作为决策组件。作为概念验证,我们在 15 个离散块构建的 2D 机器人组装任务基准上评估了该方法。使用真实世界闭环机器人设置的实验展示了该方法的可行性及其处理施工噪声的能力。结果表明,我们的框架为在真实环境中实现更具适应性和鲁棒性的机器人施工提供了一个有前景的方向。
cs.RO / 26 / 2602.23937
Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos
通过真实室内旅游视频中的多模态事件知识增强视觉-语言导航
Abstract
Vision-Language Navigation (VLN) agents often struggle with long-horizon reasoning in unseen environments, particularly when facing ambiguous, coarse-grained instructions. While recent advances use knowledge graph to enhance reasoning, the potential of multimodal event knowledge inspired by human episodic memory remains underexplored. In this work, we propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task. First, we construct YE-KG, the first large-scale multimodal spatiotemporal knowledge graph, with over 86k nodes and 83k edges, derived from real-world indoor videos. By leveraging multimodal large language models (i.e., LLaVa, GPT4), we extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory. Second, we introduce STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism. This allows agents to retrieve causal event sequences and dynamically fuse them with egocentric visual observations. Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our event-centric strategy, outperforming state-of-the-art approaches across diverse action spaces. Our data and code are available on the project website https://sites.google.com/view/y-event-kg/.
Chinese Translation
视觉-语言导航(VLN)代理在未知环境中往往面临长时间推理的挑战,尤其是在处理模糊和粗略指令时。尽管最近的研究利用知识图谱来增强推理能力,但受人类情节记忆启发的多模态事件知识的潜力仍未得到充分探索。在本研究中,我们提出了一种以事件为中心的知识增强策略,用于自动化过程知识挖掘和特征融合,以解决VLN任务中的粗略指令和长时间推理问题。首先,我们构建了YE-KG,这是第一个大规模多模态时空知识图谱,包含超过86,000个节点和83,000条边,源自真实的室内视频。通过利用多模态大型语言模型(如LLaVa、GPT4),我们将非结构化的视频流提取为结构化的语义-动作-效果事件,以作为显式的情节记忆。其次,我们引入了STE-VLN,通过粗到细的层次检索机制将上述图谱整合到VLN模型中。这使得代理能够检索因果事件序列,并将其与自我中心的视觉观测动态融合。在REVERIE、R2R和R2R-CE基准上的实验表明,我们的以事件为中心的策略的有效性,在多样的动作空间中超越了最先进的方法。我们的数据和代码可在项目网站 https://sites.google.com/view/y-event-kg/ 上获取。
cs.RO / 27 / 2602.23972
Learning Robust Control Policies for Inverted Pose on Miniature Blimp Robots
为迷你飞艇机器人学习稳健的倒立姿态控制策略
Abstract
The ability to achieve and maintain inverted poses is essential for unlocking the full agility of miniature blimp robots (MBRs). However, developing reliable control methods for MBRs remains challenging due to their complex and underactuated dynamics. To address this challenge, we propose a novel framework that enables robust control policy learning for inverted pose on MBRs. The proposed framework operates through three core stages: First, a high-fidelity three-dimensional (3D) simulation environment was constructed, which was calibrated against real-world MBR motion data to ensure accurate replication of inverted-state dynamics. Second, a robust policy for MBR inverted control was trained within the simulation environment via a domain randomization strategy and a modified Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. Third, a mapping layer was designed to bridge the sim-to-real gap for the learned policy deployment. Comprehensive evaluations in the simulation environment demonstrate that the learned policy achieves a higher success rate compared to the energy-shaping controller. Furthermore, experimental results confirm that the learned policy with a mapping layer enables an MBR to achieve and maintain a fully upside-down pose in real-world settings.
Chinese Translation
实现和维持倒立姿态的能力对于释放迷你飞艇机器人(MBRs)的全部灵活性至关重要。然而,由于其复杂且欠驱动的动态特性,为MBRs开发可靠的控制方法仍然具有挑战性。为了解决这一挑战,我们提出了一种新颖的框架,使得MBRs的倒立姿态能够进行稳健的控制策略学习。该框架通过三个核心阶段进行操作:首先,构建了一个高保真度的三维(3D)仿真环境,并根据真实世界MBR运动数据进行了校准,以确保倒立状态动态的准确复制。其次,在仿真环境中,通过领域随机化策略和修改后的双延迟深度确定性策略梯度(TD3)算法训练了MBR倒立控制的稳健策略。第三,设计了一个映射层,以弥合学习策略部署的仿真与现实之间的差距。在仿真环境中的全面评估表明,学习到的策略相比于能量塑形控制器具有更高的成功率。此外,实验结果确认,带有映射层的学习策略使得MBR能够在现实环境中实现并维持完全倒立的姿态。
cs.RO / 28 / 2602.24011
Autonomous Inspection of Power Line Insulators with UAV on an Unmapped Transmission Tower
无人机在未映射输电塔上对绝缘子进行自主检查
Abstract
This paper introduces an online inspection algorithm that enables an autonomous UAV to fly around a transmission tower and obtain detailed inspection images without a prior map of the tower. Our algorithm relies on camera-LiDAR sensor fusion for online detection and localization of insulators. In particular, the algorithm is based on insulator detection using a convolutional neural network, projection of LiDAR points onto the image, and filtering them using the bounding boxes. The detection pipeline is coupled with several proposed insulator localization methods based on DBSCAN, RANSAC, and PCA algorithms. The performance of the proposed online inspection algorithm and camera-LiDAR sensor fusion pipeline is demonstrated through simulation and real-world flights. In simulation, we showed that our single-flight inspection strategy can save up to 24 % of total inspection time, compared to the two-flight strategy of scanning the tower and afterwards visiting the inspection waypoints in the optimal way. In a real-world experiment, the best performing proposed method achieves a mean horizontal and vertical localization error for the insulator of 0.16 +- 0.08 m and 0.16 +- 0.11 m, respectively. Compared to the most relevant approach, the proposed method achieves more than an order of magnitude lower variance in horizontal insulator localization error.
Chinese Translation
本文介绍了一种在线检查算法,使得自主无人机能够在没有事先映射输电塔的情况下,围绕输电塔飞行并获取详细的检查图像。我们的算法依赖于相机-激光雷达(LiDAR)传感器融合,以实现绝缘子的在线检测和定位。具体而言,该算法基于使用卷积神经网络进行绝缘子检测,将激光雷达点投影到图像上,并使用边界框对其进行过滤。检测流程结合了几种基于DBSCAN、RANSAC和PCA算法的绝缘子定位方法。通过仿真和实际飞行,展示了所提出的在线检查算法和相机-激光雷达传感器融合流程的性能。在仿真中,我们显示出我们的单次飞行检查策略相比于扫描塔体后以最优方式访问检查航点的双次飞行策略,可以节省多达24%的总检查时间。在实际实验中,表现最佳的所提方法在绝缘子的水平和垂直定位误差分别达到了0.16 ± 0.08米和0.16 ± 0.11米。与最相关的方法相比,所提方法在水平绝缘子定位误差上实现了超过一个数量级的方差降低。
cs.RO / 29 / 2602.24030
Curriculum Reinforcement Learning for Quadrotor Racing with Random Obstacles
针对随机障碍物的四旋翼赛车课程强化学习
Abstract
Autonomous drone racing has attracted increasing interest as a research topic for exploring the limits of agile flight. However, existing studies primarily focus on obstacle-free racetracks, while the perception and dynamic challenges introduced by obstacles remain underexplored, often resulting in low success rates and limited robustness in real-world flight. To this end, we propose a novel vision-based curriculum reinforcement learning framework for training a robust controller capable of addressing unseen obstacles in drone racing. We combine multi-stage cu rriculum learning, domain randomization, and a multi-scene updating strategy to address the conflicting challenges of obstacle avoidance and gate traversal. Our end-to-end control policy is implemented as a single network, allowing high-speed flight of quadrotors in environments with variable obstacles. Both hardware-in-the-loop and real-world experiments demonstrate that our method achieves faster lap times and higher success rates than existing approaches, effectively advancing drone racing in obstacle-rich environments. The video and code are available at: https://github.com/SJTU-ViSYS-team/CRL-Drone-Racing.
Chinese Translation
自主无人机赛车作为探索灵活飞行极限的研究课题,越来越受到关注。然而,现有研究主要集中在无障碍赛道上,而障碍物带来的感知和动态挑战仍然未得到充分探索,通常导致实际飞行中的成功率低和鲁棒性有限。为此,我们提出了一种新颖的基于视觉的课程强化学习框架,用于训练能够应对无人机赛车中未见障碍物的鲁棒控制器。我们结合了多阶段课程学习、领域随机化和多场景更新策略,以应对障碍物规避和门口穿越之间的矛盾挑战。我们的端到端控制策略实现为单一网络,允许四旋翼在具有可变障碍物的环境中进行高速飞行。硬件在环和实际实验表明,我们的方法在障碍物丰富的环境中实现了比现有方法更快的圈速和更高的成功率,有效推动了无人机赛车的发展。视频和代码可在以下链接获取:https://github.com/SJTU-ViSYS-team/CRL-Drone-Racing。
cs.RO / 30 / 2602.24104
Geometry-based pneumatic actuators for soft robotics
基于几何的软体机器人气动驱动器
Abstract
Soft pneumatic actuators enable safe human-machine interaction with lightweight and powerful applied parts. On the other side, they suffer design limitations as regards complex actuation patterns, including minimum bending radii, multi-states capabilities and structural stability. We present geometry-based pneumatic actuators (GPAs), a design and implementation approach that introduces constraint layers with configurable CNC heat-sealed chambers. The approach achieves predictable deformation, near-zero bending radii, multi-states actuation, and enables customizable and repeatable complex actuated geometries. Mathematical modeling reveals predictable linear angle transformations and validates nonlinear torque-angle relationships across diverse configurations. We demonstrate versatility of the GPAs approach through three applications: a 49 g wrist exoskeleton reducing muscle activity by up to 51%, a 30.8 g haptic interface delivering 8 N force feedback with fast response, and a 208 g bipedal robot achieving multi-gait locomotion. GPAs establish a configurable platform for next-generation wearable robotics, haptic systems, and soft locomotion devices.
Chinese Translation
软气动驱动器使轻量且强大的应用部件能够安全地与人机交互。然而,它们在复杂驱动模式方面存在设计限制,包括最小弯曲半径、多状态能力和结构稳定性。我们提出了一种基于几何的气动驱动器(Geometry-based Pneumatic Actuators, GPAs),这种设计和实施方法引入了具有可配置CNC热封腔的约束层。该方法实现了可预测的变形、近乎零的弯曲半径、多状态驱动,并使复杂驱动几何形状的定制和重复成为可能。数学建模揭示了可预测的线性角度变换,并验证了在不同配置下的非线性扭矩-角度关系。我们通过三个应用展示了GPAs方法的多样性:一个49克的腕部外骨骼将肌肉活动减少了多达51%,一个30.8克的触觉接口提供了8牛顿的力反馈且响应迅速,以及一个208克的双足机器人实现了多种步态的运动。GPAs为下一代可穿戴机器人、触觉系统和软体运动设备建立了一个可配置的平台。
cs.RO / 31 / 2602.24121
Planning from Observation and Interaction
基于观察和互动的规划
Abstract
Observational learning requires an agent to learn to perform a task by referencing only observations of the performed task. This work investigates the equivalent setting in real-world robot learning where access to hand-designed rewards and demonstrator actions are not assumed. To address this data-constrained setting, this work presents a planning-based Inverse Reinforcement Learning (IRL) algorithm for world modeling from observation and interaction alone. Experiments conducted entirely in the real-world demonstrate that this paradigm is effective for learning image-based manipulation tasks from scratch in under an hour, without assuming prior knowledge, pre-training, or data of any kind beyond task observations. Moreover, this work demonstrates that the learned world model representation is capable of online transfer learning in the real-world from scratch. In comparison to existing approaches, including IRL, RL, and Behavior Cloning (BC), which have more restrictive assumptions, the proposed approach demonstrates significantly greater sample efficiency and success rates, enabling a practical path forward for online world modeling and planning from observation and interaction. Videos and more at: https://uwrobotlearning.github.io/mpail2/.
Chinese Translation
观察学习要求智能体仅通过参考已执行任务的观察来学习执行任务。本研究探讨了现实世界机器人学习中的等效设置,其中不假设可以访问手工设计的奖励和示范者的动作。为了解决这一数据受限的情境,本文提出了一种基于规划的逆强化学习(Inverse Reinforcement Learning, IRL)算法,仅通过观察和互动进行世界建模。完全在现实世界中进行的实验表明,这一范式在不到一小时的时间内有效地从头学习基于图像的操作任务,而不假设任何先验知识、预训练或超出任务观察的数据。此外,本研究表明,学习到的世界模型表示能够在现实世界中进行在线迁移学习。与现有方法(包括IRL、强化学习(Reinforcement Learning, RL)和行为克隆(Behavior Cloning, BC))相比,这些方法具有更严格的假设,所提出的方法显示出显著更高的样本效率和成功率,为基于观察和互动的在线世界建模与规划提供了切实可行的前进路径。视频及更多信息请访问:https://uwrobotlearning.github.io/mpail2/.
cs.RO / 32 / 2602.24143
Robust Skills, Brittle Grounding: Diagnosing Restricted Generalization in Vision-Language Action Policies via Multi-Object Picking
稳健的技能,脆弱的基础:通过多物体拾取诊断视觉-语言行动策略中的有限泛化
Abstract
Vision-language action (VLA) policies often report strong manipulation benchmark performance with relatively few demonstrations, but it remains unclear whether this reflects robust language-to-object grounding or reliance on object--location correlations that do not transfer beyond the training distribution. We present a controlled multi-object picking study that progressively increases object placement variability up to full workspace randomization and evaluates held-out object--location pairings that break familiar associations without increasing spatial difficulty. Across these stress tests and data scaling, we find that for representative VLA policies, including SmolVLA and $\pi_{0.5}$, execution of the manipulation primitive remains substantially more reliable than instruction-conditioned task success in harder regimes, suggesting that manipulation skill acquisition is decoupled from instruction following. We recommend augmenting manipulation benchmarks with task ladders and decomposed metrics that separately measure primitive execution and instruction-conditioned success to better diagnose instruction-grounded generalization.
Chinese Translation
视觉-语言行动(VLA)策略通常在相对较少的示范下报告出色的操控基准性能,但尚不清楚这是否反映了稳健的语言与物体的基础,还是依赖于在训练分布之外无法转移的物体-位置关联。我们提出了一项受控的多物体拾取研究,逐步增加物体放置的变异性,直到完全随机化工作空间,并评估打破熟悉关联的保留物体-位置配对,而不增加空间难度。在这些压力测试和数据扩展中,我们发现,对于代表性的VLA策略,包括SmolVLA和$ ext{π}_{0.5}$,操控原语的执行在更困难的环境中仍然显著比基于指令的任务成功更可靠,这表明操控技能的获取与指令遵循是解耦的。我们建议通过任务阶梯和分解指标来增强操控基准,这些指标分别测量原语执行和基于指令的成功,以更好地诊断基于指令的泛化。
cs.RO / 33 / 2602.24156
Humanoid Robots as First Assistants in Endoscopic Surgery
类人机器人作为内窥镜手术中的首要助手
Abstract
Humanoid robots have become a focal point of technological ambition, with claims of surgical capability within years in mainstream discourse. These projections are aspirational yet lack empirical grounding. To date, no humanoid has assisted a surgeon through an actual procedure, let alone performed one. The work described here breaks this new ground. Here we report a proof of concept in which a teleoperated Unitree G1 provided endoscopic visualization while an attending otolaryngologist performed a cadaveric sphenoidectomy. The procedure was completed successfully, with stable visualization maintained throughout. Teleoperation allowed assessment of whether the humanoid form factor could meet the physical demands of surgical assistance in terms of sustenance and precision; the cognitive demands were satisfied -- for now -- by the operator. Post-procedure analysis identified engineering targets for clinical translation, alongside near-term opportunities such as autonomous diagnostic scoping. This work establishes form-factor feasibility for humanoid surgical assistance while identifying challenges for continued development.
Chinese Translation
类人机器人已成为技术雄心的焦点,主流话语中声称其在数年内具备外科手术能力。这些预测虽然充满理想,但缺乏实证基础。迄今为止,没有任何类人机器人在实际手术中协助外科医生,更不用说独立完成手术。本文所述的工作开辟了这一新领域。我们报告了一项概念验证,其中一台远程操作的Unitree G1在一位耳鼻喉科医生进行尸体蝶骨切除术时提供了内窥镜可视化。该手术成功完成,整个过程中保持了稳定的可视化。远程操作评估了类人机器人形态是否能够满足外科助手在维持和精确度方面的物理要求;认知需求则由操作员暂时满足。手术后的分析确定了临床转化的工程目标,以及诸如自主诊断内窥镜检查等近期机会。这项工作确立了类人手术助手的形态可行性,同时识别了持续发展的挑战。
cs.RO / 34 / 2602.24192
How IMU Drift Influences Multi-Radar Inertial Odometry for Ground Robots in Subterranean Terrains
IMU漂移如何影响地下地形中地面机器人的多雷达惯性里程计
Abstract
Reliable radar inertial odometry (RIO) requires mitigating IMU bias drift, a challenge that intensifies in subterranean environments due to extreme temperatures and gravity-induced accelerations. Cost-effective IMUs such as the Pixhawk, when paired with FMCW TI IWR6843AOP EVM radars, suffer from drift-induced degradation compounded by sparse, noisy, and flickering radar returns, making fusion less stable than LiDAR-based odometry. Yet, LiDAR fails under smoke, dust, and aerosols, whereas FMCW radars remain compact, lightweight, cost-effective, and robust in these situations. To address these challenges, we propose a two-stage MRIO framework that combines an IMU bias estimator for resilient localization and mapping in GPS-denied subterranean environments affected by smoke. Radar-based ego-velocity estimation is formulated through a least-squares approach and incorporated into an EKF for online IMU bias correction; the corrected IMU accelerations are fused with heterogeneous measurements from multiple radars and an IMU to refine odometry. The proposed framework further supports radar-only mapping by exploiting the robot's estimated translational and rotational displacements. In subterranean field trials, MRIO delivers robust localization and mapping, outperforming EKF-RIO. It maintains accuracy across cost-efficient FMCW radar setups and different IMUs, showing resilience with Pixhawk and higher-grade units such as VectorNav. The implementation will be provided as an open-source resource to the community (code available at https://github.com/LTU-RAI/MRIO
Chinese Translation
可靠的雷达惯性里程计(RIO)需要减轻IMU偏差漂移,这一挑战在地下环境中因极端温度和重力引起的加速度而加剧。像Pixhawk这样的经济型IMU在与FMCW TI IWR6843AOP EVM雷达配对时,受到漂移引起的退化影响,且稀疏、噪声和闪烁的雷达回波使得融合的稳定性低于基于LiDAR的里程计。然而,LiDAR在烟雾、灰尘和气溶胶环境下失效,而FMCW雷达在这些情况下仍然保持紧凑、轻便、经济且稳健。为了解决这些挑战,我们提出了一种两阶段的多雷达惯性里程计(MRIO)框架,该框架结合了IMU偏差估计器,以实现GPS缺失的地下环境中的鲁棒定位和地图构建。雷达基础的自我速度估计通过最小二乘法进行公式化,并纳入扩展卡尔曼滤波器(EKF)进行在线IMU偏差校正;校正后的IMU加速度与来自多个雷达和IMU的异构测量数据融合,以精细化里程计。所提出的框架进一步通过利用机器人估计的平移和旋转位移支持仅基于雷达的地图构建。在地下实地试验中,MRIO提供了鲁棒的定位和地图构建,优于EKF-RIO。它在经济高效的FMCW雷达设置和不同IMU之间保持准确性,显示出与Pixhawk和更高等级单元如VectorNav的鲁棒性。该实现将作为开源资源提供给社区(代码可在https://github.com/LTU-RAI/MRIO获取)。
cs.RO / 35 / 2602.24202
Evaluating Accuracy of Vine Robot Shape Sensing with Distributed Inertial Measurement Units
评估分布式惯性测量单元的藤蔓机器人形状感知精度
Abstract
Soft, tip-extending vine robots are well suited for navigating tight, debris-filled environments, making them ideal for urban search and rescue. Sensing the full shape of a vine robot's body is helpful both for localizing information from other sensors placed along the robot body and for determining the robot's configuration within the space being explored. Prior approaches have localized vine robot tips using a single inertial measurement unit (IMU) combined with force sensing or length estimation, while one method demonstrated full-body shape sensing using distributed IMUs on a passively steered robot in controlled maze environments. However, the accuracy of distributed IMU-based shape sensing under active steering, varying robot lengths, and different sensor spacings has not been systematically quantified. In this work, we experimentally evaluate the accuracy of vine robot shape sensing using distributed IMUs along the robot body. We quantify IMU drift, measuring an average orientation drift rate of 1.33 degrees/min across 15 sensors. For passive steering, mean tip position error was 11% of robot length. For active steering, mean tip position error increased to 16%. During growth experiments across lengths from 30-175 cm, mean tip error was 8%, with a positive trend with increasing length. We also analyze the influence of sensor spacing and observe that intermediate spacings can minimize error for single-curvature shapes. These results demonstrate the feasibility of distributed IMU-based shape sensing for vine robots while highlighting key limitations and opportunities for improved modeling and algorithmic integration for field deployment.
Chinese Translation
柔性、尖端延伸的藤蔓机器人非常适合在狭窄、充满碎片的环境中导航,使其成为城市搜索和救援的理想选择。感知藤蔓机器人身体的完整形状有助于从放置在机器人身体上的其他传感器获取位置信息,并确定机器人在被探索空间中的配置。之前的方法使用单个惯性测量单元(IMU)结合力传感或长度估计来定位藤蔓机器人尖端,而一种方法则在受控迷宫环境中展示了使用分布式IMU的全身形状感知。然而,在主动转向、不同机器人长度和不同传感器间距下,基于分布式IMU的形状感知精度尚未系统量化。在本研究中,我们通过实验评估了沿机器人身体使用分布式IMU进行藤蔓机器人形状感知的精度。我们量化了IMU漂移,测得15个传感器的平均方向漂移率为1.33度/分钟。对于被动转向,尖端位置的平均误差为机器人长度的11%。对于主动转向,尖端位置的平均误差增加到16%。在长度范围为30-175厘米的生长实验中,尖端的平均误差为8%,并随着长度增加呈正趋势。我们还分析了传感器间距的影响,观察到中间间距可以最小化单曲率形状的误差。这些结果证明了基于分布式IMU的藤蔓机器人形状感知的可行性,同时突出了关键的局限性和改善建模及算法集成以便于现场部署的机会。
cs.RO / 36 / 2602.24235
SafeGen-LLM: Enhancing Safety Generalization in Task Planning for Robotic Systems
SafeGen-LLM:提升机器人系统任务规划中的安全性泛化
Abstract
Safety-critical task planning in robotic systems remains challenging: classical planners suffer from poor scalability, Reinforcement Learning (RL)-based methods generalize poorly, and base Large Language Models (LLMs) cannot guarantee safety. To address this gap, we propose safety-generalizable large language models, named SafeGen-LLM. SafeGen-LLM can not only enhance the safety satisfaction of task plans but also generalize well to novel safety properties in various domains. We first construct a multi-domain Planning Domain Definition Language 3 (PDDL3) benchmark with explicit safety constraints. Then, we introduce a two-stage post-training framework: Supervised Fine-Tuning (SFT) on a constraint-compliant planning dataset to learn planning syntax and semantics, and Group Relative Policy Optimization (GRPO) guided by fine-grained reward machines derived from formal verification to enforce safety alignment and by curriculum learning to better handle complex tasks. Extensive experiments show that SafeGen-LLM achieves strong safety generalization and outperforms frontier proprietary baselines across multi-domain planning tasks and multiple input formats (e.g., PDDLs and natural language).
Chinese Translation
在机器人系统中,安全关键的任务规划仍然面临挑战:经典规划器在可扩展性方面表现不佳,基于强化学习(Reinforcement Learning, RL)的方法泛化能力差,而基础的大型语言模型(Large Language Models, LLMs)无法保证安全性。为了解决这一问题,我们提出了一种安全可泛化的大型语言模型,命名为SafeGen-LLM。SafeGen-LLM不仅能够增强任务规划的安全满意度,还能在各种领域中对新颖的安全属性进行良好的泛化。我们首先构建了一个具有明确安全约束的多领域规划领域定义语言3(Planning Domain Definition Language 3, PDDL3)基准。然后,我们引入了一个两阶段的后训练框架:在符合约束的规划数据集上进行监督微调(Supervised Fine-Tuning, SFT),以学习规划的语法和语义;以及通过从形式验证中推导出的细粒度奖励机制指导的群体相对策略优化(Group Relative Policy Optimization, GRPO),以强化安全对齐,并通过课程学习更好地处理复杂任务。大量实验表明,SafeGen-LLM在安全泛化方面表现出色,并在多领域规划任务和多种输入格式(例如PDDL和自然语言)上超越了前沿的专有基线。
cs.CV / 1 / 2602.23438
DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation
DesignSense:用于图形布局生成的人类偏好数据集和奖励建模框架
Abstract
Graphic layouts serve as an important and engaging medium for visual communication across different channels. While recent layout generation models have demonstrated impressive capabilities, they frequently fail to align with nuanced human aesthetic judgment. Existing preference datasets and reward models trained on text-to-image generation do not generalize to layout evaluation, where the spatial arrangement of identical elements determines quality. To address this critical gap, we introduce DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation. We propose a five-stage curation pipeline that generates visually coherent layout transformations across diverse aspect ratios, using semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement to produce high-quality comparison pairs. Human preferences are annotated using a 4-class scheme (left, right, both good, both bad) to capture subjective ambiguity. Leveraging this dataset, we train DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics (54.6% improvement in Macro F1 over the strongest proprietary baseline). Our analysis shows that frontier VLMs remain unreliable overall and fail catastrophically on the full four-class task, underscoring the need for specialized, preference-aware models. Beyond the dataset, our reward model DesignSense yields tangible downstream gains in layout generation. Using our judge during RL based training improves generator win rate by about 3%, while inference-time scaling, which involves generating multiple candidates and selecting the best one, provides a 3.6% improvement. These results highlight the practical impact of specialized, layout-aware preference modeling on real-world layout generation quality.
Chinese Translation
图形布局作为一种重要且引人入胜的视觉传播媒介,广泛应用于不同的渠道。尽管近期的布局生成模型展现了令人印象深刻的能力,但它们常常无法与细致的人类审美判断相一致。现有的偏好数据集和基于文本到图像生成的奖励模型在布局评估中并不通用,因为相同元素的空间排列决定了质量。为了解决这一关键问题,我们引入了DesignSense-10k,这是一个包含10,235对人类标注的图形布局评估偏好的大规模数据集。我们提出了一种五阶段的策划流程,通过语义分组、布局预测、过滤、聚类和基于视觉语言模型(VLM)的精炼,生成不同纵横比下视觉一致的布局变换,从而产生高质量的比较对。人类偏好使用四类方案(左、右、都好、都不好)进行标注,以捕捉主观模糊性。利用该数据集,我们训练了DesignSense,一个基于视觉语言模型的分类器,在全面评估指标上显著超越现有的开源和专有模型(在最强专有基线基础上,Macro F1提升54.6%)。我们的分析表明,前沿的VLM整体上仍然不可靠,并且在完整的四类任务上表现惨败,强调了对专门的、关注偏好的模型的需求。除了数据集之外,我们的奖励模型DesignSense在布局生成中带来了实质性的下游收益。在基于强化学习的训练中使用我们的评判器使生成器的胜率提高了约3%,而推理时的扩展,即生成多个候选并选择最佳者,提供了3.6%的提升。这些结果突显了专门的、关注布局的偏好建模对现实世界布局生成质量的实际影响。
cs.CV / 2 / 2602.23514
Modelling and Simulation of Neuromorphic Datasets for Anomaly Detection in Computer Vision
用于计算机视觉异常检测的神经形态数据集建模与仿真
Abstract
Limitations on the availability of Dynamic Vision Sensors (DVS) present a fundamental challenge to researchers of neuromorphic computer vision applications. In response, datasets have been created by the research community, but often contain a limited number of samples or scenarios. To address the lack of a comprehensive simulator of neuromorphic vision datasets, we introduce the Anomalous Neuromorphic Tool for Shapes (ANTShapes), a novel dataset simulation framework. Built in the Unity engine, ANTShapes simulates abstract, configurable 3D scenes populated by objects displaying randomly-generated behaviours describing attributes such as motion and rotation. The sampling of object behaviours, and the labelling of anomalously-acting objects, is a statistical process following central limit theorem principles. Datasets containing an arbitrary number of samples can be created and exported from ANTShapes, along with accompanying label and frame data, through the adjustment of a limited number of parameters within the software. ANTShapes addresses the limitations of data availability to researchers of event-based computer vision by allowing for the simulation of bespoke datasets to suit purposes including object recognition and localisation alongside anomaly detection.
Chinese Translation
动态视觉传感器(DVS)可用性的限制对神经形态计算机视觉应用的研究者构成了根本挑战。作为回应,研究社区创建了数据集,但通常包含的样本或场景数量有限。为了解决缺乏全面的神经形态视觉数据集仿真器的问题,我们引入了形状异常神经形态工具(ANTShapes),这是一种新颖的数据集仿真框架。ANTShapes基于Unity引擎构建,模拟了由展示随机生成行为的对象构成的抽象、可配置的3D场景,这些行为描述了诸如运动和旋转等属性。对象行为的采样以及异常行为对象的标记是遵循中心极限定理原则的统计过程。通过调整软件中的有限参数,可以从ANTShapes创建并导出包含任意数量样本的数据集,以及相应的标签和帧数据。ANTShapes通过允许模拟定制数据集以满足包括对象识别、定位和异常检测在内的目的,解决了事件驱动计算机视觉研究者在数据可用性方面的限制。
cs.CV / 3 / 2602.23523
All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with a Robust Landmark-Identity Watermark
一体化:通过稳健的地标身份水印统一深伪检测、篡改定位和源追踪
Abstract
With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152-dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an "all-in-one" trifunctional forensic solution: the regression head underlies an "intrinsic-extrinsic" consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments show that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content. The code is available at https://github.com/vpsg-research/LIDMark.
Chinese Translation
随着深伪技术的快速发展,恶意的人脸操控对个人隐私和社会安全构成了重大威胁。然而,现有的主动取证方法通常将深伪检测、篡改定位和源追踪视为独立任务,缺乏一个统一的框架来共同解决这些问题。为此,我们提出了一个统一的主动取证框架,联合解决这三项核心任务。我们的核心框架采用了一种创新的152维地标身份水印,称为LIDMark,它将面部地标与独特的源标识符结构性地交织在一起。为了稳健地提取LIDMark,我们设计了一种新颖的分解头解码器(Factorized-Head Decoder,FHD)。其架构将共享的主干特征分解为两个专用头(即回归和分类),在遭受严重失真或篡改时,稳健地重建嵌入的地标和标识符。该设计实现了一种“全能”三功能取证解决方案:回归头用于进行“内在-外在”一致性检查以进行检测和定位,而分类头则稳健地解码源标识符以进行追踪。大量实验表明,所提出的LIDMark框架为深伪内容的检测、定位和追踪提供了统一、稳健且不可察觉的解决方案。代码可在 https://github.com/vpsg-research/LIDMark 获取。
cs.CV / 4 / 2602.23543
Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos
合成视觉基因组2:从视频中提取大规模时空场景图
Abstract
We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL's generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.
Chinese Translation
我们介绍了合成视觉基因组2(Synthetic Visual Genome 2,SVG2),这是一个大规模的全景视频场景图数据集。SVG2包含超过636K个视频,6.6M个对象,52.0M个属性和6.7M个关系,相较于之前的时空场景图数据集,其规模和多样性有了数量级的提升。为了创建SVG2,我们设计了一个完全自动化的流程,结合了多尺度全景分割、在线-离线轨迹跟踪与自动新对象发现、每条轨迹的语义解析,以及基于GPT-5的时空关系推断。在此基础上,我们训练了TRaSER,一个视频场景图生成模型。TRaSER通过轨迹对齐的标记排列机制和新的模块增强了视觉语言模型(VLM),包括对象轨迹重采样器和时间窗口重采样器,以便在一次前向传递中将原始视频和全景轨迹转换为紧凑的时空场景图。时间窗口重采样器将视觉标记绑定到短轨迹段,以保持局部运动和时间语义,而对象轨迹重采样器则聚合整个轨迹,以维护对象的全局上下文。在PVSG、VIPSeg、VidOR和SVG2测试数据集上,TRaSER在关系检测上提高了15%到20%,在对象预测上提高了30%到40%(相较于最强的开源基线)以及提高了13%(相较于GPT-5),在属性预测上提高了15%。当TRaSER生成的场景图被发送到VLM进行视频问答时,相较于仅使用视频或使用Qwen2.5-VL生成的场景图增强的视频,其绝对准确率提高了1.5%到4.6%,展示了显式时空场景图作为中间表示的实用性。
cs.CV / 5 / 2602.23553
LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification
LE-NeuS:通过自适应时间验证实现低延迟神经符号视频理解
Abstract
Neuro-symbolic approaches to long-form video question answering (LVQA) have demonstrated significant accuracy improvements by grounding temporal reasoning in formal verification. However, existing methods incur prohibitive latency overheads, up to 90x slower than base VLM prompting, rendering them impractical for latency-sensitive edge deployments. We present LE-NeuS, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency. Our key insight is that the dominant computational bottleneck arises from sequential and dense proposition detection across video frames during automaton construction. We address this through two principled optimizations: (1) CLIP guided two-stage adaptive sampling that exploits visual redundancy to skip semantically similar frames while preserving temporal boundaries, and (2) batched proposition detection that parallelizes VLM inference across temporal windows. Theoretically, we derive latency bounds as a function of video length, proposition complexity, and sampling density, establishing conditions under which latency efficiency is achievable. Empirically, on LongVideoBench and Video-MME benchmarks deployed on NVIDIA H100 GPUs, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining >10% accuracy gains on temporally complex queries.
Chinese Translation
神经符号方法在长视频问答(LVQA)中通过将时间推理与形式验证相结合,展示了显著的准确性提升。然而,现有方法的延迟开销高达90倍,相较于基础的VLM提示,导致其在对延迟敏感的边缘部署中不切实际。我们提出了LE-NeuS,一个低延迟的神经符号框架,保留了时间逻辑引导的视频理解的准确性优势,同时大幅降低推理延迟。我们的关键见解是,主要的计算瓶颈源于在自动机构建过程中对视频帧进行顺序和密集的命题检测。我们通过两项原则性优化来解决这一问题:(1)CLIP引导的两阶段自适应采样,利用视觉冗余跳过语义相似的帧,同时保留时间边界;(2)批量命题检测,平行化VLM在时间窗口上的推理。从理论上讲,我们推导了延迟界限,作为视频长度、命题复杂性和采样密度的函数,建立了实现延迟效率的条件。从经验上看,在部署于NVIDIA H100 GPU的LongVideoBench和Video-MME基准测试中,LE-NeuS将延迟差距从90倍减少到约10倍,同时在时间复杂查询上保持超过10%的准确性提升。
cs.CV / 6 / 2602.23559
No Calibration, No Depth, No Problem: Cross-Sensor View Synthesis with 3D Consistency
无需标定,无需深度,无需问题:具有3D一致性的跨传感器视图合成
Abstract
We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection.
Chinese Translation
我们首次研究了不同模态之间的跨传感器视图合成。我们考察了一个实用的、基础的但被广泛忽视的问题:获取对齐的RGB-X数据,其中大多数RGB-X的先前工作假设此类配对存在并专注于模态融合,但在实践中需要大量的标定工程工作。我们提出了一种匹配-稠密化-整合的方法。首先,我们执行RGB-X图像匹配,然后进行引导点稠密化。通过所提出的基于置信度的稠密化和自匹配过滤,我们实现了更好的视图合成,并随后在3D高斯点云(3D Gaussian Splatting, 3DGS)中整合它们。我们的方法不使用X传感器的3D先验,仅假设RGB的COLMAP几乎没有成本。我们的目标是消除各种RGB-X传感器的繁琐标定,并通过一种可扩展的解决方案推动跨传感器学习的普及,从而突破大规模真实世界RGB-X数据收集的瓶颈。
cs.CV / 7 / 2602.23574
Evidential Neural Radiance Fields
证据神经辐射场
Abstract
Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality.
Chinese Translation
理解不确定性的来源对可信的三维场景建模至关重要。尽管近期在神经辐射场(NeRFs)方面的进展在场景重建和新视角合成中取得了令人印象深刻的准确性,但缺乏不确定性估计显著限制了其在安全关键环境中的应用。现有的NeRF不确定性量化方法未能同时捕捉到偶然性(aleatoric)和认识性(epistemic)不确定性。在那些能够量化其中一种或另一种不确定性的方法中,许多要么妥协了渲染质量,要么在获取不确定性估计时带来了显著的计算开销。为了解决这些问题,我们提出了证据神经辐射场,这是一种概率方法,能够与NeRF渲染过程无缝集成,并从单次前向传递中直接量化偶然性和认识性不确定性。我们在三个标准基准上比较了多种不确定性量化方法,我们的方法展示了最先进的场景重建保真度和不确定性估计质量。
cs.CV / 8 / 2602.23575
CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation
CycleBEV:通过视图循环一致性正则化视图转换网络以实现鸟瞰视图语义分割
Abstract
Transforming image features from perspective view (PV) space to bird's-eye-view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large-scale nuScenes dataset. Experimental results show consistent improvements -- with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively -- without increasing inference complexity, since the IVT network is used only during training. The implementation code is available at https://github.com/JeongbinHong/CycleBEV.
Chinese Translation
在自动驾驶中,由于深度模糊和遮挡,从透视视图(PV)空间转换图像特征到鸟瞰视图(BEV)空间仍然具有挑战性。尽管已经提出了几种视图转换(VT)范式,但这一挑战依然存在。在本文中,我们提出了一种新的正则化框架,称为CycleBEV,旨在增强现有的VT模型以实现BEV语义分割。受广泛应用于图像分布建模的循环一致性启发,我们设计了一个逆视图转换(IVT)网络,该网络将BEV分割图映射回PV分割图,并利用它在训练过程中通过循环一致性损失对VT网络进行正则化,使其能够从输入的PV图像中捕获更丰富的语义和几何信息。为了进一步发挥IVT网络的能力,我们引入了两个新颖的思想,将循环一致性扩展到几何和表示空间。我们在使用大规模nuScenes数据集的四个代表性VT模型上评估了CycleBEV,涵盖了三种主要范式。实验结果表明,CycleBEV在可驾驶区域、车辆和行人类别上分别获得了高达0.74、4.86和3.74的mIoU的持续提升,而不增加推理复杂性,因为IVT网络仅在训练期间使用。实现代码可在https://github.com/JeongbinHong/CycleBEV获取。
cs.CV / 9 / 2602.23588
Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning
冻结语言和图像模型的超维跨模态对齐用于高效图像描述生成
Abstract
Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend on large-scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross-modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross-modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages lightweight symbolic operations -- binding, bundling, and similarity-based retrieval to construct associative cross-modal representations in a single pass over the data. Caption generation emerges from high-dimensional memory retrieval rather than iterative gradient-based optimization. We show that HDFLIM achieves performance comparable to end-to-end vision-language training methods and produces captions that are more semantically grounded than zero-shot baselines. By decoupling alignment from parameter tuning, our results suggest that semantic mapping across foundation models can be realized through symbolic operations on hyperdimensional encodings of the respective embeddings. More broadly, this work points toward an alternative paradigm for foundation model alignment in which frozen models are integrated through structured representational mappings rather than through large-scale retraining. The codebase for our implementation can be found at https://github.com/Abhishek-Dalvi410/HDFLIM.
Chinese Translation
大型单模态基础模型在视觉和语言方面编码了丰富的语义结构,但对齐它们通常需要计算密集型的多模态微调。这类方法依赖于大规模的参数更新,资源消耗高,并可能扰动预训练的表示。然而,新的证据表明,独立训练的基础模型可能已经表现出潜在的语义兼容性,反映了它们所建模数据中的共享结构。这引发了一个基本问题:是否可以在不修改模型本身的情况下实现跨模态对齐?在此,我们介绍了HDFLIM(HyperDimensional computing with Frozen Language and Image Models),一个在保持预训练的视觉和语言模型完全冻结的情况下建立跨模态映射的框架。HDFLIM将单模态嵌入投影到共享的超维空间,并利用轻量级的符号操作——绑定、捆绑和基于相似性的检索,在对数据进行单次处理时构建关联的跨模态表示。描述生成源于高维记忆检索,而不是迭代的基于梯度的优化。我们展示了HDFLIM的性能可与端到端的视觉-语言训练方法相媲美,并生成的描述在语义上比零-shot基线更为扎实。通过将对齐与参数调优解耦,我们的结果表明,基础模型之间的语义映射可以通过对各自嵌入的超维编码进行符号操作来实现。更广泛地说,这项工作指向了一种基础模型对齐的替代范式,其中冻结模型通过结构化的表示映射进行集成,而不是通过大规模的再训练。我们实现的代码库可以在 https://github.com/Abhishek-Dalvi410/HDFLIM 找到。
cs.CV / 10 / 2602.23589
Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models
伪对比学习在多模态模型中的图表理解
Abstract
Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models' limited sensitivity to fine-grained structural variations. We propose a new training paradigm designed to enhance diagram comprehension in vision-language models. Our approach introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences in diagrammatic imagery without requiring any modification or editing of the original data. By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures. Empirical evaluations on a benchmark dataset of flowcharts demonstrate substantial improvements over standard CLIP and hard-negative CLIP training in both image-text matching and visual question answering tasks. The results underscore the value of domain-specific training strategies and contribute to advancing diagrammatic understanding within the broader context of vision-language learning.
Chinese Translation
近期的多模态模型,如对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP),展现了出色的视觉与语言表示对齐能力。然而,在小的视觉差异蕴含巨大语义意义的领域,如图表理解,由于模型对细粒度结构变化的敏感性有限,仍然面临挑战。我们提出了一种新的训练范式,旨在增强视觉-语言模型的图表理解能力。我们的方法引入了由图表渲染器生成的伪对比样本,该渲染器使用随机选择的文本元素创建合成图表。这些样本突出了图表图像中的结构差异,而无需对原始数据进行任何修改或编辑。通过将这些伪对比样本纳入训练目标,模型学习捕捉更精确且语义一致的图表结构。在一个流程图基准数据集上的实证评估表明,在图像-文本匹配和视觉问答任务中,相较于标准CLIP和困难负样本CLIP训练,取得了显著的提升。结果强调了领域特定训练策略的价值,并有助于在更广泛的视觉-语言学习背景下推进图表理解。
cs.CV / 11 / 2602.23595
Incremental dimension reduction for efficient and accurate visual anomaly detection
增量维度降低用于高效准确的视觉异常检测
Abstract
While nowadays visual anomaly detection algorithms use deep neural networks to extract salient features from images, the high dimensionality of extracted features makes it difficult to apply those algorithms to large data with 1000s of images. To address this issue, we present an incremental dimension reduction algorithm to reduce the extracted features. While our algorithm essentially computes truncated singular value decomposition of these features, other than processing all vectors at once, our algorithm groups the vectors into batches. At each batch, our algorithm updates the truncated singular values and vectors that represent all visited vectors, and reduces each batch by its own singular values and vectors so they can be stored in the memory with low overhead. After processing all batches, we re-transform these batch-wise singular vectors to the space spanned by the singular vectors of all features. We show that our algorithm can accelerate the training of state-of-the-art anomaly detection algorithm with close accuracy.
Chinese Translation
尽管目前的视觉异常检测算法使用深度神经网络从图像中提取显著特征,但提取特征的高维性使得这些算法难以应用于包含数千张图像的大型数据集。为了解决这个问题,我们提出了一种增量维度降低算法,以减少提取的特征。我们的算法本质上计算这些特征的截断奇异值分解,而不是一次性处理所有向量,我们的算法将向量分组为多个批次。在每个批次中,我们的算法更新表示所有已访问向量的截断奇异值和向量,并通过各自的奇异值和向量减少每个批次,以便它们可以以低开销存储在内存中。在处理完所有批次后,我们将这些批次的奇异向量重新转换到由所有特征的奇异向量所张成的空间。我们展示了我们的算法能够加速最先进的异常检测算法的训练,并保持接近的准确性。
cs.CV / 12 / 2602.23615
Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning
通过强化学习实现高分辨率大型多模态模型的无注释视觉推理
Abstract
Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to post-train Qwen2.5-VL-7B, HART even surpasses larger-scale models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.
Chinese Translation
当前的大型多模态模型(LMMs)在推理过程中面临高分辨率视觉输入的挑战,因为图像标记的数量随着分辨率的增加而呈平方增长,导致大量冗余和无关信息的引入。常见的做法是识别关键图像区域,并在推理过程中参考其高分辨率对应物,通常需要外部视觉监督进行训练。然而,这种视觉监督线索需要来自人工标注者的昂贵基础标签。同时,如何增强模型的基础能力以支持推理而不依赖额外注释仍然是一个未解的问题。本文提出了高分辨率无注释推理技术(High-resolution Annotation-free Reasoning Technique, HART),这是一个闭环框架,使LMMs能够专注于并自我验证高分辨率视觉输入的关键区域。HART结合了一种后训练范式,在该范式中,我们设计了优势偏好组相对策略优化(Advantage Preference Group Relative Policy Optimization, AP-GRPO),以鼓励关键区域的准确定位。值得注意的是,HART提供了可解释的推理路径,并实现了定位的高效优化。大量实验表明,HART在广泛的高分辨率视觉任务中提高了性能,始终优于强基线。当应用于后训练的Qwen2.5-VL-7B时,HART甚至在高分辨率、视觉中心基准上超越了更大规模的模型,如Qwen2.5-VL-72B和LLaVA-OneVision-72B。
cs.CV / 13 / 2602.23618
Egocentric Visibility-Aware Human Pose Estimation
以自我为中心的可见性意识人类姿态估计
Abstract
Egocentric human pose estimation (HPE) using a head-mounted device is crucial for various VR and AR applications, but it faces significant challenges due to keypoint invisibility. Nevertheless, none of the existing egocentric HPE datasets provide keypoint visibility annotations, and the existing methods often overlook the invisibility problem, treating visible and invisible keypoints indiscriminately during estimation. As a result, their capacity to accurately predict visible keypoints is compromised. In this paper, we first present Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels. Additionally, we augment the existing EMHI dataset with keypoint visibility annotations to further facilitate the research in this direction. Furthermore, we propose EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy. Extensive experiments validate the significant value of ground-truth visibility labels in egocentric HPE settings, and demonstrate that our EvaPose achieves state-of-the-art performance in both Eva-3M and EMHI datasets.
Chinese Translation
使用头戴设备进行以自我为中心的人类姿态估计(HPE)对各种虚拟现实(VR)和增强现实(AR)应用至关重要,但由于关键点不可见性面临重大挑战。然而,现有的以自我为中心的 HPE 数据集均未提供关键点可见性注释,现有方法往往忽视不可见性问题,在估计过程中对可见和不可见的关键点不加区分。因此,它们准确预测可见关键点的能力受到影响。本文首先提出了 Eva-3M,这是一个大规模的以自我为中心的可见性意识 HPE 数据集,包含超过 300 万帧,其中 43.5 万帧带有关键点可见性标签。此外,我们还对现有的 EMHI 数据集进行了关键点可见性注释,以进一步促进该方向的研究。此外,我们提出了 EvaPose,这是一种新颖的以自我为中心的可见性意识 HPE 方法,明确结合可见性信息以提高姿态估计的准确性。大量实验验证了真实可见性标签在以自我为中心的 HPE 设置中的重要价值,并证明我们的 EvaPose 在 Eva-3M 和 EMHI 数据集上均实现了最先进的性能。
cs.CV / 14 / 2602.23622
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
DLEBench:评估基于指令的图像编辑模型在小规模物体编辑能力上的表现
Abstract
Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.
Chinese Translation
在基于指令的图像编辑模型(IIEMs)领域取得了显著进展。然而,尽管这些模型在当前基准测试中表现出对指令的合理遵循和强大的推理能力,但它们在编辑小物体方面的能力仍然未得到充分探索,尽管这对于精确的局部编辑和细节的精细化在真实和生成图像中都至关重要。本文介绍了DeepLookEditBench(DLEBench),这是第一个专门用于评估IIEMs在小规模物体编辑能力的基准测试。具体而言,我们构建了一个具有挑战性的测试平台,包含1889个样本,涵盖七种指令类型。在这些样本中,目标物体仅占图像面积的1%-10%,涉及部分遮挡和多物体编辑等复杂场景。为了确保在该基准上的稳健评估,我们提出了一种评估协议,采用精细的评分标准,以最小化在两个标准(指令遵循和视觉一致性)上的主观性和模糊性。该协议还引入了一种双模式评估框架(工具驱动模式和Oracle引导模式),解决了DLEBench上LMM作为评判者与人类判断之间的不一致。对10个IIEM的实证结果揭示了在小规模物体编辑方面的显著性能差距,突显了推进这一能力所需的专门基准测试的必要性。
cs.CV / 15 / 2602.23645
BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
BuildAnyPoint:来自多样化点云的3D建筑结构抽象
Abstract
We introduce BuildAnyPoint, a novel generative framework for structured 3D building reconstruction from point clouds with diverse distributions, such as those captured by airborne LiDAR and Structure-from-Motion. To recover artist-created building abstraction in this highly underconstrained setting, we capitalize on the role of explicit 3D generative priors in autoregressive mesh generation. Specifically, we design a Loosely Cascaded Diffusion Transformer (Loca-DiT) that initially recovers the underlying distribution from noisy or sparse points, followed by autoregressively encapsulating them into compact meshes. We first formulate distribution recovery as a conditional generation task by training latent diffusion models conditioned on input point clouds, and then tailor a decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds. Our method delivers substantial qualitative and quantitative improvements over prior building abstraction methods. Furthermore, the effectiveness of our approach is evidenced by the strong performance of its recovered point clouds on building point cloud completion benchmarks, which exhibit improved surface accuracy and distribution uniformity.
Chinese Translation
我们介绍了BuildAnyPoint,这是一种新颖的生成框架,用于从具有多样分布的点云(例如由空中激光雷达和运动结构法捕获的点云)中进行结构化3D建筑重建。为了在这种高度欠约束的环境中恢复艺术家创作的建筑抽象,我们利用了显式3D生成先验在自回归网格生成中的作用。具体而言,我们设计了一种松散级联扩散变换器(Loosely Cascaded Diffusion Transformer, Loca-DiT),该变换器最初从嘈杂或稀疏的点中恢复基础分布,随后自回归地将其封装成紧凑的网格。我们首先将分布恢复公式化为条件生成任务,通过训练条件于输入点云的潜在扩散模型,然后为基于恢复的点云的条件自回归网格生成量身定制一个仅解码器的变换器。我们的方法在定性和定量上显著优于先前的建筑抽象方法。此外,我们的方法在建筑点云补全基准测试中所恢复的点云表现出色,证明了其有效性,展现了更好的表面精度和分布均匀性。
cs.CV / 16 / 2602.23652
3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
用于MRI多脏器异常检测的3D模态感知预训练视觉语言模型
Abstract
Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection. Our code is available at https://github.com/RomantiDr/MedMAP.
Chinese Translation
视觉语言模型(VLMs)在医学影像复杂诊断任务中展现出强大的潜力。然而,将VLM应用于多脏器医学影像面临两个主要挑战:(1)模态特定的视觉语言对齐和(2)跨模态特征融合。在本研究中,我们提出了MedMAP,一个医学模态感知预训练框架,旨在增强3D MRI中的视觉语言表征学习。MedMAP包括一个模态感知的视觉语言对齐阶段和一个用于多脏器异常检测的微调阶段。在预训练阶段,模态感知编码器隐式捕捉联合模态分布,并改善视觉与文本表征之间的对齐。随后,我们对预训练的视觉编码器进行微调(同时保持文本编码器不变),以适应下游任务。为此,我们整理了MedMoM-MRI3D数据集,包含7,392对3D MRI体积报告,涵盖十二种MRI模态和九种异常,旨在满足各种3D医学分析任务的需求。在MedMoM-MRI3D上的广泛实验表明,MedMAP在基于3D MRI的多脏器异常检测中显著优于现有的VLM。我们的代码可在 https://github.com/RomantiDr/MedMAP 获取。
cs.CV / 17 / 2602.23653
ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models
ProtoDCS:面向视觉-语言模型的稳健高效开放集测试时适应
Abstract
Large-scale Vision-Language Models (VLMs) exhibit strong zero-shot recognition, yet their real-world deployment is challenged by distribution shifts. While Test-Time Adaptation (TTA) can mitigate this, existing VLM-based TTA methods operate under a closed-set assumption, failing in open-set scenarios where test streams contain both covariate-shifted in-distribution (csID) and out-of-distribution (csOOD) data. This leads to a critical difficulty: the model must discriminate unknown csOOD samples to avoid interference while simultaneously adapting to known csID classes for accuracy. Current open-set TTA (OSTTA) methods rely on hard thresholds for separation and entropy minimization for adaptation. These strategies are brittle, often misclassifying ambiguous csOOD samples and inducing overconfident predictions, and their parameter-update mechanism is computationally prohibitive for VLMs. To address these limitations, we propose Prototype-based Double-Check Separation (ProtoDCS), a robust framework for OSTTA that effectively separates csID and csOOD samples, enabling safe and efficient adaptation of VLMs to csID data. Our main contributions are: (1) a novel double-check separation mechanism employing probabilistic Gaussian Mixture Model (GMM) verification to replace brittle thresholding; and (2) an evidence-driven adaptation strategy utilizing uncertainty-aware loss and efficient prototype-level updates, mitigating overconfidence and reducing computational overhead. Extensive experiments on CIFAR-10/100-C and Tiny-ImageNet-C demonstrate that ProtoDCS achieves state-of-the-art performance, significantly boosting both known-class accuracy and OOD detection metrics. Code will be available at https://github.com/O-YangF/ProtoDCS.
Chinese Translation
大规模视觉-语言模型(VLMs)展现出强大的零样本识别能力,但其在实际应用中的部署受到分布变化的挑战。尽管测试时适应(TTA)可以缓解这一问题,但现有基于VLM的TTA方法在封闭集假设下运行,无法应对测试流中同时包含协变量变化的分布内(csID)和分布外(csOOD)数据的开放集场景。这导致了一个关键难题:模型必须区分未知的csOOD样本,以避免干扰,同时又要适应已知的csID类别以保持准确性。目前的开放集TTA(OSTTA)方法依赖于硬阈值进行分离,并通过熵最小化进行适应。这些策略较为脆弱,常常错误分类模糊的csOOD样本,并导致过于自信的预测,其参数更新机制对VLMs来说计算开销巨大。为了解决这些局限性,我们提出了基于原型的双重检查分离(ProtoDCS),这是一个稳健的OSTTA框架,能够有效分离csID和csOOD样本,从而安全高效地使VLMs适应csID数据。我们的主要贡献包括:(1)一种新颖的双重检查分离机制,采用概率高斯混合模型(GMM)验证来替代脆弱的阈值;(2)一种基于证据的适应策略,利用不确定性感知损失和高效的原型级更新,减轻过于自信的现象并降低计算开销。在CIFAR-10/100-C和Tiny-ImageNet-C上的大量实验表明,ProtoDCS实现了最先进的性能,显著提升了已知类别的准确性和OOD检测指标。代码将发布在https://github.com/O-YangF/ProtoDCS。
cs.CV / 18 / 2602.23676
Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering
通过语义解耦潜在引导抑制放射学报告生成中的先前比较幻觉
Abstract
Automated radiology report generation using vision-language models (VLMs) is limited by the risk of prior-comparison hallucination, where the model generates historical findings unsupported by the current study. We address this challenge with a training-free, inference-time control framework termed Semantically Decoupled Latent Steering (SDLS). Unlike generic activation steering, which often suffers from semantic entanglement, our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition followed by $QR$-based orthogonalization. This orthogonalization step is critical. It leverages geometric constraints to filter out the clinical semantics often entangled in standard principal component analysis (PCA) directions, ensuring that the steering vector targets only the ``historical comparison" axis. We validate our method on the BiomedGPT foundation model, demonstrating that it overcomes the trade-off between hallucination suppression and clinical accuracy. Extensive experiments on MIMIC-CXR, and zero-shot transfer evaluation on CheXpert Plus and IU-Xray, demonstrate the robustness of our approach. Quantitative evaluations on MIMIC-CXR show that our approach significantly reduces the probability of historical hallucinations (FilBERT score decreases from 0.2373 to 0.1889) and improves clinical label fidelity (CheXpert macro-F1 increases from 0.2242 to 0.3208). Supplementary evaluations confirm that the structural integrity of the clinical narrative is maintained.
Chinese Translation
使用视觉-语言模型(VLMs)进行自动化放射学报告生成受到先前比较幻觉风险的限制,即模型生成与当前研究不符的历史发现。我们通过一种称为语义解耦潜在引导(SDLS)的无训练推理控制框架来解决这一挑战。与通常受语义纠缠影响的通用激活引导不同,我们的方法通过大型语言模型(LLM)驱动的语义分解构建一个无语义干预向量,随后进行基于 $QR$ 的正交化。这个正交化步骤至关重要。它利用几何约束过滤掉标准主成分分析(PCA)方向中常常纠缠的临床语义,确保引导向量仅针对“历史比较”轴。我们在BiomedGPT基础模型上验证了我们的方法,证明它克服了幻觉抑制与临床准确性之间的权衡。在MIMIC-CXR上的大量实验,以及在CheXpert Plus和IU-Xray上的零样本迁移评估,展示了我们方法的鲁棒性。在MIMIC-CXR上的定量评估显示,我们的方法显著降低了历史幻觉的概率(FilBERT分数从0.2373降至0.1889),并提高了临床标签的保真度(CheXpert宏F1从0.2242提高至0.3208)。补充评估确认临床叙述的结构完整性得以保持。
cs.CV / 19 / 2602.23677
Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation
多领域作物-杂草分割的视觉-语言语义基础
Abstract
Fine-grained crop-weed segmentation is essential for enabling targeted herbicide application in precision agriculture. However, existing deep learning models struggle to generalize across heterogeneous agricultural environments due to reliance on dataset-specific visual features. We propose Vision-Language Weed Segmentation (VL-WS), a novel framework that addresses this limitation by grounding pixel-level segmentation in semantically aligned, domain-invariant representations. Our architecture employs a dual-encoder design, where frozen Contrastive Language-Image Pretraining (CLIP) embeddings and task-specific spatial features are fused and modulated via Feature-wise Linear Modulation (FiLM) layers conditioned on natural language captions. This design enables image level textual descriptions to guide channel-wise feature refinement while preserving fine-grained spatial localization. Unlike prior works restricted to training and evaluation on single-source datasets, VL-WS is trained on a unified corpus that includes close-range ground imagery (robotic platforms) and high-altitude UAV imagery, covering diverse crop types, weed species, growth stages, and sensing conditions. Experimental results across four benchmark datasets demonstrate the effectiveness of our framework, with VL-WS achieving a mean Dice score of 91.64% and outperforming the CNN baseline by 4.98%. The largest gains occur on the most challenging weed class, where VL-WS attains 80.45% Dice score compared to 65.03% for the best baseline, representing a 15.42% improvement. VL-WS further maintains stable weed segmentation performance under limited target-domain supervision, indicating improved generalization and data efficiency. These findings highlight the potential of vision-language alignment to enable scalable, label-efficient segmentation models deployable across diverse real-world agricultural domains.
Chinese Translation
细粒度的作物-杂草分割对于精准农业中实现针对性除草剂施用至关重要。然而,现有的深度学习模型由于依赖于特定数据集的视觉特征,难以在异质农业环境中进行泛化。我们提出了视觉-语言杂草分割(Vision-Language Weed Segmentation, VL-WS),这是一个新颖的框架,旨在通过将像素级分割与语义对齐的领域不变表示相结合来解决这一局限。我们的架构采用双编码器设计,其中冻结的对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)嵌入与任务特定的空间特征通过基于自然语言描述的特征线性调制(Feature-wise Linear Modulation, FiLM)层进行融合和调制。这种设计使得图像级文本描述能够引导通道级特征的细化,同时保持细粒度的空间定位。与以往仅限于单一来源数据集进行训练和评估的工作不同,VL-WS在一个统一的语料库上进行训练,该语料库包括近距离地面图像(机器人平台)和高空无人机图像,涵盖了多种作物类型、杂草种类、生长阶段和传感条件。在四个基准数据集上的实验结果表明,我们的框架有效,VL-WS的平均Dice得分为91.64%,比CNN基线提高了4.98%。在最具挑战性的杂草类别上,VL-WS的Dice得分达到80.45%,而最佳基线为65.03%,提升幅度为15.42%。VL-WS在有限的目标领域监督下进一步保持了稳定的杂草分割性能,表明其泛化能力和数据效率有所提升。这些发现突显了视觉-语言对齐在实现可扩展、标签高效的分割模型方面的潜力,使其能够在多样化的现实农业领域中部署。
cs.CV / 20 / 2602.23678
Any Model, Any Place, Any Time: Get Remote Sensing Foundation Model Embeddings On Demand
任何模型,任何地点,任何时间:按需获取遥感基础模型嵌入
Abstract
The remote sensing community is witnessing a rapid growth of foundation models, which provide powerful embeddings for a wide range of downstream tasks. However, practical adoption and fair comparison remain challenging due to substantial heterogeneity in model release formats, platforms and interfaces, and input data specifications. These inconsistencies significantly increase the cost of obtaining, using, and benchmarking embeddings across models. To address this issue, we propose rs-embed, a Python library that offers a unified, region of interst (ROI) centric interface: with a single line of code, users can retrieve embeddings from any supported model for any location and any time range. The library also provides efficient batch processing to enable large-scale embedding generation and evaluation. The code is available at: https://github.com/cybergis/rs-embed
Chinese Translation
遥感领域正在快速发展基础模型,这些模型为广泛的下游任务提供强大的嵌入。然而,由于模型发布格式、平台和接口以及输入数据规范的显著异质性,实际应用和公平比较仍然面临挑战。这些不一致性显著增加了在不同模型之间获取、使用和基准测试嵌入的成本。为了解决这一问题,我们提出了 rs-embed,一个提供统一的区域兴趣(ROI)中心接口的 Python 库:用户只需一行代码即可从任何支持的模型中检索任何位置和任何时间范围的嵌入。该库还提供高效的批处理功能,以支持大规模嵌入生成和评估。代码可在以下网址获取:https://github.com/cybergis/rs-embed
cs.CV / 21 / 2602.23697
Towards Source-Aware Object Swapping with Initial Noise Perturbation
面向源感知的物体交换与初始噪声扰动
Abstract
Object swapping aims to replace a source object in a scene with a reference object while preserving object fidelity, scene fidelity, and object-scene harmony. Existing methods either require per-object finetuning and slow inference or rely on extra paired data that mostly depict the same object across contexts, forcing models to rely on background cues rather than learning cross-object alignment. We propose SourceSwap, a self-supervised and source-aware framework that learns cross-object alignment. Our key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images. We then train a dual U-Net with full-source conditioning and a noise-free reference encoder, enabling direct inter-object alignment, zero-shot inference without per-object finetuning, and lightweight iterative refinement. We further introduce SourceBench, a high-quality benchmark with higher resolution, more categories, and richer interactions. Experiments demonstrate that SourceSwap achieves superior fidelity, stronger scene preservation, and more natural harmony, and it transfers well to edits such as subject-driven refinement and face swapping.
Chinese Translation
物体交换旨在在场景中用参考物体替换源物体,同时保持物体的保真度、场景的保真度和物体与场景的和谐性。现有方法要么需要对每个物体进行微调并且推理速度较慢,要么依赖于额外的配对数据,这些数据大多描绘了在不同上下文中相同的物体,迫使模型依赖背景线索而不是学习跨物体对齐。我们提出了SourceSwap,一个自监督的源感知框架,旨在学习跨物体对齐。我们的关键见解是通过在初始噪声空间中进行频率分离扰动,从任何图像合成高质量的伪配对,这种方法在改变外观的同时保持姿态、粗略形状和场景布局,不需要视频、多视角数据或额外图像。然后,我们训练一个具有完全源条件的双U-Net和一个无噪声的参考编码器,从而实现直接的物体间对齐、零-shot推理而无需对每个物体进行微调,以及轻量级的迭代优化。我们进一步引入了SourceBench,一个高质量的基准,具有更高的分辨率、更多的类别和更丰富的交互。实验表明,SourceSwap在保真度、场景保持和自然和谐性方面表现优越,并且在主题驱动的优化和人脸交换等编辑任务中具有良好的迁移能力。
cs.CV / 22 / 2602.23699
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit
HiDrop:通过晚期注入、凹形金字塔剪枝和提前退出在多模态大语言模型中的层次视觉令牌减少
Abstract
The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at https://github.com/EIT-NLP/HiDrop.
Chinese Translation
在多模态大语言模型(MLLMs)中处理视觉令牌的二次计算成本阻碍了其广泛应用。尽管渐进式视觉令牌剪枝提供了一个有前景的解决方案,但当前的方法误解了浅层功能,并使用僵化的调度,未能释放出全部的效率潜力。为了解决这些问题,我们提出了HiDrop,一个将令牌剪枝与MLLM层的真实层次功能对齐的框架。HiDrop具有两个关键创新:(1)晚期注入(Late Injection),它绕过被动的浅层,以在主动融合开始的确切位置引入视觉令牌;(2)凹形金字塔剪枝(Concave Pyramid Pruning)结合提前退出机制(Early Exit),以动态调整中层和深层的剪枝率。该过程通过层间相似性度量和可微分的top-k操作进行优化。为了确保实际效率,HiDrop进一步结合了持久位置编码、兼容FlashAttention的令牌选择和视觉计算的并行解耦,以消除与动态令牌减少相关的隐藏开销。大量实验表明,HiDrop在保持原始性能的同时压缩了约90%的视觉令牌,并将训练速度提高了1.72倍。我们的工作不仅为高效的MLLM训练和推理设定了新的最先进水平,还为多模态融合的层次特性提供了宝贵的见解。代码已发布在 https://github.com/EIT-NLP/HiDrop。
cs.CV / 23 / 2602.23709
EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
EgoGraph:用于自我中心视频理解的时间知识图谱
Abstract
Ultra-long egocentric videos spanning multiple days present significant challenges for video understanding. Existing approaches still rely on fragmented local processing and limited temporal modeling, restricting their ability to reason over such extended sequences. To address these limitations, we introduce EgoGraph, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams. EgoGraph employs a novel egocentric schema that unifies the extraction and abstraction of core entities, such as people, objects, locations, and events, and structurally reasons about their attributes and interactions, yielding a significantly richer and more coherent semantic representation than traditional clip-based video models. Crucially, we develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning. Extensive experiments on the EgoLifeQA and EgoR1-bench benchmarks demonstrate that EgoGraph achieves state-of-the-art performance on long-term video question answering, validating its effectiveness as a new paradigm for ultra-long egocentric video understanding.
Chinese Translation
超长的自我中心视频跨越多个天数,为视频理解带来了重大挑战。现有的方法仍然依赖于碎片化的局部处理和有限的时间建模,限制了它们对如此扩展序列进行推理的能力。为了解决这些局限性,我们提出了EgoGraph,一个无训练且动态的知识图谱构建框架,明确编码自我中心视频流中的长期跨实体依赖关系。EgoGraph采用了一种新颖的自我中心模式,统一了核心实体(如人、物体、地点和事件)的提取和抽象,并在结构上推理它们的属性和交互,从而产生比传统基于片段的视频模型显著更丰富和更连贯的语义表示。关键是,我们开发了一种时间关系建模策略,捕捉跨实体的时间依赖关系,并在多个天数中累积稳定的长期记忆,从而实现复杂的时间推理。在EgoLifeQA和EgoR1-bench基准上的大量实验表明,EgoGraph在长期视频问答中达到了最先进的性能,验证了其作为超长自我中心视频理解新范式的有效性。
cs.CV / 24 / 2602.23711
Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?
统一生成与理解模型能否在不同输出模态间保持语义等价性?
Abstract
Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture. However, existing evaluations typically assess these capabilities separately, overlooking semantic equivalence, i.e., the ability to manifest consistent reasoning results regardless of the output modality. In this work, we investigate whether current U-MLLMs satisfy this premise. We observe that while models demonstrate robust textual reasoning, they fail to maintain semantic equivalence when required to render the same results in the image modality. To rigorously diagnose this discrepancy, we introduce VGUBench, a framework to decouple reasoning logic from generation fidelity. VGUBench comprises three diagnostic tasks: (1)Textual Generative Understanding, establishing a baseline for reasoning accuracy in textual response; (2)Visual Generative Understanding, evaluating the ability to generate visual responses that represent the correct answer; and (3)a Visual Rendering control task, which assesses the ability to directly render explicit visual descriptions into images without complex reasoning. Our evaluation reveals a significant disparity: despite strong performance in textual understanding and visual rendering, U-MLLMs exhibit a marked performance collapse when required to generate visual answers to questions. Furthermore, we find a negligible correlation between visual answering performance and basic rendering quality. These results suggest that the failure stems not from insufficient generation fidelity, but from a breakdown in cross-modal semantic alignment. We provide diagnostic insights to address this challenge in future Unified Generation and Understanding Models.
Chinese Translation
统一多模态大型语言模型(U-MLLMs)在单一架构中整合了理解与生成。然而,现有评估通常将这些能力分开评估,忽视了语义等价性,即无论输出模态如何,能够展现一致推理结果的能力。在本研究中,我们探讨当前的U-MLLMs是否满足这一前提。我们观察到,尽管模型在文本推理方面表现出色,但在需要以图像模态呈现相同结果时,它们未能保持语义等价性。为了严格诊断这一差异,我们引入了VGUBench,一个将推理逻辑与生成保真度解耦的框架。VGUBench包括三个诊断任务:(1)文本生成理解,建立文本响应中推理准确性的基线;(2)视觉生成理解,评估生成代表正确答案的视觉响应的能力;(3)视觉呈现控制任务,评估直接将明确的视觉描述渲染为图像而无需复杂推理的能力。我们的评估揭示了显著的差异:尽管在文本理解和视觉呈现方面表现强劲,U-MLLMs在需要生成视觉答案时却表现出明显的性能崩溃。此外,我们发现视觉回答性能与基本渲染质量之间的相关性微乎其微。这些结果表明,失败的根源并不在于生成保真度不足,而是在于跨模态语义对齐的破裂。我们提供了诊断见解,以应对未来统一生成与理解模型中的这一挑战。
cs.CV / 25 / 2602.23732
A Difference-in-Difference Approach to Detecting AI-Generated Images
一种差异中的差异方法用于检测AI生成的图像
Abstract
Diffusion models are able to produce AI-generated images that are almost indistinguishable from real ones. This raises concerns about their potential misuse and poses substantial challenges for detecting them. Many existing detectors rely on reconstruction error -- the difference between the input image and its reconstructed version -- as the basis for distinguishing real from fake images. However, these detectors become less effective as modern AI-generated images become increasingly similar to real ones. To address this challenge, we propose a novel difference-in-difference method. Instead of directly using the reconstruction error (a first-order difference), we compute the difference in reconstruction error -- a second-order difference -- for variance reduction and improving detection accuracy. Extensive experiments demonstrate that our method achieves strong generalization performance, enabling reliable detection of AI-generated images in the era of generative AI.
Chinese Translation
扩散模型能够生成几乎无法与真实图像区分的AI生成图像。这引发了对其潜在误用的担忧,并对检测这些图像提出了重大挑战。许多现有的检测器依赖于重建误差——输入图像与其重建版本之间的差异——作为区分真实与虚假图像的基础。然而,随着现代AI生成的图像与真实图像越来越相似,这些检测器的有效性降低。为了解决这一挑战,我们提出了一种新颖的差异中的差异方法。我们不是直接使用重建误差(一级差异),而是计算重建误差的差异——二级差异——以减少方差并提高检测准确性。大量实验表明,我们的方法在生成性AI时代实现了强大的泛化性能,能够可靠地检测AI生成的图像。
cs.CV / 26 / 2602.23734
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking
UTPTrack:面向简单统一的视觉跟踪令牌剪枝
Abstract
One-stream Transformer-based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter-component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language-guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking. Code will be released at https://github.com/EIT-NLP/UTPTrack.
Chinese Translation
基于单流Transformer的跟踪器在视觉目标跟踪中实现了先进的性能,但面临显著的计算开销,阻碍了实时部署。尽管令牌剪枝提供了提高效率的途径,但现有方法相对分散。它们通常孤立地剪枝搜索区域、动态模板和静态模板,忽视了关键的组件间依赖关系,从而导致次优剪枝和精度下降。为了解决这一问题,我们提出了UTPTrack,这是一种简单的统一令牌剪枝框架,首次联合压缩所有三个组件。UTPTrack采用了一种基于注意力的、令牌类型感知的策略,全面建模冗余,这一设计无缝支持在单一模型中跨多模态和语言引导任务的统一跟踪。在10个基准测试上的广泛评估表明,UTPTrack在基于剪枝的跟踪器的精度-效率权衡中达到了新的最先进水平,在基于RGB的跟踪中剪枝了65.4%的视觉令牌,而在统一跟踪中剪枝了67.5%,同时分别保留了99.7%和100.5%的基线性能。这种在RGB和多模态场景下的强大表现强调了其作为未来高效视觉跟踪研究坚实基础的潜力。代码将发布在https://github.com/EIT-NLP/UTPTrack。
cs.CV / 27 / 2602.23739
U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
U-Mind:一种统一的实时多模态交互框架与视听生成
Abstract
Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.
Chinese Translation
实时的全栈多模态交互是构建能够自然、动态沟通的智能具身代理的核心目标。然而,现有系统要么仅限于单模态生成,要么在推理能力和跨模态对齐方面表现不佳,导致无法实现连贯且感知基础的交互。在本研究中,我们提出了U-Mind,这是第一个支持实时生成的高智能多模态对话统一系统,能够在单一交互循环中联合建模语言、语音、动作和视频合成。U-Mind的核心实现了一个统一对齐与推理框架,解决了两个关键挑战:通过分段对齐策略增强跨模态同步,以及通过复习驱动学习保持推理能力。在推理过程中,U-Mind采用文本优先的解码管道,首先进行内部思维链规划,然后在各模态之间进行时间同步生成。为了闭合循环,我们实现了一个基于姿态和语音的实时视频渲染框架,能够提供富有表现力和同步的视觉反馈。大量实验表明,U-Mind在一系列多模态交互任务中(包括问答、指令跟随和动作生成)达到了最先进的性能,为智能沉浸式对话代理的实现铺平了道路。
cs.CV / 28 / 2602.23759
Learning Accurate Segmentation Purely from Self-Supervision
仅通过自我监督学习准确分割
Abstract
Accurately segmenting objects without any manual annotations remains one of the core challenges in computer vision. In this work, we introduce Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing. Selfment first constructs patch-level affinity graphs from self-supervised features and applies NCut to obtain an initial coarse foreground--background separation. We then introduce Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering. The refined masks are subsequently used as supervisory signals to train a lightweight segmentation head with contrastive and region-consistency objectives, allowing the model to learn stable and transferable object representations. Despite its simplicity and complete absence of manual supervision, Selfment sets new state-of-the-art (SoTA) results across multiple benchmarks. It achieves substantial improvements on $F_{\max}$ over previous unsupervised saliency detection methods on ECSSD ($+4.0\%$), HKUIS ($+4.6\%$), and PASCAL-S ($+5.7\%$). Moreover, without any additional fine-tuning, Selfment demonstrates remarkable zero-shot generalization to camouflaged object detection tasks (e.g., $0.910$ $S_m$ on CHAMELEON and $0.792$ $F_{\beta}^{\omega}$ on CAMO), outperforming all existing unsupervised approaches and even rivaling the SoTA fully supervised methods.
Chinese Translation
在没有任何人工标注的情况下准确分割物体仍然是计算机视觉中的核心挑战之一。在本研究中,我们介绍了Selfment,一个完全自我监督的框架,它能够直接从原始图像中分割前景物体,而无需人工标签、预训练的分割模型或任何后处理。Selfment首先从自我监督特征构建补丁级别的亲和图,并应用NCut获得初步的粗略前景-背景分离。然后,我们引入了迭代补丁优化(Iterative Patch Optimization, IPO),这是一种特征空间精炼过程,通过迭代补丁聚类逐步增强空间一致性和语义一致性。精炼后的掩膜随后被用作监督信号,以训练一个轻量级的分割头,结合对比和区域一致性目标,使模型能够学习稳定且可迁移的物体表示。尽管其简单性和完全缺乏人工监督,Selfment在多个基准测试中设定了新的最先进(SoTA)结果。它在ECSSD上相较于之前的无监督显著性检测方法在$F_{ ext{max}}$上取得了显著提升($+4.0\%$),在HKUIS上提升了$+4.6\\%$,在PASCAL-S上提升了$+5.7\\%$。此外,在没有任何额外微调的情况下,Selfment在伪装物体检测任务上展示了显著的零-shot泛化能力(例如,在CHAMELEON上达到$0.910$ $S_m$,在CAMO上达到$0.792$ $F_{eta}^{ ext{ω}}$),超越了所有现有的无监督方法,甚至与最先进的完全监督方法相抗衡。
cs.CV / 29 / 2602.23783
Diffusion Probe: Generated Image Result Prediction Using CNN Probes
扩散探针:基于CNN探针的生成图像结果预测
Abstract
Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-attention distributions and final image quality. Based on this finding, we introduce Diffusion Probe, a framework that leverages internal cross-attention maps as predictive signals. We design a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image's overall quality. This enables accurate forecasting of image quality across diverse evaluation metrics long before full synthesis is complete. We validate Diffusion Probe across a wide range of settings. On multiple T2I models, across early denoising windows, resolutions, and quality metrics, it achieves strong correlation (PCC > 0.7) and high classification performance (AUC-ROC > 0.9). Its reliability translates into practical gains. By enabling early quality-aware decisions in workflows such as prompt optimization, seed selection, and accelerated RL training, the probe supports more targeted sampling and avoids computation on low-potential generations. This reduces computational overhead while improving final output quality. Diffusion Probe is model-agnostic, efficient, and broadly applicable, offering a practical solution for improving T2I generation efficiency through early quality prediction.
Chinese Translation
文本到图像(T2I)扩散模型缺乏有效的早期质量评估机制,这导致在多代场景中(如提示迭代、基于代理的生成和流图)出现高昂的试错成本。我们揭示了早期扩散交叉注意力分布与最终图像质量之间的强相关性。基于这一发现,我们引入了扩散探针(Diffusion Probe),一个利用内部交叉注意力图作为预测信号的框架。我们设计了一个轻量级预测器,将从初始去噪步骤中提取的早期交叉注意力的统计特性映射到最终图像的整体质量。这使得在完整合成完成之前,就能够准确预测图像质量,适用于多种评估指标。我们在广泛的设置中验证了扩散探针。在多个T2I模型中,跨越早期去噪窗口、分辨率和质量指标,它实现了强相关性(PCC > 0.7)和高分类性能(AUC-ROC > 0.9)。其可靠性转化为实际收益。通过在提示优化、种子选择和加速强化学习训练等工作流程中实现早期质量意识决策,该探针支持更有针对性的采样,避免在低潜力生成上进行计算。这减少了计算开销,同时提高了最终输出质量。扩散探针是模型无关的、高效的且广泛适用,提供了一种通过早期质量预测提高T2I生成效率的实用解决方案。
cs.CV / 30 / 2602.23790
Fourier Angle Alignment for Oriented Object Detection in Remote Sensing
遥感中定向物体检测的傅里叶角对齐
Abstract
In remote sensing rotated object detection, mainstream methods suffer from two bottlenecks, directional incoherence at detector neck and task conflict at detecting head. Ulitising fourier rotation equivariance, we introduce Fourier Angle Alignment, which analyses angle information through frequency spectrum and aligns the main direction to a certain orientation. Then we propose two plug and play modules : FAAFusion and FAA Head. FAAFusion works at the detector neck, aligning the main direction of higher-level features to the lower-level features and then fusing them. FAA Head serves as a new detection head, which pre-aligns RoI features to a canonical angle and adds them to the original features before classification and regression. Experiments on DOTA-v1.0, DOTA-v1.5 and HRSC2016 show that our method can greatly improve previous work. Particularly, our method achieves new state-of-the-art results of 78.72% mAP on DOTA-v1.0 and 72.28% mAP on DOTA-v1.5 datasets with single scale training and testing, validating the efficacy of our approach in remote sensing object detection. The code is made publicly available at https://github.com/gcy0423/Fourier-Angle-Alignment .
Chinese Translation
在遥感旋转物体检测中,主流方法面临两个瓶颈:检测器颈部的方向不一致性和检测头的任务冲突。利用傅里叶旋转等变性,我们提出了傅里叶角对齐(Fourier Angle Alignment),该方法通过频谱分析角度信息,并将主要方向对齐到特定方向。然后,我们提出了两个即插即用的模块:FAA融合(FAA Fusion)和FAA头(FAA Head)。FAA融合模块在检测器颈部工作,将高层特征的主要方向对齐到低层特征,然后进行融合。FAA头作为一个新的检测头,预先将感兴趣区域(RoI)特征对齐到标准角度,并在分类和回归之前将其添加到原始特征中。在DOTA-v1.0、DOTA-v1.5和HRSC2016数据集上的实验表明,我们的方法可以显著改善之前的工作。特别是,我们的方法在DOTA-v1.0和DOTA-v1.5数据集上分别达到了78.72%和72.28%的新最优结果(mAP),并且在单尺度训练和测试下验证了我们方法在遥感物体检测中的有效性。代码已公开发布在 https://github.com/gcy0423/Fourier-Angle-Alignment 。
cs.CV / 31 / 2602.23806
See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent
观察、行动、适应:通过个性化VLM引导代理实现无监督跨域视觉适应的主动感知
Abstract
Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea$^2$ (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea$^2$ keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module's outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea$^2$ directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.
Chinese Translation
预训练的感知模型在通用图像领域表现出色,但在室内场景等新环境中显著下降。传统的解决方案是对下游数据进行微调,这会导致先前知识的灾难性遗忘,并需要昂贵的场景特定标注。我们提出了一种范式转变,通过Sea$^2$(观察、行动、适应):我们不是调整感知模块本身,而是通过智能姿态控制代理调整它们的部署方式。Sea$^2$保持所有感知模块不变,在训练过程中无需下游标签,仅使用标量感知反馈引导代理朝向信息丰富的视角。具体而言,我们通过两阶段的训练流程将视觉语言模型(VLM)转变为低级姿态控制器:首先在基于规则的探索轨迹上进行微调,这些轨迹系统性地探测室内场景,然后通过无监督强化学习来优化策略,该策略根据感知模块的输出和置信度构建奖励。与之前将探索与特定模型结合或收集数据以重新训练它们的主动感知方法不同,Sea$^2$直接利用现成的感知模型处理各种任务,而无需重新训练。我们在三个视觉感知任务上进行了实验,包括视觉定位、分割和3D盒估计,在数据集ReplicaCAD上分别实现了13.54%、15.92%和27.68%的性能提升。
cs.CV / 32 / 2602.23814
Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation
基于三维几何先验的双手操作动作几何预测
Abstract
Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations. We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code is available at https://github.com/Chongyang-99/GAP.git.
Chinese Translation
双手操作需要能够推理三维几何、预测其在动作下如何演变并生成平滑、协调运动的策略。然而,现有方法通常依赖于具有有限空间意识的二维特征,或需要在现实环境中难以可靠获取的显式点云。同时,最近的三维几何基础模型表明,可以快速且稳健地直接从RGB图像重建准确且多样的三维结构。我们利用这一机会,提出了一个框架,直接在预训练的三维几何基础模型上构建双手操作。我们的策略将几何感知潜变量、二维语义特征和本体感知融合为统一的状态表示,并使用扩散模型共同预测未来的动作片段和解码为密集点图的未来三维潜变量。通过明确预测三维场景如何与动作序列共同演变,该策略仅使用RGB观测获得强大的空间理解和预测能力。我们在RoboTwin基准测试的仿真和实际机器人执行中评估了我们的方法。我们的方法在操作成功率、臂间协调和三维空间预测准确性方面始终优于基于二维和点云的基线,达到了最先进的性能。代码可在 https://github.com/Chongyang-99/GAP.git 获取。
cs.CV / 33 / 2602.23817
Footprint-Guided Exemplar-Free Continual Histopathology Report Generation
基于足迹引导的无例子持续病理报告生成
Abstract
Rapid progress in vision-language modeling has enabled pathology report generation from gigapixel whole-slide images, but most approaches assume static training with simultaneous access to all data. In clinical deployment, however, new organs, institutions, and reporting conventions emerge over time, and sequential fine-tuning can cause catastrophic forgetting. We introduce an exemplar-free continual learning framework for WSI-to-report generation that avoids storing raw slides or patch exemplars. The core idea is a compact domain footprint built in a frozen patch-embedding space: a small codebook of representative morphology tokens together with slide-level co-occurrence summaries and lightweight patch-count priors. These footprints support generative replay by synthesizing pseudo-WSI representations that reflect domain-specific morphological mixtures, while a teacher snapshot provides pseudo-reports to supervise the updated model without retaining past data. To address shifting reporting conventions, we distill domain-specific linguistic characteristics into a compact style descriptor and use it to steer generation. At inference, the model identifies the most compatible descriptor directly from the slide signal, enabling domain-agnostic setup without requiring explicit domain identifiers. Evaluated across multiple public continual learning benchmarks, our approach outperforms exemplar-free and limited-buffer rehearsal baselines, highlighting footprint-based generative replay as a practical solution for deployment in evolving clinical settings.
Chinese Translation
视觉-语言建模的快速进展使得从千兆像素全切片图像生成病理报告成为可能,但大多数方法假设在训练过程中可以同时访问所有数据。在临床应用中,新的器官、机构和报告规范会随着时间的推移而出现,而顺序微调可能导致灾难性遗忘。我们提出了一种无例子的持续学习框架,用于从全切片图像(WSI)生成报告,避免存储原始切片或补丁示例。核心思想是在冻结的补丁嵌入空间中构建一个紧凑的领域足迹:一个包含代表性形态标记的小型代码本,以及切片级共现摘要和轻量级补丁计数先验。这些足迹通过合成反映特定领域形态混合的伪全切片图像表示,支持生成重放,而教师快照提供伪报告以监督更新模型,而无需保留过去的数据。为了应对变化的报告规范,我们将领域特定的语言特征提炼成一个紧凑的风格描述符,并利用它来引导生成。在推理阶段,模型直接从切片信号中识别出最兼容的描述符,实现无领域依赖的设置,而无需显式的领域标识符。在多个公共持续学习基准测试中评估,我们的方法优于无例子和有限缓冲重放基线,突显了基于足迹的生成重放作为在不断发展的临床环境中部署的实用解决方案。
cs.CV / 34 / 2602.23820
Denoising-Enhanced YOLO for Robust SAR Ship Detection
去噪增强的YOLO用于稳健的SAR船舶检测
Abstract
With the rapid advancement of deep learning, synthetic aperture radar (SAR) imagery has become a key modality for ship detection. However, robust performance remains challenging in complex scenes, where clutter and speckle noise can induce false alarms and small targets are easily missed. To address these issues, we propose CPN-YOLO, a high-precision ship detection framework built upon YOLOv8 with three targeted improvements. First, we introduce a learnable large-kernel denoising module for input pre-processing, producing cleaner representations and more discriminative features across diverse ship types. Second, we design a feature extraction enhancement strategy based on the PPA attention mechanism to strengthen multi-scale modeling and improve sensitivity to small ships. Third, we incorporate a Gaussian similarity loss derived from the normalized Wasserstein distance (NWD) to better measure similarity under complex bounding-box distributions and improve generalization. Extensive experiments on HRSID and SSDD demonstrate the effectiveness of our method. On SSDD, CPN-YOLO surpasses the YOLOv8 baseline, achieving 97.0% precision, 95.1% recall, and 98.9% mAP, and consistently outperforms other representative deep-learning detectors in overall performance.
Chinese Translation
随着深度学习的快速发展,合成孔径雷达(SAR)图像已成为船舶检测的重要模式。然而,在复杂场景中,稳健的性能仍然具有挑战性,杂波和斑点噪声可能引发误报,小目标也容易被遗漏。为了解决这些问题,我们提出了CPN-YOLO,这是一个基于YOLOv8构建的高精度船舶检测框架,具有三项针对性的改进。首先,我们引入了一个可学习的大核去噪模块用于输入预处理,生成更清晰的表示和更具辨别性的特征,适用于多种船舶类型。其次,我们设计了一种基于PPA注意力机制的特征提取增强策略,以加强多尺度建模并提高对小船舶的敏感性。第三,我们结合了源自归一化Wasserstein距离(NWD)的高斯相似性损失,以更好地衡量复杂边界框分布下的相似性并提高泛化能力。在HRSID和SSDD上的大量实验表明我们方法的有效性。在SSDD上,CPN-YOLO超越了YOLOv8基线,达到了97.0%的精度、95.1%的召回率和98.9%的mAP,并在整体性能上始终优于其他代表性的深度学习检测器。
cs.CV / 35 / 2602.23823
APPO: Attention-guided Perception Policy Optimization for Video Reasoning
APPO:基于注意力引导的视频推理感知策略优化
Abstract
Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning. Through extensive empirical observation, we have recognized the critical impact of perception. In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance. Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens). Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5%~4%). We hope our work provides a promising approach to effectively enhance model's perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.
Chinese Translation
复杂的视频推理实际上过于依赖细粒度的感知,而非专家(如博士、科学)级别的推理。通过广泛的实证观察,我们认识到感知的关键影响。特别是,当感知能力几乎固定时,将推理从 Qwen3-8B 提升到 OpenAI-o3 仅带来了 0.7% 的性能提升。相反,即使是感知模型规模的最小变化(从 7B 到 32B)也能提升 1.4% 的性能,这表明提升感知而非推理对提高性能更为关键。因此,探索如何通过推理提升感知能力,而无需昂贵的细粒度标注信息,是值得的。为实现这一目标,我们特别提出了 APPO,即基于注意力引导的感知策略优化算法,该算法利用令牌级的密集奖励来改善模型的细粒度感知。APPO 的核心思想是优化来自不同响应的那些主要关注同一关键视频帧的令牌(称为组内感知令牌)。在不同规模(3/7B)的多样化视频基准和模型上的实验结果表明,APPO 始终优于 GRPO 和 DAPO(0.5%~4%)。我们希望我们的工作提供了一种有前景的方法,以低成本有效提升模型的感知能力,通过推理服务于多样化的场景和需求。
cs.CV / 36 / 2602.23863
NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection
NAU-QMUL:利用BERT和CLIP进行多模态AI生成图像检测
Abstract
With the aim of detecting AI-generated images and identifying the specific models responsible for their generation, we propose a multi-modal multi-task model. The model leverages pre-trained BERT and CLIP Vision encoders for text and image feature extraction, respectively, and employs cross-modal feature fusion with a tailored multi-task loss function. Additionally, a pseudo-labeling-based data augmentation strategy was utilized to expand the training dataset with high-confidence samples. The model achieved fifth place in both Tasks A and B of the `CT2: AI-Generated Image Detection' competition, with F1 scores of 83.16\% and 48.88\%, respectively. These findings highlight the effectiveness of the proposed architecture and its potential for advancing AI-generated content detection in real-world scenarios. The source code for our method is published on https://github.com/xxxxxxxxy/AIGeneratedImageDetection.
Chinese Translation
为了检测AI生成的图像并识别其生成所用的具体模型,我们提出了一种多模态多任务模型。该模型分别利用预训练的BERT和CLIP视觉编码器进行文本和图像特征提取,并采用定制的多任务损失函数进行跨模态特征融合。此外,我们还采用了一种基于伪标签的数据增强策略,以高置信度样本扩展训练数据集。该模型在`CT2: AI-Generated Image Detection'竞赛的任务A和任务B中分别获得了第五名,F1分数为83.16%和48.88%。这些结果突显了所提架构的有效性及其在现实场景中推进AI生成内容检测的潜力。我们的方法源代码已发布在https://github.com/xxxxxxxxy/AIGeneratedImageDetection。
cs.CV / 37 / 2602.23869
Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition
通过分层注意力掩码和模型组合实现遥感中的开放词汇语义分割
Abstract
In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.
Chinese Translation
在本文中,我们提出了一种新的无训练开放词汇语义分割方法ReSeg-CLIP,专门用于遥感数据。为了弥补视觉语言模型(如CLIP)在语义分割中由于自注意力层内不当交互而导致的问题,我们引入了一种分层方案,利用由SAM生成的掩码来限制多尺度下的交互。我们还提出了一种模型组合方法,通过对多个遥感特定的CLIP变体的参数进行平均,利用一种新的加权方案来评估使用不同文本提示的表征质量。我们的方法在三个遥感基准测试中实现了最先进的结果,无需额外训练。
cs.CV / 38 / 2602.23871
Bandwidth-adaptive Cloud-Assisted 360-Degree 3D Perception for Autonomous Vehicles
带宽自适应的云辅助360度3D感知用于自动驾驶车辆
Abstract
A key challenge for autonomous driving lies in maintaining real-time situational awareness regarding surrounding obstacles under strict latency constraints. The high processing requirements coupled with limited onboard computational resources can cause delay issues, particularly in complex urban settings. To address this, we propose leveraging Vehicle-to-Everything (V2X) communication to partially offload processing to the cloud, where compute resources are abundant, thus reducing overall latency. Our approach utilizes transformer-based models to fuse multi-camera sensor data into a comprehensive Bird's-Eye View (BEV) representation, enabling accurate 360-degree 3D object detection. The computation is dynamically split between the vehicle and the cloud based on the number of layers processed locally and the quantization level of the features. To further reduce network load, we apply feature vector clipping and compression prior to transmission. In a real-world experimental evaluation, our hybrid strategy achieved a 72 \% reduction in end-to-end latency compared to a traditional onboard solution. To adapt to fluctuating network conditions, we introduce a dynamic optimization algorithm that selects the split point and quantization level to maximize detection accuracy while satisfying real-time latency constraints. Trace-based evaluation under realistic bandwidth variability shows that this adaptive approach improves accuracy by up to 20 \% over static parameterization with the same latency performance.
Chinese Translation
自动驾驶面临的一项关键挑战是,在严格的延迟限制下,保持对周围障碍物的实时情况感知。高处理需求与有限的车载计算资源相结合,可能导致延迟问题,特别是在复杂的城市环境中。为了解决这一问题,我们提出利用车对一切(Vehicle-to-Everything, V2X)通信将部分处理任务卸载到云端,云端计算资源丰富,从而降低整体延迟。我们的方法利用基于变换器(transformer)的模型,将多摄像头传感器数据融合为全面的鸟瞰图(Bird's-Eye View, BEV)表示,从而实现准确的360度3D物体检测。计算任务根据本地处理的层数和特征的量化水平在车辆与云端之间动态分配。为了进一步减少网络负载,我们在传输前应用特征向量剪裁和压缩。在实际的实验评估中,我们的混合策略相比传统的车载解决方案实现了72%的端到端延迟减少。为了适应波动的网络条件,我们引入了一种动态优化算法,该算法选择分割点和量化水平,以最大化检测准确性,同时满足实时延迟约束。在现实带宽变化下的基于轨迹的评估表明,这种自适应方法在相同延迟性能下提高了多达20%的准确性,相较于静态参数化。
cs.CV / 39 / 2602.23872
Altitude-Aware Visual Place Recognition in Top-Down View
基于高度感知的自上而下视觉地点识别
Abstract
To address the challenge of aerial visual place recognition (VPR) problem under significant altitude variations, this study proposes an altitude-adaptive VPR approach that integrates ground feature density analysis with image classification techniques. The proposed method estimates airborne platforms' relative altitude by analyzing the density of ground features in images, then applies relative altitude-based cropping to generate canonical query images, which are subsequently used in a classification-based VPR strategy for localization. Extensive experiments across diverse terrains and altitude conditions demonstrate that the proposed approach achieves high accuracy and robustness in both altitude estimation and VPR under significant altitude changes. Compared to conventional methods relying on barometric altimeters or Time-of-Flight (ToF) sensors, this solution requires no additional hardware and offers a plug-and-play solution for downstream applications, {making it suitable for small- and medium-sized airborne platforms operating in diverse environments, including rural and urban areas.} Under significant altitude variations, incorporating our relative altitude estimation module into the VPR retrieval pipeline boosts average R@1 and R@5 by 29.85\% and 60.20\%, respectively, compared with applying VPR retrieval alone. Furthermore, compared to traditional {Monocular Metric Depth Estimation (MMDE) methods}, the proposed method reduces the mean error by 202.1 m, yielding average additional improvements of 31.4\% in R@1 and 44\% in R@5. These results demonstrate that our method establishes a robust, vision-only framework for three-dimensional visual place recognition, offering a practical and scalable solution for accurate airborne platforms localization under large altitude variations and limited sensor availability.
Chinese Translation
为了解决在显著高度变化下的空中视觉地点识别(VPR)问题,本研究提出了一种高度自适应的VPR方法,该方法将地面特征密度分析与图像分类技术相结合。所提方法通过分析图像中地面特征的密度来估计空中平台的相对高度,然后应用基于相对高度的裁剪生成规范查询图像,这些图像随后用于基于分类的VPR策略进行定位。在不同地形和高度条件下进行的大量实验表明,所提方法在高度估计和VPR方面都能在显著高度变化下实现高准确性和鲁棒性。与依赖气压高度计或飞行时间(ToF)传感器的传统方法相比,该解决方案不需要额外的硬件,并为下游应用提供了即插即用的解决方案,使其适用于在包括农村和城市地区的多种环境中运行的小型和中型空中平台。在显著高度变化下,将我们的相对高度估计模块纳入VPR检索管道,使得平均R@1和R@5分别提高了29.85 ext{%}和60.20 ext{%},与单独应用VPR检索相比。此外,与传统的单目度量深度估计(MMDE)方法相比,所提方法将平均误差降低了202.1米,R@1和R@5的平均额外提升分别为31.4 ext{%}和44 ext{%}。这些结果表明,我们的方法建立了一个强大的仅基于视觉的三维视觉地点识别框架,为在大高度变化和有限传感器可用性下的空中平台定位提供了一个实用且可扩展的解决方案。
cs.CV / 40 / 2602.23890
DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution
DACESR:针对真实世界图像超分辨率的降解感知条件嵌入
Abstract
Multimodal large models have shown excellent ability in addressing image super-resolution in real-world scenarios by leveraging language class as condition information, yet their abilities in degraded images remain limited. In this paper, we first revisit the capabilities of the Recognize Anything Model (RAM) for degraded images by calculating text similarity. We find that directly using contrastive learning to fine-tune RAM in the degraded space is difficult to achieve acceptable results. To address this issue, we employ a degradation selection strategy to propose a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning. Furthermore, we use a Conditional Feature Modulator (CFM) to incorporate the high-level information of REE for a powerful Mamba-based network, which can leverage effective pixel information to restore image textures and produce visually pleasing results. Extensive experiments demonstrate that the REE can effectively help image super-resolution networks balance fidelity and perceptual quality, highlighting the great potential of Mamba in real-world applications. The source code of this work will be made publicly available at: https://github.com/nathan66666/DACESR.git
Chinese Translation
多模态大型模型在利用语言类别作为条件信息的真实场景图像超分辨率任务中表现出色,但在处理降级图像方面的能力仍然有限。本文首先通过计算文本相似度重新审视了识别任何事物模型(Recognize Anything Model, RAM)在降级图像上的能力。我们发现,直接使用对比学习在降级空间微调RAM难以获得令人满意的结果。为了解决这个问题,我们采用了一种降级选择策略,提出了真实嵌入提取器(Real Embedding Extractor, REE),该方法通过对比学习在降级图像内容上实现了显著的识别性能提升。此外,我们使用条件特征调制器(Conditional Feature Modulator, CFM)将REE的高级信息融入强大的基于Mamba的网络中,从而利用有效的像素信息恢复图像纹理,并产生视觉上令人愉悦的结果。大量实验表明,REE能够有效帮助图像超分辨率网络在保真度和感知质量之间取得平衡,突显了Mamba在真实世界应用中的巨大潜力。本研究的源代码将公开发布于:https://github.com/nathan66666/DACESR.git
cs.CV / 41 / 2602.23893
AoE: Always-on Egocentric Human Video Collection for Embodied AI
AoE:始终在线的自我中心人类视频收集系统用于具身人工智能
Abstract
Embodied foundation models require large-scale, high-quality real-world interaction data for pre-training and scaling. However, existing data collection methods suffer from high infrastructure costs, complex hardware dependencies, and limited interaction scope, making scalable expansion challenging. In fact, humans themselves are ideal physically embodied agents. Therefore, obtaining egocentric real-world interaction data from globally distributed "human agents" offers advantages of low cost and sustainability. To this end, we propose the Always-on Egocentric (AoE) data collection system, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity. Specifically, we first employ an ergonomic neck-mounted smartphone holder to enable low-barrier, large-scale egocentric data collection through a cloud-edge collaborative architecture. Second, we develop a cross-platform mobile APP that leverages on-device compute for real-time processing, while the cloud hosts automated labeling and filtering pipelines that transform raw videos into high-quality training data. Finally, the AoE system supports distributed Ego video data collection by anyone, anytime, and anywhere. We evaluate AoE on data preprocessing quality and downstream tasks, demonstrating that high-quality egocentric data significantly boosts real-world generalization.
Chinese Translation
具身基础模型需要大规模、高质量的真实世界交互数据进行预训练和扩展。然而,现有的数据收集方法面临着高基础设施成本、复杂的硬件依赖性和有限的交互范围,使得可扩展性扩展变得具有挑战性。实际上,人类本身就是理想的具身代理。因此,从全球分布的“人类代理”获取自我中心的真实世界交互数据具有低成本和可持续性的优势。为此,我们提出了始终在线自我中心(AoE)数据收集系统,旨在通过利用人类及其智能手机简化硬件依赖性,从而实现低成本、高效率和场景无关的真实世界交互数据收集,以应对数据稀缺的挑战。具体而言,我们首先采用符合人体工程学的颈部挂载智能手机支架,通过云边协作架构实现低门槛、大规模的自我中心数据收集。其次,我们开发了一款跨平台的移动应用程序,利用设备内计算进行实时处理,而云端则托管自动标注和过滤管道,将原始视频转换为高质量的训练数据。最后,AoE系统支持任何人、在任何时间和地点进行分布式自我视频数据收集。我们在数据预处理质量和下游任务上评估了AoE,结果表明高质量的自我中心数据显著提升了真实世界的泛化能力。
cs.CV / 42 / 2602.23894
SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction
SelfOccFlow:面向端到端自监督的3D占用流预测
Abstract
Estimating 3D occupancy and motion at the vehicle's surroundings is essential for autonomous driving, enabling situational awareness in dynamic environments. Existing approaches jointly learn geometry and motion but rely on expensive 3D occupancy and flow annotations, velocity labels from bounding boxes, or pretrained optical flow models. We propose a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision. Our method disentangles the scene into separate static and dynamic signed distance fields and learns motion implicitly through temporal aggregation. Additionally, we introduce a strong self-supervised flow cue derived from features' cosine similarities. We demonstrate the efficacy of our 3D occupancy flow method on SemanticKITTI, KITTI-MOT, and nuScenes.
Chinese Translation
估计车辆周围的3D占用和运动对于自动驾驶至关重要,它能够在动态环境中实现情境感知。现有的方法共同学习几何和运动,但依赖于昂贵的3D占用和流动注释、来自边界框的速度标签或预训练的光流模型。我们提出了一种自监督的3D占用流估计方法,消除了对人工生成注释或外部流动监督的需求。我们的方法将场景解耦为单独的静态和动态有符号距离场,并通过时间聚合隐式学习运动。此外,我们引入了一种强大的自监督流动线索,该线索源自特征的余弦相似性。我们在SemanticKITTI、KITTI-MOT和nuScenes上展示了我们3D占用流方法的有效性。
cs.CV / 43 / 2602.23898
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Ref-Adv:探索多模态大语言模型在指称表达任务中的视觉推理
Abstract
Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.
Chinese Translation
指称表达理解(REC)将语言与区域级视觉感知联系起来。标准基准(RefCOCO、RefCOCO+、RefCOCOg)在多模态大语言模型的推动下迅速发展,但仍然是视觉推理和基础测试的薄弱环节:(i)许多表达非常简短,几乎没有推理需求;(ii)图像中通常包含很少的干扰项,使目标易于找到;(iii)冗余描述符使得可以通过捷径解决方案绕过真正的文本理解和视觉推理。我们引入了Ref-Adv,一个现代的REC基准,通过将语言上非平凡的表达与唯一识别目标所需的信息配对,抑制了捷径。该数据集包含在真实图像上的指称表达,经过精心策划,配有困难的干扰项,并注释了包括否定在内的推理方面。我们进行了全面的消融实验(词序扰动和描述符删除充分性),以表明解决Ref-Adv需要超越简单线索的推理,并对一系列当代多模态大语言模型在Ref-Adv上的表现进行了评估。尽管在RefCOCO、RefCOCO+和RefCOCOg上取得了良好的结果,但模型在Ref-Adv上的表现显著下降,揭示了对捷径的依赖以及视觉推理和基础方面的缺陷。我们提供了深入的失败分析,并希望Ref-Adv能够指导未来在多模态大语言模型中的视觉推理和基础研究。
cs.CV / 44 / 2602.23899
Experience-Guided Self-Adaptive Cascaded Agents for Breast Cancer Screening and Diagnosis with Reduced Biopsy Referrals
经验指导的自适应级联智能体用于乳腺癌筛查和诊断,减少活检转诊
Abstract
We propose an experience-guided cascaded multi-agent framework for Breast Ultrasound Screening and Diagnosis, called BUSD-Agent, that aims to reduce diagnostic escalation and unnecessary biopsy referrals. Our framework models screening and diagnosis as a two-stage, selective decision-making process. A lightweight `screening clinic' agent, restricted to classification models as tools, selectively filters out benign and normal cases from further diagnostic escalation when malignancy risk and uncertainty are estimated as low. Cases that have higher risks are escalated to the `diagnostic clinic' agent, which integrates richer perception and radiological description tools to make a secondary decision on biopsy referral. To improve agent performance, past records of pathology-confirmed outcomes along with image embeddings, model predictions, and historical agent actions are stored in a memory bank as structured decision trajectories. For each new case, BUSD-Agent retrieves similar past cases based on image, model response and confidence similarity to condition the agent's current decision policy. This enables retrieval-conditioned in-context adaptation that dynamically adjusts model trust and escalation thresholds from prior experiences without parameter updates. Evaluation across 10 breast ultrasound datasets shows that the proposed experience-guided workflow reduces diagnostic escalation in BUSD-Agent from 84.95% to 58.72% and overall biopsy referrals from 59.50% to 37.08%, compared to the same architecture without trajectory conditioning, while improving average screening specificity by 68.48% and diagnostic specificity by 6.33%.
Chinese Translation
我们提出了一种经验指导的级联多智能体框架,用于乳腺超声筛查和诊断,称为 BUSD-Agent,旨在减少诊断升级和不必要的活检转诊。我们的框架将筛查和诊断建模为一个两阶段的选择性决策过程。一个轻量级的“筛查诊所”智能体,仅限于分类模型作为工具,当恶性风险和不确定性被估计为低时,选择性地过滤掉良性和正常病例以避免进一步的诊断升级。风险较高的病例则被升级到“诊断诊所”智能体,该智能体整合了更丰富的感知和放射学描述工具,以对活检转诊做出二次决策。为了提高智能体的性能,过去经过病理确认的结果记录、图像嵌入、模型预测和历史智能体行动被存储在一个记忆库中,作为结构化决策轨迹。对于每一个新病例,BUSD-Agent 根据图像、模型响应和置信度的相似性检索类似的过去病例,以调整智能体当前的决策策略。这使得检索条件下的上下文适应能够动态调整模型的信任度和升级阈值,而无需更新参数。在10个乳腺超声数据集上的评估显示,与没有轨迹条件的相同架构相比,所提出的经验指导工作流程使 BUSD-Agent 的诊断升级率从 84.95% 降低到 58.72%,整体活检转诊率从 59.50% 降低到 37.08%,同时平均筛查特异性提高了 68.48%,诊断特异性提高了 6.33%。
cs.CV / 45 / 2602.23903
SegMate: Asymmetric Attention-Based Lightweight Architecture for Efficient Multi-Organ Segmentation
SegMate:基于非对称注意力的轻量级架构用于高效多脏器分割
Abstract
State-of-the-art models for medical image segmentation achieve excellent accuracy but require substantial computational resources, limiting deployment in resource-constrained clinical settings. We present SegMate, an efficient 2.5D framework that achieves state-of-the-art accuracy, while considerably reducing computational requirements. Our efficient design is the result of meticulously integrating asymmetric architectures, attention mechanisms, multi-scale feature fusion, slice-based positional conditioning, and multi-task optimization. We demonstrate the efficiency-accuracy trade-off of our framework across three modern backbones (EfficientNetV2-M, MambaOut-Tiny, FastViT-T12). We perform experiments on three datasets: TotalSegmentator, SegTHOR and AMOS22. Compared with the vanilla models, SegMate reduces computation (GFLOPs) by up to 2.5x and memory footprint (VRAM) by up to 2.1x, while generally registering performance gains of around 1%. On TotalSegmentator, we achieve a Dice score of 93.51% with only 295MB peak GPU memory. Zero-shot cross-dataset evaluations on SegTHOR and AMOS22 demonstrate strong generalization, with Dice scores of up to 86.85% and 89.35%, respectively. We release our open-source code at https://github.com/andreibunea99/SegMate.
Chinese Translation
最先进的医学图像分割模型在准确性方面表现优异,但需要大量计算资源,这限制了其在资源受限的临床环境中的应用。我们提出了SegMate,这是一种高效的2.5D框架,能够实现最先进的准确性,同时显著降低计算需求。我们的高效设计是通过精心整合非对称架构、注意力机制、多尺度特征融合、基于切片的位置条件和多任务优化而实现的。我们展示了该框架在三个现代骨干网络(EfficientNetV2-M、MambaOut-Tiny、FastViT-T12)上的效率与准确性之间的权衡。我们在三个数据集(TotalSegmentator、SegTHOR和AMOS22)上进行了实验。与原始模型相比,SegMate的计算量(GFLOPs)减少了最多2.5倍,内存占用(VRAM)减少了最多2.1倍,同时通常实现了约1%的性能提升。在TotalSegmentator上,我们的Dice得分达到了93.51%,仅需295MB的峰值GPU内存。在SegTHOR和AMOS22上的零样本跨数据集评估显示出强大的泛化能力,Dice得分分别高达86.85%和89.35%。我们在https://github.com/andreibunea99/SegMate上发布了我们的开源代码。
cs.CV / 46 / 2602.23906
Half-Truths Break Similarity-Based Retrieval
半真相破坏基于相似性的检索
Abstract
When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP
Chinese Translation
当文本描述增加了额外细节时,如果该细节是错误的,图像-文本相似度应该下降。我们展示了 CLIP 风格的双编码器常常违反这一直觉:将一个看似合理但不正确的对象或关系附加到一个原本正确的描述上,可能会提高相似度分数。我们将这种情况称为半真相。在 COCO 数据集上,CLIP 仅在 40.6% 的情况下偏好正确的较短描述,当添加的细节是关系时,性能下降至 32.9%。我们将这种脆弱性追溯到对标题部分的弱监督:对比训练对齐完整句子,但并未明确强制要求个体实体和关系是有依据的。我们提出了 CS-CLIP(组件监督 CLIP),该方法将标题分解为实体和关系单元,为每个单元构建一个最小编辑的对照,并微调模型以使正确单元的得分高于其对照,同时保持标准的双编码器推理。CS-CLIP 将半真相准确率提高至 69.3%,并在既定的组合基准上平均提升了 5.7 分,表明减少半真相错误与组合理解的更广泛提升相一致。代码已公开发布于:https://github.com/kargibora/CS-CLIP
cs.CV / 47 / 2602.23916
The Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking
转移的几何学:解锁医学视觉流形以实现无训练模型排名
Abstract
The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around \textbf{31\%} relative improvement in the weighted Kendall, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.
Chinese Translation
大规模自监督学习(SSL)的出现产生了大量医学基础模型。然而,为特定分割任务选择最佳医学基础模型仍然是一个计算瓶颈。现有的可转移性估计(TE)指标主要针对分类任务设计,依赖于全局统计假设,未能捕捉到密集预测所需的拓扑复杂性。我们提出了一种新颖的基于拓扑的可转移性估计框架,该框架评估流形的可处理性,而非统计重叠。我们的方法引入了三个组成部分:(1)全局表示拓扑差异(GRTD),利用最小生成树量化特征-标签结构同构;(2)局部边界感知拓扑一致性(LBTC),专门在关键解剖边界评估流形的可分离性;(3)任务自适应融合,根据目标任务的语义基数动态整合全局和局部指标。在大规模OpenMind基准测试中,我们的方法在不同解剖目标和SSL基础模型上进行了验证,显著超越了最先进的基线,权重肯德尔指标相对提高约31%,提供了一种稳健的、无训练的有效模型选择代理,而无需微调。代码将在接受后公开发布。
cs.CV / 48 / 2602.23926
Leveraging Geometric Prior Uncertainty and Complementary Constraints for High-Fidelity Neural Indoor Surface Reconstruction
利用几何先验不确定性和互补约束进行高保真神经室内表面重建
Abstract
Neural implicit surface reconstruction with signed distance function has made significant progress, but recovering fine details such as thin structures and complex geometries remains challenging due to unreliable or noisy geometric priors. Existing approaches rely on implicit uncertainty that arises during optimization to filter these priors, which is indirect and inefficient, and masking supervision in high-uncertainty regions further leads to under-constrained optimization. To address these issues, we propose GPU-SDF, a neural implicit framework for indoor surface reconstruction that leverages geometric prior uncertainty and complementary constraints. We introduce a self-supervised module that explicitly estimates prior uncertainty without auxiliary networks. Based on this estimation, we design an uncertainty-guided loss that modulates prior influence rather than discarding it, thereby retaining weak but informative cues. To address regions with high prior uncertainty, GPU-SDF further incorporates two complementary constraints: an edge distance field that strengthens boundary supervision and a multi-view consistency regularization that enforces geometric coherence. Extensive experiments confirm that GPU-SDF improves the reconstruction of fine details and serves as a plug-and-play enhancement for existing frameworks. Source code will be available at https://github.com/IRMVLab/GPU-SDF
Chinese Translation
基于带符号距离函数的神经隐式表面重建取得了显著进展,但由于几何先验的不可靠或噪声,恢复细节如薄结构和复杂几何形状仍然具有挑战性。现有方法依赖于在优化过程中产生的隐式不确定性来过滤这些先验,这种方法间接且效率低下,而在高不确定性区域的掩蔽监督进一步导致了欠约束优化。为了解决这些问题,我们提出了GPU-SDF,一个用于室内表面重建的神经隐式框架,利用几何先验不确定性和互补约束。我们引入了一个自监督模块,明确估计先验不确定性而无需辅助网络。基于这一估计,我们设计了一种不确定性引导损失,调节先验影响而不是简单丢弃,从而保留弱但有信息量的线索。为了解决高先验不确定性区域,GPU-SDF进一步结合了两个互补约束:一个边缘距离场以增强边界监督,以及一个多视图一致性正则化以强制几何一致性。大量实验确认GPU-SDF改善了细节重建,并作为现有框架的即插即用增强。源代码将发布在 https://github.com/IRMVLab/GPU-SDF
cs.CV / 49 / 2602.23945
PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning
PointCoT:显式三维几何推理的多模态基准
Abstract
While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations. They confidently generate plausible responses that fail to ground in precise structural details. To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a \textit{Look, Think, then Answer} paradigm. In this approach, the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising $\sim$86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.
Chinese Translation
尽管多模态大型语言模型(MLLMs)在二维场景中表现出色,但将其感知智能扩展到三维点云理解仍然是一个重大挑战。目前的方法主要集中在将三维特征与预训练模型对齐。然而,它们通常将几何推理视为一种隐式映射过程。这些方法绕过了中间逻辑步骤,因此容易出现几何幻觉。它们自信地生成看似合理的响应,但未能准确反映结构细节。为了解决这一问题,我们提出了PointCoT,这是一种新颖的框架,赋予MLLMs在三维数据上进行显式链式思维(Chain-of-Thought, CoT)推理的能力。我们倡导一种“观察、思考,然后回答”的范式。在这种方法中,模型在预测最终答案之前被监督生成基于几何的推理。为此,我们构建了Point-Reason-Instruct,这是一个包含约86,000个指令调优样本及其层次化CoT注释的大规模基准。通过利用双流多模态架构,我们的方法将语义外观与几何真相相结合。大量实验表明,PointCoT在复杂推理任务上达到了最先进的性能。
cs.CV / 50 / 2602.23950
Micro-expression Recognition Based on Dual-branch Feature Extraction and Fusion
基于双分支特征提取与融合的微表情识别
Abstract
Micro-expressions, characterized by transience and subtlety, pose challenges to existing optical flow-based recognition methods. To address this, this paper proposes a dual-branch micro-expression feature extraction network integrated with parallel attention. Key contributions include: 1) a residual network designed to alleviate gradient anishing and network degradation; 2) an Inception network constructed to enhance model representation and suppress interference from irrelevant regions; 3) an adaptive feature fusion module developed to integrate dual-branch features. Experiments on the CASME II dataset demonstrate that the proposed method achieves 74.67% accuracy, outperforming LBP-TOP (by 11.26%), MSMMT (by 3.36%), and other comparative methods.
Chinese Translation
微表情具有瞬时性和细微性的特点,对现有的基于光流的识别方法提出了挑战。为此,本文提出了一种集成并行注意力的双分支微表情特征提取网络。主要贡献包括:1)设计了一种残差网络,以缓解梯度消失和网络退化;2)构建了一种Inception网络,以增强模型表示能力并抑制来自无关区域的干扰;3)开发了一种自适应特征融合模块,以整合双分支特征。在CASME II数据集上的实验表明,所提方法的准确率达到74.67%,优于LBP-TOP(提高了11.26%)、MSMMT(提高了3.36%)及其他对比方法。
cs.CV / 51 / 2602.23951
AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors
AHAP:利用几何先验从任意视角重建任意人类
Abstract
Reconstructing 3D humans from images captured at multiple perspectives typically requires pre-calibration, like using checkerboards or MVS algorithms, which limits scalability and applicability in diverse real-world scenarios. In this work, we present \textbf{AHAP} (Reconstructing \textbf{A}rbitrary \textbf{H}umans from \textbf{A}rbitrary \textbf{P}erspectives), a feed-forward framework for reconstructing arbitrary humans from arbitrary camera perspectives without requiring camera calibration. Our core lies in the effective fusion of multi-view geometry to assist human association, reconstruction and localization. Specifically, we use a Cross-View Identity Association module through learnable person queries and soft assignment, supervised by contrastive learning to resolve cross-view human identity association. A Human Head fuses cross-view features and scene context for SMPL prediction, guided by cross-view reprojection losses to enforce body pose consistency. Additionally, multi-view geometry eliminates the depth ambiguity inherent in monocular methods, providing more precise 3D human localization through multi-view triangulation. Experiments on EgoHumans and EgoExo4D demonstrate that AHAP achieves competitive performance on both world-space human reconstruction and camera pose estimation, while being 180$\times$ faster than optimization-based approaches.
Chinese Translation
从多个视角捕获的图像中重建三维人类通常需要预先校准,例如使用棋盘格或多视图立体(MVS)算法,这限制了其在多样化现实场景中的可扩展性和适用性。在本研究中,我们提出了 extbf{AHAP}(从 extbf{A}任意视角重建 extbf{A}任意人类),这是一个前馈框架,用于在不需要相机校准的情况下从任意相机视角重建任意人类。我们的核心在于有效融合多视图几何,以辅助人类的关联、重建和定位。具体而言,我们使用一个跨视图身份关联模块,通过可学习的人物查询和软分配,借助对比学习监督来解决跨视图人类身份关联问题。一个人头模块融合跨视图特征和场景上下文以进行SMPL预测,并通过跨视图重投影损失来指导,以确保身体姿态的一致性。此外,多视图几何消除了单目方法固有的深度模糊性,通过多视图三角测量提供更精确的三维人类定位。在EgoHumans和EgoExo4D上的实验表明,AHAP在世界空间人类重建和相机姿态估计方面实现了竞争性性能,同时比基于优化的方法快180倍。
cs.CV / 52 / 2602.23952
CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering
CC-VQA:一种关注冲突和相关性的方法,用于缓解知识基础视觉问答中的知识冲突
Abstract
Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose \textbf{CC-VQA}: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3\% to 6.4\% compared to existing methods. Code is available at https://github.com/cqu-student/CC-VQA.
Chinese Translation
基于知识的视觉问答(KB-VQA)在处理知识密集型任务方面展现出显著潜力。然而,由于视觉语言模型(VLMs)中的静态参数知识与动态检索信息之间的冲突,导致了问题的产生。这些输出要么忽视检索到的上下文,要么与参数知识的整合不一致,从而对KB-VQA构成了重大挑战。目前的知识冲突缓解方法主要源自语言基础的方法,侧重于通过工程化提示策略或上下文感知解码机制来解决上下文级别的冲突。然而,这些方法忽视了视觉信息在冲突中的关键作用,并且存在冗余检索上下文的问题,这妨碍了准确的冲突识别和有效的缓解。为了解决这些局限性,我们提出了 extbf{CC-VQA}:一种新颖的无训练、关注冲突和相关性的方法,用于KB-VQA。我们的方法包含两个核心组件:(1)以视觉为中心的上下文冲突推理,执行内部和外部知识上下文之间的视觉-语义冲突分析;(2)基于相关性的编码和解码,采用位置编码压缩低相关性语句,并使用相关性加权的冲突评分进行自适应解码。在E-VQA、InfoSeek和OK-VQA基准上的广泛评估表明,CC-VQA实现了最先进的性能,与现有方法相比,绝对准确率提高了3.3 ext{%}至6.4 ext{%}。代码可在 https://github.com/cqu-student/CC-VQA 获取。
cs.CV / 53 / 2602.23953
GDA-YOLO11: Amodal Instance Segmentation for Occlusion-Robust Robotic Fruit Harvesting
GDA-YOLO11:用于抗遮挡的机器人水果采摘的非模态实例分割
Abstract
Occlusion remains a critical challenge in robotic fruit harvesting, as undetected or inaccurately localised fruits often results in substantial crop losses. To mitigate this issue, we propose a harvesting framework using a new amodal segmentation model, GDA-YOLO11, which incorporates architectural improvements and an updated asymmetric mask loss. The proposed model is trained on a modified version of a public citrus dataset and evaluated on both the base dataset and occlusion-sensitive subsets with varying occlusion levels. Within the framework, full fruit masks, including invisible regions, are inferred by GDA-YOLO11, and picking points are subsequently estimated using the Euclidean distance transform. These points are then projected into 3D coordinates for robotic harvesting execution. Experiments were conducted using real citrus fruits in a controlled environment simulating occlusion scenarios. Notably, to the best of our knowledge, this study provides the first practical demonstration of amodal instance segmentation in robotic fruit harvesting. GDA-YOLO11 achieves a precision of 0.844, recall of 0.846, mAP@50 of 0.914, and mAP@50:95 of 0.636, outperforming YOLO11n by 5.1%, 1.3%, and 1.0% in precision, mAP@50, and mAP@50:95, respectively. The framework attains harvesting success rates of 92.59%, 85.18%, 48.14%, and 22.22% at zero to high occlusion levels, improving success by 3.5% under medium and high occlusion. These findings demonstrate that GDA-YOLO11 enhances occlusion robust segmentation and streamlines perception-to-action integration, paving the way for more reliable autonomous systems in agriculture.
Chinese Translation
遮挡仍然是机器人水果采摘中的一个关键挑战,因为未检测到或定位不准确的水果往往会导致显著的作物损失。为了解决这一问题,我们提出了一种新的采摘框架,采用了新的非模态分割模型GDA-YOLO11,该模型结合了架构改进和更新的非对称掩膜损失。所提模型在修改后的公共柑橘数据集上进行训练,并在基础数据集及具有不同遮挡水平的敏感子集上进行评估。在该框架内,GDA-YOLO11推断出完整的水果掩膜,包括不可见区域,随后利用欧几里得距离变换估计采摘点。这些点随后被投影到三维坐标中,以执行机器人采摘。实验在受控环境中使用真实的柑橘水果进行,模拟遮挡场景。值得注意的是,尽我们所知,本研究首次提供了机器人水果采摘中非模态实例分割的实际演示。GDA-YOLO11的精度为0.844,召回率为0.846,mAP@50为0.914,mAP@50:95为0.636,分别比YOLO11n在精度、mAP@50和mAP@50:95上提高了5.1%、1.3%和1.0%。该框架在从零到高遮挡水平下的采摘成功率分别为92.59%、85.18%、48.14%和22.22%,在中等和高遮挡下成功率提高了3.5%。这些发现表明,GDA-YOLO11增强了抗遮挡分割能力,并简化了感知到行动的整合,为农业中更可靠的自主系统铺平了道路。
cs.CV / 54 / 2602.23956
SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
SwitchCraft:无训练的多事件视频生成与注意力控制
Abstract
Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.
Chinese Translation
最近,文本到视频扩散模型的进展使得高保真和时间一致性的视频合成成为可能。然而,目前的模型主要针对单事件生成进行优化。在处理多事件提示时,缺乏明确的时间基础,这些模型往往会产生混合或崩溃的场景,破坏预期的叙事。为了解决这一局限性,我们提出了SwitchCraft,一个无训练的多事件视频生成框架。我们的关键见解是,跨时间的均匀提示注入忽视了事件与帧之间的对应关系。为此,我们引入了事件对齐查询引导(Event-Aligned Query Steering, EAQS),它引导帧级注意力与相关事件提示对齐。此外,我们提出了自适应平衡强度求解器(Auto-Balance Strength Solver, ABSS),它自适应地平衡引导强度,以保持时间一致性和视觉保真度。大量实验表明,与现有基线相比,SwitchCraft在提示对齐、事件清晰度和场景一致性方面显著提升,提供了一种简单而有效的多事件视频生成解决方案。
cs.CV / 55 / 2602.23959
Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought
以连续动作进行图像思考:数值视觉链思维
Abstract
Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in https://github.com/kesenzhao/NV-CoT.
Chinese Translation
近期,多模态大型语言模型(MLLMs)越来越依赖视觉链思维在图像上进行区域基础推理。然而,现有的方法通过文本化坐标来确定区域,这导致了模态不匹配和语义碎片化,或者使用固定粒度的补丁,这两者都限制了精确的区域选择,并且通常需要非平凡的架构更改。本文提出了数值视觉链思维(Numerical Visual Chain-of-Thought, NV-CoT),这是一个使MLLMs能够使用连续数值坐标对图像进行推理的框架。NV-CoT将MLLM的动作空间从离散词汇标记扩展到连续的欧几里得空间,使模型能够直接生成边界框坐标作为动作,仅需最小的架构修改。该框架支持监督微调和强化学习。特别地,我们用高斯(或拉普拉斯)策略替代了分类标记策略,并通过重新参数化采样引入随机性,使NV-CoT与GRPO风格的策略优化完全兼容。在三个基准测试上对八个代表性的视觉推理基线进行的广泛实验表明,NV-CoT显著提高了定位精度和最终答案的准确性,同时加速了训练收敛,验证了在MLLMs中连续动作视觉推理的有效性。代码可在 https://github.com/kesenzhao/NV-CoT 获取。
cs.CV / 56 / 2602.23963
SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking
SpikeTrack:一种基于脉冲驱动的高效视觉跟踪框架
Abstract
Spiking Neural Networks (SNNs) promise energy-efficient vision, but applying them to RGB visual tracking remains difficult: Existing SNN tracking frameworks either do not fully align with spike-driven computation or do not fully leverage neurons' spatiotemporal dynamics, leading to a trade-off between efficiency and accuracy. To address this, we introduce SpikeTrack, a spike-driven framework for energy-efficient RGB object tracking. SpikeTrack employs a novel asymmetric design that uses asymmetric timestep expansion and unidirectional information flow, harnessing spatiotemporal dynamics while cutting computation. To ensure effective unidirectional information transfer between branches, we design a memory-retrieval module inspired by neural inference mechanisms. This module recurrently queries a compact memory initialized by the template to retrieve target cues and sharpen target perception over time. Extensive experiments demonstrate that SpikeTrack achieves the state-of-the-art among SNN-based trackers and remains competitive with advanced ANN trackers. Notably, it surpasses TransT on LaSOT dataset while consuming only 1/26 of its energy. To our knowledge, SpikeTrack is the first spike-driven framework to make RGB tracking both accurate and energy efficient. The code and models are available at https://github.com/faicaiwawa/SpikeTrack.
Chinese Translation
脉冲神经网络(SNNs)承诺实现能效视觉,但将其应用于RGB视觉跟踪仍然困难:现有的SNN跟踪框架要么与脉冲驱动计算不完全对齐,要么未能充分利用神经元的时空动态,导致效率和准确性之间的权衡。为了解决这个问题,我们提出了SpikeTrack,一种用于能效RGB目标跟踪的脉冲驱动框架。SpikeTrack采用了一种新颖的非对称设计,利用非对称时间步扩展和单向信息流,充分利用时空动态的同时减少计算量。为了确保分支之间有效的单向信息传递,我们设计了一个受神经推理机制启发的记忆检索模块。该模块通过查询由模板初始化的紧凑记忆,重复检索目标线索,并随着时间的推移增强目标感知。大量实验表明,SpikeTrack在基于SNN的跟踪器中达到了最先进的水平,并且在与先进的ANN跟踪器的竞争中保持竞争力。值得注意的是,它在LaSOT数据集上超越了TransT,同时仅消耗其1/26的能量。据我们所知,SpikeTrack是第一个使RGB跟踪在准确性和能效之间实现平衡的脉冲驱动框架。代码和模型可在https://github.com/faicaiwawa/SpikeTrack获取。
cs.CV / 57 / 2602.23980
Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping
Venus:基准测试与增强多模态大型语言模型在美学指导与裁剪中的应用
Abstract
The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) -- an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation. Code is available at https://github.com/PKU-ICST-MIPL/Venus_CVPR2026.
Chinese Translation
智能手机的广泛使用使得摄影变得无处不在,但普通用户与专业摄影师之间仍然存在明显差距,后者能够识别美学问题并在拍摄过程中提供可操作的指导。我们将这种能力定义为美学指导(Aesthetic Guidance, AG)——这是计算美学中一个重要但尚未得到充分探索的领域。现有的多模态大型语言模型(Multimodal Large Language Models, MLLMs)主要提供过于积极的反馈,未能识别问题或提供可操作的指导。缺乏AG能力,它们无法有效识别干扰区域或优化构图平衡,因此在美学裁剪中也面临困难,而美学裁剪旨在通过重新构图来优化照片构图。为了解决这个问题,我们引入了AesGuide,这是第一个大规模的AG数据集和基准,包含10,748张带有美学评分、分析和指导的照片。在此基础上,我们提出了Venus,一个两阶段框架,首先通过逐步复杂的美学问题赋能MLLMs以获得AG能力,然后通过基于链式推理(Chain of Thought, CoT)的推理激活其美学裁剪能力。大量实验表明,Venus显著提高了AG能力,并在美学裁剪中达到了最新的(State-of-the-Art, SOTA)性能,实现了在照片创作的两个阶段中可解释和互动的美学优化。代码可在 https://github.com/PKU-ICST-MIPL/Venus_CVPR2026 获取。
cs.CV / 58 / 2602.23996
Accelerating Masked Image Generation by Learning Latent Controlled Dynamics
通过学习潜在控制动态加速掩蔽图像生成
Abstract
Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.
Chinese Translation
掩蔽图像生成模型(MIGMs)已取得显著成功,但其效率受到双向注意力多个步骤的制约。实际上,它们的计算存在显著冗余:在采样离散标记时,连续特征中丰富的语义信息会丢失。一些现有的研究尝试缓存特征以近似未来特征。然而,在激进的加速率下,它们表现出相当大的近似误差。我们将此归因于它们的表达能力有限以及未能考虑采样信息。为填补这一空白,我们提出了一种轻量级模型,该模型结合了先前的特征和采样标记,并回归特征演变的平均速度场。该模型具有适中的复杂性,足以捕捉微妙的动态,同时与原始基础模型相比保持轻量化。我们将我们的方法MIGM-Shortcut应用于两个代表性的MIGM架构和任务。特别是在最先进的Lumina-DiMOO上,它实现了文本到图像生成的4倍以上加速,同时保持质量,显著推动了掩蔽图像生成的Pareto前沿。代码和模型权重可在https://github.com/Kaiwen-Zhu/MIGM-Shortcut获取。
cs.CV / 59 / 2602.24013
Ordinal Diffusion Models for Color Fundus Images
用于彩色眼底图像的序数扩散模型
Abstract
It has been suggested that generative image models such as diffusion models can improve performance on clinically relevant tasks by offering deep learning models supplementary training data. However, most conditional diffusion models treat disease stages as independent classes, ignoring the continuous nature of disease progression. This mismatch is problematic in medical imaging because continuous pathological processes are typically only observed through coarse, discrete but ordered labels as in ophthalmology for diabetic retinopathy (DR). We propose an ordinal latent diffusion model for generating color fundus images that explicitly incorporates the ordered structure of DR severity into the generation process. Instead of categorical conditioning, we used a scalar disease representation, enabling a smooth transition between adjacent stages. We evaluated our approach using visual realism metrics and classification-based clinical consistency analysis on the EyePACS dataset. Compared to a standard conditional diffusion model, our model reduced the Fr\'echet inception distance for four of the five DR stages and increased the quadratic weighted $\kappa$ from 0.79 to 0.87. Furthermore, interpolation experiments showed that the model captured a continuous spectrum of disease progression learned from ordered, coarse class labels.
Chinese Translation
有研究表明,生成图像模型如扩散模型可以通过提供补充训练数据来提高深度学习模型在临床相关任务上的表现。然而,大多数条件扩散模型将疾病阶段视为独立类别,忽视了疾病进展的连续性。这种不匹配在医学成像中是一个问题,因为连续的病理过程通常只能通过粗略、离散但有序的标签来观察,例如在眼科中对糖尿病视网膜病变(DR)的分类。我们提出了一种序数潜在扩散模型,用于生成彩色眼底图像,该模型明确将DR严重程度的有序结构纳入生成过程。我们使用标量疾病表示代替分类条件,使得相邻阶段之间能够平滑过渡。我们在EyePACS数据集上使用视觉真实感指标和基于分类的临床一致性分析评估了我们的方法。与标准条件扩散模型相比,我们的模型在五个DR阶段中的四个阶段减少了Fréchet起始距离,并将二次加权κ从0.79提高到0.87。此外,插值实验表明,该模型捕捉到了从有序粗类标签学习到的疾病进展的连续光谱。
cs.CV / 60 / 2602.24014
Interpretable Debiasing of Vision-Language Models for Social Fairness
可解释的视觉-语言模型去偏见方法以促进社会公平
Abstract
The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.
Chinese Translation
视觉-语言模型(VLMs)的快速发展引发了人们对其黑箱推理过程可能导致意想不到的社会偏见形式的日益关注。目前的去偏见方法主要集中在通过事后学习或测试时算法来减轻表层偏见信号,而对模型的内部动态则几乎没有探索。在本研究中,我们提出了一种可解释的、与模型无关的偏见缓解框架DeBiasLens,该框架通过应用稀疏自编码器(SAEs)于多模态编码器来定位VLM中的社会属性神经元。基于SAEs的解耦能力,我们在没有对应社会属性标签的面部图像或标题数据集上训练它们,以揭示对特定人口统计特征高度敏感的神经元,包括那些代表性不足的群体。通过选择性地停用与每个群体偏见最紧密相关的社会神经元,我们有效地减轻了VLM的社会偏见行为,而不会降低其语义知识。我们的研究为未来的审计工具奠定了基础,优先考虑新兴现实世界人工智能系统中的社会公平。
cs.CV / 61 / 2602.24020
SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting
SR3R:重新思考基于前馈高斯点云的超分辨率三维重建
Abstract
3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce SR3R, a feed-forward framework that directly predicts HR 3DGS representations from sparse LR views via the learned mapping network. To further enhance reconstruction fidelity, we introduce Gaussian offset learning and feature refinement, which stabilize reconstruction and sharpen high-frequency details. SR3R is plug-and-play and can be paired with any feed-forward 3DGS reconstruction backbone: the backbone provides an LR 3DGS scaffold, and SR3R upscales it to an HR 3DGS. Extensive experiments across three 3D benchmarks demonstrate that SR3R surpasses state-of-the-art (SOTA) 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes.
Chinese Translation
三维超分辨率(3DSR)旨在从低分辨率(LR)多视角图像中重建高分辨率(HR)三维场景。现有方法依赖于密集的LR输入和逐场景优化,这限制了构建HR三维高斯点云(3DGS)所需的高频先验仅能继承自预训练的二维超分辨率(2DSR)模型。这严重限制了重建的保真度、跨场景的泛化能力和实时可用性。我们提出将3DSR重新表述为从稀疏LR视图到HR 3DGS表示的直接前馈映射,使模型能够自主学习来自大规模多场景数据的三维特定高频几何和外观。这从根本上改变了3DSR获取高频知识的方式,并使其能够对未见场景进行稳健的泛化。具体而言,我们引入了SR3R,一个前馈框架,通过学习的映射网络直接从稀疏LR视图预测HR 3DGS表示。为了进一步提高重建的保真度,我们引入了高斯偏移学习和特征精炼,这些方法稳定了重建并增强了高频细节的清晰度。SR3R具有即插即用的特性,可以与任何前馈3DGS重建主干网络配对:主干提供LR 3DGS框架,而SR3R将其放大至HR 3DGS。在三个3D基准测试中的大量实验表明,SR3R超越了最先进的(SOTA)3DSR方法,并实现了强大的零样本泛化,甚至在未见场景上超越了SOTA逐场景优化方法。
cs.CV / 62 / 2602.24021
Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection
在冻结的多模态大语言模型中引导和修正潜在表示流形以进行视频异常检测
Abstract
Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs) which are most discriminative for VAD. Then a hierarchical meta-controller (HMC) generates dynamic rectification signals by jointly conditioning on global context and these LAE outputs. The signals execute targeted, anisotropic scaling directly upon the LAE representation manifolds, amplifying anomaly-relevant dimensions while suppressing inherent biases. Extensive experiments on mainstream benchmarks demonstrate our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data, establishing it as a powerful new direction for video anomaly detection. The code will be released upon the publication.
Chinese Translation
视频异常检测(VAD)旨在识别视频中的异常事件。传统的VAD方法通常面临标注数据和全面训练的高成本,因此一些近期的研究探索了以无调优的方式利用冻结的多模态大语言模型(MLLMs)来执行VAD。然而,由于这些方法直接继承了预训练偏差,无法将内部表示适应特定视频上下文,其性能受到限制,导致在处理细微或模糊异常时遇到困难。为了解决这些局限性,我们提出了一种新颖的干预框架,称为SteerVAD,该框架通过从被动读取转变为主动引导和修正内部表示,推动基于MLLM的VAD发展。我们的方法首先利用无梯度的表示可分离性分析(RSA)来识别作为潜在异常专家(LAEs)的顶级注意力头,这些头对于VAD具有最强的判别能力。然后,一个层次化的元控制器(HMC)通过共同条件化全局上下文和这些LAE输出生成动态修正信号。这些信号直接对LAE表示流形执行有针对性的各向异性缩放,放大与异常相关的维度,同时抑制固有偏差。在主流基准上的广泛实验表明,我们的方法在仅需1%训练数据的无调优方法中实现了最先进的性能,确立了其作为视频异常检测的新强大方向。代码将在发表后发布。
cs.CV / 63 / 2602.24027
GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models
GuardAlign:多模态大语言模型中的测试时安全对齐
Abstract
Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.
Chinese Translation
大型视觉-语言模型(LVLMs)在视觉-语言推理任务中取得了显著进展,但确保其安全性仍然是一个关键挑战。近期的输入侧防御方法利用 CLIP 检测不安全的图像,并在提示前添加安全前缀,但在复杂场景中仍然存在检测不准确和解码过程中安全信号不稳定的问题。为了解决这些问题,我们提出了 GuardAlign,这是一种无训练的防御框架,集成了两种策略。首先,增强的最优传输(OT)安全检测利用最优传输来测量图像块与不安全语义之间的分布距离,从而能够准确识别恶意区域,而无需额外的计算成本。其次,跨模态注意力校准通过自适应地重新分配跨层的注意力,增强安全前缀的影响,确保安全信号在生成过程中始终保持激活。在六个代表性的多模态大语言模型(MLLMs)上的广泛评估表明,GuardAlign 在 SPA-VL 上将不安全响应率降低了多达 39%,同时保持了实用性,在 VQAv2 上的表现从 78.51% 提升至 79.21%。
cs.CV / 64 / 2602.24041
Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation
仔细观察:多模态大语言模型中的自适应视觉增强以减轻幻觉
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning, yet they remain vulnerable to hallucination, where generated content deviates from visual evidence. Existing mitigation strategies either require costly supervision during training or introduce additional latency at inference time. Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding, but they typically inject all tokens indiscriminately, which causes interference from background regions and distracts the model from critical cues. To overcome this challenge, we propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs. AIR consists of two components. Prototype-based token reduction condenses the large pool of visual tokens into a compact subset to suppress redundancy. OT-guided patch reinforcement quantifies the alignment between hidden states and patch embeddings to selectively integrate the most consistent patches into feed-forward layers. As a result, AIR enhances the model's reliance on salient visual information and effectively mitigates hallucination. Extensive experiments across representative MLLMs demonstrate that AIR substantially reduces hallucination while preserving general capabilities, establishing it as an effective solution for building reliable MLLMs.
Chinese Translation
多模态大语言模型(MLLMs)在视觉-语言推理方面取得了显著进展,但仍然容易受到幻觉的影响,即生成内容偏离视觉证据。现有的减轻策略要么在训练期间需要昂贵的监督,要么在推理时引入额外的延迟。最近的视觉增强方法试图通过在解码过程中增强视觉标记来解决这个问题,但它们通常无差别地注入所有标记,这导致背景区域的干扰,并使模型分心于关键线索。为了解决这一挑战,我们提出了自适应视觉增强(Adaptive Visual Reinforcement,AIR),这是一个无训练的MLLM框架。AIR由两个组件组成。基于原型的标记减少将大量视觉标记浓缩为一个紧凑的子集,以抑制冗余。基于最优传输(OT)的补丁增强量化隐藏状态与补丁嵌入之间的对齐程度,以选择性地将最一致的补丁整合到前馈层中。因此,AIR增强了模型对显著视觉信息的依赖,有效减轻了幻觉。对代表性MLLMs的广泛实验表明,AIR显著减少了幻觉,同时保持了整体能力,确立了其作为构建可靠MLLMs的有效解决方案。
cs.CV / 65 / 2602.24043
Spatio-Temporal Garment Reconstruction Using Diffusion Mapping via Pattern Coordinates
基于模式坐标的扩散映射时空服装重建
Abstract
Reconstructing 3D clothed humans from monocular images and videos is a fundamental problem with applications in virtual try-on, avatar creation, and mixed reality. Despite significant progress in human body recovery, accurately reconstructing garment geometry, particularly for loose-fitting clothing, remains an open challenge. We propose a unified framework for high-fidelity 3D garment reconstruction from both single images and video sequences. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn expressive garment shape priors in 2D UV space. Leveraging these priors, we introduce a mapping model that establishes correspondences between image pixels, UV pattern coordinates, and 3D geometry, enabling accurate and detailed garment reconstruction from single images. We further extend this formulation to dynamic reconstruction by introducing a spatio-temporal diffusion scheme with test-time guidance to enforce long-range temporal consistency. We also develop analytic projection-based constraints that preserve image-aligned geometry in visible regions while enforcing coherent completion in occluded areas over time. Although trained exclusively on synthetically simulated cloth data, our method generalizes well to real-world imagery and consistently outperforms existing approaches on both tight- and loose-fitting garments. The reconstructed garments preserve fine geometric detail while exhibiting realistic dynamic motion, supporting downstream applications such as texture editing, garment retargeting, and animation.
Chinese Translation
从单目图像和视频重建三维穿衣人类是一个基础性问题,具有虚拟试穿、头像创建和混合现实等应用。尽管在人体恢复方面取得了显著进展,但准确重建服装几何形状,特别是对于宽松服装,仍然是一个未解决的挑战。我们提出了一个统一框架,用于从单幅图像和视频序列中进行高保真三维服装重建。我们的方法结合了隐式缝合模式(Implicit Sewing Patterns, ISP)和生成扩散模型,以在二维UV空间中学习表现力丰富的服装形状先验。利用这些先验,我们引入了一种映射模型,建立图像像素、UV模式坐标和三维几何之间的对应关系,从而实现从单幅图像中准确且详细的服装重建。我们进一步通过引入具有测试时指导的时空扩散方案,将该公式扩展到动态重建,以强制执行长时间范围内的时间一致性。我们还开发了基于解析投影的约束,保持可见区域中的图像对齐几何,同时在时间上强制执行被遮挡区域的连贯补全。尽管我们的模型仅在合成模拟布料数据上训练,但它在真实世界图像中表现良好,并在紧身和宽松服装上始终优于现有方法。重建的服装保留了细致的几何细节,同时展现出逼真的动态运动,支持纹理编辑、服装重定向和动画等下游应用。
cs.CV / 66 / 2602.24059
Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization
量化专家:基于混合专家的面向Token的自适应误差重构用于大型视觉-语言模型的量化
Abstract
Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose \textbf{Quant Experts (QE)}, a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.
Chinese Translation
后训练量化(PTQ)已成为一种有效的技术,通过压缩权重和激活而无需重新训练完整模型,从而减轻视觉-语言模型(VLMs)在计算和内存方面的巨大开销。现有的PTQ方法主要依赖于对敏感或异常通道的静态识别和全局补偿,但往往忽视了这些重要通道在不同输入间的分布差异,导致量化效果不理想。在本研究中,我们观察到重要通道的分布和出现频率在不同模态之间以及同一模态内的Token之间存在显著差异。因此,我们提出了 extbf{量化专家(Quant Experts, QE)},一种面向Token的自适应误差补偿方法,结合了混合专家用于VLMs的量化。QE将重要通道分为与Token无关和与Token相关两组。对于前者,设计了一个共享专家,用于大多数Token,通过低秩适配器补偿全局量化误差。对于后者,详细描述了包括多个路由低秩适配器的路由专家,以补偿与特定Token相关的局部量化误差。大量实验表明,QE在各种量化设置和模型规模(从2B到70B参数)中始终提高了任务准确性,同时保持与全精度模型相当的性能。
cs.CV / 67 / 2602.24065
EvalMVX: A Unified Benchmarking for Neural 3D Reconstruction under Diverse Multiview Setups
EvalMVX:针对多视角设置下神经3D重建的统一基准测试
Abstract
Recent advancements in neural surface reconstruction have significantly enhanced 3D reconstruction. However, current real world datasets mainly focus on benchmarking multiview stereo (MVS) based on RGB inputs. Multiview photometric stereo (MVPS) and multiview shape from polarization (MVSfP), though indispensable on high-fidelity surface reconstruction and sparse inputs, have not been quantitatively assessed together with MVS. To determine the working range of different MVX (MVS, MVSfP, and MVPS) techniques, we propose EvalMVX, a real-world dataset containing $25$ objects, each captured with a polarized camera under $20$ varying views and $17$ light conditions including OLAT and natural illumination, leading to $8,500$ images. Each object includes aligned ground-truth 3D mesh, facilitating quantitative benchmarking of MVX methods simultaneously. Based on our EvalMVX, we evaluate $13$ MVX methods published in recent years, record the best-performing methods, and identify open problems under diverse geometric details and reflectance types. We hope EvalMVX and the benchmarking results can inspire future research on multiview 3D reconstruction.
Chinese Translation
近年来,神经表面重建的进展显著提升了3D重建的效果。然而,目前的真实世界数据集主要集中在基于RGB输入的多视角立体视觉(MVS)基准测试上。尽管多视角光度立体(MVPS)和多视角偏振形状重建(MVSfP)在高保真表面重建和稀疏输入中不可或缺,但它们尚未与MVS一起进行定量评估。为了确定不同MVX(MVS、MVSfP和MVPS)技术的工作范围,我们提出了EvalMVX,这是一个包含25个物体的真实世界数据集,每个物体在20个不同视角和17种光照条件下(包括OLAT和自然光照)使用偏振相机捕获,共生成8500张图像。每个物体都包含对齐的真实3D网格,便于同时对MVX方法进行定量基准测试。基于我们的EvalMVX,我们评估了近年来发布的13种MVX方法,记录了表现最佳的方法,并识别出在不同几何细节和反射类型下的开放问题。我们希望EvalMVX及其基准测试结果能够激发未来在多视角3D重建方面的研究。
cs.CV / 68 / 2602.24084
FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting
FoV-Net:通过视场光线投射实现旋转不变的CAD边界表示学习
Abstract
Learning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary $\mathbf{SO}(3)$ rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated over the B-rep graph using a graph attention network. FoV-Net achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrating robustness to arbitrary rotations while also requiring less training data to achieve strong results.
Chinese Translation
直接从边界表示(B-reps)中学习显著推动了3D CAD分析的发展。然而,最先进的B-rep学习方法依赖于绝对坐标和法线来编码全局上下文,使其对旋转高度敏感。我们的实验表明,在对齐基准上实现超过95%准确率的模型在任意$ extbf{SO}(3)$旋转下可能降至仅10%。为了解决这一问题,我们提出了FoV-Net,这是第一个以旋转不变的方式捕捉局部表面几何和全局结构上下文的B-rep学习框架。每个面由局部参考框架(Local Reference Frame, LRF)UV网格表示,该网格编码其局部表面几何,并通过视场(Field-of-View, FoV)网格捕捉周围的3D上下文,方法是投射光线并记录与邻近面的交点。轻量级卷积神经网络(CNN)提取每个面的特征,这些特征通过图注意网络在B-rep图上进行传播。FoV-Net在B-rep分类和分割基准上实现了最先进的性能,展示了对任意旋转的鲁棒性,同时在实现强大结果时所需的训练数据更少。
cs.CV / 69 / 2602.24096
DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
DiffusionHarmonizer:通过在线扩散增强器连接神经重建与逼真模拟
Abstract
Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations, we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively is a custom data curation pipeline that constructs synthetic-real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. The result is a scalable system that significantly elevates simulation fidelity in both research and production environments.
Chinese Translation
模拟对于自主机器人(如自动驾驶车辆)的开发和评估至关重要。神经重建作为一种新兴的解决方案,能够仅通过真实世界数据自动且可扩展地模拟各种场景。然而,尽管像NeRF和3D Gaussian Splatting等方法能够产生视觉上引人注目的结果,但在渲染新视角时,它们往往会出现伪影,并且在真实地整合插入的动态物体时表现不佳,尤其是当这些物体来自不同场景时。为了解决这些局限性,我们提出了DiffusionHarmonizer,这是一种在线生成增强框架,能够将来自这些不完美场景的渲染转换为时间一致的输出,同时提高其逼真度。其核心是一个单步时间条件增强器,该增强器由预训练的多步图像扩散模型转换而来,能够在单个GPU的在线模拟器中运行。有效训练的关键在于一个定制的数据策划管道,该管道构建了强调外观协调、伪影修正和光照真实感的合成-真实配对。最终结果是一个可扩展的系统,显著提升了研究和生产环境中的模拟保真度。
cs.CV / 70 / 2602.24111
Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification
通过形式验证为视觉语言模型中的临床推理提供保障
Abstract
Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic impressions unsupported by their own perceptual findings or missing logically entailed conclusions. Standard lexical metrics heavily penalize clinical paraphrasing and fail to capture these deductive failures in reference-free settings. Toward guarantees for clinical reasoning, we introduce a neurosymbolic verification framework that deterministically audits the internal consistency of VLM-generated reports. Our pipeline autoformalizes free-text radiographic findings into structured propositional evidence, utilizing an SMT solver (Z3) and a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted. Evaluating seven VLMs across five chest X-ray benchmarks, our verifier exposes distinct reasoning failure modes, such as conservative observation and stochastic hallucination, that remain invisible to traditional metrics. On labeled datasets, enforcing solver-backed entailment acts as a rigorous post-hoc guarantee, systematically eliminating unsupported hallucinations to significantly increase diagnostic soundness and precision in generative clinical assistants.
Chinese Translation
视觉语言模型(VLMs)在撰写放射学报告方面展现出潜力,但它们常常存在逻辑不一致的问题,生成的诊断印象与自身的感知发现不符,或缺失逻辑上应有的结论。标准的词汇度量方法对临床释义给予了严厉的惩罚,未能在无参考的环境中捕捉到这些推理失败。为了为临床推理提供保障,我们引入了一种神经符号验证框架,该框架确定性地审计VLM生成报告的内部一致性。我们的流程将自由文本的放射学发现自动形式化为结构化的命题证据,利用SMT求解器(Z3)和临床知识库来验证每个诊断声明是否在数学上被蕴含、虚构或遗漏。通过在五个胸部X光基准上评估七个VLM,我们的验证器揭示了不同的推理失败模式,如保守观察和随机虚构,这些模式在传统度量中是不可见的。在标注数据集上,强制执行求解器支持的蕴含作为严格的事后保障,系统性地消除不支持的虚构,从而显著提高生成临床助手的诊断合理性和准确性。
cs.CV / 71 / 2602.24133
FocusTrack: One-Stage Focus-and-Suppress Framework for 3D Point Cloud Object Tracking
FocusTrack:一种用于3D点云目标跟踪的一阶段聚焦与抑制框架
Abstract
In 3D point cloud object tracking, the motion-centric methods have emerged as a promising avenue due to its superior performance in modeling inter-frame motion. However, existing two-stage motion-based approaches suffer from fundamental limitations: (1) error accumulation due to decoupled optimization caused by explicit foreground segmentation prior to motion estimation, and (2) computational bottlenecks from sequential processing. To address these challenges, we propose FocusTrack, a novel one-stage paradigms tracking framework that unifies motion-semantics co-modeling through two core innovations: Inter-frame Motion Modeling (IMM) and Focus-and-Suppress Attention. The IMM module employs a temp-oral-difference siamese encoder to capture global motion patterns between adjacent frames. The Focus-and-Suppress attention that enhance the foreground semantics via motion-salient feature gating and suppress the background noise based on the temporal-aware motion context from IMM without explicit segmentation. Based on above two designs, FocusTrack enables end-to-end training with compact one-stage pipeline. Extensive experiments on prominent 3D tracking benchmarks, such as KITTI, nuScenes, and Waymo, demonstrate that the FocusTrack achieves new SOTA performance while running at a high speed with 105 FPS.
Chinese Translation
在3D点云目标跟踪中,以运动为中心的方法因其在建模帧间运动方面的优越性能而成为一种有前景的途径。然而,现有的两阶段基于运动的方法存在基本局限性:(1)由于在运动估计之前进行显式前景分割而导致的解耦优化引起的误差累积,以及(2)来自顺序处理的计算瓶颈。为了解决这些挑战,我们提出了FocusTrack,一种新颖的一阶段跟踪框架,通过两个核心创新统一了运动-语义共同建模:帧间运动建模(Inter-frame Motion Modeling, IMM)和聚焦-抑制注意力(Focus-and-Suppress Attention)。IMM模块采用时间差双胞胎编码器捕捉相邻帧之间的全局运动模式。聚焦-抑制注意力通过运动显著特征门控增强前景语义,并基于IMM提供的时间感知运动上下文抑制背景噪声,而无需显式分割。基于上述两个设计,FocusTrack实现了紧凑的一阶段管道的端到端训练。在KITTI、nuScenes和Waymo等著名3D跟踪基准上的广泛实验表明,FocusTrack在以105 FPS的高速度运行的同时,实现了新的SOTA性能。
cs.CV / 72 / 2602.24134
AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation
AgenticOCR:仅解析所需内容以实现高效的检索增强生成
Abstract
The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator's attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a "thinking with images" manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the "third building block" of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at https://github.com/OpenDataLab/AgenticOCR.
Chinese Translation
检索增强生成(RAG)在多模态领域的扩展加剧了处理复杂视觉文档(如财务报告)的挑战。尽管基于页面的分块和检索是一个自然的起点,但它创造了一个关键瓶颈:将整个页面传递给生成器引入了过多的冗余上下文。这不仅使生成器的注意力机制过载,还稀释了最显著的证据。此外,将这些信息丰富的页面压缩到有限的视觉令牌预算中进一步增加了幻觉的风险。为了解决这个问题,我们引入了AgenticOCR,一种动态解析范式,它将光学字符识别(OCR)从静态的全文本处理转变为基于查询的按需提取系统。通过以“图像思维”的方式自主分析文档布局,AgenticOCR识别并选择性地识别感兴趣的区域。这种方法在需要的地方按需解压视觉令牌,有效地将检索粒度与严格的页面级分块解耦。AgenticOCR有潜力作为视觉文档RAG堆栈的“第三个构建块”,与标准的嵌入和重排序模块协同工作并增强其功能。实验结果表明,AgenticOCR提高了视觉RAG系统的效率和准确性,在长文档理解中达到了专家级的表现。代码和模型可在https://github.com/OpenDataLab/AgenticOCR获取。
cs.CV / 73 / 2602.24136
Prune Wisely, Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives
明智修剪,锐利重建:通过自适应修剪和高斯差分原语实现紧凑的3D高斯点云
Abstract
Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densities in a single primitive, improving the expressiveness of Gaussians under compact configurations. Our method significantly improves model compactness, achieving up to 90\% reduction in Gaussian-count while delivering visual quality that is similar to, or in some cases better than, that produced by state-of-the-art methods. Code will be made publicly available.
Chinese Translation
近期在3D场景表示方面的重大进展得益于3D高斯点云(3DGS),该技术实现了具有照片级真实感的实时渲染。3DGS通常需要大量原语以达到高保真度,这导致冗余表示和高资源消耗,从而限制了其在复杂或大规模场景中的可扩展性。因此,开发有效的修剪策略和更具表现力的原语,以减少冗余并保持视觉质量,对于实际应用至关重要。我们提出了一种高效的、集成的重建感知修剪策略,该策略根据重建质量自适应地确定修剪时机和精炼间隔,从而在减小模型尺寸的同时提升渲染质量。此外,我们引入了一种3D高斯差分原语,该原语在单个原语中联合建模正负密度,提高了高斯在紧凑配置下的表现力。我们的方法显著提高了模型的紧凑性,实现了高达90%的高斯数量减少,同时提供的视觉质量与最先进的方法相似,甚至在某些情况下更优。代码将公开发布。
cs.CV / 74 / 2602.24138
Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics
用于外科机器人无监督时间分割的多模态最优传输
Abstract
Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures. While effective, this strategy incurs substantial computational and data collection costs. In this work, we question whether such heavy pre-training is truly necessary. We propose Text-Augmented Action Segmentation Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition that extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated directly from the videos. TASOT formulates temporal action segmentation as a multimodal optimal transport problem, where the matching cost is defined as a weighted combination of visual and text-based costs. The visual term captures frame-level appearance similarity, while the text term provides complementary semantic cues, and both are jointly regularized through a temporally consistent unbalanced Gromov-Wasserstein formulation. This design enables effective alignment between video frames and surgical actions without surgical-specific pretraining or external web-scale supervision. We evaluate TASOT on multiple benchmark surgical datasets and observe consistent and substantial improvements over existing zero-shot methods, including StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6). These results demonstrate that fine-grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations, without resorting to increasingly complex pre-training pipelines. The code will be available at https://github.com/omar8ahmed9/TASOT.
Chinese Translation
从视频中识别外科阶段和步骤是计算机辅助干预中的一个基本问题。近期的方法越来越依赖于对数千个标记外科视频的大规模预训练,随后进行零样本迁移到特定程序。尽管这种策略有效,但它带来了可观的计算和数据收集成本。在本研究中,我们质疑这种重型预训练是否真的必要。我们提出了文本增强的动作分割最优传输(Text-Augmented Action Segmentation Optimal Transport, TASOT),这是一种无监督的外科阶段和步骤识别方法,通过直接从视频中生成的文本信息扩展了动作分割最优传输(Action Segmentation Optimal Transport, ASOT)。TASOT将时间动作分割公式化为一个多模态最优传输问题,其中匹配成本被定义为视觉和基于文本的成本的加权组合。视觉项捕捉帧级外观相似性,而文本项提供互补的语义线索,二者通过时间一致的非平衡Gromov-Wasserstein公式共同正则化。这一设计使得在没有外科特定预训练或外部网络规模监督的情况下,有效对齐视频帧和外科动作。我们在多个基准外科数据集上评估了TASOT,并观察到相较于现有的零样本方法有一致且显著的改善,包括StrasBypass70(+23.7)、BernBypass70(+4.5)、Cholec80(+16.5)和AutoLaparo(+19.6)。这些结果表明,通过利用标准视觉和文本表示中已存在的信息,可以实现细粒度的外科理解,而无需依赖日益复杂的预训练流程。代码将可在https://github.com/omar8ahmed9/TASOT获取。
cs.CV / 75 / 2602.24144
Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation
固定锚点不足:用于数据集蒸馏的动态检索与持久同调
Abstract
Decoupled dataset distillation (DD) compresses large corpora into a few synthetic images by matching a frozen teacher's statistics. However, current residual-matching pipelines rely on static real patches, creating a fit-complexity gap and a pull-to-anchor effect that reduce intra-class diversity and hurt generalization. To address these issues, we introduce RETA -- a Retrieval and Topology Alignment framework for decoupled DD. First, Dynamic Retrieval Connection (DRC) selects a real patch from a prebuilt pool by minimizing a fit-complexity score in teacher feature space; the chosen patch is injected via a residual connection to tighten feature fit while controlling injected complexity. Second, Persistent Topology Alignment (PTA) regularizes synthesis with persistent homology: we build a mutual k-NN feature graph, compute persistence images of components and loops, and penalize topology discrepancies between real and synthetic sets, mitigating pull-to-anchor effect. Across CIFAR-100, Tiny-ImageNet, ImageNet-1K, and multiple ImageNet subsets, RETA consistently outperforms various baselines under comparable time and memory, especially reaching 64.3% top-1 accuracy on ImageNet-1K with ResNet-18 at 50 images per class, +3.1% over the best prior.
Chinese Translation
解耦数据集蒸馏(Decoupled Dataset Distillation, DD)通过匹配冻结教师的统计信息,将大型语料库压缩为少量合成图像。然而,目前的残差匹配流程依赖于静态真实补丁,造成了拟合复杂度差距和拉锚效应,降低了类内多样性并影响了泛化能力。为了解决这些问题,我们提出了RETA——一种用于解耦DD的检索与拓扑对齐框架。首先,动态检索连接(Dynamic Retrieval Connection, DRC)通过最小化教师特征空间中的拟合复杂度评分,从预构建池中选择一个真实补丁;所选补丁通过残差连接注入,以在控制注入复杂度的同时增强特征拟合。其次,持久拓扑对齐(Persistent Topology Alignment, PTA)通过持久同调对合成进行正则化:我们构建了一个互相k近邻特征图,计算组件和环的持久性图像,并惩罚真实和合成集之间的拓扑差异,从而减轻拉锚效应。在CIFAR-100、Tiny-ImageNet、ImageNet-1K及多个ImageNet子集上,RETA在可比时间和内存条件下始终优于各种基线,特别是在ImageNet-1K上以ResNet-18在每类50张图像的情况下达到了64.3%的Top-1准确率,比之前最佳结果提高了3.1%。
cs.CV / 76 / 2602.24148
HumanOrbit: 3D Human Reconstruction as 360{\deg} Orbit Generation
HumanOrbit:作为360°轨道生成的3D人类重建
Abstract
We present a method for generating a full 360{\deg} orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated their ability in generating photorealistic results that align well with the given prompts. Inspired by these results, we propose HumanOrbit, a video diffusion model for multi-view human image generation. Our approach enables the model to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the appearance and identity of the person. Using the generated multi-view frames, we further propose a reconstruction pipeline that recovers a textured mesh of the subject. Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.
Chinese Translation
我们提出了一种从单一输入图像生成围绕一个人的完整360°轨道视频的方法。现有的方法通常将基于图像的扩散模型应用于多视图合成,但在不同视图之间以及与原始身份之间产生不一致的结果。相比之下,最近的视频扩散模型展示了其在生成与给定提示高度一致的照片级真实感结果方面的能力。受到这些结果的启发,我们提出了HumanOrbit,一种用于多视图人类图像生成的视频扩散模型。我们的方法使模型能够合成围绕主体的连续相机旋转,生成几何一致的新视图,同时保持人物的外观和身份。利用生成的多视图帧,我们进一步提出了一个重建管道,以恢复主体的纹理网格。实验结果验证了HumanOrbit在多视图图像生成方面的有效性,并且重建的3D模型在完整性和保真度上优于最先进基线的结果。
cs.CV / 77 / 2602.24159
RAViT: Resolution-Adaptive Vision Transformer
RAViT:分辨率自适应视觉变换器
Abstract
Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.
Chinese Translation
视觉变换器最近在计算机视觉领域取得了突破,在多个应用中展现出卓越的精度表现。然而,与卷积神经网络等替代方法相比,其计算成本非常高。为了解决这个问题,我们提出了一种基于多分支网络的新颖图像分类框架RAViT,该框架在多个不同分辨率的同一图像副本上进行操作,以降低计算成本,同时保持整体准确性。此外,我们的框架还包括一个提前退出机制,使我们的模型具有自适应性,并能够在运行时选择准确性与计算成本之间的适当权衡。例如,在一个双分支架构中,原始图像首先被调整大小以降低其分辨率,然后使用第一个变换器对其进行预测,得到的预测结果与原始大小的图像一起被重用,以在第二个变换器上进行最终预测,这样的计算量比经典的视觉变换器架构要少。提前退出过程使模型能够在中间分支处做出最终预测,从而节省更多计算。我们在CIFAR-10、Tiny ImageNet和ImageNet上评估了我们的方法。我们获得了与经典视觉变换器模型相当的准确性,但仅使用了大约70%的FLOPs。
cs.CV / 78 / 2602.24160
Manifold-Preserving Superpixel Hierarchies and Embeddings for the Exploration of High-Dimensional Images
保留流形的超像素层次结构与嵌入用于高维图像的探索
Abstract
High-dimensional images, or images with a high-dimensional attribute vector per pixel, are commonly explored with coordinated views of a low-dimensional embedding of the attribute space and a conventional image representation. Nowadays, such images can easily contain several million pixels. For such large datasets, hierarchical embedding techniques are better suited to represent the high-dimensional attribute space than flat dimensionality reduction methods. However, available hierarchical dimensionality reduction methods construct the hierarchy purely based on the attribute information and ignore the spatial layout of pixels in the images. This impedes the exploration of regions of interest in the image space, since there is no congruence between a region of interest in image space and the associated attribute abstractions in the hierarchy. In this paper, we present a superpixel hierarchy for high-dimensional images that takes the high-dimensional attribute manifold into account during construction. Through this, our method enables consistent exploration of high-dimensional images in both image and attribute space. We show the effectiveness of this new image-guided hierarchy in the context of embedding exploration by comparing it with classical hierarchical embedding-based image exploration in two use cases.
Chinese Translation
高维图像,即每个像素具有高维属性向量的图像,通常通过属性空间的低维嵌入与传统图像表示的协调视图进行探索。如今,这类图像可以轻松包含数百万个像素。对于如此大规模的数据集,层次嵌入技术比平面降维方法更适合表示高维属性空间。然而,现有的层次降维方法纯粹基于属性信息构建层次结构,忽略了图像中像素的空间布局。这妨碍了对图像空间中感兴趣区域的探索,因为图像空间中的感兴趣区域与层次结构中相关的属性抽象之间没有一致性。在本文中,我们提出了一种针对高维图像的超像素层次结构,该结构在构建过程中考虑了高维属性流形。通过这种方式,我们的方法能够在图像空间和属性空间中一致地探索高维图像。我们通过在两个使用案例中将这种新的图像引导层次结构与经典的基于层次嵌入的图像探索进行比较,展示了其在嵌入探索中的有效性。
cs.CV / 79 / 2602.24161
GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
GeoDiff4D:基于几何感知的4D头部头像重建扩散方法
Abstract
Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are incorporated into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.
Chinese Translation
从单张肖像图像重建逼真且可动画的4D头部头像仍然是计算机视觉中的一项基本挑战。尽管扩散模型在头像重建的图像和视频生成方面取得了显著进展,但现有方法主要依赖于2D先验,难以实现一致的3D几何形状。我们提出了一种新颖的框架,利用几何感知扩散来学习高保真头部头像重建的强几何先验。我们的方法共同合成肖像图像及其对应的表面法线,同时无姿态的表情编码器捕捉隐式表情表示。合成的图像和表情潜变量都被纳入基于3D高斯的头像中,从而实现具有准确几何形状的逼真渲染。大量实验表明,我们的方法在视觉质量、表情保真度和跨身份泛化方面显著优于现有最先进的方法,同时支持实时渲染。
cs.CV / 80 / 2602.24181
A Mixed Diet Makes DINO An Omnivorous Vision Encoder
混合饮食使DINO成为一种杂食性视觉编码器
Abstract
Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.
Chinese Translation
预训练的视觉编码器如DINOv2在单模态任务中表现出色。然而,我们观察到它们在不同模态之间的特征表示对齐效果较差。例如,RGB图像及其对应场景深度图的特征嵌入,其余弦相似度几乎与两个随机、不相关图像的相似度相同。为了解决这个问题,我们提出了杂食性视觉编码器(Omnivorous Vision Encoder),这是一个学习模态无关特征空间的新框架。我们以双重目标训练该编码器:首先,最大化同一场景不同模态之间的特征对齐;其次,采用蒸馏目标,将学习到的表示锚定到完全冻结的教师模型(如DINOv2)的输出上。最终得到的学生编码器通过为给定场景生成一致且强大的嵌入,成为“杂食性”的,无论输入模态(RGB、深度、分割等)如何。这种方法在保持原始基础模型的判别语义的同时,能够实现稳健的跨模态理解。
cs.CV / 81 / 2602.24183
A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification
一种多模态切片发现框架用于医学图像分类中的系统性故障检测与解释
Abstract
Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical settings. Existing auditing approaches mainly rely on unimodal features or metadata-based subgroup analyses, which are limited in interpretability and often fail to capture hidden systematic failures. To address these limitations, we introduce the first automated auditing framework that extends slice discovery methods to multimodal representations specifically for medical applications. Comprehensive experiments were conducted under common failure scenarios using the MIMIC-CXR-JPG dataset, demonstrating the framework's strong capability in both failure discovery and explanation generation. Our results also show that multimodal information generally allows more comprehensive and effective auditing of classifiers, while unimodal variants beyond image-only inputs exhibit strong potential in scenarios where resources are constrained.
Chinese Translation
尽管基于机器学习的医学图像分类器取得了进展,但这些系统在实际应用中的安全性和可靠性仍然是主要关注点。现有的审计方法主要依赖单一模态特征或基于元数据的子组分析,这在可解释性上存在局限,且往往无法捕捉隐藏的系统性故障。为了解决这些局限性,我们提出了首个自动化审计框架,该框架将切片发现方法扩展到多模态表示,专门用于医学应用。我们在常见故障场景下使用 MIMIC-CXR-JPG 数据集进行了全面实验,证明了该框架在故障发现和解释生成方面的强大能力。我们的结果还表明,多模态信息通常允许对分类器进行更全面和有效的审计,而超越仅图像输入的单模态变体在资源受限的场景中展现出强大的潜力。
cs.CV / 82 / 2602.24208
SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
SenCache:通过敏感度感知缓存加速扩散模型推理
Abstract
Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps. This has motivated a growing line of research on accelerating diffusion inference. Among training-free acceleration methods, caching reduces computation by reusing previously computed model outputs across timesteps. Existing caching methods rely on heuristic criteria to choose cache/reuse timesteps and require extensive tuning. We address this limitation with a principled sensitivity-aware caching framework. Specifically, we formalize the caching error through an analysis of the model output sensitivity to perturbations in the denoising inputs, i.e., the noisy latent and the timestep, and show that this sensitivity is a key predictor of caching error. Based on this analysis, we propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis. Our framework provides a theoretical basis for adaptive caching, explains why prior empirical heuristics can be partially effective, and extends them to a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX, and LTX-Video show that SenCache achieves better visual quality than existing caching methods under similar computational budgets.
Chinese Translation
扩散模型在视频生成质量上达到了最先进的水平,但由于需要大量的顺序去噪步骤,其推理仍然昂贵。这促使了加速扩散推理的研究不断增长。在无训练加速方法中,缓存通过在时间步之间重用先前计算的模型输出来减少计算量。现有的缓存方法依赖于启发式标准来选择缓存/重用时间步,并且需要大量的调优。我们通过一个有原则的敏感度感知缓存框架来解决这一限制。具体而言,我们通过分析模型输出对去噪输入(即噪声潜变量和时间步)的扰动的敏感度来形式化缓存误差,并表明这种敏感度是缓存误差的关键预测因子。基于这一分析,我们提出了敏感度感知缓存(Sensitivity-Aware Caching,SenCache),这是一种动态缓存策略,能够根据每个样本自适应地选择缓存时间步。我们的框架为自适应缓存提供了理论基础,解释了为何先前的经验启发式方法在某种程度上有效,并将其扩展为动态的、样本特定的方法。在Wan 2.1、CogVideoX和LTX-Video上的实验表明,SenCache在相似计算预算下实现了比现有缓存方法更好的视觉质量。
cs.CV / 83 / 2602.24222
MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
MuViT:用于显微镜多尺度学习的多分辨率视觉变换器
Abstract
Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.
Chinese Translation
现代显微镜常常产生包含多种空间尺度结构的千兆像素图像,从细胞形态到更广泛的组织结构。许多分析任务需要结合这些尺度,但大多数视觉模型仅在单一分辨率下操作或从单一视图中提取多尺度特征,这限制了它们利用显微镜数据固有的多分辨率特性的能力。我们提出了MuViT,一种旨在融合来自同一基础图像的真实多分辨率观测的变换器架构。MuViT将所有图像块嵌入到一个共享的世界坐标系统中,并将旋转位置嵌入扩展到这些坐标,从而使注意力机制能够在单个编码器中整合广域上下文与高分辨率细节。在合成基准、肾脏组织病理学和高分辨率小鼠大脑显微镜图像中,MuViT在强大的ViT和CNN基线之上提供了一致的改进。多分辨率的MAE预训练进一步产生了尺度一致的表示,增强了下游任务的表现。这些结果表明,显式的世界坐标建模提供了一种简单而强大的机制,用于在大规模显微镜分析中利用多分辨率信息。
cs.CV / 84 / 2602.24233
Enhancing Spatial Understanding in Image Generation via Reward Modeling
通过奖励建模增强图像生成中的空间理解
Abstract
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.
Chinese Translation
最近在文本到图像生成方面的进展极大地提升了视觉真实感和创造力,但也对提示的复杂性提出了更高的要求,特别是在编码复杂的空间关系方面。在这种情况下,获得令人满意的结果通常需要多次采样尝试。为了解决这一挑战,我们提出了一种新方法,旨在增强当前图像生成模型的空间理解能力。我们首先构建了包含超过8万对偏好的SpatialReward-Dataset。在此数据集的基础上,我们构建了SpatialScore,这是一个旨在评估文本到图像生成中空间关系准确性的奖励模型,其性能甚至超过了领先的专有模型在空间评估上的表现。我们进一步证明,该奖励模型有效地支持了复杂空间生成的在线强化学习。在多个基准测试中的广泛实验表明,我们的专用奖励模型在图像生成的空间理解方面带来了显著且一致的提升。
cs.CV / 85 / 2602.24240
Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution
联合几何与轨迹一致性学习用于一步真实世界超分辨率
Abstract
Diffusion-based Real-World Image Super-Resolution (Real-ISR) achieves impressive perceptual quality but suffers from high computational costs due to iterative sampling. While recent distillation approaches leveraging large-scale Text-to-Image (T2I) priors have enabled one-step generation, they are typically hindered by prohibitive parameter counts and the inherent capability bounds imposed by teacher models. As a lightweight alternative, Consistency Models offer efficient inference but struggle with two critical limitations: the accumulation of consistency drift inherent to transitive training, and a phenomenon we term "Geometric Decoupling" - where the generative trajectory achieves pixel-wise alignment yet fails to preserve structural coherence. To address these challenges, we propose GTASR (Geometric Trajectory Alignment Super-Resolution), a simple yet effective consistency training paradigm for Real-ISR. Specifically, we introduce a Trajectory Alignment (TA) strategy to rectify the tangent vector field via full-path projection, and a Dual-Reference Structural Rectification (DRSR) mechanism to enforce strict structural constraints. Extensive experiments verify that GTASR delivers superior performance over representative baselines while maintaining minimal latency. The code and model will be released at https://github.com/Blazedengcy/GTASR.
Chinese Translation
基于扩散的真实世界图像超分辨率(Real-ISR)在感知质量上取得了令人印象深刻的成果,但由于迭代采样导致的高计算成本使其受到限制。尽管最近利用大规模文本到图像(Text-to-Image, T2I)先验的蒸馏方法实现了一步生成,但通常受到过高的参数数量和教师模型固有能力限制的阻碍。作为一种轻量级替代方案,一致性模型提供了高效的推理,但面临两个关键限制:传递训练固有的一致性漂移的积累,以及我们称之为“几何解耦”的现象——生成轨迹在像素级对齐的同时未能保持结构一致性。为了解决这些挑战,我们提出了GTASR(几何轨迹对齐超分辨率),这是一种简单而有效的真实世界超分辨率一致性训练范式。具体而言,我们引入了一种轨迹对齐(Trajectory Alignment, TA)策略,通过全路径投影来修正切向量场,以及一种双重参考结构修正(Dual-Reference Structural Rectification, DRSR)机制,以强制执行严格的结构约束。大量实验验证了GTASR在保持最小延迟的同时,优于代表性基线的性能。代码和模型将发布在 https://github.com/Blazedengcy/GTASR。
cs.CV / 86 / 2602.24264
Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models
组合泛化需要视觉嵌入模型中的线性正交表示
Abstract
Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at https://github.com/oshapio/necessary-compositionality.
Chinese Translation
组合泛化,即在新情境中识别熟悉部分的能力,是智能系统的一个定义特征。尽管现代模型在海量数据集上进行训练,但它们仍然仅覆盖可能输入的组合空间的一小部分,这引发了一个问题:表示必须具备何种结构才能支持对未见组合的泛化。我们在标准训练下形式化了组合泛化的三个期望(可分性、可转移性、稳定性),并表明这些期望施加了必要的几何约束:表示必须线性分解为每个概念的组件,并且这些组件在概念之间必须是正交的。这为线性表示假设提供了理论基础:在神经表示中广泛观察到的线性结构是组合泛化的必要结果。我们进一步推导了将可组合概念的数量与嵌入几何联系起来的维度界限。从经验上,我们在现代视觉模型(CLIP、SigLIP、DINO)中评估了这些预测,发现表示展示了部分线性因子分解,具有低秩、近正交的每个概念因子,并且这种结构的程度与未见组合的组合泛化相关。随着模型的不断扩展,这些条件预测了它们可能收敛的表示几何。代码可在 https://github.com/oshapio/necessary-compositionality 获取。
cs.CV / 87 / 2602.24275
Hierarchical Action Learning for Weakly-Supervised Action Segmentation
用于弱监督动作分割的层次化动作学习
Abstract
Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.
Chinese Translation
人类通过关键转变感知动作,这些转变在多个抽象层次上构建动作,而机器依赖于视觉特征,往往会过度分割。这突显了在视频理解中实现层次推理的困难。有趣的是,我们观察到低层次视觉变量和高层次动作潜变量以不同的速率演变,低层次视觉变量变化迅速,而高层次动作变量演变较慢,使其更易于识别。基于这一见解,我们提出了层次化动作学习(Hierarchical Action Learning, extbf{HAL})模型,用于弱监督动作分割。我们的方法引入了一种层次化因果数据生成过程,其中高层次潜在动作主导低层次视觉特征的动态。为了有效建模这些不同的时间尺度,我们引入了确定性过程以在时间上对齐这些潜变量。 extbf{HAL} 模型采用层次化金字塔变换器来捕捉视觉特征和潜变量,并施加稀疏转变约束,以强化高层次动作变量的较慢动态。这一机制增强了这些潜变量随时间的识别能力。在温和的假设下,我们证明这些潜在动作变量是严格可识别的。在多个基准测试上的实验结果表明, extbf{HAL} 模型显著优于现有的弱监督动作分割方法,确认了其在实际应用中的有效性。
cs.CV / 88 / 2602.24289
Mode Seeking meets Mean Seeking for Fast Long Video Generation
模式寻求与均值寻求相结合的快速长视频生成
Abstract
Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.
Chinese Translation
将视频生成从秒级扩展到分钟级面临一个关键瓶颈:短视频数据虽然丰富且高保真,但连贯的长视频数据却稀缺且局限于狭窄的领域。为了解决这一问题,我们提出了一种训练范式,将模式寻求与均值寻求相结合,通过解耦局部保真度与长期连贯性,基于解耦扩散变换器(Decoupled Diffusion Transformer)实现统一表示。我们的方法利用一个通过监督学习在长视频上训练的全局流匹配头(Flow Matching head)来捕捉叙事结构,同时采用一个局部分布匹配头(Distribution Matching head),通过模式寻求的反KL散度将滑动窗口与冻结的短视频教师对齐。这一策略使得我们能够合成分钟级视频,从有限的长视频中学习长程连贯性和运动,同时通过将学生的每个滑动窗口段与冻结的短视频教师对齐,继承局部现实感,从而实现一个几步快速长视频生成器。评估结果表明,我们的方法通过共同改善局部清晰度、运动和长程一致性,有效缩小了保真度与视野之间的差距。项目网站:https://primecai.github.io/mmm/.
cs.CV / 89 / 2602.24290
UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images
UFO-4D:基于两幅图像的无姿态前馈4D重建
Abstract
Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering multiple signals from a single Dynamic 3D Gaussian representation offers major training advantages. This approach enables a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion. Since all modalities share the same geometric primitives, supervising one inherently regularizes and improves the others. This synergy overcomes data scarcity, allowing UFO-4D to outperform prior work by up to 3 times in joint geometry, motion, and camera pose estimation. Our representation also enables high-fidelity 4D interpolation across novel views and time. Please visit our project page for visual results: https://ufo-4d.github.io/
Chinese Translation
从无姿态图像进行密集4D重建仍然是一个关键挑战,目前的方法依赖于缓慢的测试时间优化或碎片化的任务特定前馈模型。我们提出了UFO-4D,一个统一的前馈框架,仅通过一对无姿态图像重建密集的显式4D表示。UFO-4D直接估计动态3D高斯点云,从而以前馈方式实现3D几何、3D运动和相机姿态的联合一致估计。我们的核心见解是,从单一动态3D高斯表示中可微分地渲染多个信号提供了显著的训练优势。这种方法能够实现自监督的图像合成损失,同时紧密耦合外观、深度和运动。由于所有模态共享相同的几何原语,对一个模态的监督本质上会正则化并改善其他模态。这种协同作用克服了数据稀缺,使得UFO-4D在联合几何、运动和相机姿态估计方面的表现比之前的工作提高了多达3倍。我们的表示还支持在新视角和时间上的高保真4D插值。请访问我们的项目页面以查看可视化结果:https://ufo-4d.github.io/
cs.AI / 1 / 2602.23367
HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance
HumanMCP:用于评估MCP工具检索性能的人类类查询数据集
Abstract
Model Context Protocol (MCP) servers contain a collection of thousands of open-source standardized tools, linking LLMs to external systems; however, existing datasets and benchmarks lack realistic, human-like user queries, remaining a critical gap in evaluating the tool usage and ecosystems of MCP servers. Existing datasets often do contain tool descriptions but fail to represent how different users portray their requests, leading to poor generalization and inflated reliability of certain benchmarks. This paper introduces the first large-scale MCP dataset featuring diverse, high-quality diverse user queries generated specifically to match 2800 tools across 308 MCP servers, developing on the MCP Zero dataset. Each tool is paired with multiple unique user personas that we have generated, to capture varying levels of user intent ranging from precise task requests, and ambiguous, exploratory commands, reflecting the complexity of real-world interaction patterns.
Chinese Translation
模型上下文协议(MCP)服务器包含数千个开源标准化工具的集合,将大语言模型(LLMs)与外部系统连接起来;然而,现有的数据集和基准缺乏现实的人类类用户查询,这在评估MCP服务器的工具使用和生态系统中仍然是一个关键缺口。现有数据集虽然包含工具描述,但未能代表不同用户如何表达他们的请求,导致泛化能力差以及某些基准的可靠性被夸大。本文介绍了第一个大规模MCP数据集,具有多样化的高质量用户查询,专门生成以匹配308个MCP服务器上的2800个工具,基于MCP Zero数据集进行开发。每个工具都与多个独特的用户角色配对,以捕捉从精确任务请求到模糊探索性命令的不同用户意图水平,反映出现实世界交互模式的复杂性。
cs.AI / 2 / 2602.23373
An Agentic LLM Framework for Adverse Media Screening in AML Compliance
用于反洗钱合规的代理型大语言模型框架在不良媒体筛查中的应用
Abstract
Adverse media screening is a critical component of anti-money laundering (AML) and know-your-customer (KYC) compliance processes in financial institutions. Traditional approaches rely on keyword-based searches that generate high false-positive rates or require extensive manual review. We present an agentic system that leverages Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to automate adverse media screening. Our system implements a multi-step approach where an LLM agent searches the web, retrieves and processes relevant documents, and computes an Adverse Media Index (AMI) score for each subject. We evaluate our approach using multiple LLM backends on a dataset comprising Politically Exposed Persons (PEPs), persons from regulatory watchlists, and sanctioned persons from OpenSanctions and clean names from academic sources, demonstrating the system's ability to distinguish between high-risk and low-risk individuals.
Chinese Translation
不良媒体筛查是金融机构反洗钱(AML)和了解你的客户(KYC)合规流程中的关键组成部分。传统方法依赖于基于关键词的搜索,这会产生较高的误报率或需要大量的人工审查。我们提出了一种代理系统,利用大语言模型(LLMs)与检索增强生成(RAG)技术来自动化不良媒体筛查。我们的系统实施了一种多步骤的方法,其中LLM代理在网络上搜索,检索和处理相关文档,并为每个主体计算不良媒体指数(AMI)分数。我们使用多个LLM后端在一个包含政治曝光人士(PEPs)、来自监管观察名单的人员、以及来自OpenSanctions的制裁人员和来自学术来源的清白姓名的数据集上评估了我们的方法,展示了该系统区分高风险和低风险个体的能力。
cs.AI / 3 / 2602.23541
Causal Identification from Counterfactual Data: Completeness and Bounding Results
基于反事实数据的因果识别:完整性与界限结果
Abstract
Previous work establishing completeness results for $\textit{counterfactual identification}$ has been circumscribed to the setting where the input data belongs to observational or interventional distributions (Layers 1 and 2 of Pearl's Causal Hierarchy), since it was generally presumed impossible to obtain data from counterfactual distributions, which belong to Layer 3. However, recent work (Raghavan & Bareinboim, 2025) has formally characterized a family of counterfactual distributions which can be directly estimated via experimental methods - a notion they call $\textit{counterfactual realizabilty}$. This leaves open the question of what $\textit{additional}$ counterfactual quantities now become identifiable, given this new access to (some) Layer 3 data. To answer this question, we develop the CTFIDU+ algorithm for identifying counterfactual queries from an arbitrary set of Layer 3 distributions, and prove that it is complete for this task. Building on this, we establish the theoretical limit of which counterfactuals can be identified from physically realizable distributions, thus implying the $\textit{fundamental limit to exact causal inference in the non-parametric setting}$. Finally, given the impossibility of identifying certain critical types of counterfactuals, we derive novel analytic bounds for such quantities using realizable counterfactual data, and corroborate using simulations that counterfactual data helps tighten the bounds for non-identifiable quantities in practice.
Chinese Translation
以往关于反事实识别的完整性结果的研究仅限于输入数据属于观察性或干预性分布的情境(Pearl因果层次结构的第1层和第2层),因为通常认为从反事实分布(属于第3层)获取数据是不可能的。然而,最近的研究(Raghavan & Bareinboim, 2025)正式描述了一类可以通过实验方法直接估计的反事实分布——他们称之为反事实可实现性(counterfactual realizability)。这引出了一个问题,即在获得(某些)第3层数据的新情况下,哪些额外的反事实量现在变得可识别。为了解答这个问题,我们开发了CTFIDU+算法,用于从任意一组第3层分布中识别反事实查询,并证明该算法在这一任务上是完整的。在此基础上,我们确立了从物理可实现分布中可以识别的反事实的理论极限,从而暗示了在非参数设置下因果推断的基本极限。最后,鉴于某些关键类型的反事实无法识别,我们利用可实现的反事实数据推导出这些量的新分析界限,并通过模拟验证反事实数据在实践中有助于收紧不可识别量的界限。
cs.AI / 4 / 2602.23545
Planning under Distribution Shifts with Causal POMDPs
在分布转移下的规划与因果部分可观测马尔可夫决策过程
Abstract
In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of conditions may no longer remain valid as the distribution of states or the environment dynamics change, which in turn causes previously learned strategies to fail. In this work, we propose a theoretical framework for planning under partial observability using Partially Observable Markov Decision Processes (POMDPs) formulated using causal knowledge. By representing shifts in the environment as interventions on this causal POMDP, the framework enables evaluating plans under hypothesized changes and actively identifying which components of the environment have been altered. We show how to maintain and update a belief over both the latent state and the underlying domain, and we prove that the value function remains piecewise linear and convex (PWLC) in this augmented belief space. Preservation of PWLC under distribution shifts has the advantage of maintaining the tractability of planning via $\alpha$-vector-based POMDP methods.
Chinese Translation
在现实世界中,规划常常面临分布转移的挑战。因此,在一组条件下获得的环境模型可能在状态分布或环境动态发生变化时不再有效,这反过来导致先前学习的策略失效。在本研究中,我们提出了一种基于因果知识的部分可观测马尔可夫决策过程(POMDP)理论框架,用于在部分可观测性下进行规划。通过将环境的变化表示为对该因果POMDP的干预,该框架能够在假设的变化下评估计划,并主动识别环境中哪些组件已被改变。我们展示了如何在潜在状态和基础领域上维持和更新信念,并证明在这个增强的信念空间中,价值函数保持分段线性和凸性(PWLC)。在分布转移下保持PWLC的优势在于通过基于$eta$-向量的方法保持规划的可处理性。
cs.AI / 5 / 2602.23579
Construct, Merge, Solve & Adapt with Reinforcement Learning for the min-max Multiple Traveling Salesman Problem
基于强化学习的最小-最大多旅行推销员问题的构建、合并、求解与适应
Abstract
The Multiple Traveling Salesman Problem (mTSP) extends the Traveling Salesman Problem to m tours that start and end at a common depot and jointly visit all customers exactly once. In the min-max variant, the objective is to minimize the longest tour, reflecting workload balance. We propose a hybrid approach, Construct, Merge, Solve & Adapt with Reinforcement Learning (RL-CMSA), for the symmetric single-depot min-max mTSP. The method iteratively constructs diverse solutions using probabilistic clustering guided by learned pairwise q-values, merges routes into a compact pool, solves a restricted set-covering MILP, and refines solutions via inter-route remove, shift, and swap moves. The q-values are updated by reinforcing city-pair co-occurrences in high-quality solutions, while the pool is adapted through ageing and pruning. This combination of exact optimization and reinforcement-guided construction balances exploration and exploitation. Computational results on random and TSPLIB instances show that RL-CMSA consistently finds (near-)best solutions and outperforms a state-of-the-art hybrid genetic algorithm under comparable time limits, especially as instance size and the number of salesmen increase.
Chinese Translation
多旅行推销员问题(mTSP)将旅行推销员问题扩展到 m 条从共同仓库出发并结束于同一地点的路线,且每个客户恰好被访问一次。在最小-最大变体中,目标是最小化最长的路线,以反映工作负载的平衡。我们提出了一种混合方法,即基于强化学习的构建、合并、求解与适应(RL-CMSA),用于对称单仓库的最小-最大 mTSP。该方法通过学习的成对 q 值指导的概率聚类,迭代构建多样化的解决方案,将路线合并为紧凑的池,求解限制集覆盖的混合整数线性规划(MILP),并通过跨路线的移除、移动和交换操作来优化解决方案。q 值通过强化高质量解决方案中城市对的共现进行更新,而池则通过老化和修剪进行调整。这种精确优化与强化指导构建的结合平衡了探索与开发。在随机和 TSPLIB 实例上的计算结果表明,RL-CMSA 一直能够找到(近)最优解,并在可比时间限制下优于最先进的混合遗传算法,尤其是在实例规模和推销员数量增加时。
cs.AI / 6 / 2602.23605
SleepLM: Natural-Language Intelligence for Human Sleep
SleepLM:用于人类睡眠的自然语言智能
Abstract
We present SleepLM, a family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language. Despite the critical role of sleep, learning-based sleep analysis systems operate in closed label spaces (e.g., predefined stages or events) and fail to describe, query, or generalize to novel sleep phenomena. SleepLM bridges natural language and multimodal polysomnography, enabling language-grounded representations of sleep physiology. To support this alignment, we introduce a multilevel sleep caption generation pipeline that enables the curation of the first large-scale sleep-text dataset, comprising over 100K hours of data from more than 10,000 individuals. Furthermore, we present a unified pretraining objective that combines contrastive alignment, caption generation, and signal reconstruction to better capture physiological fidelity and cross-modal interactions. Extensive experiments on real-world sleep understanding tasks verify that SleepLM outperforms state-of-the-art in zero-shot and few-shot learning, cross-modal retrieval, and sleep captioning. Importantly, SleepLM also exhibits intriguing capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks. All code and data will be open-sourced.
Chinese Translation
我们提出了SleepLM,一个睡眠语言基础模型系列,旨在实现人类睡眠的对齐、解释和与自然语言的交互。尽管睡眠在生活中扮演着至关重要的角色,基于学习的睡眠分析系统通常在封闭标签空间中运行(例如,预定义的阶段或事件),无法描述、查询或推广到新的睡眠现象。SleepLM架起了自然语言与多模态多导睡眠监测(polysomnography)之间的桥梁,使得睡眠生理学的语言基础表示成为可能。为了支持这种对齐,我们引入了一个多层次的睡眠标题生成管道,能够策划出首个大规模睡眠文本数据集,包含来自超过10,000个个体的10万小时以上的数据。此外,我们提出了一个统一的预训练目标,结合了对比对齐、标题生成和信号重构,以更好地捕捉生理保真性和跨模态交互。在真实世界的睡眠理解任务上的广泛实验验证了SleepLM在零样本和少样本学习、跨模态检索和睡眠标题生成方面优于当前最先进的技术。重要的是,SleepLM还展现了引人注目的能力,包括基于语言的事件定位、针对性洞察生成以及对未见任务的零样本泛化。所有代码和数据将会开源。
cs.AI / 7 / 2602.23632
MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs
MMKG-RDS:通过深度挖掘多模态知识图谱进行推理数据合成
Abstract
Synthesizing high-quality training data is crucial for enhancing domain models' reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and formulas, useful for complex benchmark construction. The dataset and code are available at https://github.com/360AILAB-NLP/MMKG-RDS
Chinese Translation
合成高质量的训练数据对于增强领域模型的推理能力至关重要。现有方法在长尾知识覆盖、有效性验证和可解释性方面存在局限。基于知识图谱的方法在功能性、粒度、可定制性和评估方面仍显不足。为了解决这些问题,我们提出了MMKG-RDS,一个灵活的推理数据合成框架,利用多模态知识图谱。该框架支持细粒度知识提取、可定制路径采样和多维数据质量评分。我们使用涵盖五个领域、17种任务类型和14,950个样本的MMKG-RDS-Bench数据集验证了MMKG-RDS。实验结果表明,在少量合成样本上微调Qwen3模型(0.6B/8B/32B)可将推理准确率提高9.2%。该框架还生成独特的数据,挑战现有模型在涉及表格和公式的任务中,适用于复杂基准的构建。数据集和代码可在https://github.com/360AILAB-NLP/MMKG-RDS获取。
cs.AI / 8 / 2602.23643
AI Must Embrace Specialization via Superhuman Adaptable Intelligence
人工智能必须通过超人类适应性智能拥抱专业化
Abstract
Everyone from AI executives and researchers to doomsayers, politicians, and activists is talking about Artificial General Intelligence (AGI). Yet, they often don't seem to agree on its exact definition. One common definition of AGI is an AI that can do everything a human can do, but are humans truly general? In this paper, we address what's wrong with our conception of AGI, and why, even in its most coherent formulation, it is a flawed concept to describe the future of AI. We explore whether the most widely accepted definitions are plausible, useful, and truly general. We argue that AI must embrace specialization, rather than strive for generality, and in its specialization strive for superhuman performance, and introduce Superhuman Adaptable Intelligence (SAI). SAI is defined as intelligence that can learn to exceed humans at anything important that we can do, and that can fill in the skill gaps where humans are incapable. We then lay out how SAI can help hone a discussion around AI that was blurred by an overloaded definition of AGI, and extrapolate the implications of using it as a guide for the future.
Chinese Translation
从人工智能高管和研究人员到悲观主义者、政治家和活动家,大家都在讨论人工通用智能(AGI)。然而,他们似乎并未就其确切定义达成一致。AGI的一个常见定义是能够做所有人类能做的事情的人工智能,但人类真的算是通用的吗?在本文中,我们探讨了我们对AGI的概念存在的问题,以及即使在其最连贯的表述中,为什么它仍然是一个描述人工智能未来的缺陷概念。我们考察了最广泛接受的定义是否合理、有用,并且真正通用。我们认为,人工智能必须拥抱专业化,而不是追求通用性,并在其专业化中追求超人类的表现,提出了超人类适应性智能(SAI)的概念。SAI被定义为能够学习超越人类在任何重要领域的智能,并能够填补人类无法胜任的技能空白。接着,我们阐述了SAI如何帮助澄清围绕人工智能的讨论,这一讨论因AGI的过载定义而变得模糊,并推测使用SAI作为未来指导的影响。
cs.AI / 9 / 2602.23668
PseudoAct: Leveraging Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents
PseudoAct:利用伪代码合成实现大语言模型代理的灵活规划与行动控制
Abstract
Large language model (LLM) agents typically rely on reactive decision-making paradigms such as ReAct, selecting actions conditioned on growing execution histories. While effective for short tasks, these approaches often lead to redundant tool usage, unstable reasoning, and high token consumption in complex long-horizon tasks involving branching, iteration, or multi-tool coordination. To address these limitations, this paper introduces PseudoAct, a novel framework for flexible planning and action control in LLM agents through pseudocode synthesis. Leveraging the ability of LLMs to express task-solving strategies as code, PseudoAct synthesizes a structured pseudocode plan that decomposes a task into subtasks and explicitly encodes control flow, including sequencing, conditionals, loops, parallel composition, and combinations of these logic primitives. Actions are then executed by following this global plan, making the decision logic explicit and temporally coherent. This design reduces redundant actions, prevents infinite loops, and avoids uninformative alternative exploration, enabling consistent and efficient long-horizon decision-making. Experiments on benchmark datasets show that our method significantly outperforms existing reactive agent approaches, achieving a 20.93% absolute gain in success rate on FEVER and setting a new state-of-the-art on HotpotQA.
Chinese Translation
大型语言模型(LLM)代理通常依赖于反应式决策范式,如 ReAct,根据不断增长的执行历史选择行动。虽然这种方法在短期任务中有效,但在涉及分支、迭代或多工具协调的复杂长期任务中,这些方法往往导致工具使用冗余、推理不稳定以及高令牌消耗。为了解决这些局限性,本文提出了 PseudoAct,一个通过伪代码合成实现 LLM 代理灵活规划与行动控制的新框架。利用 LLM 表达任务解决策略为代码的能力,PseudoAct 合成一个结构化的伪代码计划,将任务分解为子任务,并明确编码控制流,包括顺序、条件、循环、并行组合及这些逻辑原语的组合。然后,通过遵循这一全局计划来执行行动,使决策逻辑明确且时间上连贯。这一设计减少了冗余行动,防止了无限循环,并避免了无信息的替代探索,从而实现了一致且高效的长期决策。基于基准数据集的实验表明,我们的方法显著优于现有的反应式代理方法,在 FEVER 上实现了 20.93% 的绝对成功率提升,并在 HotpotQA 上设定了新的最先进水平。
cs.AI / 10 / 2602.23681
ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference
ODAR:基于主动推理的LLM推理的原则性自适应路由
Abstract
The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for example, fixed best-of-N or self-consistency) that is costly, hard to attribute, and can trigger overthinking with diminishing returns. We propose ODAR-Expert, an adaptive routing framework that optimizes the accuracy-efficiency trade-off via principled resource allocation. ODAR uses a difficulty estimator grounded in amortized active inference to dynamically route queries between a heuristic Fast Agent and a deliberative Slow Agent. We further introduce a free-energy-principled, risk-sensitive fusion mechanism that selects answers by minimizing a variational free energy objective, balancing log-likelihood with epistemic uncertainty (varentropy) as a principled alternative to ad hoc voting over heterogeneous candidates. Extensive evaluation across 23 benchmarks shows strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam (HLE), while improving the compute-accuracy frontier under compute-matched settings. We also validate reproducibility on a fully open-source stack (Llama 4 + DeepSeek), where ODAR surpasses homogeneous sampling strategies while reducing computational costs by 82%. Overall, our results suggest that thinking-optimal scaling requires adaptive resource allocation with free-energy-based decision-making rather than simply increasing test-time compute.
Chinese Translation
大型语言模型(LLM)推理的范式正从参数扩展转向测试时计算扩展,然而许多现有方法仍依赖于均匀的暴力采样(例如,固定的最佳N或自一致性),这种方法成本高、难以归因,并且可能导致过度思考,收益递减。我们提出了ODAR-Expert,一个自适应路由框架,通过原则性的资源分配优化准确性与效率的权衡。ODAR使用基于摊销主动推理的难度估计器,动态地在启发式快速代理(Fast Agent)和深思熟虑的慢速代理(Slow Agent)之间路由查询。我们进一步引入了一种基于自由能原则的风险敏感融合机制,通过最小化变分自由能目标来选择答案,平衡对数似然与认知不确定性(varentropy),作为对异构候选者进行临时投票的原则性替代。对23个基准的广泛评估显示出强劲且一致的提升,包括在MATH上达到98.2%的准确率和在人类最后考试(HLE)上达到54.8%的准确率,同时在计算匹配设置下改善了计算-准确性边界。我们还在一个完全开源的堆栈(Llama 4 + DeepSeek)上验证了可重复性,其中ODAR超越了同质采样策略,同时将计算成本降低了82%。总体而言,我们的结果表明,思维最优扩展需要基于自由能的决策制定的自适应资源分配,而不仅仅是简单地增加测试时计算。
cs.AI / 11 / 2602.23701
From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems
从平面日志到因果图:基于大语言模型的多智能体系统的层次化故障归因
Abstract
LLM-powered Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in complex domains but suffer from inherent fragility and opaque failure mechanisms. Existing failure attribution methods, whether relying on direct prompting, costly replays, or supervised fine-tuning, typically treat execution logs as flat sequences. This linear perspective fails to disentangle the intricate causal links inherent to MAS, leading to weak observability and ambiguous responsibility boundaries. To address these challenges, we propose CHIEF, a novel framework that transforms chaotic trajectories into a structured hierarchical causal graph. It then employs hierarchical oracle-guided backtracking to efficiently prune the search space via sybthesized virtual oracles. Finally, it implements counterfactual attribution via a progressive causal screening strategy to rigorously distinguish true root causes from propagated symptoms. Experiments on Who&When benchmark show that CHIEF outperforms eight strong and state-of-the-art baselines on both agent- and step-level accuracy. Ablation studies further confirm the critical role of each proposed module.
Chinese Translation
基于大语言模型(LLM)的多智能体系统(MAS)在复杂领域展现出了卓越的能力,但也存在固有的脆弱性和不透明的故障机制。现有的故障归因方法,无论是依赖直接提示、昂贵的重放,还是监督微调,通常将执行日志视为平面序列。这种线性视角未能解开MAS固有的复杂因果关系,导致观察能力弱和责任边界模糊。为了解决这些挑战,我们提出了CHIEF,一个将混乱轨迹转化为结构化层次因果图的新框架。该框架随后采用层次化的oracle引导回溯,通过合成的虚拟oracle有效地修剪搜索空间。最后,它通过渐进的因果筛选策略实现反事实归因,以严格区分真实根本原因和传播的症状。在Who&When基准上的实验表明,CHIEF在智能体级和步骤级准确性上均优于八个强大且最先进的基线。消融研究进一步确认了每个提出模块的关键作用。
cs.AI / 12 / 2602.23716
ProductResearch: Training E-Commerce Deep Research Agents via Multi-Agent Synthetic Trajectory Distillation
ProductResearch:通过多智能体合成轨迹蒸馏训练电子商务深度研究代理
Abstract
Large Language Model (LLM)-based agents show promise for e-commerce conversational shopping, yet existing implementations lack the interaction depth and contextual breadth required for complex product research. Meanwhile, the Deep Research paradigm, despite advancing information synthesis in web search, suffers from domain gaps when transferred to e-commerce. We propose ProductResearch, a multi-agent framework that synthesizes high-fidelity, long-horizon tool-use trajectories for training robust e-commerce shopping agents. The framework employs a User Agent to infer nuanced shopping intents from behavioral histories, and a Supervisor Agent that orchestrates iterative collaboration with a Research Agent to generate synthetic trajectories culminating in comprehensive, insightful product research reports. These trajectories are rigorously filtered and distilled through a reflective internalization process that consolidates multi-agent supervisory interactions into coherent single-role training examples, enabling effective fine-tuning of LLM agents for complex shopping inquiries. Extensive experiments show that a compact MoE model fine-tuned on our synthetic data achieves substantial improvements over its base model in response comprehensiveness, research depth, and user-perceived utility, approaching the performance of frontier proprietary deep research systems and establishing multi-agent synthetic trajectory training as an effective and scalable paradigm for enhancing LLM-based shopping assistance.
Chinese Translation
基于大型语言模型(LLM)的代理在电子商务对话购物中展现出良好的前景,但现有实现缺乏进行复杂产品研究所需的交互深度和上下文广度。同时,尽管深度研究范式在网络搜索中的信息合成方面有所进展,但在转移到电子商务时却存在领域差距。我们提出了ProductResearch,一个多智能体框架,用于合成高保真、长时间跨度的工具使用轨迹,以训练强大的电子商务购物代理。该框架采用用户代理(User Agent)从行为历史中推断细致的购物意图,并通过监督代理(Supervisor Agent)与研究代理(Research Agent)进行迭代协作,生成合成轨迹,最终形成全面且富有洞察力的产品研究报告。这些轨迹经过严格过滤和蒸馏,通过反思内化过程将多智能体监督交互整合为连贯的单角色训练示例,从而有效地对LLM代理进行微调,以应对复杂的购物查询。大量实验表明,在我们的合成数据上微调的紧凑型MoE模型在响应全面性、研究深度和用户感知效用方面相较于其基础模型取得了显著提升,接近前沿专有深度研究系统的性能,并确立了多智能体合成轨迹训练作为增强基于LLM的购物助手的有效且可扩展的范式。
cs.AI / 13 / 2602.23720
The Auton Agentic AI Framework
自主代理人工智能框架
Abstract
The field of Artificial Intelligence is undergoing a transition from Generative AI -- probabilistic generation of text and images -- to Agentic AI, in which autonomous systems execute actions within external environments on behalf of users. This transition exposes a fundamental architectural mismatch: Large Language Models (LLMs) produce stochastic, unstructured outputs, whereas the backend infrastructure they must control -- databases, APIs, cloud services -- requires deterministic, schema-conformant inputs. The present paper describes the Auton Agentic AI Framework, a principled architecture for standardizing the creation, execution, and governance of autonomous agent systems. The framework is organized around a strict separation between the Cognitive Blueprint, a declarative, language-agnostic specification of agent identity and capabilities, and the Runtime Engine, the platform-specific execution substrate that instantiates and runs the agent. This separation enables cross-language portability, formal auditability, and modular tool integration via the Model Context Protocol (MCP). The paper formalizes the agent execution model as an augmented Partially Observable Markov Decision Process (POMDP) with a latent reasoning space, introduces a hierarchical memory consolidation architecture inspired by biological episodic memory systems, defines a constraint manifold formalism for safety enforcement via policy projection rather than post-hoc filtering, presents a three-level self-evolution framework spanning in-context adaptation through reinforcement learning, and describes runtime optimizations -- including parallel graph execution, speculative inference, and dynamic context pruning -- that reduce end-to-end latency for multi-step agent workflows.
Chinese Translation
人工智能领域正经历从生成式人工智能(Generative AI)——即文本和图像的概率生成——向代理式人工智能(Agentic AI)的转变,在这一过程中,自主系统代表用户在外部环境中执行操作。这一转变暴露出一个根本的架构不匹配:大型语言模型(Large Language Models, LLMs)产生随机的、非结构化的输出,而它们必须控制的后端基础设施——数据库、API、云服务——则需要确定性、符合模式的输入。本文描述了自主代理人工智能框架(Auton Agentic AI Framework),这是一个用于标准化自主代理系统的创建、执行和治理的原则性架构。该框架严格区分了认知蓝图(Cognitive Blueprint)和运行时引擎(Runtime Engine),前者是对代理身份和能力的声明性、语言无关的规范,后者是实例化和运行代理的平台特定执行基础。这样的分离使得跨语言可移植性、正式审计性和模块化工具集成成为可能,后者通过模型上下文协议(Model Context Protocol, MCP)实现。本文将代理执行模型形式化为一个增强的部分可观察马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),并引入了一个受生物情节记忆系统启发的层次化记忆整合架构,定义了通过策略投影而非事后过滤进行安全强制的约束流形形式,提出了一个涵盖通过强化学习进行上下文适应的三层自我演化框架,并描述了运行时优化——包括并行图执行、推测推理和动态上下文修剪——以减少多步骤代理工作流的端到端延迟。
cs.AI / 14 / 2602.23730
Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off
解锁认知能力与分析感知-逻辑权衡
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) pursue omni-perception capabilities, yet integrating robust sensory grounding with complex reasoning remains a challenge, particularly for underrepresented regions. In this report, we introduce the research preview of MERaLiON2-Omni (Alpha), a 10B-parameter multilingual omni-perception tailored for Southeast Asia (SEA). We present a progressive training pipeline that explicitly decouples and then integrates "System 1" (Perception) and "System 2" (Reasoning) capabilities. First, we establish a robust Perception Backbone by aligning region-specific audio-visual cues (e.g., Singlish code-switching, local cultural landmarks) with a multilingual LLM through orthogonal modality adaptation. Second, to inject cognitive capabilities without large-scale supervision, we propose a cost-effective Generate-Judge-Refine pipeline. By utilizing a Super-LLM to filter hallucinations and resolve conflicts via a consensus mechanism, we synthesize high-quality silver data that transfers textual Chain-of-Thought reasoning to multimodal scenarios. Comprehensive evaluation on our newly introduced SEA-Omni Benchmark Suite reveals an Efficiency-Stability Paradox: while reasoning acts as a non-linear amplifier for abstract tasks (boosting mathematical and instruction-following performance significantly), it introduces instability in low-level sensory processing. Specifically, we identify Temporal Drift in long-context audio, where extended reasoning desynchronizes the model from acoustic timestamps, and Visual Over-interpretation, where logic overrides pixel-level reality. This report details the architecture, the data-efficient training recipe, and a diagnostic analysis of the trade-offs between robust perception and structured reasoning.
Chinese Translation
近年来,多模态大型语言模型(MLLMs)的进展追求全感知能力,但将强大的感官基础与复杂推理相结合仍然是一个挑战,特别是在代表性不足的地区。在本报告中,我们介绍了MERaLiON2-Omni(Alpha)的研究预览,这是一个针对东南亚(SEA)量身定制的100亿参数多语言全感知模型。我们提出了一种渐进式训练流程,明确地解耦并整合“系统1”(感知)和“系统2”(推理)能力。首先,我们通过正交模态适应建立了一个强大的感知基础,利用区域特定的视听线索(例如,新加坡英语代码切换、当地文化地标)与多语言LLM对齐。其次,为了在没有大规模监督的情况下注入认知能力,我们提出了一种具有成本效益的生成-评估-精炼流程。通过利用超级大型语言模型(Super-LLM)过滤幻觉并通过共识机制解决冲突,我们合成了高质量的银数据,将文本链式思维推理转移到多模态场景中。对我们新引入的SEA-Omni基准套件的全面评估揭示了效率-稳定性悖论:尽管推理在抽象任务中充当非线性放大器(显著提升数学和遵循指令的表现),但它在低级感官处理上引入了不稳定性。具体而言,我们识别出在长上下文音频中的时间漂移,其中扩展的推理使模型与声学时间戳不同步,以及视觉过度解读,其中逻辑覆盖了像素级现实。本报告详细介绍了架构、数据高效的训练方案,以及对强大感知与结构化推理之间权衡的诊断分析。
cs.AI / 15 / 2602.23777
Reasoning-Driven Multimodal LLM for Domain Generalization
基于推理驱动的多模态大语言模型用于领域泛化
Abstract
This paper addresses the domain generalization (DG) problem in deep learning. While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models (MLLMs) and explore the potential of constructing reasoning chains that derives image categories to achieve more robust predictions under domain shift. To this end, we systematically study the role of reasoning in DG using DomainBed-Reasoning, a newly constructed extension of DomainBed dataset, in which each sample is paired with class-relevant reasoning chains. Our analysis reveals two key challenges: (i) fine-tuning MLLMs with reasoning chains for classification is more challenging than direct label supervision, since the model must optimize complex reasoning sequences before label prediction; and (ii) mismatches in reasoning patterns between supervision signals and fine-tuned MLLMs lead to a trade-off between semantic richness (informative but harder to optimize) and optimization efficiency (easier to optimize but less informative). To address these issues, we propose RD-MLDG (Reasoning-Driven Multimodal LLM for Domain Generalization), a framework with two components: (i) MTCT (Multi-Task Cross-Training), which introduces an additional direct classification pathway to guide reasoning supervision; and (ii) SARR (Self-Aligned Reasoning Regularization), which preserves the semantic richness of reasoning chains while mitigating reasoning-pattern mismatches via iterative self-labeling. Experiments on standard DomainBed datasets (PACS, VLCS, OfficeHome, TerraInc) demonstrate that RD-MLDG achieves state-of-the-art performances, highlighting reasoning as a promising complementary signal for robust out-of-domain generalization.
Chinese Translation
本文探讨了深度学习中的领域泛化(DG)问题。尽管大多数DG方法专注于强制视觉特征的不变性,我们利用多模态大语言模型(MLLMs)的推理能力,探索构建推理链以推导图像类别的潜力,从而在领域转移下实现更稳健的预测。为此,我们使用DomainBed-Reasoning这一新构建的DomainBed数据集扩展,系统研究推理在DG中的作用,其中每个样本都与类相关的推理链配对。我们的分析揭示了两个关键挑战:(i)使用推理链对MLLMs进行分类微调比直接标签监督更具挑战性,因为模型必须在标签预测之前优化复杂的推理序列;(ii)监督信号与微调后的MLLMs之间推理模式的不匹配导致语义丰富性(信息量大但优化更困难)与优化效率(优化更容易但信息量少)之间的权衡。为了解决这些问题,我们提出了RD-MLDG(基于推理驱动的多模态大语言模型用于领域泛化),该框架包含两个组成部分:(i)MTCT(多任务交叉训练),引入额外的直接分类路径以指导推理监督;(ii)SARR(自对齐推理正则化),在通过迭代自标记减轻推理模式不匹配的同时,保留推理链的语义丰富性。在标准DomainBed数据集(PACS、VLCS、OfficeHome、TerraInc)上的实验表明,RD-MLDG实现了最先进的性能,突出了推理作为稳健的域外泛化的有前景的补充信号。
cs.AI / 16 / 2602.23802
EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models
EMO-R3:用于多模态大型语言模型的情感推理的反思强化学习
Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised fine-tuning often suffer from limited generalization and poor interpretability, while reinforcement learning methods such as Group Relative Policy Optimization fail to align with the intrinsic characteristics of emotional cognition. To address these challenges, we propose Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3), a framework designed to enhance the emotional reasoning ability of MLLMs. Specifically, we introduce Structured Emotional Thinking to guide the model to perform step-by-step emotional reasoning in a structured and interpretable manner, and design a Reflective Emotional Reward that enables the model to re-evaluate its reasoning based on visual-text consistency and emotional coherence. Extensive experiments demonstrate that EMO-R3 significantly improves both the interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks.
Chinese Translation
多模态大型语言模型(MLLMs)在视觉推理和理解任务中取得了显著进展,但仍然难以捕捉人类情感的复杂性和主观性。现有基于监督微调的方法往往面临有限的泛化能力和较差的可解释性,而诸如群体相对策略优化(Group Relative Policy Optimization)等强化学习方法未能与情感认知的内在特征相一致。为了解决这些挑战,我们提出了情感推理的反思强化学习(EMO-R3),这是一个旨在增强MLLMs情感推理能力的框架。具体而言,我们引入了结构化情感思维,以指导模型以结构化和可解释的方式逐步进行情感推理,并设计了反思情感奖励,使模型能够基于视觉-文本一致性和情感连贯性重新评估其推理。大量实验表明,EMO-R3显著提高了MLLMs的可解释性和情感智能,在多个视觉情感理解基准测试中取得了优越的表现。
cs.AI / 17 / 2602.23864
RUMAD: Reinforcement-Unifying Multi-Agent Debate
RUMAD:强化统一多智能体辩论
Abstract
Multi-agent debate (MAD) systems leverage collective intelligence to enhance reasoning capabilities, yet existing approaches struggle to simultaneously optimize accuracy, consensus formation, and computational efficiency. Static topology methods lack adaptability to task complexity variations, while external LLM-based coordination risks introducing privileged knowledge that compromises debate neutrality. This work presents RUMAD (Reinforcement-Unifying Multi-Agent Debate), a novel framework that formulates dynamic communication topology control in MAD as a reinforcement learning (RL) problem. RUMAD employs a content-agnostic observation scheme that captures high-level debate dynamics avoiding access to raw agent reasoning content. RUMAD uses a multi-objective reward to model solution quality, cohesion and efficiency. A PPO-trained controller dynamically adjusts edge weights in the communication graph, while a dual-threshold mechanism enables fine-grained control over both agent activation and information visibility. Experimental evaluation across MMLU, GSM8K, and GPQA benchmarks demonstrates that RUMAD achieves substantial efficiency gains, reducing token costs by over 80\%, while still improving reasoning accuracy compared to single LLM model and multiple MAD baselines. Notably, RUMAD trained exclusively on MMLU exhibits robust zero-shot generalization to out-of-domain (OOD) tasks, indicating that the learned communication strategies capture task-independent principles of effective multi-agent coordination. These results establish RUMAD as a efficient and robust approach for deploying multi-agent reasoning application with practical resource constraints.
Chinese Translation
多智能体辩论(MAD)系统利用集体智能来增强推理能力,但现有方法在同时优化准确性、共识形成和计算效率方面存在困难。静态拓扑方法缺乏对任务复杂性变化的适应性,而基于外部大型语言模型(LLM)的协调则存在引入特权知识的风险,从而影响辩论的中立性。本文提出了RUMAD(强化统一多智能体辩论),这是一个新颖的框架,将MAD中的动态通信拓扑控制形式化为强化学习(RL)问题。RUMAD采用了一种与内容无关的观察方案,捕捉高层次的辩论动态,避免访问原始智能体推理内容。RUMAD使用多目标奖励来建模解决方案的质量、凝聚力和效率。经过PPO训练的控制器动态调整通信图中的边权重,而双阈值机制则实现了对智能体激活和信息可见性的细粒度控制。在MMLU、GSM8K和GPQA基准上的实验评估表明,RUMAD实现了显著的效率提升,令令牌成本降低超过80%,同时相比于单一LLM模型和多个MAD基线,推理准确性也有所提高。值得注意的是,RUMAD在仅使用MMLU训练的情况下,展现出对域外(OOD)任务的强健零-shot泛化能力,表明所学习的通信策略捕捉了有效多智能体协调的任务独立原则。这些结果确立了RUMAD作为在实际资源限制下部署多智能体推理应用的高效且稳健的方法。
cs.AI / 18 / 2602.23876
RF-Agent: Automated Reward Function Design via Language Agent Tree Search
RF-Agent:通过语言代理树搜索实现自动化奖励函数设计
Abstract
Designing efficient reward functions for low-level control tasks is a challenging problem. Recent research aims to reduce reliance on expert experience by using Large Language Models (LLMs) with task information to generate dense reward functions. These methods typically rely on training results as feedback, iteratively generating new reward functions with greedy or evolutionary algorithms. However, they suffer from poor utilization of historical feedback and inefficient search, resulting in limited improvements in complex control tasks. To address this challenge, we propose RF-Agent, a framework that treats LLMs as language agents and frames reward function design as a sequential decision-making process, enhancing optimization through better contextual reasoning. RF-Agent integrates Monte Carlo Tree Search (MCTS) to manage the reward design and optimization process, leveraging the multi-stage contextual reasoning ability of LLMs. This approach better utilizes historical information and improves search efficiency to identify promising reward functions. Outstanding experimental results in 17 diverse low-level control tasks demonstrate the effectiveness of our method. The source code is available at https://github.com/deng-ai-lab/RF-Agent.
Chinese Translation
为低级控制任务设计高效的奖励函数是一项具有挑战性的任务。近期研究旨在通过使用大型语言模型(Large Language Models, LLMs)结合任务信息来生成密集的奖励函数,从而减少对专家经验的依赖。这些方法通常依赖于训练结果作为反馈,采用贪婪或进化算法迭代生成新的奖励函数。然而,它们在历史反馈的利用和搜索效率方面存在不足,导致在复杂控制任务中的改进有限。为了解决这一挑战,我们提出了RF-Agent,一个将LLMs视为语言代理并将奖励函数设计框定为顺序决策过程的框架,通过更好的上下文推理增强优化。RF-Agent集成了蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)来管理奖励设计和优化过程,利用LLMs的多阶段上下文推理能力。该方法更好地利用历史信息,提高搜索效率,以识别有前景的奖励函数。在17个不同的低级控制任务中,出色的实验结果证明了我们方法的有效性。源代码可在 https://github.com/deng-ai-lab/RF-Agent 获取。
cs.AI / 19 / 2602.23974
Pessimistic Auxiliary Policy for Offline Reinforcement Learning
离线强化学习的悲观辅助策略
Abstract
Offline reinforcement learning aims to learn an agent from pre-collected datasets, avoiding unsafe and inefficient real-time interaction. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation. In this paper, we construct a new pessimistic auxiliary policy for sampling reliable actions. Specifically, we develop a pessimistic auxiliary strategy by maximizing the lower confidence bound of the Q-function. The pessimistic auxiliary strategy exhibits a relatively high value and low uncertainty in the vicinity of the learned policy, avoiding the learned policy sampling high-value actions with potentially high errors during the learning process. Less approximation error introduced by sampled action from pessimistic auxiliary strategy leads to the alleviation of error accumulation. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.
Chinese Translation
离线强化学习旨在从预先收集的数据集中学习智能体,避免不安全和低效的实时交互。然而,在学习过程中不可避免地接触到分布外的动作,这会引入近似误差,导致误差积累和显著的高估。在本文中,我们构建了一种新的悲观辅助策略,以采样可靠的动作。具体而言,我们通过最大化 Q 函数的下置信界来开发悲观辅助策略。该悲观辅助策略在学习策略附近表现出相对较高的值和较低的不确定性,避免了学习策略在学习过程中采样具有潜在高误差的高值动作。通过悲观辅助策略采样的动作引入的较少近似误差有助于减轻误差积累。在离线强化学习基准上的大量实验表明,利用悲观辅助策略可以有效提高其他离线强化学习方法的效率。
cs.AI / 20 / 2602.24037
Portfolio Reinforcement Learning with Scenario-Context Rollout
情景上下文回放的投资组合强化学习
Abstract
Market regime shifts induce distribution shifts that can degrade the performance of portfolio rebalancing policies. We propose macro-conditioned scenario-context rollout (SCR) that generates plausible next-day multivariate return scenarios under stress events. However, doing so faces new challenges, as history will never tell what would have happened differently. As a result, incorporating scenario-based rewards from rollouts introduces a reward--transition mismatch in temporal-difference learning, destabilizing RL critic training. We analyze this inconsistency and show it leads to a mixed evaluation target. Guided by this analysis, we construct a counterfactual next state using the rollout-implied continuations and augment the critic agent's bootstrap target. Doing so stabilizes the learning and provides a viable bias-variance tradeoff. In out-of-sample evaluations across 31 distinct universes of U.S. equity and ETF portfolios, our method improves Sharpe ratio by up to 76% and reduces maximum drawdown by up to 53% compared with classic and RL-based portfolio rebalancing baselines.
Chinese Translation
市场状态的转变会引发分布变化,从而降低投资组合再平衡策略的表现。我们提出了一种宏观条件下的情景上下文回放(Scenario-Context Rollout, SCR)方法,该方法在压力事件下生成合理的次日多变量收益情景。然而,这样做面临新的挑战,因为历史永远无法告诉我们会发生什么不同。因此,从回放中引入基于情景的奖励会导致时间差分学习中的奖励-转移不匹配,从而使强化学习(Reinforcement Learning, RL)评估者的训练不稳定。我们分析了这种不一致性,并表明它导致了混合评估目标。在这一分析的指导下,我们利用回放所暗示的延续构建了一个反事实的下一个状态,并增强了评估者的自举目标。这样做稳定了学习,并提供了可行的偏差-方差权衡。在对31个不同的美国股票和ETF投资组合的样本外评估中,我们的方法使夏普比率提高了最多76%,并将最大回撤减少了最多53%,相较于经典和基于强化学习的投资组合再平衡基准。
cs.AI / 21 / 2602.24055
CIRCLE: A Framework for Evaluating AI from a Real-World Lens
CIRCLE:从现实世界视角评估人工智能的框架
Abstract
This paper proposes CIRCLE, a six-stage, lifecycle-based framework to bridge the reality gap between model-centric performance metrics and AI's materialized outcomes in deployment. While existing frameworks like MLOps focus on system stability and benchmarks measure abstract capabilities, decision-makers outside the AI stack lack systematic evidence about the behavior of AI technologies under real-world user variability and constraints. CIRCLE operationalizes the Validation phase of TEVV (Test, Evaluation, Verification, and Validation) by formalizing the translation of stakeholder concerns outside the stack into measurable signals. Unlike participatory design, which often remains localized, or algorithmic audits, which are often retrospective, CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics. By integrating methods such as field testing, red teaming, and longitudinal studies into a coordinated pipeline, CIRCLE produces systematic knowledge: evidence that is comparable across sites yet sensitive to local context. This can enable governance based on materialized downstream effects rather than theoretical capabilities.
Chinese Translation
本文提出了CIRCLE,一个基于生命周期的六阶段框架,旨在弥合模型中心性能指标与人工智能在实际应用中所产生结果之间的现实差距。现有的框架如MLOps关注系统稳定性,而基准测试则衡量抽象能力,然而,AI技术外部的决策者缺乏关于AI技术在现实用户变异性和约束下行为的系统性证据。CIRCLE通过将利益相关者在技术栈外的关注点转化为可测量信号,来实现TEVV(测试、评估、验证和确认)中的验证阶段。与通常局限于局部的参与式设计或往往是回顾性的算法审计不同,CIRCLE提供了一种结构化的前瞻性协议,将与上下文相关的定性洞察与可扩展的定量指标相连接。通过将现场测试、红队测试和纵向研究等方法整合到一个协调的流程中,CIRCLE产生了系统性的知识:这种证据在不同地点之间具有可比性,同时又对当地背景敏感。这可以使治理基于实际的下游效果,而非理论能力。
cs.AI / 22 / 2602.24080
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction
人类还是机器?语音对语音交互的初步图灵测试
Abstract
The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably as Turing test judges. In response, we propose an interpretable model that leverages the fine-grained human-likeness ratings and delivers accurate and transparent human-vs-machine discrimination, offering a powerful tool for automatic human-likeness evaluation. Our work establishes the first human-likeness evaluation for S2S systems and moves beyond binary outcomes to enable detailed diagnostic insights, paving the way for human-like improvements in conversational AI systems.
Chinese Translation
追求类人对话代理的努力长期以来受到图灵测试的指导。对于现代语音对语音(S2S)系统,一个关键但未解答的问题是它们是否能够像人类一样进行对话。为了解决这个问题,我们首次对S2S系统进行了图灵测试,收集了2968个关于9个最先进S2S系统与28名人类参与者之间对话的人类判断。我们的结果提供了一个明确的发现:没有现有的评估S2S系统通过测试,揭示了类人特征的显著差距。为了诊断这一失败,我们开发了一个包含18个类人特征维度的细粒度分类法,并相应地对收集的对话进行了众包标注。我们的分析表明,瓶颈并不在于语义理解,而是源于副语言特征、情感表现力和对话个性。此外,我们发现现成的人工智能模型在担任图灵测试评判时表现不可靠。对此,我们提出了一种可解释模型,利用细粒度的类人评分,提供准确且透明的人类与机器的区分,成为自动化类人评估的有力工具。我们的工作建立了S2S系统的首个类人评估,超越了二元结果,能够提供详细的诊断见解,为对话人工智能系统的类人改进铺平了道路。
cs.AI / 23 / 2602.24097
Bi-level RL-Heuristic Optimization for Real-world Winter Road Maintenance
双层强化学习启发式优化用于现实世界冬季道路维护
Abstract
Winter road maintenance is critical for ensuring public safety and reducing environmental impacts, yet existing methods struggle to manage large-scale routing problems effectively and mostly reply on human decision. This study presents a novel, scalable bi-level optimization framework, validated on real operational data on UK strategic road networks (M25, M6, A1), including interconnected local road networks in surrounding areas for vehicle traversing, as part of the highway operator's efforts to solve existing planning challenges. At the upper level, a reinforcement learning (RL) agent strategically partitions the road network into manageable clusters and optimally allocates resources from multiple depots. At the lower level, a multi-objective vehicle routing problem (VRP) is solved within each cluster, minimizing the maximum vehicle travel time and total carbon emissions. Unlike existing approaches, our method handles large-scale, real-world networks efficiently, explicitly incorporating vehicle-specific constraints, depot capacities, and road segment requirements. Results demonstrate significant improvements, including balanced workloads, reduced maximum travel times below the targeted two-hour threshold, lower emissions, and substantial cost savings. This study illustrates how advanced AI-driven bi-level optimization can directly enhance operational decision-making in real-world transportation and logistics.
Chinese Translation
冬季道路维护对于确保公共安全和减少环境影响至关重要,但现有方法在有效管理大规模路线问题方面面临挑战,且大多依赖于人工决策。本研究提出了一种新颖的可扩展双层优化框架,基于英国战略公路网络(M25、M6、A1)上的实际运营数据进行验证,并考虑周边地区互联的地方道路网络,以支持车辆通行,作为公路运营商解决现有规划挑战的努力的一部分。在上层,强化学习(RL)代理战略性地将道路网络划分为可管理的集群,并从多个仓库中优化分配资源。在下层,针对每个集群解决多目标车辆路径问题(VRP),以最小化最大车辆行驶时间和总碳排放量。与现有方法不同,我们的方法高效处理大规模现实世界网络,明确纳入了特定车辆约束、仓库容量和道路段要求。结果显示出显著改善,包括工作负载平衡、最大行驶时间降低至目标两小时阈值以下、排放减少以及显著的成本节约。本研究展示了先进的基于人工智能的双层优化如何直接提升现实世界交通和物流中的运营决策。
cs.AI / 24 / 2602.24100
Artificial Agency Program: Curiosity, compression, and communication in agents
人工智能代理程序:代理中的好奇心、压缩与沟通
Abstract
This paper presents the Artificial Agency Program (AAP), a position and research agenda for building AI systems as reality embedded, resource-bounded agents whose development is driven by curiosity-as-learning-progress under physical and computational constraints. The central thesis is that AI is most useful when treated as part of an extended human--tool system that increases sensing, understanding, and actuation capability while reducing friction at the interface between people, tools, and environments. The agenda unifies predictive compression, intrinsic motivation, empowerment and control, interface quality (unification), and language/self-communication as selective information bottlenecks. We formulate these ideas as a falsifiable program with explicit costs, staged experiments, and a concrete multimodal tokenized testbed in which an agent allocates limited budget among observation, action, and deliberation. The aim is to provide a conceptual and experimental framework that connects intrinsic motivation, information theory, thermodynamics, bounded rationality, and modern reasoning systems
Chinese Translation
本文提出了人工智能代理程序(Artificial Agency Program, AAP),这是一个构建AI系统的立场和研究议程,旨在将AI系统视为嵌入现实、资源受限的代理,其发展受到在物理和计算约束下的好奇心驱动的学习进展的推动。中心论点是,当AI被视为一个扩展的人类-工具系统的一部分时,它最为有用,该系统能够提高感知、理解和执行能力,同时减少人、工具与环境之间的摩擦。该议程统一了预测压缩、内在动机、赋权与控制、接口质量(统一性)以及语言/自我沟通作为选择性信息瓶颈。我们将这些思想构建为一个可证伪的程序,具有明确的成本、分阶段的实验以及一个具体的多模态标记测试平台,其中代理在观察、行动和深思熟虑之间分配有限的预算。目标是提供一个概念性和实验性的框架,连接内在动机、信息理论、热力学、有界理性和现代推理系统。
cs.AI / 25 / 2602.24110
Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
回收失败:通过细粒度的离策略指导挽救RLVR中的探索
Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves ineffective.Prior methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model's distribution, but still fail to utilize the largely correct rollouts generated by the model itself and thus do not effectively mitigate the narrowing of the exploration space. To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification. By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%, thereby sustaining a broad exploration space. Extensive experiments demonstrate that our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.
Chinese Translation
可验证奖励的强化学习(RLVR)已成为增强大型推理模型复杂推理能力的强大范式。然而,基于结果的标准监督存在一个关键限制,即对那些在大部分情况下是正确但因若干失误而失败的轨迹的惩罚,与对完全错误的轨迹的惩罚一样严重。这种粗糙的反馈信号导致模型丢弃有价值的、在很大程度上是正确的回滚,从而导致回滚多样性的下降,过早地缩小了探索空间。过程奖励模型在提供可靠的逐步验证以进行测试时扩展方面已显示出有效性,但将这些信号天真地整合到RLVR中作为密集奖励却被证明是无效的。先前的方法尝试引入离策略指导的全轨迹替换,通常超出了策略模型的分布,但仍未能有效利用模型自身生成的在很大程度上是正确的回滚,因此未能有效缓解探索空间的缩小。为了解决这些问题,我们提出了SCOPE(逐步纠正的在策略探索),这是一个新颖的框架,利用过程奖励模型定位次优回滚中的第一个错误步骤,并应用细粒度的逐步离策略修正。通过对部分正确的回滚进行精确的细化,我们的方法有效挽救了部分正确的轨迹,并将多样性评分提高了13.5%,从而维持了广泛的探索空间。大量实验表明,我们的方法建立了新的最先进结果,在数学推理上实现了46.6%的平均准确率,并在分布外推理任务上展现出53.4%的鲁棒泛化能力。
cs.AI / 26 / 2602.24173
LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
LemmaBench:一个实时的研究级基准,用于评估大型语言模型在数学中的能力
Abstract
We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.
Chinese Translation
我们提出了一种新的方法,用于基准测试大型语言模型(LLM)在研究级数学中的能力。现有的基准主要依赖于静态的、人工策划的竞赛或教科书风格的问题集,作为数学研究的代理。相反,我们建立了一个可更新的基准,直接评估模型在数学最新研究成果上的表现。这包括一个自动化流程,从 arXiv 中提取引理,并通过明确所有假设和所需定义,将其重写为自包含的陈述。最终形成的基准可以定期更新,直接采用来自人类数学研究的新问题,而之前的实例可以用于训练,而不影响未来的评估。我们对当前最先进的 LLM 进行了基准测试,结果显示,根据模型的不同,其在定理证明(pass@1)中的准确率约为 10-15 ext{%},这表明 LLM 在研究背景下达到人类水平的证明能力仍有很大的进步空间。
cs.AI / 27 / 2602.24180
Learning Flexible Job Shop Scheduling under Limited Buffers and Material Kitting Constraints
在有限缓冲区和物料配套约束下的灵活作业车间调度学习
Abstract
The Flexible Job Shop Scheduling Problem (FJSP) originates from real production lines, while some practical constraints are often ignored or idealized in current FJSP studies, among which the limited buffer problem has a particular impact on production efficiency. To this end, we study an extended problem that is closer to practical scenarios--the Flexible Job Shop Scheduling Problem with Limited Buffers and Material Kitting. In recent years, deep reinforcement learning (DRL) has demonstrated considerable potential in scheduling tasks. However, its capacity for state modeling remains limited when handling complex dependencies and long-term constraints. To address this, we leverage a heterogeneous graph network within the DRL framework to model the global state. By constructing efficient message passing among machines, operations, and buffers, the network focuses on avoiding decisions that may cause frequent pallet changes during long-sequence scheduling, thereby helping improve buffer utilization and overall decision quality. Experimental results on both synthetic and real production line datasets show that the proposed method outperforms traditional heuristics and advanced DRL methods in terms of makespan and pallet changes, and also achieves a good balance between solution quality and computational cost. Furthermore, a supplementary video is provided to showcase a simulation system that effectively visualizes the progression of the production line.
Chinese Translation
灵活作业车间调度问题(FJSP)源于真实的生产线,而当前FJSP研究中常常忽略或理想化一些实际约束,其中有限缓冲区问题对生产效率有着特别的影响。为此,我们研究了一个更接近实际场景的扩展问题——有限缓冲区和物料配套的灵活作业车间调度问题。近年来,深度强化学习(DRL)在调度任务中展现了相当大的潜力。然而,在处理复杂依赖关系和长期约束时,其状态建模能力仍然有限。为了解决这个问题,我们在DRL框架内利用异构图网络来建模全局状态。通过在机器、操作和缓冲区之间构建高效的信息传递,网络专注于避免在长序列调度中可能导致频繁托盘更换的决策,从而帮助提高缓冲区利用率和整体决策质量。在合成和真实生产线数据集上的实验结果表明,所提出的方法在完工时间和托盘更换方面优于传统启发式方法和先进的DRL方法,并且在解决方案质量和计算成本之间达成了良好的平衡。此外,提供了一段补充视频,以展示一个有效可视化生产线进展的仿真系统。
cs.AI / 28 / 2602.24195
Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume
具有不一致性调整语义体积的多模态大型语言模型的不确定性量化
Abstract
Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models' own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE's design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE's generalization to non-text output tasks, including image and audio generation.
Chinese Translation
尽管多模态大型语言模型(MLLMs)具备强大的能力,但它们可能会产生看似合理但实际上错误的输出,从而阻碍其可靠部署。准确的不确定性度量可以使不可靠的查询升级到人类专家或更大型模型,以提高性能。然而,现有的不确定性度量存在实际限制,例如仅针对特定模态设计、依赖外部工具或计算成本高昂。我们提出了UMPIRE,这是一个无训练的不确定性量化框架,适用于多模态大型语言模型,能够高效地处理各种输入和输出模态,而无需外部工具,仅依赖模型自身的内部模态特征。UMPIRE计算给定任务实例的采样MLLM响应的不一致性调整语义体积,有效捕捉样本的全局语义多样性和基于内部模型置信度的响应局部不一致性。我们为MLLMs提出了不确定性需求,并提供了理论分析以支持UMPIRE的设计。大量实验表明,UMPIRE在图像、音频和视频文本基准测试中,在错误检测和不确定性校准方面始终优于基线度量,包括对抗性和分布外设置。我们还展示了UMPIRE在非文本输出任务(包括图像和音频生成)中的泛化能力。
cs.AI / 29 / 2602.24273
A Minimal Agent for Automated Theorem Proving
用于自动定理证明的最小代理
Abstract
We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate our baseline using qualitatively different benchmarks and compare various popular models and design choices, and demonstrate competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture. Our results demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.
Chinese Translation
我们提出了一种最小代理基线,能够在不同的基于人工智能的定理证明器架构之间进行系统比较。该设计实现了当前最先进系统共享的核心特性:迭代证明精炼、库搜索和上下文管理。我们使用质性不同的基准对我们的基线进行评估,并比较各种流行模型和设计选择,结果表明其在性能上与当前最先进的方法具有竞争力,同时使用了显著更简单的架构。我们的结果展示了迭代方法在多个单次生成中的一致优势,尤其是在样本效率和成本效益方面。该实现已作为开源项目发布,作为未来研究的参考候选和社区可访问的证明器。
cs.AI / 30 / 2602.24288
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
DARE-bench:评估大型语言模型在数据科学中的建模和指令遵循的准确性
Abstract
The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B's accuracy by 1.83x and reinforcement learning boosts Qwen3-4B's accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.
Chinese Translation
对大型语言模型(LLMs)在复杂多步骤数据科学任务中的快速增长需求,催生了对准确基准测试的迫切需求。现有基准测试存在两个主要缺口:(i)缺乏标准化的、过程感知的评估,无法有效捕捉指令遵循和过程保真度;(ii)准确标注的训练数据稀缺。为了解决这些问题,我们提出了DARE-bench,这是一个专为机器学习建模和数据科学指令遵循设计的基准测试。与许多依赖于人工或模型评判的现有基准不同,DARE-bench中的所有任务都具有可验证的真实答案,确保了客观和可重复的评估。为了覆盖广泛的任务并支持自主工具,DARE-bench包含6300个源自Kaggle的任务,并提供大规模的训练数据和评估集。广泛的评估显示,即使是像gpt-o4-mini这样能力强大的模型,在机器学习建模任务中也难以取得良好的表现。使用DARE-bench训练任务进行微调可以显著提高模型性能。例如,监督微调使Qwen3-32B的准确率提高了1.83倍,而强化学习使Qwen3-4B的准确率提高了超过8倍。这些显著的改进验证了DARE-bench作为准确评估基准和关键训练数据的重要性。
cs.CL / 1 / 2602.23370
Toward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents
朝向通用语义块划分:一种用于超长文档的判别框架
Abstract
Long-document topic segmentation plays an important role in information retrieval and document understanding, yet existing methods still show clear shortcomings in ultra-long text settings. Traditional discriminative models are constrained by fixed windows and cannot model document-level semantics; generative large language models can output paragraph boundaries, but inference is expensive and long inputs are difficult to support. To address these issues, we propose a discriminative segmentation model based on Qwen3-0.6B. On top of the backbone network, we add a cross-window context fusion layer and a boundary classification head, and combine them with an overlapping sliding-window strategy. Our model supports single-pass inputs of up to 13k tokens and can be extended to ultra-long documents for paragraph boundary detection. To further enhance downstream retrieval efficiency, we derive a vector fusion method with scalar correction, which compresses the representation of ultra-long segments into a single vector without semantic loss. Experiments on the Wikipedia long-document topic segmentation dataset WIKI-727K show that, compared with three generative models based on Qwen2-0.5B released by Jina, our method achieves a better macro-averaged F1 and delivers two orders of magnitude faster inference, substantially improving the practicality and scalability of long-document processing.
Chinese Translation
长文档主题分割在信息检索和文档理解中扮演着重要角色,但现有方法在超长文本环境下仍然存在明显不足。传统的判别模型受限于固定窗口,无法建模文档级语义;生成式大型语言模型可以输出段落边界,但推理成本高且难以支持长输入。为了解决这些问题,我们提出了一种基于 Qwen3-0.6B 的判别分割模型。在主干网络的基础上,我们添加了一个跨窗口上下文融合层和一个边界分类头,并将其与重叠滑动窗口策略相结合。我们的模型支持单次输入最多 13k 个标记,并可以扩展到超长文档以进行段落边界检测。为了进一步提高下游检索效率,我们推导了一种带标量修正的向量融合方法,该方法在不损失语义的情况下将超长片段的表示压缩为单个向量。在维基百科长文档主题分割数据集 WIKI-727K 上的实验表明,与 Jina 发布的基于 Qwen2-0.5B 的三种生成模型相比,我们的方法实现了更好的宏平均 F1,并提供了两个数量级更快的推理速度,显著提高了长文档处理的实用性和可扩展性。
cs.CL / 2 / 2602.23388
Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages
任务透镜:基于跨任务效用的低资源印度语言语音数据集分析
Abstract
The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain metadata and properties suitable for specific tasks. Next, we propose task-aligned enhancements to unlock datasets to their full downstream potential. Finally, we identify tasks and Indian languages that are critically underserved by current resources. Our findings reveal that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks. By uncovering cross-task linkages and gaps, Task-Lens enables researchers to explore the broader applicability of existing datasets and to prioritize dataset creation for underserved tasks and languages.
Chinese Translation
对包容性语音技术日益增长的需求加大了对多语言数据集的需求,以支持自然语言处理(NLP)研究。然而,对低资源语言中现有特定任务资源的有限认知阻碍了研究的进展。这一挑战在语言多样性丰富的国家,尤其是印度,显得尤为突出。对现有印度语音数据集进行跨任务分析可以缓解数据稀缺的问题。这涉及到研究数据集在多个下游任务中的效用,而不是仅仅关注单一任务。以往的调查通常仅对单一任务的数据集进行分类,留下了全面的跨任务分析作为一个未被充分利用的机会。因此,我们提出了任务透镜(Task-Lens),这是一个跨任务调查,评估涵盖26种语言的50个印度语音数据集在九个下游语音任务中的准备情况。首先,我们分析哪些数据集包含适合特定任务的元数据和属性。接下来,我们提出与任务对齐的增强措施,以充分挖掘数据集的下游潜力。最后,我们识别出当前资源严重不足的任务和印度语言。我们的研究结果表明,许多印度语音数据集包含未被充分利用的元数据,这些元数据可以支持多个下游任务。通过揭示跨任务的联系和差距,任务透镜使研究人员能够探索现有数据集的更广泛适用性,并优先考虑为服务不足的任务和语言创建数据集。
cs.CL / 3 / 2602.23440
Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning
带有过程奖励的截断步级采样用于检索增强推理
Abstract
Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.
Chinese Translation
通过强化学习训练大型语言模型与搜索引擎进行推理面临一个基本的信用分配问题:现有方法如 Search-R1 仅在整个多步轨迹结束后提供稀疏的结果奖励,这使得将成功或失败归因于个别推理和检索决策变得不可行。过程奖励方法如 StepSearch 通过引入步级监督来缓解这一问题,但依赖于诸如与黄金文档的 TF-IDF 重叠等启发式奖励,并且仍然对每个示例采样 k 条完整轨迹,保持高梯度方差。我们提出了 SLATE,一个基于两个互补思想的框架:(1)截断步级采样,生成 k 条共享公共前缀且仅在下一步不同的轨迹;(2)密集的 LLM 作为评估者的奖励,用一个能够评估每个推理步骤、搜索查询和答案质量的 LLM 替代启发式评分,从而提供更丰富和更可靠的监督。我们理论上证明,在相同的密集奖励结构下,截断采样相比于完整轨迹采样可以将优势估计的方差降低最多一个 T 的因子,适用于 T 步轨迹,从而产生低方差和更具针对性的策略梯度。在七个问答基准上的实验确认,SLATE 始终优于稀疏奖励和过程奖励基线,尤其在更难的多跳任务和较小模型上获得了最大的提升。
cs.CL / 4 / 2602.23452
CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era
CiteAudit:你引用了它,但你读过它吗?在大语言模型时代验证科学引用的基准
Abstract
Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.
Chinese Translation
科学研究依赖于准确的引用以确保归属和完整性,但大型语言模型(LLMs)引入了一种新的风险:虚构的引用看似合理但并不存在于真实出版物中。这种幻觉引用已经在主要机器学习会议的提交和接受论文中被观察到,暴露了同行评审的脆弱性。同时,快速增长的参考文献列表使得手动验证变得不切实际,而现有的自动化工具在嘈杂和异构的引用格式下仍然脆弱,并且缺乏标准化评估。我们提出了第一个全面的基准和检测框架,用于识别科学写作中的幻觉引用。我们的多代理验证流程将引用检查分解为声明提取、证据检索、段落匹配、推理和校准判断,以评估被引用的来源是否真正支持其声明。我们构建了一个跨领域的大规模人工验证数据集,并定义了统一的引用真实性和证据对齐的评估指标。与最先进的LLMs的实验揭示了显著的引用错误,并表明我们的框架在准确性和可解释性方面显著优于先前的方法。这项工作提供了在LLM时代审计引用的第一个可扩展基础设施和实用工具,以提高科学引用的可信度。
cs.CL / 5 / 2602.23479
FHIRPath-QA: Executable Question Answering over FHIR Electronic Health Records
FHIRPath-QA:基于FHIR电子健康记录的可执行问答系统
Abstract
Though patients are increasingly granted digital access to their electronic health records (EHRs), existing interfaces may not support precise, trustworthy answers to patient-specific questions. Large language models (LLM) show promise in clinical question answering (QA), but retrieval-based approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real-life EHRs. In this work, we introduce FHIRPath-QA, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data. We propose a text-to-FHIRPath QA paradigm that shifts reasoning from free-text generation to FHIRPath query synthesis, significantly reducing LLM usage. Built on MIMIC-IV on FHIR Demo, the dataset pairs over 14k natural language questions in patient and clinician phrasing with validated FHIRPath queries and answers. Further, we demonstrate that state-of-the-art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis. However, they benefit strongly from supervised fine-tuning. Our results highlight that text-to-FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and our dataset and benchmark serve as a starting point for future research on the topic. The full dataset and generation code is available at: https://github.com/mooshifrew/fhirpath-qa.
Chinese Translation
尽管患者越来越多地获得对其电子健康记录(EHR)的数字访问,但现有接口可能无法为患者特定问题提供精确、可靠的答案。大型语言模型(LLM)在临床问答(QA)中显示出潜力,但基于检索的方法计算效率低下,容易产生幻觉,并且在现实世界的EHR中难以部署。在本研究中,我们介绍了FHIRPath-QA,这是首个针对患者特定问答的开放数据集和基准,包含针对真实临床数据的开放标准FHIRPath查询。我们提出了一种文本到FHIRPath的问答范式,将推理从自由文本生成转移到FHIRPath查询合成,显著减少了LLM的使用。该数据集基于MIMIC-IV on FHIR Demo,配对了超过14,000个以患者和临床医生措辞的自然语言问题与经过验证的FHIRPath查询和答案。此外,我们展示了最先进的LLM在处理患者语言的模糊性方面存在困难,并且在FHIRPath查询合成中表现不佳。然而,它们在监督微调中受益显著。我们的结果强调,文本到FHIRPath的合成有潜力作为安全、高效和可互操作的消费者健康应用的实用基础,而我们的数据集和基准为未来相关研究提供了起点。完整数据集和生成代码可在以下链接获取:https://github.com/mooshifrew/fhirpath-qa。
cs.CL / 6 / 2602.23481
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation
IDP 加速器:从提取到合规验证的自主文档智能
Abstract
Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging to segment complex document packets; (2) configurable Extraction Module leveraging multimodal LLMs to transform unstructured content into structured data; (3) Agentic Analytics Module, compliant with the Model Context Protocol (MCP) providing data access through secure, sandboxed code execution; and (4) Rule Validation Module replacing deterministic engines with LLM-driven logic for complex compliance checks. The interactive demonstration enables users to upload document packets, visualize classification results, and explore extracted data through an intuitive web interface. We demonstrate effectiveness across industries, highlighting a production deployment at a leading healthcare provider achieving 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs over legacy baselines. IDP Accelerator is open-sourced with a live demonstration available to the community.
Chinese Translation
从非结构化文档中理解和提取结构化见解仍然是工业自然语言处理中的一项基础挑战。虽然大型语言模型(LLMs)能够实现零样本提取,但传统管道往往无法处理多文档包、复杂推理和严格的合规要求。我们提出了 IDP(智能文档处理)加速器,这是一个使自主人工智能能够实现端到端文档智能的框架,包含四个关键组件:(1)DocSplit,一个新颖的基准数据集和多模态分类器,使用 BIO 标记法对复杂文档包进行分段;(2)可配置的提取模块,利用多模态 LLM 将非结构化内容转换为结构化数据;(3)自主分析模块,符合模型上下文协议(MCP),通过安全的沙箱代码执行提供数据访问;(4)规则验证模块,用 LLM 驱动的逻辑替代确定性引擎,以进行复杂的合规检查。互动演示使用户能够上传文档包,直观地可视化分类结果,并通过直观的网络界面探索提取的数据。我们展示了在各行业中的有效性,强调了在一家领先医疗服务提供商的生产部署,达到了 98% 的分类准确率,80% 的处理延迟减少,以及相比传统基线降低了 77% 的运营成本。IDP 加速器是开源的,并向社区提供实时演示。
cs.CL / 7 / 2602.23546
Humans and LLMs Diverge on Probabilistic Inferences
人类与大型语言模型在概率推理上的差异
Abstract
Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.
Chinese Translation
人类推理通常涉及在有限信息的基础上得出概率结论。在其最简单的形式中,这涉及到做出一个不严格由前提推导出的推论,而是仅在前提下可能成立。尽管推理型大型语言模型(LLMs)在逻辑和数学任务上表现出色,但它们在此类开放式、非确定性推理上的行为仍然 largely 未被探索。我们引入了ProbCOPA,这是一个包含210个手工制作的英语概率推理的数据集,每个推理都由25至30名参与者标注了推理的可能性。我们发现人类的反应是分级和多样的,揭示了我们数据集中推理的概率判断。将这些判断与八个最先进的推理型LLMs的反应进行比较,我们发现模型始终未能产生类似人类的分布。最后,通过分析LLM的推理链,我们发现了用于评估此类推理的共同推理模式的证据。我们的研究结果揭示了人类与LLMs之间持续存在的差异,并强调了在非确定性环境中评估推理的必要性。
cs.CL / 8 / 2602.23547
France or Spain or Germany or France: A Neural Account of Non-Redundant Redundant Disjunctions
法国或西班牙或德国或法国:非冗余冗余析取的神经机制
Abstract
Sentences like "She will go to France or Spain, or perhaps to Germany or France." appear formally redundant, yet become acceptable in contexts such as "Mary will go to a philosophy program in France or Spain, or a mathematics program in Germany or France." While this phenomenon has typically been analyzed using symbolic formal representations, we aim to provide a complementary account grounded in artificial neural mechanisms. We first present new behavioral evidence from humans and large language models demonstrating the robustness of this apparent non-redundancy across contexts. We then show that, in language models, redundancy avoidance arises from two interacting mechanisms: models learn to bind contextually relevant information to repeated lexical items, and Transformer induction heads selectively attend to these context-licensed representations. We argue that this neural explanation sheds light on the mechanisms underlying context-sensitive semantic interpretation, and that it complements existing symbolic analyses.
Chinese Translation
像“她将去法国或西班牙,或者也许去德国或法国。”这样的句子在形式上看似冗余,但在诸如“玛丽将去法国或西班牙的哲学项目,或者去德国或法国的数学项目。”这样的语境中却变得可接受。尽管这一现象通常通过符号形式表示进行分析,我们旨在提供一个基于人工神经机制的补充解释。我们首先展示了来自人类和大型语言模型的新行为证据,证明这种表面上的非冗余性在不同语境中的稳健性。然后,我们展示在语言模型中,冗余避免是由两个相互作用的机制产生的:模型学习将上下文相关信息绑定到重复的词汇项上,而Transformer的归纳头则选择性地关注这些经过上下文许可的表征。我们认为,这一神经解释阐明了上下文敏感语义解释的机制,并且补充了现有的符号分析。
cs.CL / 9 / 2602.23577
Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations
基于多智能体因果推理的在线对话自杀意念检测
Abstract
Suicide remains a pressing global public health concern. While social media platforms offer opportunities for early risk detection through online conversation trees, existing approaches face two major limitations: (1) They rely on predefined rules (e.g., quotes or relies) to log conversations that capture only a narrow spectrum of user interactions, and (2) They overlook hidden influences such as user conformity and suicide copycat behavior, which can significantly affect suicidal expression and propagation in online communities. To address these limitations, we propose a Multi-Agent Causal Reasoning (MACR) framework that collaboratively employs a Reasoning Agent to scale user interactions and a Bias-aware Decision-Making Agent to mitigate harmful biases arising from hidden influences. The Reasoning Agent integrates cognitive appraisal theory to generate counterfactual user reactions to posts, thereby scaling user interactions. It analyses these reactions through structured dimensions, i.e., cognitive, emotional, and behavioral patterns, with a dedicated sub-agent responsible for each dimension. The Bias-aware Decision-Making Agent mitigates hidden biases through a front-door adjustment strategy, leveraging the counterfactual user reactions produced by the Reasoning Agent. Through the collaboration of reasoning and bias-aware decision making, the proposed MACR framework not only alleviates hidden biases, but also enriches contextual information of user interactions with counterfactual knowledge. Extensive experiments on real-world conversational datasets demonstrate the effectiveness and robustness of MACR in identifying suicide risk.
Chinese Translation
自杀仍然是一个紧迫的全球公共卫生问题。尽管社交媒体平台通过在线对话树提供了早期风险检测的机会,但现有方法面临两个主要限制:(1)它们依赖于预定义规则(例如,引用或回复)来记录对话,这仅捕捉了用户互动的狭窄范围;(2)它们忽视了隐藏影响因素,如用户从众行为和自杀模仿行为,这些因素可能显著影响在线社区中的自杀表达和传播。为了解决这些限制,我们提出了一种多智能体因果推理(Multi-Agent Causal Reasoning, MACR)框架,该框架通过协作使用推理智能体来扩展用户互动,并使用偏见感知决策智能体来减轻由隐藏影响引起的有害偏见。推理智能体整合了认知评估理论,以生成对帖子反事实的用户反应,从而扩展用户互动。它通过结构化维度分析这些反应,即认知、情感和行为模式,每个维度都有一个专门的子智能体负责。偏见感知决策智能体通过前门调整策略减轻隐藏偏见,利用推理智能体生成的反事实用户反应。通过推理与偏见感知决策的协作,所提出的MACR框架不仅缓解了隐藏偏见,还丰富了用户互动的上下文信息,提供了反事实知识。在真实世界对话数据集上的广泛实验表明,MACR在识别自杀风险方面的有效性和稳健性。
cs.CL / 10 / 2602.23580
BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation
弥合差距:通过群体间数据增强减轻英语学习者自动评分中的偏见放大
Abstract
In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings. Instead of relying on the limited minority samples, BRIDGE synthesizes high-scoring ELL samples by "pasting" construct-relevant (i.e., rubric-aligned knowledge and evidence) content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns. We further introduce a discriminator model to ensure the quality of synthetic samples. Experiments on California Science Test (CAST) datasets demonstrate that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance. Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.
Chinese Translation
在教育评估领域,自动评分系统越来越依赖深度学习和大型语言模型(LLMs)。然而,这些系统面临着显著的偏见放大风险,即学生群体之间的模型预测差距变得比训练数据中观察到的差距更大。对于英语学习者(ELLs)等代表性不足的群体,这一问题尤为严重,因为模型可能会继承并进一步放大数据中现有的差异。我们发现,这一问题与代表性偏见密切相关:少数群体(高分ELL)样本的稀缺使得采用经验风险最小化训练的模型偏向于多数群体(非ELL)的语言模式。因此,模型往往低估那些即使展现出相当领域知识但使用不同语言模式的ELL学生,从而削弱了自动评分结果的公平性。为此,我们提出了BRIDGE,一个旨在低资源评估环境中设计的减少偏见的群体间数据生成框架。BRIDGE并不依赖于有限的少数样本,而是通过将来自丰富的高分非ELL样本中的构念相关(即与评分标准一致的知识和证据)内容“粘贴”到真实的ELL语言模式中,合成高分ELL样本。我们进一步引入了一个判别模型,以确保合成样本的质量。在加利福尼亚科学测试(CAST)数据集上的实验表明,BRIDGE有效减少了高分ELL学生的预测偏见,同时保持了整体评分性能。值得注意的是,我们的方法在公平性提升方面可与使用额外真实人类数据相媲美,为确保大规模评估中的公平评分提供了一种具有成本效益的解决方案。
cs.CL / 11 / 2602.23603
LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering
LFQA-HP-1M:一个用于长篇问答的大规模人类偏好数据集
Abstract
Long-form question answering (LFQA) demands nuanced evaluation of multi-sentence explanatory responses, yet existing metrics often fail to reflect human judgment. We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA. We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. We further examine transitivity consistency, positional bias, and verbosity biases in LLM evaluators and demonstrate their vulnerability to adversarial perturbations. Overall, this work provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation.
Chinese Translation
长篇问答(LFQA)要求对多句解释性回答进行细致评估,但现有的评估指标往往无法反映人类的判断。我们提出了LFQA-HP-1M,这是一个包含130万条人类成对偏好注释的大规模数据集,专门用于LFQA。我们提出了九个用于回答质量评估的标准,并展示了基于这些特征的简单线性模型在性能上与最先进的LLM(大语言模型)评估器相当。我们进一步考察了LLM评估器中的传递一致性、位置偏差和冗长偏差,并证明了它们对对抗性扰动的脆弱性。总体而言,这项工作提供了一个最大的公共LFQA偏好数据集之一,以及一个基于标准的透明可靠评估框架。
cs.CL / 12 / 2602.23610
LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning
基于大语言模型的多轮任务导向对话合成用于现实推理
Abstract
The reasoning capability of large language models (LLMs), defined as their ability to analyze, infer, and make decisions based on input information, is essential for building intelligent task-oriented dialogue systems. However, existing benchmarks do not sufficiently reflect the complexity of real-world scenarios, which limits their effectiveness in evaluating and enhancing LLM reasoning in practical contexts. Many current reasoning datasets are overly simplistic and abstract, often disconnected from realistic task flows, domain constraints, and operational rules, making it difficult to effectively evaluate LLMs' logical reasoning ability. In addition, data contamination from pretraining corpora undermines the reliability of evaluation results, and traditional crowdsourcing methods for dataset construction are labor-intensive and difficult to scale. To address these challenges, we propose a LLM-driven framework for synthesizing multi-turn, task-oriented dialogues grounded in realistic reasoning scenarios, leveraging trilevel optimization to enhance dialogue quality. Our method generates dialogues grounded in authentic task scenarios, enriched with real-world information, and exhibiting strong contextual coherence. Corresponding reasoning tasks are carefully designed around these dialogues and iteratively refined to continuously improve the tasks' quality and challenge. The resulting dataset serves as a valuable benchmark for assessing and advancing the realistic logical reasoning capabilities of LLMs. Experimental results show that our synthetic data-based reasoning tasks introduce non-trivial reasoning challenges and provide meaningful support for improving the reasoning capabilities of LLMs.
Chinese Translation
大语言模型(LLMs)的推理能力,即根据输入信息进行分析、推断和决策的能力,对于构建智能的任务导向对话系统至关重要。然而,现有的基准测试并未充分反映现实场景的复杂性,这限制了它们在实际环境中评估和提升LLM推理能力的有效性。许多当前的推理数据集过于简单和抽象,往往与现实任务流程、领域约束和操作规则脱节,使得有效评估LLM的逻辑推理能力变得困难。此外,来自预训练语料库的数据污染削弱了评估结果的可靠性,而传统的数据集构建众包方法则劳动密集且难以扩展。为了解决这些挑战,我们提出了一种基于LLM的框架,用于合成基于现实推理场景的多轮任务导向对话,利用三级优化来提升对话质量。我们的方法生成基于真实任务场景的对话,丰富了现实世界的信息,并展现出强烈的上下文连贯性。相应的推理任务围绕这些对话精心设计,并经过迭代优化,以持续提高任务的质量和挑战性。最终生成的数据集为评估和提升LLM的现实逻辑推理能力提供了宝贵的基准。实验结果表明,我们基于合成数据的推理任务引入了非平凡的推理挑战,并为提升LLM的推理能力提供了有意义的支持。
cs.CL / 13 / 2602.23656
TRIZ-RAGNER: A Retrieval-Augmented Large Language Model for TRIZ-Aware Named Entity Recognition in Patent-Based Contradiction Mining
TRIZ-RAGNER:一种用于基于专利的矛盾挖掘的TRIZ感知命名实体识别的检索增强大型语言模型
Abstract
TRIZ-based contradiction mining is a fundamental task in patent analysis and systematic innovation, as it enables the identification of improving and worsening technical parameters that drive inventive problem solving. However, existing approaches largely rely on rule-based systems or traditional machine learning models, which struggle with semantic ambiguity, domain dependency, and limited generalization when processing complex patent language. Recently, large language models (LLMs) have shown strong semantic understanding capabilities, yet their direct application to TRIZ parameter extraction remains challenging due to hallucination and insufficient grounding in structured TRIZ knowledge. To address these limitations, this paper proposes TRIZ-RAGNER, a retrieval-augmented large language model framework for TRIZ-aware named entity recognition in patent-based contradiction mining. TRIZ-RAGNER reformulates contradiction mining as a semantic-level NER task and integrates dense retrieval over a TRIZ knowledge base, cross-encoder reranking for context refinement, and structured LLM prompting to extract improving and worsening parameters from patent sentences. By injecting domain-specific TRIZ knowledge into the LLM reasoning process, the proposed framework effectively reduces semantic noise and improves extraction consistency. Experiments on the PaTRIZ dataset demonstrate that TRIZ-RAGNER consistently outperforms traditional sequence labeling models and LLM-based baselines. The proposed framework achieves a precision of 85.6%, a recall of 82.9%, and an F1-score of 84.2% in TRIZ contradiction pair identification. Compared with the strongest baseline using prompt-enhanced GPT, TRIZ-RAGNER yields an absolute F1-score improvement of 7.3 percentage points, confirming the effectiveness of retrieval-augmented TRIZ knowledge grounding for robust and accurate patent-based contradiction mining.
Chinese Translation
基于TRIZ的矛盾挖掘是专利分析和系统创新中的一项基础任务,因为它能够识别推动创造性问题解决的改善和恶化的技术参数。然而,现有的方法主要依赖于基于规则的系统或传统机器学习模型,这些方法在处理复杂专利语言时面临语义模糊、领域依赖和有限的泛化能力等挑战。最近,大型语言模型(LLMs)展示了强大的语义理解能力,但由于幻觉现象和对结构化TRIZ知识的不足基础,其在TRIZ参数提取中的直接应用仍然具有挑战性。为了解决这些局限性,本文提出了TRIZ-RAGNER,一个用于基于专利的矛盾挖掘的TRIZ感知命名实体识别的检索增强大型语言模型框架。TRIZ-RAGNER将矛盾挖掘重新表述为一个语义层面的命名实体识别任务,并集成了对TRIZ知识库的密集检索、上下文细化的交叉编码器重排序,以及结构化LLM提示,以从专利句子中提取改善和恶化的参数。通过将特定领域的TRIZ知识注入LLM推理过程,所提出的框架有效减少了语义噪声,提高了提取的一致性。在PaTRIZ数据集上的实验表明,TRIZ-RAGNER始终优于传统的序列标注模型和基于LLM的基线。所提出的框架在TRIZ矛盾对识别中实现了85.6%的精确率、82.9%的召回率和84.2%的F1-score。与使用增强提示的GPT的最强基线相比,TRIZ-RAGNER的F1-score绝对提升了7.3个百分点,确认了检索增强的TRIZ知识基础在稳健和准确的基于专利的矛盾挖掘中的有效性。
cs.CL / 14 / 2602.23729
From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning
从静态基准到动态协议:以代理为中心的文本异常检测用于评估大型语言模型的推理能力
Abstract
The evaluation of large language models (LLMs) has predominantly relied on static datasets, which offer limited scalability and fail to capture the evolving reasoning capabilities of recent models. To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems. Within this protocol, a teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks, and a student agent attempts to solve the validated problems. An invalid problem is revised by the teacher agent until it passes validation. If the student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants. Consequently, the benchmark scales in difficulty automatically as more capable agents are substituted into any role, enabling progressive evaluation of large language models without manually curated datasets. Adopting text anomaly detection as our primary evaluation format, which demands cross-sentence logical inference and resists pattern-matching shortcuts, we demonstrate that this protocol systematically exposes corner-case reasoning errors that conventional benchmarks fail to reveal. We further advocate evaluating systems along several complementary axes including cross-model pairwise performance and progress between the initial and orchestrator-finalized problems. By shifting the focus from fixed datasets to dynamic protocols, our approach offers a sustainable direction for evaluating ever-evolving language models and introduces a research agenda centered on the co-evolution of agent-centric benchmarks.
Chinese Translation
大型语言模型(LLMs)的评估主要依赖于静态数据集,这些数据集的可扩展性有限,无法捕捉到近期模型不断发展的推理能力。为了克服这些局限性,我们提出了一种以代理为中心的基准测试范式,通过引入动态协议,超越静态数据集,在该协议中,自主代理迭代生成、验证和解决问题。在这个协议中,教师代理生成候选问题,协调者代理严格验证其有效性并防范对抗性攻击,而学生代理则尝试解决经过验证的问题。无效的问题由教师代理修订,直到通过验证。如果学生正确解决了问题,协调者将提示教师生成更具挑战性的变体。因此,基准的难度会随着更强大的代理在任何角色中的替换而自动提升,从而实现对大型语言模型的渐进评估,而无需手动策划的数据集。我们采用文本异常检测作为主要评估格式,这要求跨句子逻辑推理并抵制模式匹配的捷径,证明该协议系统性地揭示了传统基准无法揭示的边缘案例推理错误。我们进一步倡导从多个互补维度评估系统,包括跨模型的成对性能和初始问题与协调者最终确定的问题之间的进展。通过将重点从固定数据集转向动态协议,我们的方法为评估不断演变的语言模型提供了一种可持续的方向,并引入了一个以代理中心基准的共同演变为核心的研究议程。
cs.CL / 15 / 2602.23753
Structured Prompt Optimization for Few-Shot Text Classification via Semantic Alignment in Latent Space
通过潜在空间中的语义对齐进行少样本文本分类的结构化提示优化
Abstract
This study addresses the issues of semantic entanglement, unclear label structure, and insufficient feature representation in few-shot text classification, and proposes an optimization framework based on structured prompts to enhance semantic understanding and task adaptation under low-resource conditions. The framework first uses a pretrained language model to encode the input text and obtain basic semantic representations. It then introduces structured prompts composed of multi-dimensional semantic factors and integrates them with text features through a learnable combination mechanism, which forms task-related representations with clear boundaries in the latent space. To further strengthen the consistency between text representations and label semantics, the method constructs a structured label embedding matrix and employs a cross-space alignment mechanism to ensure stable matching between textual features and label attributes. In addition, the model applies prompt orthogonality constraints and a joint optimization objective to maintain independence across different semantic factors in the prompts, allowing the structured prompts to provide transparent and controllable guidance for classification decisions. Three types of sensitivity experiments, including learning rate sensitivity, prompt length sensitivity, and data scale sensitivity, are designed to evaluate the stability and robustness of the framework under different conditions. Experimental results show that the proposed structured prompt optimization framework effectively alleviates semantic conflicts and label ambiguity in few-shot text classification. It significantly improves performance on accuracy, precision, recall, and AUC, and demonstrates strong cross-task applicability.
Chinese Translation
本研究针对少样本文本分类中的语义纠缠、标签结构不清晰和特征表示不足等问题,提出了一种基于结构化提示的优化框架,以增强低资源条件下的语义理解和任务适应性。该框架首先利用预训练语言模型对输入文本进行编码,获取基本的语义表示。然后,引入由多维语义因素组成的结构化提示,并通过可学习的组合机制将其与文本特征进行整合,从而在潜在空间中形成具有明确边界的任务相关表示。为了进一步增强文本表示与标签语义之间的一致性,该方法构建了一个结构化标签嵌入矩阵,并采用跨空间对齐机制,确保文本特征与标签属性之间的稳定匹配。此外,模型应用了提示正交性约束和联合优化目标,以保持提示中不同语义因素之间的独立性,使结构化提示能够为分类决策提供透明且可控的指导。设计了包括学习率敏感性、提示长度敏感性和数据规模敏感性在内的三种敏感性实验,以评估该框架在不同条件下的稳定性和鲁棒性。实验结果表明,所提出的结构化提示优化框架有效缓解了少样本文本分类中的语义冲突和标签模糊性,显著提高了准确率、精确率、召回率和AUC,并展示了强大的跨任务适用性。
cs.CL / 16 / 2602.23792
Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding
分而治之:通过自适应并行解码加速基于扩散的大型语言模型
Abstract
Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks, establishing themselves as an alternative to autoregressive large language models (LLMs). Unlike autoregressive LLMs that generate one token per step based on all previous tokens, dLLMs theoretically enable parallel generation of multiple tokens at each decoding step. However, recent dLLMs still favor one-token-per-step generation in practice, as directly decoding multiple masked tokens often leads to degraded generation quality and stability. This reveals a substantial gap between the theoretical parallelism and practical performance of dLLMs. To bridge this gap, we introduce an adaptive parallel decoding approach, namely DiCo, which features a three-phase divide-and-conquer paradigm to unleash the inherent parallelism of dLLMs. During the Divide phase, DiCo first explores the input masked sequence and identifies masked tokens as seed tokens, which are then expanded to construct a set of local clusters. During the Conquer phase, DiCo performs parallel decoding across different local clusters constructed in the Divide phase. The divide-and-conquer process repeatedly alternates between the Divide and Conquer phases until convergence. During the Finalize phase, DiCo decodes the remaining few masked tokens using an effective fine-grained compound decoding scheme to finalize the generation. Extensive experiments demonstrate that DiCo can achieve significant inference speedups while maintaining competitive generation quality.
Chinese Translation
基于扩散的大型语言模型(dLLMs)在各种推理任务中表现出色,成为自回归大型语言模型(LLMs)的替代方案。与自回归 LLMs 每一步基于所有先前的标记生成一个标记不同,dLLMs 理论上允许在每个解码步骤中并行生成多个标记。然而,最近的 dLLMs 在实践中仍然倾向于逐步生成一个标记,因为直接解码多个被掩盖的标记往往会导致生成质量和稳定性的下降。这揭示了 dLLMs 理论并行性与实际性能之间的显著差距。为了解决这一问题,我们提出了一种自适应并行解码方法,即 DiCo,该方法采用三阶段的分而治之范式,以释放 dLLMs 的内在并行性。在分阶段,DiCo 首先探索输入的被掩盖序列,并将被掩盖的标记识别为种子标记,然后扩展以构建一组局部集群。在治阶段,DiCo 在分阶段构建的不同局部集群之间进行并行解码。分而治之的过程在分阶段和治阶段之间反复交替,直到收敛。在最终阶段,DiCo 使用有效的细粒度复合解码方案解码剩余的少量被掩盖标记,以完成生成。大量实验表明,DiCo 能够在保持竞争性生成质量的同时实现显著的推理加速。
cs.CL / 17 / 2602.23826
GLUScope: A Tool for Analyzing GLU Neurons in Transformer Language Models
GLUScope:一种用于分析变换器语言模型中GLU神经元的工具
Abstract
We present GLUScope, an open-source tool for analyzing neurons in Transformer-based language models, intended for interpretability researchers. We focus on more recent models than previous tools do; specifically we consider gated activation functions such as SwiGLU. This introduces a new challenge: understanding positive activations is not enough. Instead, both the gate and the in activation of a neuron can be positive or negative, leading to four different possible sign combinations that in some cases have quite different functionalities. Accordingly, for any neuron, our tool shows text examples for each of the four sign combinations, and indicates how often each combination occurs. We describe examples of how our tool can lead to novel insights. A demo is available at https: //sjgerstner.github.io/gluscope.
Chinese Translation
我们提出了GLUScope,这是一款开源工具,用于分析基于变换器的语言模型中的神经元,旨在为可解释性研究人员提供支持。我们关注比之前工具更新的模型;具体而言,我们考虑了诸如SwiGLU的门控激活函数。这引入了一个新的挑战:理解正激活值是不够的。相反,神经元的门控和输入激活都可以是正值或负值,从而导致四种不同的符号组合,在某些情况下,这些组合具有截然不同的功能。因此,对于任何神经元,我们的工具展示了每种符号组合的文本示例,并指示每种组合出现的频率。我们描述了我们的工具如何带来新颖见解的示例。演示版本可在 https://sjgerstner.github.io/gluscope 获取。
cs.CL / 18 / 2602.23845
CLFEC: A New Task for Unified Linguistic and Factual Error Correction in paragraph-level Chinese Professional Writing
CLFEC:统一语言和事实错误纠正的新任务——针对段落级中文专业写作
Abstract
Chinese text correction has traditionally focused on spelling and grammar, while factual error correction is usually treated separately. However, in paragraph-level Chinese professional writing, linguistic (word/grammar/punctuation) and factual errors frequently co-occur and interact, making unified correction both necessary and challenging. This paper introduces CLFEC (Chinese Linguistic & Factual Error Correction), a new task for joint linguistic and factual correction. We construct a mixed, multi-domain Chinese professional writing dataset spanning current affairs, finance, law, and medicine. We then conduct a systematic study of LLM-based correction paradigms, from prompting to retrieval-augmented generation (RAG) and agentic workflows. The analysis reveals practical challenges, including limited generalization of specialized correction models, the need for evidence grounding for factual repair, the difficulty of mixed-error paragraphs, and over-correction on clean inputs. Results further show that handling linguistic and factual Error within the same context outperform decoupled processes, and that agentic workflows can be effective with suitable backbone models. Overall, our dataset and empirical findings provide guidance for building reliable, fully automatic proofreading systems in industrial settings.
Chinese Translation
中文文本纠正传统上关注拼写和语法,而事实错误纠正通常被单独处理。然而,在段落级中文专业写作中,语言(词汇/语法/标点)和事实错误常常同时出现并相互影响,这使得统一纠正既必要又具有挑战性。本文介绍了CLFEC(中文语言与事实错误纠正),这是一个针对语言和事实纠正的联合任务。我们构建了一个混合的多领域中文专业写作数据集,涵盖时事、金融、法律和医学等领域。随后,我们对基于大型语言模型(LLM)的纠正范式进行了系统研究,从提示到检索增强生成(RAG)及代理工作流。分析揭示了实际挑战,包括专业纠正模型的有限泛化能力、事实修复所需的证据基础、混合错误段落的处理难度以及对干净输入的过度纠正。结果进一步表明,在同一上下文中处理语言和事实错误的效果优于解耦处理,并且代理工作流在合适的基础模型下可以有效。总体而言,我们的数据集和实证发现为在工业环境中构建可靠的全自动校对系统提供了指导。
cs.CL / 19 / 2602.23928
The Astonishing Ability of Large Language Models to Parse Jabberwockified Language
大型语言模型解析胡言乱语语言的惊人能力
Abstract
We show that large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts. Texts in which content words have been randomly substituted by nonsense strings, e.g., "At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp", can be translated to conventional English that is, in many cases, close to the original text, e.g., "At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife." These results show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined. Although the abilities of LLMs to make sense of "Jabberwockified" English are clearly superhuman, they are highly relevant to understanding linguistic structure and suggest that efficient language processing either in biological or artificial systems likely benefits from very tight integration between syntax, lexical semantics, and general world knowledge.
Chinese Translation
我们展示了大型语言模型(LLMs)在从严重退化的英语文本中恢复意义方面具有惊人的能力。在这些文本中,内容词被随机替换为无意义的字符串,例如,“在 swuint 的 ghybe,我们遇到了 Wourge Phrear-gwurr,他走进一个带有 crurp 的 ghitch flount”,可以翻译为常规英语,在许多情况下接近原始文本,例如,“在故事的开头,我们遇到一个人,Chow,他与妻子一起搬进了一栋公寓大楼。”这些结果表明,结构线索(例如,形态句法、封闭类词)在约束词汇意义方面的程度远超想象。尽管 LLMs 理解“胡言乱语”英语的能力显然超出人类,但它们与理解语言结构高度相关,并且表明无论是在生物系统还是人工系统中,高效的语言处理可能受益于句法、词汇语义和一般世界知识之间的紧密整合。
cs.CL / 20 / 2602.23940
Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language
基于BERT模型的尼泊尔语句子级主题分类基准研究
Abstract
Transformer-based models such as BERT have significantly advanced Natural Language Processing (NLP) across many languages. However, Nepali, a low-resource language written in Devanagari script, remains relatively underexplored. This study benchmarks multilingual, Indic, Hindi, and Nepali BERT variants to evaluate their effectiveness in Nepali topic classification. Ten pre-trained models, including mBERT, XLM-R, MuRIL, DevBERT, HindiBERT, IndicBERT, and NepBERTa, were fine-tuned and tested on the balanced Nepali dataset containing 25,006 sentences across five conceptual domains and the performance was evaluated using accuracy, weighted precision, recall, F1-score, and AUROC metrics. The results reveal that Indic models, particularly MuRIL-large, achieved the highest F1-score of 90.60%, outperforming multilingual and monolingual models. NepBERTa also performed competitively with an F1-score of 88.26%. Overall, these findings establish a robust baseline for future document-level classification and broader Nepali NLP applications.
Chinese Translation
基于Transformer的模型,如BERT,已在多种语言的自然语言处理(NLP)领域取得了显著进展。然而,尼泊尔语作为一种使用天城文书写的低资源语言,仍然相对未被充分探索。本研究对多语言、印地语、印地和尼泊尔BERT变体进行了基准测试,以评估它们在尼泊尔主题分类中的有效性。我们对包括mBERT、XLM-R、MuRIL、DevBERT、HindiBERT、IndicBERT和NepBERTa在内的十个预训练模型进行了微调,并在包含25,006个句子的平衡尼泊尔数据集上进行了测试,评估指标包括准确率、加权精确率、召回率、F1分数和AUROC。结果显示,印地模型,特别是MuRIL-large,达到了最高的F1分数90.60%,优于多语言和单语言模型。NepBERTa的表现也相当出色,F1分数为88.26%。总体而言,这些发现为未来的文档级分类和更广泛的尼泊尔NLP应用建立了一个稳健的基准。
cs.CL / 21 / 2602.23941
EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates
EDDA-Coordinata:历史地理坐标的注释数据集
Abstract
This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d'Alembert's eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset's usefulness as training data, and our two-step method's cross-lingual, cross-domain generalizability.
Chinese Translation
本文介绍了从迪德罗和达朗贝尔的十八世纪《百科全书》中提取的丰富地理坐标数据集。从历史文本中自动恢复地理坐标是一项复杂的任务,因为这些坐标以多种方式表达,并且精确度各不相同。为了提高从类似数字化早期现代文本中检索坐标的能力,我们创建了一个金标准数据集,训练了模型,发布了推断和标准化的坐标数据,并尝试将这些模型应用于新文本。在ARTFL和ENCCRE的《百科全书》数字化版本中,共有74,000篇文章,我们检查了15,278个地理条目,手动识别出4,798个包含坐标的条目,以及10,480个具有描述性但不带数字引用的条目。利用我们的金标准注释,我们训练了基于变换器的模型以检索和标准化坐标。这里提出的流程结合了一个分类器来识别包含坐标的条目,以及一个用于检索的第二模型,分别在编码器-解码器和解码器架构上进行了测试。交叉验证的结果显示,EM得分为86%。在一个非领域的十八世纪特雷武字典(同样为法语)上,我们微调的模型得分为61%;而在十九世纪第七版的《大英百科全书》(英语)上,EM得分为77%。这些发现突显了金标准数据集作为训练数据的实用性,以及我们两步法在跨语言和跨领域的可推广性。
cs.CL / 22 / 2602.23944
MemEmo: Evaluating Emotion in Memory Systems of Agents
MemEmo:评估智能体记忆系统中的情感
Abstract
Memory systems address the challenge of context loss in Large Language Model during prolonged interactions. However, compared to human cognition, the efficacy of these systems in processing emotion-related information remains inconclusive. To address this gap, we propose an emotion-enhanced memory evaluation benchmark to assess the performance of mainstream and state-of-the-art memory systems in handling affective information. We developed the \textbf{H}uman-\textbf{L}ike \textbf{M}emory \textbf{E}motion (\textbf{HLME}) dataset, which evaluates memory systems across three dimensions: emotional information extraction, emotional memory updating, and emotional memory question answering. Experimental results indicate that none of the evaluated systems achieve robust performance across all three tasks. Our findings provide an objective perspective on the current deficiencies of memory systems in processing emotional memories and suggest a new trajectory for future research and system optimization.
Chinese Translation
记忆系统解决了在长时间交互中大语言模型面临的上下文丧失挑战。然而,与人类认知相比,这些系统在处理与情感相关的信息方面的有效性仍然不确定。为了解决这一问题,我们提出了一种情感增强的记忆评估基准,以评估主流和最先进的记忆系统在处理情感信息方面的表现。我们开发了 extbf{H}uman- extbf{L}ike extbf{M}emory extbf{E}motion( extbf{HLME})数据集,该数据集从三个维度评估记忆系统:情感信息提取、情感记忆更新和情感记忆问答。实验结果表明,评估的系统在所有三个任务中都未能实现稳健的表现。我们的研究结果为当前记忆系统在处理情感记忆方面的不足提供了客观视角,并为未来的研究和系统优化提出了新的方向。
cs.CL / 23 / 2602.23993
The GRADIEND Python Package: An End-to-End System for Gradient-Based Feature Learning
GRADIEND Python 包:一种基于梯度的特征学习端到端系统
Abstract
We present gradiend, an open-source Python package that operationalizes the GRADIEND method for learning feature directions from factual-counterfactual MLM and CLM gradients in language models. The package provides a unified workflow for feature-related data creation, training, evaluation, visualization, persistent model rewriting via controlled weight updates, and multi-feature comparison. We demonstrate GRADIEND on an English pronoun paradigm and on a large-scale feature comparison that reproduces prior use cases.
Chinese Translation
我们介绍了 gradiend,一个开源的 Python 包,它实现了 GRADIEND 方法,用于从语言模型中的事实-反事实 MLM 和 CLM 梯度学习特征方向。该包提供了一个统一的工作流程,用于特征相关数据的创建、训练、评估、可视化、通过受控权重更新进行持久模型重写,以及多特征比较。我们在一个英语代词范例和一个大规模特征比较上展示了 GRADIEND,重现了先前的使用案例。
cs.CL / 24 / 2602.24002
Dialect and Gender Bias in YouTube's Spanish Captioning System
YouTube西班牙语字幕系统中的方言与性别偏见
Abstract
Spanish is the official language of twenty-one countries and is spoken by over 441 million people. Naturally, there are many variations in how Spanish is spoken across these countries. Media platforms such as YouTube rely on automatic speech recognition systems to make their content accessible to different groups of users. However, YouTube offers only one option for automatically generating captions in Spanish. This raises the question: could this captioning system be biased against certain Spanish dialects? This study examines the potential biases in YouTube's automatic captioning system by analyzing its performance across various Spanish dialects. By comparing the quality of captions for female and male speakers from different regions, we identify systematic disparities which can be attributed to specific dialects. Our study provides further evidence that algorithmic technologies deployed on digital platforms need to be calibrated to the diverse needs and experiences of their user populations.
Chinese Translation
西班牙语是二十一国的官方语言,使用人数超过4.41亿。自然,不同国家的西班牙语口音存在许多变体。媒体平台如YouTube依赖自动语音识别系统,使其内容能够被不同用户群体访问。然而,YouTube仅提供一种自动生成西班牙语字幕的选项。这引发了一个问题:该字幕系统是否对某些西班牙语方言存在偏见?本研究通过分析YouTube自动字幕系统在不同西班牙语方言中的表现,探讨其潜在偏见。通过比较来自不同地区的女性和男性发言者的字幕质量,我们识别出可以归因于特定方言的系统性差异。我们的研究进一步证明,数字平台上部署的算法技术需要根据用户群体的多样化需求和经验进行校准。
cs.CL / 25 / 2602.24060
Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis
任务复杂性的重要性:对大语言模型在情感分析中推理能力的实证研究
Abstract
Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families--including adaptive, conditional, and reinforcement learning-based reasoning architectures--on sentiment analysis datasets of varying granularity (binary, five-class, and 27-class emotion). Our findings reveal that reasoning effectiveness is strongly task-dependent, challenging prevailing assumptions: (1) Reasoning shows task-complexity dependence--binary classification degrades up to -19.9 F1 percentage points (pp), while 27-class emotion recognition gains up to +16.0pp; (2) Distilled reasoning variants underperform base models by 3-18 pp on simpler tasks, though few-shot prompting enables partial recovery; (3) Few-shot learning improves over zero-shot in most cases regardless of model type, with gains varying by architecture and task complexity; (4) Pareto frontier analysis shows base models dominate efficiency-performance trade-offs, with reasoning justified only for complex emotion recognition despite 2.1x-54x computational overhead. We complement these quantitative findings with qualitative error analysis revealing that reasoning degrades simpler tasks through systematic over-deliberation, offering mechanistic insight beyond the high-level overthinking hypothesis.
Chinese Translation
具备推理能力的大语言模型(LLMs)推动了一个引人注目的论述,即推理普遍提高了语言任务的表现。我们通过对七个模型家族(包括自适应、条件和基于强化学习的推理架构)在不同粒度(双分类、五类和27类情感)的情感分析数据集上进行的504种配置的全面评估来检验这一说法。我们的研究结果表明,推理的有效性与任务密切相关,挑战了现有的假设:(1)推理表现出任务复杂性的依赖性——双分类的F1得分下降最多可达-19.9个百分点,而27类情感识别则提升最多可达+16.0个百分点;(2)在简单任务中,提炼的推理变体的表现低于基础模型3-18个百分点,尽管少量示例提示可以部分恢复;(3)在大多数情况下,少量学习优于零样本学习,增益因架构和任务复杂性而异;(4)帕累托前沿分析显示基础模型在效率与性能的权衡中占据主导地位,推理仅在复杂情感识别中才有其合理性,尽管计算开销高达2.1倍至54倍。我们通过定性错误分析补充了这些定量发现,揭示了推理通过系统性的过度思考降低了简单任务的表现,提供了超越高层次过度思考假设的机制性见解。
cs.CL / 26 / 2602.24082
Preference Packing: Efficient Preference Optimization for Large Language Models
偏好打包:大型语言模型的高效偏好优化
Abstract
Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. In particular, batch packing is commonly used in pre-training and supervised fine-tuning to achieve resource-efficient training. We propose preference packing, a method to enhance resource efficiency in training techniques that use data with different responses for the same input prompt, such as reward models or Direct Preference Optimization (DPO). Preference packing improves resource efficiency by reducing the attention operations for duplicate input prompts and decreasing KV cache memory usage. We conducted experiments on text-only datasets and image-included datasets and achieved at least 37% reduction in training time. Notably, this method can be applied alongside existing optimization techniques such as batch sorting, resulting in a 3.22x speedup.
Chinese Translation
随着大型语言模型(LLMs)规模的不断扩大,资源高效的训练优化技术变得越来越重要。特别是在预训练和监督微调中,批量打包被广泛用于实现资源高效的训练。我们提出了偏好打包(preference packing),这是一种增强训练技术资源效率的方法,适用于对同一输入提示使用不同响应的数据,如奖励模型或直接偏好优化(Direct Preference Optimization, DPO)。偏好打包通过减少重复输入提示的注意力操作和降低KV缓存内存使用来提高资源效率。我们在仅包含文本的数据集和包含图像的数据集上进行了实验,训练时间至少减少了37%。值得注意的是,该方法可以与现有的优化技术(如批量排序)结合使用,从而实现3.22倍的加速。
cs.CL / 27 / 2602.24109
ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts
ARGUS:观察叙事特征对论证文本说服力的影响
Abstract
Can narratives make arguments more persuasive? And to this end, which narrative features matter most? Although stories are often seen as powerful tools for persuasion, their specific role in online, unstructured argumentation remains underexplored. To address this gap, we present ARGUS, a framework for studying the impact of narration on persuasion in argumentative discourse. ARGUS introduces a new ChangeMyView corpus annotated for story presence and six key narrative features, integrating insights from two established theoretical frameworks that capture both textual narrative features and their effects on recipients. Leveraging both encoder-based classifiers and zero-shot large language models (LLMs), ARGUS identifies stories and narrative features and applies them at scale to examine how different narrative dimensions influence persuasion success in online argumentation.
Chinese Translation
叙事能否使论证更具说服力?为此,哪些叙事特征最为重要?尽管故事常被视为强有力的说服工具,但它们在在线非结构化论证中的具体作用仍然未被充分探索。为了解决这一空白,我们提出了ARGUS,一个用于研究叙事对论证话语中说服力影响的框架。ARGUS引入了一个新的ChangeMyView语料库,该语料库对故事存在性和六个关键叙事特征进行了注释,整合了两个已建立理论框架的见解,这些框架捕捉了文本叙事特征及其对接收者的影响。通过利用基于编码器的分类器和零样本大语言模型(LLMs),ARGUS识别故事和叙事特征,并在大规模上应用它们,以考察不同叙事维度如何影响在线论证中的说服成功。
cs.CL / 28 / 2602.24119
Terminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
术语稀有性预测低资源古代语言的大型语言模型翻译中的灾难性失败:来自古希腊的证据
Abstract
This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose. We evaluate translations by three commercial LLMs (Claude, Gemini, ChatGPT) of twenty paragraph-length passages from two works by the Greek physician Galen of Pergamum (ca. 129-216 CE): On Mixtures, which has two published English translations, and On the Composition of Drugs according to Kinds, which has never been fully translated into English. We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied to all 60 translations by a team of domain specialists. On the previously translated expository text, LLMs achieved high translation quality (mean MQM score 95.2/100), with performance approaching expert level. On the untranslated pharmacological text, aggregate quality was lower (79.9/100) but with high variance driven by two passages presenting extreme terminological density; excluding these, scores converged to within 4 points of the translated text. Terminology rarity, operationalized via corpus frequency in the literary Diorisis Ancient Greek Corpus, emerged as a strong predictor of translation failure (r = -.97 for passage-level quality on the untranslated text). Automated metrics showed moderate correlation with human judgment overall on the text with a wide quality spread (Composition), but no metric discriminated among high-quality translations. We discuss implications for the use of LLMs in Classical scholarship and for the design of automated evaluation pipelines for low-resource ancient languages.
Chinese Translation
本研究首次系统性地进行了无参考的人工评估,评估大型语言模型(LLM)在古希腊(AG)技术散文翻译中的表现。我们对三种商业LLM(Claude、Gemini、ChatGPT)翻译的二十段来自希腊医生盖伦(Galen of Pergamum,约公元129-216年)两部作品的段落进行了评估:其一是《混合论》(On Mixtures),已有两种已出版的英文翻译;其二是《按种类配制药物的组成》(On the Composition of Drugs according to Kinds),该作品从未被完整翻译成英文。我们使用标准自动评估指标(BLEU、chrF++、METEOR、ROUGE-L、BERTScore、COMET、BLEURT)和经过修改的多维质量指标(MQM)框架对所有60个翻译进行专家人工评估。对于已翻译的说明性文本,LLM的翻译质量较高(平均MQM得分为95.2/100),接近专家水平。而对于未翻译的药理学文本,整体质量较低(79.9/100),但由于两个段落的术语密度极高,导致质量差异较大;如果排除这两个段落,得分则收敛至与已翻译文本相差4分以内。术语稀有性通过在文学《迪奥里西斯古希腊语语料库》(Diorisis Ancient Greek Corpus)中的语料频率进行操作,成为翻译失败的强预测因子(未翻译文本段落级质量的相关系数r = -.97)。自动评估指标在整体上与人类判断在质量差异较大的文本(Composition)中表现出中等相关性,但没有指标能区分高质量翻译。我们讨论了LLM在古典学研究中的应用及其对低资源古代语言自动评估管道设计的影响。
cs.CL / 29 / 2602.24142
CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning
CoME:赋能移动专家通道的混合能力推理
Abstract
Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activates the corresponding expert to generate output tokens in each reasoning stage via output-oriented activation. To empower CoME with hybrid-capabilities reasoning, we introduce a progressive training strategy: Expert-FT enables decoupling and enhancement of different experts' capability; Router-FT aligns expert activation with the different reasoning stage; CoT-FT facilitates seamless collaboration and balanced optimization across multiple capabilities. To mitigate error propagation in hybrid-capabilities reasoning, we propose InfoGain-Driven DPO (Info-DPO), which uses information gain to evaluate the contribution of each intermediate step, thereby guiding CoME toward more informative reasoning. Comprehensive experiments show that CoME outperforms dense mobile agents and MoE methods on both AITZ and AMEX datasets.
Chinese Translation
移动代理能够自主执行用户指令,这需要混合能力推理,包括屏幕摘要、子任务规划、行动决策和行动功能。然而,现有代理在实现这些能力的解耦增强和均衡整合方面面临困难。为了解决这些挑战,我们提出了移动专家通道(Channel-of-Mobile-Experts, CoME),这是一种新颖的代理架构,由四个不同的专家组成,每个专家与特定的推理阶段相对应,CoME通过面向输出的激活在每个推理阶段激活相应的专家以生成输出标记。为了赋能CoME进行混合能力推理,我们引入了一种渐进式训练策略:Expert-FT使不同专家的能力得以解耦和增强;Router-FT将专家激活与不同的推理阶段对齐;CoT-FT促进多种能力之间的无缝协作和均衡优化。为减少混合能力推理中的错误传播,我们提出了信息增益驱动的动态规划优化(InfoGain-Driven DPO, Info-DPO),该方法利用信息增益评估每个中间步骤的贡献,从而引导CoME朝向更具信息量的推理。全面的实验表明,CoME在AITZ和AMEX数据集上优于密集移动代理和MoE方法。
cs.CL / 30 / 2602.24172
ArgLLM-App: An Interactive System for Argumentative Reasoning with Large Language Models
ArgLLM-App:一个基于大语言模型的互动性论证推理系统
Abstract
Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by humans. Here we propose a web-based system implementing ArgLLM-empowered agents for binary tasks. ArgLLM-App supports visualisation of the produced explanations and interaction with human users, allowing them to identify and contest any mistakes in the system's reasoning. It is highly modular and enables drawing information from trusted external sources. ArgLLM-App is publicly available at https://argllm.app, with a video demonstration at https://youtu.be/vzwlGOr0sPM.
Chinese Translation
论证性大语言模型(ArgLLMs)是一种利用大语言模型(LLMs)和计算论证进行决策的现有方法,旨在使所产生的决策能够被人类忠实地解释和质疑。在此,我们提出了一个基于网络的系统,实施了由ArgLLM赋能的代理,用于二元任务。ArgLLM-App支持对生成的解释进行可视化,并与人类用户进行互动,使他们能够识别和质疑系统推理中的任何错误。该系统高度模块化,并能够从可信的外部来源提取信息。ArgLLM-App已在https://argllm.app上公开发布,视频演示可在https://youtu.be/vzwlGOr0sPM观看。
cs.CL / 31 / 2602.24174
Task-Centric Acceleration of Small-Language Models
以任务为中心的小语言模型加速
Abstract
Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often employed in high-volume, low-latency settings, where efficiency is crucial. We propose TASC, Task-Adaptive Sequence Compression, a framework for SLM acceleration comprising two use-cases: When performing SLM fine-tuning, we propose TASC-ft, which iteratively enriches the tokenizer vocabulary with high-frequency output n-grams and then fine-tunes the model to utilize the expanded vocabulary. Next, we propose an inference-time method, termed TASC-spec. TASC-spec is a lightweight, training-free speculative decoding method that constructs an n-gram draft model from the task's output corpus, mixing task and context n-gram information.TASC-spec avoids any additional training, while bypassing draft-target vocabulary alignment constraints. We demonstrate the effectiveness of both methods across multiple low output-variability generation tasks. Our methods show consistent improvements in inference efficiency while maintaining task performance.
Chinese Translation
小语言模型(SLMs)已成为任务特定应用中大型语言模型的高效替代方案。然而,它们通常在高流量、低延迟的环境中使用,在这些环境中,效率至关重要。我们提出了TASC(任务自适应序列压缩),这是一个用于SLM加速的框架,包含两个使用案例:在进行SLM微调时,我们提出了TASC-ft,它通过高频输出n-gram迭代丰富分词器词汇,然后微调模型以利用扩展的词汇。接下来,我们提出了一种推理时的方法,称为TASC-spec。TASC-spec是一种轻量级、无训练的推测解码方法,它从任务的输出语料库构建n-gram草稿模型,混合任务和上下文的n-gram信息。TASC-spec避免了任何额外的训练,同时绕过了草稿-目标词汇对齐的约束。我们在多个低输出变异性生成任务中展示了这两种方法的有效性。我们的方法在保持任务性能的同时,显示出推理效率的一致提升。
cs.CL / 32 / 2602.24188
MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
MT-PingEval:评估多回合协作中的私密信息游戏
Abstract
We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts -- despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses of contemporary language models, we note that humans achieve comparable task success at superior token efficiency by producing dialogues that are more coherent than those produced by most language models. The proactive management of private information is a defining feature of real-world communication, and we hope that MT-PingEval will drive further work towards improving this capability.
Chinese Translation
我们提出了一种可扩展的方法论,用于评估语言模型在多回合交互中的表现,采用一系列需要有效沟通私密信息的协作游戏。这使得我们能够进行交互式的规模分析,其中固定的令牌预算在可变的回合数中进行分配。我们发现,在许多情况下,语言模型无法利用交互式协作在非交互基线场景中取得改进,在该场景中,一个代理试图总结其信息,而另一个代理立即采取行动——尽管存在显著的提升空间。这表明,最先进的模型在规划和执行多回合协作对话方面仍然存在显著的弱点。我们分析了这些对话的语言特征,评估了谄媚、信息密度和话语连贯性的角色。尽管没有单一的语言解释可以解释当代语言模型的协作弱点,但我们注意到,人类通过产生比大多数语言模型更连贯的对话,在更高的令牌效率下实现了可比的任务成功。私密信息的主动管理是现实世界沟通的一个定义特征,我们希望MT-PingEval能够推动进一步的工作,以改善这一能力。
cs.CL / 33 / 2602.24210
Controllable Reasoning Models Are Private Thinkers
可控推理模型是私密思考者
Abstract
AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at https://github.com/UKPLab/arxiv2026-controllable-reasoning-models
Chinese Translation
由推理模型驱动的人工智能代理需要访问敏感用户数据。然而,它们的推理痕迹难以控制,这可能导致私密信息意外泄露给外部方。我们提出训练模型不仅在最终答案中遵循指令,还在推理痕迹中遵循指令,可能在不同的约束条件下进行。我们假设提高它们在推理痕迹中遵循指令的能力可以改善它们的隐私保护技能。为了证明这一点,我们在一个新的遵循指令的数据集上对模型进行微调,该数据集对推理痕迹有明确的限制。我们进一步引入了一种生成策略,使用独立的 LoRA 适配器将推理与答案生成解耦。我们在来自两个模型家族的六个模型上评估我们的方法,参数范围从 1.7B 到 14B,涵盖两个遵循指令的基准和两个隐私基准。我们的方法带来了显著的改进,在遵循指令的表现上提高了最多 20.9 分,在隐私基准上提高了最多 51.9 个百分点。然而,这些改进可能会以任务效用为代价,因为推理性能与遵循指令能力之间存在权衡。总体而言,我们的结果表明,提高推理模型中的遵循指令行为可以显著增强隐私,暗示了未来隐私意识代理发展的一个有前景的方向。我们的代码和数据可在 https://github.com/UKPLab/arxiv2026-controllable-reasoning-models 获取。
cs.CL / 34 / 2602.24287
Do LLMs Benefit From Their Own Words?
大型语言模型是否从自身的回答中获益?
Abstract
Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model. To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10x. To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns. Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.
Chinese Translation
与大型语言模型的多轮交互通常会在对话历史中保留助手的过去回答。在本研究中,我们重新审视这一设计选择,探讨大型语言模型是否从对自身先前回答的条件化中获益。通过使用真实场景中的多轮对话,我们比较了标准(全上下文)提示与仅用户轮次提示的方法,后者省略了所有之前的助手回答,涵盖了三个开放推理模型和一个最先进的模型。令人惊讶的是,我们发现去除先前助手回答并未对大量轮次的回答质量产生影响。省略助手侧历史可以将累积上下文长度减少多达10倍。为了解释这一结果,我们发现多轮对话中有相当一部分(36.4%)是自包含的提示,许多后续提示提供了足够的指示,仅使用当前用户轮次和之前的用户轮次即可回答。当分析仅用户轮次提示显著优于全上下文的情况时,我们识别出上下文污染的实例,其中模型过度依赖其先前的回答,导致错误、幻觉或风格性伪影在轮次间传播。基于这些发现,我们设计了一种上下文过滤方法,选择性地省略助手侧上下文。我们的研究结果表明,选择性省略助手历史可以提高回答质量,同时减少内存消耗。