cs.RO / 1 / 2603.08810
Age-Related Differences in the Perception of Eye-Gaze from a Social Robot
社会机器人眼神感知的年龄相关差异
Abstract
There is an increasing interest in social robots assisting older adults during daily life tasks. In this context, non-verbal cues such as deictic gaze are important in natural communication in human-robot interaction. However, the sensibility to deictic-gaze declines naturally with age and results in a reduction in social perception. Therefore, this work explores the benefits of deictic gaze from social robots assisting older adults during daily life tasks, and how age-related differences may influence their social perception in contrast to younger populations. This may help on the design of adaptive age-related non-verbal cues in the Human-Robot Interaction context.
Chinese Translation
随着社会机器人在老年人日常生活任务中提供帮助的兴趣日益增加,非语言线索如指示性目光在人与机器人互动中的自然交流中显得尤为重要。然而,随着年龄的增长,个体对指示性目光的敏感性自然下降,导致社会感知的减弱。因此,本研究探讨了社会机器人在帮助老年人日常生活任务中指示性目光的益处,以及年龄相关的差异如何影响老年人与年轻人群体的社会感知。这可能有助于在人与机器人互动的背景下设计适应年龄相关的非语言线索。
cs.RO / 2 / 2603.08814
Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams
Scale-Plan:适用于异构多机器人团队的可扩展语言驱动任务规划
Abstract
Long-horizon task planning for heterogeneous multi-robot systems is essential for deploying collaborative teams in real-world environments; yet, it remains challenging due to the large volume of perceptual information, much of which is irrelevant to task objectives and burdens planning. Traditional symbolic planners rely on manually constructed problem specifications, limiting scalability and adaptability, while recent large language model (LLM)-based approaches often suffer from hallucinations and weak grounding-i.e., poor alignment between generated plans and actual environmental objects and constraints-in object-rich settings. We present Scale-Plan, a scalable LLM-assisted framework that generates compact, task-relevant problem representations from natural language instructions. Given a PDDL domain specification, Scale-Plan constructs an action graph capturing domain structure and uses shallow LLM reasoning to guide a structured graph search that identifies a minimal subset of relevant actions and objects. By filtering irrelevant information prior to planning, Scale-Plan enables efficient decomposition, allocation, and long-horizon plan generation. We evaluate our approach on complex multi-agent tasks and introduce MAT2-THOR, a cleaned benchmark built on AI2-THOR for reliable evaluation of multi-robot planning systems. Scale-Plan outperforms pure LLM and hybrid LLM-PDDL baselines across all metrics, improving scalability and reliability.
Chinese Translation
针对异构多机器人系统的长时间任务规划对于在现实环境中部署协作团队至关重要;然而,由于感知信息量庞大,其中许多信息与任务目标无关,给规划带来了负担,这一过程仍然充满挑战。传统的符号规划器依赖于手动构建的问题规范,限制了其可扩展性和适应性,而最近基于大型语言模型(LLM)的方法往往受到幻觉和弱基础的困扰,即在物体丰富的环境中生成的计划与实际环境对象和约束之间的对齐较差。我们提出了Scale-Plan,一种可扩展的LLM辅助框架,能够从自然语言指令中生成紧凑的、与任务相关的问题表示。给定PDDL领域规范,Scale-Plan构建一个捕捉领域结构的动作图,并利用浅层LLM推理指导结构化图搜索,以识别相关动作和对象的最小子集。通过在规划之前过滤无关信息,Scale-Plan实现了高效的分解、分配和长时间规划生成。我们在复杂的多智能体任务上评估了我们的方法,并引入了MAT2-THOR,这是一个基于AI2-THOR构建的清理基准,用于可靠评估多机器人规划系统。Scale-Plan在所有指标上均优于纯LLM和混合LLM-PDDL基线,提升了可扩展性和可靠性。
cs.RO / 3 / 2603.08817
HMR-1: Hierarchical Massage Robot with Vision-Language-Model for Embodied Healthcare
HMR-1:具有视觉-语言模型的层次化按摩机器人用于身体健康护理
Abstract
The rapid advancement of Embodied Intelligence has opened transformative opportunities in healthcare, particularly in physical therapy and rehabilitation. However, critical challenges remain in developing robust embodied healthcare solutions, such as the lack of standardized evaluation benchmarks and the scarcity of open-source multimodal acupoint massage datasets. To address these gaps, we construct MedMassage-12K - a multimodal dataset containing 12,190 images with 174,177 QA pairs, covering diverse lighting conditions and backgrounds. Furthermore, we propose a hierarchical embodied massage framework, which includes a high-level acupoint grounding module and a low-level control module. The high-level acupoint grounding module uses multimodal large language models to understand human language and identify acupoint locations, while the low-level control module provides the planned trajectory. Based on this, we evaluate existing MLLMs and establish a benchmark for embodied massage tasks. Additionally, we fine-tune the Qwen-VL model, demonstrating the framework's effectiveness. Physical experiments further confirm the practical applicability of the framework.Our dataset and code are publicly available at https://github.com/Xiaofeng-Han-Res/HMR-1.
Chinese Translation
具身智能的快速发展为医疗保健领域,特别是物理治疗和康复,带来了变革性的机遇。然而,在开发稳健的具身健康护理解决方案方面仍然面临重大挑战,例如缺乏标准化的评估基准和开放源代码的多模态腧穴按摩数据集的稀缺。为了解决这些问题,我们构建了MedMassage-12K——一个包含12,190张图像和174,177个问答对的多模态数据集,涵盖了多种光照条件和背景。此外,我们提出了一个层次化的具身按摩框架,其中包括一个高层次的腧穴定位模块和一个低层次的控制模块。高层次的腧穴定位模块利用多模态大型语言模型理解人类语言并识别腧穴位置,而低层次的控制模块则提供规划的轨迹。在此基础上,我们评估了现有的多模态大型语言模型(MLLMs),并为具身按摩任务建立了基准。此外,我们对Qwen-VL模型进行了微调,证明了该框架的有效性。物理实验进一步确认了该框架的实际应用性。我们的数据集和代码可在https://github.com/Xiaofeng-Han-Res/HMR-1上公开获取。
cs.RO / 4 / 2603.08821
Impact of Different Failures on a Robot's Perceived Reliability
不同故障对机器人感知可靠性的影响
Abstract
Robots fail, potentially leading to a loss in the robot's perceived reliability (PR), a measure correlated with trustworthiness. In this study we examine how various kinds of failures affect the PR of the robot differently, and how this measure recovers without explicit social repair actions by the robot. In a preregistered and controlled online video study, participants were asked to predict a robot's success in a pick-and-place task. We examined manipulation failures (slips), freezing (lapses), and three types of incorrect picked objects or place goals (mistakes). Participants were shown one of 11 videos -- one of five types of failure, one of five types of failure followed by a successful execution in the same video, or a successful execution video. This was followed by two additional successful execution videos. Participants bet money either on the robot or on a coin toss after each video. People's betting patterns along with a qualitative analysis of their survey responses highlight that mistakes are less damaging to PR than slips or lapses, and some mistakes are even perceived as successes. We also see that successes immediately following a failure have the same effect on PR as successes without a preceding failure. Finally, we show that successful executions recover PR after a failure. Our findings highlight which robot failures are in higher need of repair in a human-robot interaction, and how trust could be recovered by robot successes.
Chinese Translation
机器人会发生故障,这可能导致机器人感知可靠性(Perceived Reliability, PR)的下降,这一指标与可信度相关。本研究考察了不同类型的故障如何以不同方式影响机器人的PR,以及在没有机器人明确社会修复行为的情况下,这一指标如何恢复。在一项预注册的控制在线视频研究中,参与者被要求预测机器人在一个拾取与放置任务中的成功率。我们考察了操作故障(滑动)、冻结(失误)以及三种类型的错误拾取物体或放置目标(错误)。参与者观看了11个视频中的一个——五种故障中的一种、同一视频中紧随其后的成功执行的五种故障之一,或一个成功执行的视频。随后是两个额外的成功执行视频。参与者在每个视频后对机器人或抛硬币进行下注。人们的下注模式以及对调查问卷的定性分析表明,错误对PR的损害程度低于滑动或失误,某些错误甚至被视为成功。我们还发现,紧随故障之后的成功对PR的影响与没有先前故障的成功相同。最后,我们展示了成功执行在故障后如何恢复PR。我们的研究结果强调了在人与机器人交互中,哪些机器人故障更需要修复,以及如何通过机器人的成功来恢复信任。
cs.RO / 5 / 2603.08831
Predictive Control with Indirect Adaptive Laws for Payload Transportation by Quadrupedal Robots
基于间接自适应法则的四足机器人负载运输预测控制
Abstract
This paper formally develops a novel hierarchical planning and control framework for robust payload transportation by quadrupedal robots, integrating a model predictive control (MPC) algorithm with a gradient-descent-based adaptive updating law. At the framework's high level, an indirect adaptive law estimates the unknown parameters of the reduced-order (template) locomotion model under varying payloads. These estimated parameters feed into an MPC algorithm for real-time trajectory planning, incorporating a convex stability criterion within the MPC constraints to ensure the stability of the template model's estimation error. The optimal reduced-order trajectories generated by the high-level adaptive MPC (AMPC) are then passed to a low-level nonlinear whole-body controller (WBC) for tracking. Extensive numerical investigations validate the framework's capabilities, showcasing the robot's proficiency in transporting unmodeled, unknown static payloads up to 109% in experiments on flat terrains and 91% on rough experimental terrains. The robot also successfully manages dynamic payloads with 73% of its mass on rough terrains. Performance comparisons with a normal MPC and an L1 MPC indicate a significant improvement. Furthermore, comprehensive hardware experiments conducted in indoor and outdoor environments confirm the method's efficacy on rough terrains despite uncertainties such as payload variations, push disturbances, and obstacles.
Chinese Translation
本文正式提出了一种新颖的分层规划与控制框架,用于四足机器人在负载运输中的鲁棒性,结合了模型预测控制(MPC)算法与基于梯度下降的自适应更新法则。在框架的高层次,间接自适应法则在不同负载下估计降阶(模板)运动模型的未知参数。这些估计参数输入到MPC算法中进行实时轨迹规划,在MPC约束中融入了凸稳定性标准,以确保模板模型估计误差的稳定性。高层自适应MPC(AMPC)生成的最优降阶轨迹随后传递给低层非线性全身控制器(WBC)进行跟踪。大量数值研究验证了该框架的能力,展示了机器人在平坦地形实验中运输未建模、未知静态负载的能力高达109%,在粗糙实验地形中为91%。机器人还成功地在粗糙地形上处理动态负载,其质量的73%。与普通MPC和L1 MPC的性能比较显示出显著的改善。此外,在室内和室外环境中进行的全面硬件实验确认了该方法在粗糙地形上的有效性,尽管存在负载变化、推力干扰和障碍物等不确定性。
cs.RO / 6 / 2603.08860
SEP-NMPC: Safety Enhanced Passivity-Based Nonlinear Model Predictive Control for a UAV Slung Payload System
SEP-NMPC:针对无人机吊载系统的安全增强被动非线性模型预测控制
Abstract
Model Predictive Control (MPC) is widely adopted for agile multirotor vehicles, yet achieving both stability and obstacle-free flight is particularly challenging when a payload is suspended beneath the airframe. This paper introduces a Safety Enhanced Passivity-Based Nonlinear MPC (SEP-NMPC) that provides formal guarantees of stability and safety for a quadrotor transporting a slung payload through cluttered environments. Stability is enforced by embedding a strict passivity inequality, which is derived from a shaped energy storage function with adaptive damping, directly into the NMPC. This formulation dissipates excess energy and ensures asymptotic convergence despite payload swings. Safety is guaranteed through high-order control barrier functions (HOCBFs) that render user-defined clearance sets forward-invariant, obliging both the quadrotor and the swinging payload to maintain separation while interacting with static and dynamic obstacles. The optimization remains quadratic-program compatible and is solved online at each sampling time without gain scheduling or heuristic switching. Extensive simulations and real-world experiments confirm stable payload transport, collision-free trajectories, and real-time feasibility across all tested scenarios. The SEP-NMPC framework therefore unifies passivity-based closed-loop stability with HOCBF-based safety guarantees for UAV slung-payload transportation.
Chinese Translation
模型预测控制(MPC)在灵活的多旋翼飞行器中得到了广泛应用,但当负载悬挂在机身下方时,实现稳定性和无障碍飞行尤其具有挑战性。本文提出了一种安全增强被动非线性模型预测控制(SEP-NMPC),为在复杂环境中运输吊载的四旋翼提供了稳定性和安全性的正式保证。通过将严格的被动不等式嵌入NMPC中,该不等式源于具有自适应阻尼的形状能量存储函数,从而强制实现稳定性。该公式能够消耗多余的能量,并确保尽管负载摆动,仍能实现渐近收敛。安全性则通过高阶控制障碍函数(HOCBFs)得以保证,这些函数使用户定义的安全间隔集保持前向不变,迫使四旋翼和摆动的负载在与静态和动态障碍物交互时保持距离。优化过程保持与二次规划兼容,并在每个采样时间在线求解,无需增益调度或启发式切换。大量的仿真和实际实验验证了稳定的负载运输、无碰撞轨迹以及在所有测试场景中的实时可行性。因此,SEP-NMPC框架将基于被动的闭环稳定性与基于HOCBF的安全保证统一于无人机吊载运输中。
cs.RO / 7 / 2603.08862
APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model
APPLV:来自视觉-语言-动作模型的自适应规划器参数学习
Abstract
Autonomous navigation in highly constrained environments remains challenging for mobile robots. Classical navigation approaches offer safety assurances but require environment-specific parameter tuning; end-to-end learning bypasses parameter tuning but struggles with precise control in constrained spaces. To this end, recent robot learning approaches automate parameter tuning while retaining classical systems' safety, yet still face challenges in generalizing to unseen environments. Recently, Vision-Language-Action (VLA) models have shown promise by leveraging foundation models' scene understanding capabilities, but still struggle with precise control and inference latency in navigation tasks. In this paper, we propose Adaptive Planner Parameter Learning from Vision-Language-Action Model (\textsc{applv}). Unlike traditional VLA models that directly output actions, \textsc{applv} leverages pre-trained vision-language models with a regression head to predict planner parameters that configure classical planners. We develop two training strategies: supervised learning fine-tuning from collected navigation trajectories and reinforcement learning fine-tuning to further optimize navigation performance. We evaluate \textsc{applv} across multiple motion planners on the simulated Benchmark Autonomous Robot Navigation (BARN) dataset and in physical robot experiments. Results demonstrate that \textsc{applv} outperforms existing methods in both navigation performance and generalization to unseen environments.
Chinese Translation
在高度受限的环境中,自主导航对移动机器人仍然具有挑战性。经典的导航方法提供安全保障,但需要针对特定环境进行参数调优;端到端学习则绕过了参数调优,但在受限空间中难以实现精确控制。为此,近期的机器人学习方法在保留经典系统安全性的同时,自动化参数调优,但在向未见环境的泛化方面仍面临挑战。最近,视觉-语言-动作(VLA)模型通过利用基础模型的场景理解能力展现出前景,但在导航任务中仍然面临精确控制和推理延迟的问题。在本文中,我们提出了来自视觉-语言-动作模型的自适应规划器参数学习(Adaptive Planner Parameter Learning from Vision-Language-Action Model, extsc{applv})。与传统的VLA模型直接输出动作不同, extsc{applv}利用预训练的视觉-语言模型和回归头来预测配置经典规划器的规划器参数。我们开发了两种训练策略:从收集的导航轨迹进行监督学习微调,以及通过强化学习微调进一步优化导航性能。我们在模拟的基准自主机器人导航(Benchmark Autonomous Robot Navigation,BARN)数据集和物理机器人实验中评估了 extsc{applv}。结果表明, extsc{applv}在导航性能和对未见环境的泛化能力上均优于现有方法。
cs.RO / 8 / 2603.08863
Adaptive SINDy: Residual Force System Identification Based UAV Disturbance Rejection
自适应 SINDy:基于残余力系统识别的无人机干扰抑制
Abstract
The stability and control of Unmanned Aerial Vehicles (UAVs) in a turbulent environment is a matter of great concern. Devising a robust control algorithm to reject disturbances is challenging due to the highly nonlinear nature of wind dynamics, and modeling the dynamics using analytical techniques is not straightforward. While traditional techniques using disturbance observers and classical adaptive control have shown some progress, they are mostly limited to relatively non-complex environments. On the other hand, learning based approaches are increasingly being used for modeling of residual forces and disturbance rejection; however, their generalization and interpretability is a factor of concern. To this end, we propose a novel integration of data-driven system identification using Sparse Identification of Non-Linear Dynamics (SINDy) with a Recursive Least Square (RLS) adaptive control to adapt and reject wind disturbances in a turbulent environment. We tested and validated our approach on Gazebo harmonic environment and on real flights with wind speeds of up to 2 m/s from four directions, creating a highly dynamic and turbulent environment. Adaptive SINDy outperformed the baseline PID and INDI controllers on several trajectory tracking error metrics without crashing. A root mean square error (RMSE) of up to 12.2 cm and 17.6 cm, and a mean absolute error (MAE) of 13.7 cm and 10.5 cm were achieved on circular and lemniscate trajectories, respectively. The validation was performed on a very lightweight Crazyflie drone under a highly dynamic environment for complex trajectory tracking.
Chinese Translation
无人机(UAV)在湍流环境中的稳定性和控制是一个备受关注的问题。由于风动力学的高度非线性特性,设计一个强健的控制算法以抑制干扰具有挑战性,使用分析技术建模动力学并不简单。虽然传统技术利用干扰观测器和经典自适应控制已取得了一定进展,但它们大多局限于相对简单的环境。另一方面,基于学习的方法越来越多地用于残余力建模和干扰抑制;然而,它们的泛化能力和可解释性是一个值得关注的因素。为此,我们提出了一种新颖的集成方法,将基于数据驱动的非线性动力学稀疏识别(SINDy)与递归最小二乘(RLS)自适应控制相结合,以适应并抑制湍流环境中的风干扰。我们在 Gazebo 和真实飞行中进行了测试和验证,风速高达 2 m/s,来自四个方向,创造了一个高度动态和湍流的环境。自适应 SINDy 在多个轨迹跟踪误差指标上优于基线 PID 和 INDI 控制器,且未发生崩溃。在圆形和八字形轨迹上分别达到了高达 12.2 cm 和 17.6 cm 的均方根误差(RMSE),以及 13.7 cm 和 10.5 cm 的平均绝对误差(MAE)。验证是在一个非常轻量的 Crazyflie 无人机上进行的,适用于复杂轨迹跟踪的高度动态环境。
cs.RO / 9 / 2603.08905
Proprioceptive Safe Active Navigation and Exploration for Planetary Environments
行星环境中的本体感知安全主动导航与探索
Abstract
Deformable granular terrains introduce significant locomotion and immobilization risks in planetary exploration and are difficult to detect via remote sensing (e.g., vision). Legged robots can sense terrain properties through leg-terrain interactions during locomotion, offering a direct means to assess traversability in deformable environments. How to systematically exploit this interaction-derived information for navigation planning, however, remains underexplored. We address this gap by presenting PSANE, a Proprioceptive Safe Active Navigation and Exploration framework that leverages leg-terrain interaction measurements for safe navigation and exploration in unknown deformable environments. PSANE learns a traversability model via Gaussian Process regression to estimate and certify safe regions and identify exploration frontiers online, and integrates these estimates with a reactive controller for real-time navigation. Frontier selection is formulated as a multi-objective optimization that balances safe-set expansion probability and goal-directed cost, with subgoals selected via scalarization over the Pareto-optimal frontier set. PSANE safely explores unknown granular terrain and reaches specified goals using only proprioceptively estimated traversability, while achieving performance improvements over baseline methods.
Chinese Translation
可变形颗粒地形在行星探索中引入了显著的运动和固定风险,并且通过遥感(例如视觉)难以检测。腿式机器人可以通过运动过程中的腿与地形的相互作用感知地形特性,从而为评估可变形环境中的可通行性提供了一种直接的方法。然而,如何系统性地利用这种基于相互作用的信息进行导航规划仍然未得到充分探索。我们通过提出PSANE(Proprioceptive Safe Active Navigation and Exploration)框架来填补这一空白,该框架利用腿与地形的相互作用测量进行安全导航和未知可变形环境的探索。PSANE通过高斯过程回归学习可通行性模型,以在线估计和认证安全区域,并识别探索前沿,并将这些估计与反应控制器集成以实现实时导航。前沿选择被公式化为一种多目标优化,平衡安全集扩展概率和目标导向成本,通过在帕累托最优前沿集上进行标量化选择子目标。PSANE安全地探索未知的颗粒地形,并仅使用本体感知估计的可通行性达到指定目标,同时在性能上超过基线方法。
cs.RO / 10 / 2603.08926
Fly, Track, Land: Infrastructure-less Magnetic Localization for Heterogeneous UAV-UGV Teaming
飞行、追踪、着陆:无基础设施的异构无人机-无人地面车辆团队磁定位
Abstract
We present a complete infrastructure-less magneto-inductive (MI) localization system enabling a lightweight UAV to autonomously hover, track, and land with centimeter precision on a mobile quadruped robot acting as a dynamic docking pad. This work advances the vision of heterogeneous robot collaboration, where ultra-lightweight flying robots serve as mobile perception agents for ground-based Unmanned Ground Vehicles (UGVs). By extending the sensing horizon and providing complementary viewpoints, the UAVs enhance exploration efficiency and improve the quality of data collection in large-scale, unknown environments. The proposed system aims to complements traditional localization modalities with a compact, embedded, and infrastructure-less magnetic sensing approach, providing accurate short-range relative positioning to bridge the gap between coarse navigation and precise UAV docking. A single lightweight receive coil and a fully embedded estimation pipeline on the UAV deliver 20 Hz relative pose estimates in the UGV's frame, achieving a 3D position root-mean-square error (RMSE) of 5 cm. The system uses real-time estimation and a warm-started solver to estimate the 3D position, which is then fused with inertial and optical-flow measurements in the onboard extended Kalman filter. Real-world experiments validate the effectiveness of the framework, demonstrating significant improvements in UAV--UGV teaming in infrastructure-less scenarios compared to state-of-the-art methods, requiring no external anchors or global positioning. In dynamic scenarios, the UAV tracks and docks with a moving UGV while maintaining a 7.2 cm RMSE and achieving successful autonomous landings.
Chinese Translation
我们提出了一种完整的无基础设施磁感应(MI)定位系统,使轻量级无人机能够自主悬停、追踪并以厘米级精度在作为动态对接平台的移动四足机器人上着陆。这项工作推进了异构机器人协作的愿景,其中超轻型飞行机器人作为地面无人地面车辆(UGVs)的移动感知代理。通过扩展感知范围并提供互补视角,无人机提高了探索效率,并改善了在大规模未知环境中的数据收集质量。所提出的系统旨在通过一种紧凑、嵌入式且无基础设施的磁感应方法来补充传统定位方式,提供准确的短距离相对定位,以弥合粗略导航与精确无人机对接之间的差距。单个轻量级接收线圈和无人机上的完全嵌入式估计管道在UGV的框架内提供20 Hz的相对姿态估计,达到3D位置均方根误差(RMSE)为5 cm。该系统使用实时估计和热启动求解器来估计3D位置,然后与机载扩展卡尔曼滤波器中的惯性和光流测量进行融合。实际实验验证了该框架的有效性,显示出在无基础设施场景中,相较于最先进的方法,无人机-UGV团队协作显著改善,且不需要外部锚点或全球定位。在动态场景中,无人机在跟踪和对接移动UGV的同时保持7.2 cm的均方根误差,并成功实现自主着陆。
cs.RO / 11 / 2603.08958
Formation-Aware Adaptive Conformalized Perception for Safe Leader-Follower Multi-Robot Systems
基于形成意识的自适应合规感知用于安全的领航-跟随多机器人系统
Abstract
This paper considers the perception safety problem in distributed vision-based leader-follower formations, where each robot uses onboard perception to estimate relative states, track desired setpoints, and keep the leader within its camera field of view (FOV). Safety is challenging due to heteroscedastic perception errors and the coupling between formation maneuvers and visibility constraints. We propose a distributed, formation-aware adaptive conformal prediction method based on Risk-Aware Mondrian CP to produce formation-conditioned uncertainty quantiles. The resulting bounds tighten in high-risk configurations (near FOV limits) and relax in safer regions. We integrate these bounds into a Formation-Aware Conformal CBF-QP with a smooth margin to enforce visibility while maintaining feasibility and tracking performance. Gazebo simulations show improved formation success rates and tracking accuracy over non-adaptive (global) CP baselines that ignore formation-dependent visibility risk, while preserving finite-sample probabilistic safety guarantees. The experimental videos are available on the \href{https://nail-uh.github.io/iros2026.github.io/}{project website}\footnote{Project Website: https://nail-uh.github.io/iros2026.github.io/}.
Chinese Translation
本文考虑了分布式视觉基础的领航-跟随队形中的感知安全问题,其中每个机器人利用机载感知来估计相对状态、跟踪期望的设定点,并保持领航者在其摄像头视场(FOV)内。由于异方差感知误差以及队形机动与可见性约束之间的耦合,安全性面临挑战。我们提出了一种基于风险意识的蒙德里安合规预测(Risk-Aware Mondrian CP)的分布式、形成意识的自适应合规预测方法,以生成基于队形的不确定性分位数。所得到的界限在高风险配置(接近FOV限制)中收紧,而在更安全的区域中放宽。我们将这些界限整合到一个具有平滑边际的形成意识合规CBF-QP中,以在保持可行性和跟踪性能的同时强制执行可见性。Gazebo仿真显示,与忽略队形依赖可见性风险的非自适应(全局)CP基线相比,形成成功率和跟踪精度有所提高,同时保持有限样本的概率安全保证。实验视频可在项目网站上获取。
cs.RO / 12 / 2603.08961
FAME: Force-Adaptive RL for Expanding the Manipulation Envelope of a Full-Scale Humanoid
FAME:用于扩展全尺度类人机器人操作范围的力自适应强化学习
Abstract
Maintaining balance under external hand forces is critical for humanoid bimanual manipulation, where interaction forces propagate through the kinematic chain and constrain the feasible manipulation envelope. We propose \textbf{FAME}, a force-adaptive reinforcement learning framework that conditions a standing policy on a learned latent context encoding upper-body joint configuration and bimanual interaction forces. During training, we apply diverse, spherically sampled 3D forces on each hand to inject disturbances in simulation together with an upper-body pose curriculum, exposing the policy to manipulation-induced perturbations across continuously varying arm configurations. At deployment, interaction forces are estimated from the robot dynamics and fed to the same encoder, enabling online adaptation without wrist force/torque sensors. In simulation across five fixed arm configurations with randomized hand forces and commanded base heights, FAME improves mean standing success to 73.84%, compared to 51.40% for the curriculum-only baseline and 29.44% for the base policy. We further deploy the learned policy on a full-scale Unitree H12 humanoid and evaluate robustness in representative load-interaction scenarios, including asymmetric single-arm load and symmetric bimanual load. Code and videos are available on https://fame10.github.io/Fame/
Chinese Translation
在外部手部力量作用下保持平衡对于类人双手操作至关重要,因为交互力通过运动链传播并限制了可行的操作范围。我们提出了 extbf{FAME},一种力自适应强化学习框架,该框架将站立策略与学习到的潜在上下文编码(包括上半身关节配置和双手交互力)相结合。在训练过程中,我们对每只手施加多样化的球形采样三维力,以在仿真中引入扰动,并结合上半身姿态课程,使策略暴露于因操作引起的扰动,涵盖不断变化的手臂配置。在部署阶段,交互力通过机器人动力学进行估计,并输入到相同的编码器中,从而实现在线适应,而无需手腕力/扭矩传感器。在五个固定手臂配置下进行的仿真中,随机化的手部力量和指令基础高度,FAME将平均站立成功率提高至73.84%,而仅使用课程的基线为51.40%,基础策略为29.44%。我们还将学习到的策略部署在全尺度的Unitree H12类人机器人上,并在代表性的负载交互场景中评估其鲁棒性,包括不对称单臂负载和对称双手负载。代码和视频可在https://fame10.github.io/Fame/获取。
cs.RO / 13 / 2603.08983
SurgCalib: Gaussian Splatting-Based Hand-Eye Calibration for Robot-Assisted Minimally Invasive Surgery
SurgCalib:基于高斯点云的机器人辅助微创手术手眼标定
Abstract
We present a Gaussian Splatting-based framework for hand-eye calibration of the da Vinci surgical robot. In a vision-guided robotic system, accurate estimation of the rigid transformation between the robot base and the camera frame is essential for reliable closed-loop control. For cable-driven surgical robots, this task faces unique challenges. The encoders of surgical instruments often produce inaccurate proprioceptive measurements due to cable stretch and backlash. Conventional hand-eye calibration approaches typically rely on known fiducial patterns and solve the AX = XB formulation. While effective, introducing additional markers into the operating room (OR) environment can violate sterility protocols and disrupt surgical workflows. In this study, we propose SurgCalib, an automatic, markerless framework that has the potential to be used in the OR. SurgCalib first initializes the pose of the surgical instrument using raw kinematic measurements and subsequently refines this pose through a two-phase optimization procedure under the RCM constraint within a Gaussian Splatting-based differentiable rendering pipeline. We evaluate the proposed method on the public dVRK benchmark, SurgPose. The results demonstrate average 2D tool-tip reprojection errors of 12.24 px (2.06 mm) and 11.33 px (1.9 mm), and 3D tool-tip Euclidean distance errors of 5.98 mm and 4.75 mm, for the left and right instruments, respectively.
Chinese Translation
我们提出了一种基于高斯点云的框架,用于达芬奇手术机器人手眼标定。在视觉引导的机器人系统中,准确估计机器人基座与相机框架之间的刚性变换对于可靠的闭环控制至关重要。对于电缆驱动的手术机器人,这一任务面临独特的挑战。手术器械的编码器由于电缆拉伸和间隙,往往产生不准确的本体测量。传统的手眼标定方法通常依赖已知的基准图案,并解决AX = XB的形式。虽然有效,但在手术室(OR)环境中引入额外的标记可能会违反无菌协议并干扰手术工作流程。在本研究中,我们提出了SurgCalib,这是一种自动化的无标记框架,具有在手术室中使用的潜力。SurgCalib首先利用原始运动学测量初始化手术器械的姿态,然后通过在基于高斯点云的可微渲染管道中,在RCM约束下的两阶段优化程序进一步优化该姿态。我们在公共的dVRK基准SurgPose上评估了所提出的方法。结果表明,左、右手术器械的平均2D工具尖重投影误差分别为12.24 px (2.06 mm)和11.33 px (1.9 mm),3D工具尖欧几里得距离误差分别为5.98 mm和4.75 mm。
cs.RO / 14 / 2603.08988
Characterization, Analytical Planning, and Hybrid Force Control for the Inspire RH56DFX Hand
Inspire RH56DFX 手的特性分析、分析规划与混合力控制
Abstract
Commercially accessible dexterous robot hands are increasingly prevalent, but many remain difficult to use as scientific instruments. For example, the Inspire RH56DFX hand exposes only uncalibrated proprioceptive information and shows unreliable contact behavior at high speed (up to 1618% force limit overshoot). Furthermore, its underactuated, coupled finger linkages make antipodal grasps non-trivial. We contribute three improvements to the Inspire RH56DFX to transform it from a black-box device to a research tool: (1) hardware characterization (force calibration, latency, and overshoot), (2) a sim2real validated MuJoCo model for analytical width-to-grasp planning, and (3) a hybrid, closed-loop speed-force grasp controller. We validate these components on peg-in-hole insertion, achieving 65% success and outperforming a wrist-force-only baseline of 10% and on 300 grasps across 15 physically diverse objects, achieving 87% success and outperforming plan-free grasps and learned grasps. Our approach is modular, designed for compatibility with external object detectors and vision-language models for width & force estimation and high-level planning, and provides an interpretable and immediately deployable interface for dexterous manipulation with the Inspire RH56DFX hand, open-sourced at this website https://correlllab.github.io/rh56dfx.html.
Chinese Translation
商业上可获取的灵巧机器人手越来越普遍,但许多仍然难以作为科学仪器使用。例如,Inspire RH56DFX 手仅提供未经校准的本体感知信息,并且在高速下(最高可达1618%的力极限超调)表现出不可靠的接触行为。此外,其欠驱动的耦合指链使得对立抓取变得不简单。我们对 Inspire RH56DFX 进行了三项改进,将其从黑箱设备转变为研究工具:(1) 硬件特性分析(力校准、延迟和超调),(2) 一个经过 sim2real 验证的 MuJoCo 模型用于分析宽度抓取规划,以及 (3) 一个混合的闭环速度-力抓取控制器。我们在插销入孔任务中验证了这些组件,成功率达到65%,超越了仅依赖手腕力的基线(10%),在15种物理多样的物体上进行300次抓取,成功率达到87%,超越了无规划抓取和学习抓取。我们的方法是模块化的,旨在与外部物体检测器和视觉-语言模型兼容,以进行宽度和力的估计及高层次规划,并提供了一个可解释且可立即部署的接口,用于使用 Inspire RH56DFX 手进行灵巧操作,相关代码已开源于此网站 https://correlllab.github.io/rh56dfx.html。
cs.RO / 15 / 2603.09011
Improving through Interaction: Searching Behavioral Representation Spaces with CMA-ES-IG
通过互动提升:使用CMA-ES-IG搜索行为表示空间
Abstract
Robots that interact with humans must adapt to individual users' preferences to operate effectively in human-centered environments. An intuitive and effective technique to learn non-expert users' preferences is through rankings of robot behaviors, e.g., trajectories, gestures, or voices. Existing techniques primarily focus on generating queries that optimize preference learning outcomes, such as sample efficiency or final preference estimation accuracy. However, the focus on outcome overlooks key user expectations in the process of providing these rankings, which can negatively impact users' adoption of robotic systems. This work proposes the Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG) algorithm. CMA-ES-IG explicitly incorporates user experience considerations into the preference learning process by suggesting perceptually distinct and informative trajectories for users to rank. We demonstrate these benefits through both simulated studies and real-robot experiments. CMA-ES-IG, compared to state-of-the-art alternatives, (1) scales more effectively to higher-dimensional preference spaces, (2) maintains computational tractability for high-dimensional problems, (3) is robust to noisy or inconsistent user feedback, and (4) is preferred by non-expert users in identifying their preferred robot behaviors. This project's code is available at github.com/interaction-lab/CMA-ES-IG
Chinese Translation
与人类互动的机器人必须适应个体用户的偏好,以便在以人为中心的环境中有效运作。学习非专业用户偏好的直观且有效的技术是通过对机器人行为(例如,轨迹、手势或声音)的排名。现有技术主要集中在生成优化偏好学习结果的查询上,例如样本效率或最终偏好估计的准确性。然而,关注结果的做法忽视了在提供这些排名过程中用户的关键期望,这可能对用户采纳机器人系统产生负面影响。本研究提出了带有信息增益的协方差矩阵自适应进化策略(CMA-ES-IG)算法。CMA-ES-IG通过建议感知上明显且信息丰富的轨迹供用户排名,明确将用户体验考虑纳入偏好学习过程。我们通过模拟研究和真实机器人实验展示了这些优势。与最先进的替代方案相比,CMA-ES-IG(1)在高维偏好空间中更有效地扩展,(2)在高维问题中保持计算可行性,(3)对噪声或不一致的用户反馈具有鲁棒性,以及(4)在识别用户偏好的机器人行为时受到非专业用户的青睐。该项目的代码可在github.com/interaction-lab/CMA-ES-IG获取。
cs.RO / 16 / 2603.09030
PlayWorld: Learning Robot World Models from Autonomous Play
PlayWorld:从自主游戏中学习机器人世界模型
Abstract
Action-conditioned video models offer a promising path to building general-purpose robot simulators that can improve directly from data. Yet, despite training on large-scale robot datasets, current state-of-the-art video models still struggle to predict physically consistent robot-object interactions that are crucial in robotic manipulation. To close this gap, we present PlayWorld, a simple, scalable, and fully autonomous pipeline for training high-fidelity video world simulators from interaction experience. In contrast to prior approaches that rely on success-biased human demonstrations, PlayWorld is the first system capable of learning entirely from unsupervised robot self-play, enabling naturally scalable data collection while capturing complex, long-tailed physical interactions essential for modeling realistic object dynamics. Experiments across diverse manipulation tasks show that PlayWorld generates high-quality, physically consistent predictions for contact-rich interactions that are not captured by world models trained on human-collected data.We further demonstrate the versatility of PlayWorld in enabling fine-grained failure prediction and policy evaluation, with up to 40% improvements over human-collected data. Finally, we demonstrate how PlayWorld enables reinforcement learning in the world model, improving policy performance by 65% in success rates when deployed in the real world.
Chinese Translation
基于动作的条件视频模型为构建通用机器人模拟器提供了一条有前景的路径,这些模拟器能够直接从数据中改进。然而,尽管在大规模机器人数据集上进行训练,目前最先进的视频模型仍然难以预测在机器人操作中至关重要的物理一致的机器人-物体交互。为了解决这一问题,我们提出了PlayWorld,这是一种简单、可扩展且完全自主的管道,用于从交互经验中训练高保真度的视频世界模拟器。与依赖于成功偏向的人类示范的先前方法不同,PlayWorld是第一个能够完全从无监督的机器人自我游戏中学习的系统,使得自然可扩展的数据收集成为可能,同时捕捉建模现实物体动态所必需的复杂、长尾的物理交互。在多样化的操作任务中的实验表明,PlayWorld能够生成高质量、物理一致的预测,尤其是在那些未被基于人类收集数据训练的世界模型捕捉到的接触丰富的交互中。我们进一步展示了PlayWorld在实现细粒度故障预测和策略评估方面的多样性,相较于人类收集的数据提高了多达40%的性能。最后,我们展示了PlayWorld如何在世界模型中启用强化学习,在实际应用中成功率提高了65%。
cs.RO / 17 / 2603.09031
ImpedanceDiffusion: Diffusion-Based Global Path Planning for UAV Swarm Navigation with Generative Impedance Control
阻抗扩散:基于扩散的无人机群导航全球路径规划与生成阻抗控制
Abstract
Safe swarm navigation in cluttered indoor environment requires long-horizon planning, reactive obstacle avoidance, and adaptive compliance. We propose ImpedanceDiffusion, a hierarchical framework that leverages image-conditioned diffusion-based global path planning with Artificial Potential Field (APF) tracking and semantic-aware variable impedance control for aerial drone swarms. The diffusion model generates geometric global trajectories directly from RGB images without explicit map construction. These trajectories are tracked by an APF-based reactive layer, while a VLM-RAG module performs semantic obstacle classification with 90% retrieval accuracy to adapt impedance parameters for mixed obstacle environments during execution. Two diffusion planners are evaluated: (i) a top-view long-horizon planner using single-pass inference and (ii) a first-person-view (FPV) short-horizon planner deployed via a two-stage inference pipeline. Both planners achieve a 100% trajectory generation rate across twenty static and dynamic experimental configurations and are validated via zero-shot sim-to-real deployment on Crazyflie 2.1 drones through the hierarchical APF-impedance control stack. The top-view planner produces smoother trajectories that yield conservative tracking speeds of 1.0-1.2 m/s near hard obstacles and 0.6-1.0 m/s near soft obstacles. In contrast, the FPV planner generates trajectories with greater local clearance and typically higher speeds, reaching 1.4-2.0 m/s near hard obstacles and up to 1.6 m/s near soft obstacles. Across 20 experimental configurations (100 total runs), the framework achieved a 92% success rate while maintaining stable impedance-based formation control with bounded oscillations and no in-flight collisions, demonstrating reliable and adaptive swarm navigation in cluttered indoor environments.
Chinese Translation
在杂乱的室内环境中实现安全的无人机群导航需要长时间的规划、反应式避障和自适应顺应性。我们提出了阻抗扩散(ImpedanceDiffusion),这是一个分层框架,利用图像条件的基于扩散的全球路径规划,结合人工势场(Artificial Potential Field, APF)跟踪和语义感知的可变阻抗控制,专为空中无人机群设计。该扩散模型直接从RGB图像生成几何全球轨迹,而无需显式的地图构建。这些轨迹由基于APF的反应层进行跟踪,同时VLM-RAG模块以90%的检索准确率执行语义障碍分类,以便在执行过程中为混合障碍环境调整阻抗参数。我们评估了两个扩散规划器:(i) 使用单次推理的顶视长时间规划器和(ii) 通过两阶段推理管道部署的第一人称视角(FPV)短时间规划器。两个规划器在二十个静态和动态实验配置中均实现了100%的轨迹生成率,并通过在Crazyflie 2.1无人机上的零样本模拟到现实部署进行了验证。顶视规划器生成的轨迹更平滑,在硬障碍物附近的保守跟踪速度为1.0-1.2米/秒,而在软障碍物附近为0.6-1.0米/秒。相比之下,FPV规划器生成的轨迹具有更大的局部间隙,通常速度更高,在硬障碍物附近可达1.4-2.0米/秒,在软障碍物附近可达1.6米/秒。在20个实验配置(共100次运行)中,该框架实现了92%的成功率,同时保持稳定的基于阻抗的队形控制,振荡受限且无飞行碰撞,展示了在杂乱室内环境中可靠和自适应的无人机群导航能力。
cs.RO / 18 / 2603.09047
Beyond Amplitude: Channel State Information Phase-Aware Deep Fusion for Robotic Activity Recognition
超越幅度:基于信道状态信息相位感知的深度融合用于机器人活动识别
Abstract
Wi-Fi Channel State Information (CSI) has emerged as a promising non-line-of-sight sensing modality for human and robotic activity recognition. However, prior work has predominantly relied on CSI amplitude while underutilizing phase information, particularly in robotic arm activity recognition. In this paper, we present GateFusion-Bidirectional Long Short-Term Memory network (GF-BiLSTM) for WiFi sensing in robotic activity recognition. GF-BiLSTM is a two-stream gated fusion network that encodes amplitude and phase separately and adaptively integrates per-time features through a learned gating mechanism. We systematically evaluate state-of-the-art deep learning models under a Leave-One-Velocity-Out (LOVO) protocol across four input configurations: amplitude only, phase only, amplitude + unwrapped phase, and amplitude + sanitized phase. Experimental results demonstrate that incorporating phase alongside amplitude consistently improves recognition accuracy and cross-speed robustness, with GF-BiLSTM achieving the best performance. To the best of our knowledge, this work provides the first systematic exploration of CSI phase for robotic activity recognition, establishing its critical role in Wi-Fi-based sensing.
Chinese Translation
Wi-Fi信道状态信息(CSI)已成为一种有前景的非视距感知方式,用于人类和机器人活动识别。然而,之前的研究主要依赖于CSI幅度,而未充分利用相位信息,特别是在机器人臂活动识别中。本文提出了用于Wi-Fi感知的机器人活动识别的门控融合双向长短期记忆网络(GateFusion-Bidirectional Long Short-Term Memory network,GF-BiLSTM)。GF-BiLSTM是一个双流门控融合网络,分别编码幅度和相位,并通过学习的门控机制自适应地整合每个时间特征。我们在四种输入配置下系统地评估了最先进的深度学习模型,采用了留一速度法(Leave-One-Velocity-Out,LOVO)协议:仅幅度、仅相位、幅度 + 解包相位,以及幅度 + 清洗相位。实验结果表明,结合相位和幅度能够持续提高识别准确性和跨速度的鲁棒性,其中GF-BiLSTM表现最佳。根据我们的了解,这项工作首次系统性地探索了CSI相位在机器人活动识别中的应用,确立了其在基于Wi-Fi的感知中的关键作用。
cs.RO / 19 / 2603.09051
Cutting the Cord: System Architecture for Low-Cost, GPU-Accelerated Bimanual Mobile Manipulation
割断束缚:低成本GPU加速双手移动操控系统架构
Abstract
We present a bimanual mobile manipulator built on the open-source XLeRobot with integrated onboard compute for less than \$1300. Key contributions include: (1) optimized mechanical design maximizing stiffness-to-weight ratio, (2) a Tri-Bus power topology isolating compute from motor-induced voltage transients, and (3) embedded autonomy using NVIDIA Jetson Orin Nano for untethered operation. The platform enables teleoperation, autonomous SLAM navigation, and vision-based manipulation without external dependencies, providing a low-cost alternative for research and education in robotics and robot learning.
Chinese Translation
我们提出了一种基于开源XLeRobot构建的双手移动操控器,集成了低于1300美元的机载计算。主要贡献包括:(1) 优化的机械设计,最大化刚度与重量比;(2) Tri-Bus电源拓扑,隔离计算与电机引起的电压瞬变;(3) 使用NVIDIA Jetson Orin Nano实现的嵌入式自主性,支持无缆操作。该平台支持远程操作、自动SLAM导航和基于视觉的操控,无需外部依赖,为机器人及机器人学习领域的研究和教育提供了一种低成本的替代方案。
cs.RO / 20 / 2603.09056
Quality over Quantity: Demonstration Curation via Influence Functions for Data-Centric Robot Learning
质量优于数量:通过影响函数进行数据中心化机器人学习的演示数据策展
Abstract
Learning from demonstrations has emerged as a promising paradigm for end-to-end robot control, particularly when scaled to diverse and large datasets. However, the quality of demonstration data, often collected through human teleoperation, remains a critical bottleneck for effective data-driven robot learning. Human errors, operational constraints, and teleoperator variability introduce noise and suboptimal behaviors, making data curation essential yet largely manual and heuristic-driven. In this work, we propose Quality over Quantity (QoQ), a grounded and systematic approach to identifying high-quality data by defining data quality as the contribution of each training sample to reducing loss on validation demonstrations. To efficiently estimate this contribution, we leverage influence functions, which quantify the impact of individual training samples on model performance. We further introduce two key techniques to adapt influence functions for robot demonstrations: (i) using maximum influence across validation samples to capture the most relevant state-action pairs, and (ii) aggregating influence scores of state-action pairs within the same trajectory to reduce noise and improve data coverage. Experiments in both simulated and real-world settings show that QoQ consistently improves policy performances over prior data selection methods.
Chinese Translation
从演示中学习已成为端到端机器人控制的一个有前景的范式,特别是在扩展到多样化和大规模数据集时。然而,演示数据的质量,通常通过人类遥操作收集,仍然是有效数据驱动机器人学习的一个关键瓶颈。人类错误、操作限制和遥操作员的变异性引入了噪声和次优行为,使得数据策展变得至关重要,但在很大程度上仍然是手动和启发式驱动的。在这项工作中,我们提出了“质量优于数量”(Quality over Quantity, QoQ),这是一种扎实且系统的方法,通过将数据质量定义为每个训练样本对减少验证演示损失的贡献,从而识别高质量数据。为了有效估计这一贡献,我们利用影响函数(influence functions),该函数量化了个别训练样本对模型性能的影响。我们进一步引入了两项关键技术,以适应机器人演示的影响函数:(i)使用验证样本中的最大影响来捕捉最相关的状态-动作对,以及(ii)聚合同一轨迹内状态-动作对的影响评分,以减少噪声并改善数据覆盖。在模拟和真实世界环境中的实验表明,QoQ在政策性能上始终优于先前的数据选择方法。
cs.RO / 21 / 2603.09070
3D UAV Trajectory Estimation and Classification from Internet Videos via Language Model
通过语言模型从互联网视频中估计和分类3D无人机轨迹
Abstract
Reliable 3D trajectory estimation of unmanned aerial vehicles (UAVs) is a fundamental requirement for anti-UAV systems, yet the acquisition of large-scale and accurately annotated trajectory data remains prohibitively expensive. In this work, we present a novel framework that derives UAV 3D trajectories and category information directly from Internet-scale UAV videos, without relying on manual annotations. First, language-driven data acquisition is employed to autonomously discover and collect UAV-related videos, while vision-language reasoning progressively filters task-relevant segments. Second, a training-free cross-modal label generation module is introduced to infer 3D trajectory hypotheses and UAV type cues. Third, a physics-informed refinement process is designed to impose temporal smoothness and kinematic consistency on the estimated trajectories. The resulting video clips and trajectory annotations can be readily utilized for downstream anti-UAV tasks. To assess effectiveness and generalization, we conduct zero-shot transfer experiments on a public, well-annotated 3D UAV benchmark. Results reveal a clear data scaling behavior: as the amount of online video data increases, zero-shot transfer performance on the target dataset improves consistently, without any target-domain training. The proposed method closely approaches the current state-of-the-art, highlighting its robustness and applicability to real-world anti-UAV scenarios. Code and datasets will be released upon acceptance.
Chinese Translation
无人机(UAV)的可靠3D轨迹估计是反无人机系统的基本要求,但获取大规模且准确标注的轨迹数据仍然代价高昂。在本研究中,我们提出了一种新颖的框架,直接从互联网规模的无人机视频中推导无人机的3D轨迹和类别信息,而无需依赖手动标注。首先,采用基于语言的数据获取方法,自动发现并收集与无人机相关的视频,同时视觉-语言推理逐步过滤与任务相关的片段。其次,引入了一种无训练的跨模态标签生成模块,以推断3D轨迹假设和无人机类型线索。第三,设计了一种基于物理的精炼过程,以对估计的轨迹施加时间平滑性和运动一致性。生成的视频片段和轨迹标注可以直接用于下游反无人机任务。为了评估方法的有效性和泛化能力,我们在一个公共的、标注良好的3D无人机基准上进行了零样本迁移实验。结果显示出明显的数据扩展行为:随着在线视频数据量的增加,目标数据集上的零样本迁移性能持续改善,而无需任何目标领域的训练。所提出的方法接近当前的最先进水平,突显了其在现实世界反无人机场景中的鲁棒性和适用性。代码和数据集将在接受后发布。
cs.RO / 22 / 2603.09073
High-Slip-Ratio Control for Peak Tire-Road Friction Estimation Using Automated Vehicles
基于高滑移比控制的自动驾驶车辆峰值轮胎-路面摩擦系数估计
Abstract
Accurate estimation of the tire-road friction coefficient (TRFC) is critical for ensuring safe vehicle control, especially under adverse road conditions. However, most existing methods rely on naturalistic driving data from regular vehicles, which typically operate under mild acceleration and braking. As a result, the data provide insufficient slip excitation and offer limited observability of the peak TRFC. This paper presents a high-slip-ratio control framework that enables automated vehicles (AVs) to actively excite the peak friction region during empty-haul operations while maintaining operational safety. A simplified Magic Formula tire model is adopted to represent nonlinear slip-force dynamics and is locally fitted using repeated high-slip measurements. To support safe execution in car-following scenarios, we formulate a constrained optimal control strategy that balances slip excitation, trajectory tracking, and collision avoidance. In parallel, a binning-based statistical projection method is introduced to robustly estimate peak TRFC under noise and local sparsity. The framework is validated through both closed-loop simulations and real-vehicle experiments, demonstrating its accuracy, safety, and feasibility for scalable, cost-effective roadway friction screening.
Chinese Translation
准确估计轮胎-路面摩擦系数(TRFC)对于确保车辆在不利路况下的安全控制至关重要。然而,大多数现有方法依赖于常规车辆的自然驾驶数据,这些车辆通常在温和的加速和制动下运行。因此,这些数据提供的滑移激励不足,且对峰值TRFC的可观测性有限。本文提出了一种高滑移比控制框架,使自动驾驶车辆(AVs)能够在空载操作期间主动激发峰值摩擦区域,同时保持操作安全。采用简化的魔术公式(Magic Formula)轮胎模型来表示非线性滑移-力动态,并通过重复的高滑移测量进行局部拟合。为了支持在跟车场景中的安全执行,我们制定了一种约束最优控制策略,以平衡滑移激励、轨迹跟踪和碰撞避免。同时,提出了一种基于分箱的统计投影方法,以在噪声和局部稀疏情况下稳健地估计峰值TRFC。通过闭环仿真和真实车辆实验验证了该框架,展示了其在可扩展、经济高效的道路摩擦筛查中的准确性、安全性和可行性。
cs.RO / 23 / 2603.09083
Provably Safe Trajectory Generation for Manipulators Under Motion and Environmental Uncertainties
在运动和环境不确定性下的可证明安全轨迹生成方法
Abstract
Robot manipulators operating in uncertain and non-convex environments present significant challenges for safe and optimal motion planning. Existing methods often struggle to provide efficient and formally certified collision risk guarantees, particularly when dealing with complex geometries and non-Gaussian uncertainties. This article proposes a novel risk-bounded motion planning framework to address this unmet need. Our approach integrates a rigid manipulator deep stochastic Koopman operator (RM-DeSKO) model to robustly predict the robot's state distribution under motion uncertainty. We then introduce an efficient, hierarchical verification method that combines parallelizable physics simulations with sum-of-squares (SOS) programming as a filter for fine-grained, formal certification of collision risk. This method is embedded within a Model Predictive Path Integral (MPPI) controller that uniquely utilizes binary collision information from SOS decomposition to improve its policy. The effectiveness of the proposed framework is validated on two typical robot manipulators through extensive simulations and real-world experiments, including a challenging human-robot collaboration scenario, demonstrating sim-to-real transfer of the learned model and its ability to generate safe and efficient trajectories in complex, uncertain settings.
Chinese Translation
在不确定和非凸环境中操作的机器人手臂在安全和最优运动规划方面面临重大挑战。现有方法往往难以提供高效且经过正式认证的碰撞风险保证,尤其是在处理复杂几何形状和非高斯不确定性时。本文提出了一种新颖的风险界限运动规划框架,以满足这一未被满足的需求。我们的方法集成了一种刚性操纵器深度随机Koopman算子(RM-DeSKO)模型,以稳健地预测机器人在运动不确定性下的状态分布。随后,我们引入了一种高效的分层验证方法,该方法将可并行化的物理仿真与平方和(SOS)编程结合,作为细粒度正式认证碰撞风险的过滤器。该方法嵌入在一种模型预测路径积分(MPPI)控制器中,该控制器独特地利用来自SOS分解的二元碰撞信息来改善其策略。通过广泛的仿真和现实世界实验,包括一个具有挑战性的人机协作场景,我们验证了所提框架在两个典型机器人手臂上的有效性,展示了学习模型的仿真到现实转移及其在复杂不确定环境中生成安全高效轨迹的能力。
cs.RO / 24 / 2603.09086
Latent World Models for Automated Driving: A Unified Taxonomy, Evaluation Framework, and Open Challenges
自动驾驶的潜在世界模型:统一分类法、评估框架与开放挑战
Abstract
Emerging generative world models and vision-language-action (VLA) systems are rapidly reshaping automated driving by enabling scalable simulation, long-horizon forecasting, and capability-rich decision making. Across these directions, latent representations serve as the central computational substrate: they compress high-dimensional multi-sensor observations, enable temporally coherent rollouts, and provide interfaces for planning, reasoning, and controllable generation. This paper proposes a unifying latent-space framework that synthesizes recent progress in world models for automated driving. The framework organizes the design space by the target and form of latent representations (latent worlds, latent actions, latent generators; continuous states, discrete tokens, and hybrids) and by structural priors for geometry, topology, and semantics. Building on this taxonomy, the paper articulates five cross-cutting internal mechanics (i.e, structural isomorphism, long-horizon temporal stability, semantic and reasoning alignment, value-aligned objectives and post-training, as well as adaptive computation and deliberation) and connects these design choices to robustness, generalization, and deployability. The work also proposes concrete evaluation prescriptions, including a closed-loop metric suite and a resource-aware deliberation cost, designed to reduce the open-loop / closed-loop mismatch. Finally, the paper identifies actionable research directions toward advancing latent world model for decision-ready, verifiable, and resource-efficient automated driving.
Chinese Translation
新兴的生成世界模型和视觉-语言-行动(VLA)系统正在通过实现可扩展的仿真、长期预测和丰富的决策能力,迅速重塑自动驾驶。在这些方向上,潜在表示作为核心计算基础:它们压缩高维多传感器观测,支持时间一致的展开,并为规划、推理和可控生成提供接口。本文提出了一个统一的潜在空间框架,综合了自动驾驶世界模型的最新进展。该框架通过潜在表示的目标和形式(潜在世界、潜在行动、潜在生成器;连续状态、离散标记和混合体)以及几何、拓扑和语义的结构先验来组织设计空间。在此分类法的基础上,本文阐明了五个交叉的内部机制(即结构同构、长期时间稳定性、语义和推理对齐、价值对齐目标及后训练,以及自适应计算和深思熟虑),并将这些设计选择与鲁棒性、泛化能力和可部署性联系起来。该研究还提出了具体的评估建议,包括一个闭环度量套件和一个资源感知的深思熟虑成本,旨在减少开放环路与闭环之间的不匹配。最后,本文确定了可行的研究方向,以推动潜在世界模型的发展,实现决策准备、可验证和资源高效的自动驾驶。
cs.RO / 25 / 2603.09113
PM-Nav: Priori-Map Guided Embodied Navigation in Functional Buildings
PM-Nav:基于先验地图的功能性建筑体导航
Abstract
Existing language-driven embodied navigation paradigms face challenges in functional buildings (FBs) with highly similar features, as they lack the ability to effectively utilize priori spatial knowledge. To tackle this issue, we propose a Priori-Map Guided Embodied Navigation (PM-Nav), wherein environmental maps are transformed into navigation-friendly semantic priori-maps, a hierarchical chain-of-thought prompt template with an annotation priori-map is designed to enable precise path planning, and a multi-model collaborative action output mechanism is built to accomplish positioning decisions and execution control for navigation planning. Comprehensive tests using a home-made FB dataset show that the PM-Nav obtains average improvements of 511\% and 1175\%, and 650\% and 400\% over the SG-Nav and the InstructNav in simulation and real-world, respectively. These tremendous boosts elucidate the great potential of using the PM-Nav as a backbone navigation framework for FBs.
Chinese Translation
现有的语言驱动的体导航范式在功能性建筑(FBs)中面临挑战,因为这些建筑具有高度相似的特征,缺乏有效利用先验空间知识的能力。为了解决这一问题,我们提出了一种基于先验地图的体导航方法(PM-Nav),该方法将环境地图转化为适合导航的语义先验地图,设计了一种带注释的先验地图的分层思维链提示模板,以实现精确的路径规划,并构建了一种多模型协作行动输出机制,以完成导航规划中的定位决策和执行控制。使用自制的功能性建筑数据集进行的全面测试表明,PM-Nav在模拟和现实世界中分别比SG-Nav和InstructNav平均提高了511 ext{%}和1175 ext{%},以及650 ext{%}和400 ext{%}。这些显著的提升阐明了将PM-Nav作为功能性建筑的主干导航框架的巨大潜力。
cs.RO / 26 / 2603.09121
DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation
DexHiL:一种用于灵巧操作的视觉-语言-行动模型后训练的人机协同框架
Abstract
While Vision-Language-Action (VLA) models have demonstrated promising generalization capabilities in robotic manipulation, deploying them on specific and complex downstream tasks still demands effective post-training. In parallel, Human-in-the-Loop (HiL) learning has proven to be a powerful mechanism for refining robot policies. However, extending this paradigm to dexterous manipulation remains challenging: multi-finger control is high-dimensional, contact-intensive, and exhibits execution distributions that differ markedly from standard arm motions, leaving existing dexterous VLA systems limited in reliability and adaptability. We present DexHiL, the first integrated arm-hand human-in-the-loop framework for dexterous VLA models, enabling coordinated interventions over the arm and the dexterous hand within a single system. DexHiL introduces an intervention-aware data sampling strategy that prioritizes corrective segments for post-training, alongside a lightweight teleoperation interface that supports instantaneous human corrections during execution. Real-robot experiments demonstrate that DexHiL serves as an effective post-training framework, yielding a substantial performance leap, outperforming standard offline-only fine-tuning baselines by an average of 25% in success rates across distinct tasks. Project page: https://chenzhongxi-sjtu.github.io/dexhil/
Chinese Translation
尽管视觉-语言-行动(VLA)模型在机器人操作中展示了良好的泛化能力,但在特定复杂的下游任务中部署它们仍然需要有效的后训练。同时,人机协同(HiL)学习已被证明是优化机器人策略的有效机制。然而,将这一范式扩展到灵巧操作仍然面临挑战:多指控制是高维的、接触密集的,并且其执行分布与标准手臂运动显著不同,这使得现有的灵巧VLA系统在可靠性和适应性方面受到限制。我们提出了DexHiL,这是第一个集成的臂-手人机协同框架,旨在支持灵巧VLA模型,使得在单一系统中能够对臂和灵巧手进行协调干预。DexHiL引入了一种干预感知的数据采样策略,优先考虑后训练中的纠正片段,并配备了一个轻量级的远程操作界面,支持在执行过程中即时的人为纠正。真实机器人实验表明,DexHiL作为一个有效的后训练框架,带来了显著的性能提升,在不同任务中成功率平均提高了25%,超越了标准的仅离线微调基线。项目页面:https://chenzhongxi-sjtu.github.io/dexhil/
cs.RO / 27 / 2603.09147
Walking on Rough Terrain with Any Number of Legs
在粗糙地形上行走的多足机器人
Abstract
Robotics would gain by replicating the remarkable agility of arthropods in navigating complex environments. Here we consider the control of multi-legged systems which have 6 or more legs. Current multi-legged control strategies in robots include large black-box machine learning models, Central Pattern Generator (CPG) networks, and open-loop feed-forward control with stability arising from mechanics. Here we present a multi-legged control architecture for rough terrain using a segmental robot with 3 actuators for every 2 legs, which we validated in simulation for robots with 6 to 16 legs. Segments have identical state machines, and each segment also receives input from the segment in front of it. Our design bridges the gap between WalkNet-like event cascade controllers and CPG-based controllers: it tightly couples to the ground when contact is present, but produces fictive locomotion when ground contact is missing. The approach may be useful as an adaptive and computationally lightweight controller for multi-legged robots, and as a baseline capability for scaffolding the learning of machine learning controllers.
Chinese Translation
机器人技术可以通过复制节肢动物在复杂环境中导航的卓越灵活性而受益。本文考虑了具有6条或更多腿的多足系统的控制。目前,机器人中的多足控制策略包括大型黑箱机器学习模型、中央模式发生器(Central Pattern Generator, CPG)网络以及基于力学稳定性的开环前馈控制。我们提出了一种用于粗糙地形的多足控制架构,该架构使用每2条腿配备3个驱动器的分段机器人,并在6到16条腿的机器人模拟中进行了验证。各个段具有相同的状态机,并且每个段还接收来自前一个段的输入。我们的设计弥合了类似WalkNet的事件级联控制器与基于CPG的控制器之间的差距:当接触地面时,它与地面紧密耦合,但在缺乏地面接触时则产生虚拟的运动。该方法可能作为多足机器人的一种自适应且计算轻量的控制器,以及作为机器学习控制器学习的基线能力。
cs.RO / 28 / 2603.09163
SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation
SPAN-Nav:用于多功能视觉-语言导航的广义空间意识
Abstract
Recent embodied navigation approaches leveraging Vision-Language Models (VLMs) demonstrate strong generalization in versatile Vision-Language Navigation (VLN). However, reliable path planning in complex environments remains challenging due to insufficient spatial awareness. In this work, we introduce SPAN-Nav, an end-to-end foundation model designed to infuse embodied navigation with universal 3D spatial awareness using RGB video streams. SPAN-Nav extracts spatial priors across diverse scenes through an occupancy prediction task on extensive indoor and outdoor environments. To mitigate the computational burden, we introduce a compact representation for spatial priors, finding that a single token is sufficient to encapsulate the coarse-grained cues essential for navigation tasks. Furthermore, inspired by the Chain-of-Thought (CoT) mechanism, SPAN-Nav utilizes this single spatial token to explicitly inject spatial cues into action reasoning through an end-to end framework. Leveraging multi-task co-training, SPAN-Nav captures task-adaptive cues from generalized spatial priors, enabling robust spatial awareness to generalize even to the task lacking explicit spatial supervision. To support comprehensive spatial learning, we present a massive dataset of 4.2 million occupancy annotations that covers both indoor and outdoor scenes across multi-type navigation tasks. SPAN-Nav achieves state-of-the-art performance across three benchmarks spanning diverse scenarios and varied navigation tasks. Finally, real-world experiments validate the robust generalization and practical reliability of our approach across complex physical scenarios.
Chinese Translation
最近,利用视觉-语言模型(VLM)的具身导航方法在多功能视觉-语言导航(VLN)中表现出强大的泛化能力。然而,由于空间意识不足,在复杂环境中进行可靠的路径规划仍然具有挑战性。在本研究中,我们提出了SPAN-Nav,这是一种端到端的基础模型,旨在通过RGB视频流为具身导航注入通用的3D空间意识。SPAN-Nav通过在广泛的室内和室外环境中进行占用预测任务,提取多样场景中的空间先验。为了减轻计算负担,我们引入了一种紧凑的空间先验表示,发现单个标记足以封装导航任务所需的粗粒度线索。此外,受到思维链(Chain-of-Thought, CoT)机制的启发,SPAN-Nav利用这一单一空间标记,通过端到端框架明确地将空间线索注入到行动推理中。通过多任务共同训练,SPAN-Nav从广义空间先验中捕捉任务自适应线索,使得强大的空间意识能够泛化到缺乏明确空间监督的任务中。为了支持全面的空间学习,我们提供了一个包含420万占用标注的大型数据集,覆盖了多种导航任务中的室内和室外场景。SPAN-Nav在涵盖多样场景和不同导航任务的三个基准测试中实现了最先进的性能。最后,现实世界的实验验证了我们的方法在复杂物理场景中的强泛化能力和实际可靠性。
cs.RO / 29 / 2603.09170
ZeroWBC: Learning Natural Visuomotor Humanoid Control Directly from Human Egocentric Video
ZeroWBC:直接从人类自我中心视频学习自然的视觉运动类人控制
Abstract
Achieving versatile and naturalistic whole-body control for humanoid robot scene-interaction remains a significant challenge. While some recent works have demonstrated autonomous humanoid interactive control, they are constrained to rigid locomotion patterns and expensive teleoperation data collection, lacking the versatility to execute more human-like natural behaviors such as sitting or kicking. Furthermore, acquiring the necessary real robot teleoperation data is prohibitively expensive and time-consuming. To address these limitations, we introduce ZeroWBC, a novel framework that learns a natural humanoid visuomotor control policy directly from human egocentric videos, eliminating the need for large-scale robot teleoperation data and enabling natural humanoid robot scene-interaction control. Specifically, our approach first fine-tunes a Vision-Language Model (VLM) to predict future whole-body human motions based on text instructions and egocentric visual context, then these generated motions are retargeted to real robot joints and executed via our robust general motion tracking policy for humanoid whole-body control. Extensive experiments on the Unitree G1 humanoid robot demonstrate that our method outperforms baseline approaches in motion naturalness and versatility, successfully establishing a pipeline that eliminates teleoperation data collection overhead for whole-body humanoid control, offering a scalable and efficient paradigm for general humanoid whole-body control.
Chinese Translation
实现类人机器人场景交互的多功能和自然全身控制仍然是一项重大挑战。尽管一些近期的研究展示了自主类人交互控制,但它们受到僵硬运动模式和昂贵的遥控数据收集的限制,缺乏执行更类人自然行为(如坐下或踢)的多样性。此外,获取所需的真实机器人遥控数据既昂贵又耗时。为了解决这些限制,我们提出了ZeroWBC,一个新颖的框架,直接从人类自我中心视频中学习自然的类人视觉运动控制策略,消除了对大规模机器人遥控数据的需求,使类人机器人场景交互控制成为可能。具体而言,我们的方法首先微调一个视觉语言模型(Vision-Language Model, VLM),根据文本指令和自我中心视觉上下文预测未来的全身人类动作,然后将这些生成的动作重新定向到真实机器人的关节,并通过我们稳健的通用运动跟踪策略执行类人全身控制。在Unitree G1类人机器人上的大量实验表明,我们的方法在运动自然性和多样性方面优于基线方法,成功建立了一个消除全身类人控制遥控数据收集负担的管道,为通用类人全身控制提供了一个可扩展和高效的范式。
cs.RO / 30 / 2603.09175
STONE Dataset: A Scalable Multi-Modal Surround-View 3D Traversability Dataset for Off-Road Robot Navigation
STONE 数据集:一个可扩展的多模态全景视图 3D 可通行性数据集,用于越野机器人导航
Abstract
Reliable off-road navigation requires accurate estimation of traversable regions and robust perception under diverse terrain and sensing conditions. However, existing datasets lack both scalability and multi-modality, which limits progress in 3D traversability prediction. In this work, we introduce STONE, a large-scale multi-modal dataset for off-road navigation. STONE provides (1) trajectory-guided 3D traversability maps generated by a fully automated, annotation-free pipeline, and (2) comprehensive surround-view sensing with synchronized 128-channel LiDAR, six RGB cameras, and three 4D imaging radars. The dataset covers a wide range of environments and conditions, including day and night, grasslands, farmlands, construction sites, and lakes. Our auto-labeling pipeline reconstructs dense terrain surfaces from LiDAR scans, extracts geometric attributes such as slope, elevation, and roughness, and assigns traversability labels beyond the robot's trajectory using a Mahalanobis-distance-based criterion. This design enables scalable, geometry-aware ground-truth construction without manual annotation. Finally, we establish a benchmark for voxel-level 3D traversability prediction and provide strong baselines under both single-modal and multi-modal settings. STONE is available at: https://konyul.github.io/STONE-dataset/
Chinese Translation
可靠的越野导航需要准确估计可通行区域,并在多样的地形和感知条件下具备强大的感知能力。然而,现有数据集在可扩展性和多模态性方面均存在不足,限制了 3D 可通行性预测的进展。在本研究中,我们介绍了 STONE,一个大规模的多模态越野导航数据集。STONE 提供了 (1) 由完全自动化、无注释管道生成的轨迹引导 3D 可通行性地图,以及 (2) 同步的 128 通道 LiDAR、六个 RGB 摄像头和三个 4D 成像雷达的全面全景感知。该数据集涵盖了广泛的环境和条件,包括白天和夜晚、草地、农田、施工现场和湖泊。我们的自动标注管道从 LiDAR 扫描中重建密集的地形表面,提取坡度、高度和粗糙度等几何属性,并使用基于马哈拉诺比斯距离的标准为机器人轨迹之外的区域分配可通行性标签。该设计使得在没有人工标注的情况下,能够可扩展地构建几何感知的真实标签。最后,我们建立了一个体素级 3D 可通行性预测的基准,并在单模态和多模态设置下提供了强有力的基线。STONE 数据集可在以下网址获取:https://konyul.github.io/STONE-dataset/
cs.RO / 31 / 2603.09188
Robust Spatiotemporal Motion Planning for Multi-Agent Autonomous Racing via Topological Gap Identification and Accelerated MPC
基于拓扑间隙识别和加速模型预测控制的多智能体自主赛车的鲁棒时空运动规划
Abstract
High-speed multi-agent autonomous racing demands robust spatiotemporal planning and precise control under strict computational limits. Current methods often oversimplify interactions or abandon strict kinematic constraints. We resolve this by proposing a Topological Gap Identification and Accelerated MPC framework. By predicting opponent behaviors via SGPs, our method constructs dynamic occupancy corridors to robustly select optimal overtaking gaps. We ensure strict kinematic feasibility using a Linear Time-Varying MPC powered by a customized Pseudo-Transient Continuation (PTC) solver for high-frequency execution. Experimental results on the F1TENTH platform show that our method significantly outperforms state-of-the-art baselines: it reduces total maneuver time by 51.6% in sequential scenarios, consistently maintains an overtaking success rate exceeding 81% in dense bottlenecks, and lowers average computational latency by 20.3%, pushing the boundaries of safe and high-speed autonomous racing.
Chinese Translation
高速多智能体自主赛车要求在严格的计算限制下进行鲁棒的时空规划和精确控制。目前的方法往往过于简化交互或放弃严格的运动学约束。我们通过提出一种拓扑间隙识别和加速模型预测控制(MPC)框架来解决这一问题。通过利用SGPs预测对手行为,我们的方法构建动态占用走廊,以鲁棒地选择最佳超车间隙。我们使用线性时变MPC,结合定制的伪瞬态延续(PTC)求解器,确保严格的运动学可行性,以实现高频执行。在F1TENTH平台上的实验结果表明,我们的方法显著优于现有的最先进基线:在顺序场景中总机动时间减少了51.6%,在密集瓶颈中超车成功率始终超过81%,平均计算延迟降低了20.3%,推动了安全和高速自主赛车的边界。
cs.RO / 32 / 2603.09194
WESPR: Wind-adaptive Energy-Efficient Safe Perception & Planning for Robust Flight with Quadrotors
WESPR:适应风的能效安全感知与规划框架,提升四旋翼飞行的鲁棒性
Abstract
Local wind conditions strongly influence drone performance: headwinds increase flight time, crosswinds and wind shear hinder agility in cluttered spaces, while tailwinds reduce travel time. Although adaptive controllers can mitigate turbulence, they remain unaware of the surrounding geometry that generates it, preventing proactive avoidance. Existing methods that model how wind interacts with the environment typically rely on computationally expensive fluid dynamics simulations, limiting real-time adaptation to new environments and conditions. To bridge this gap, we present WESPR, a fast framework that predicts how environmental geometry affects local wind conditions, enabling proactive path planning and control adaptation. Our lightweight pipeline integrates geometric perception and local weather data to estimate wind fields, compute cost-efficient paths, and adjust control strategies-all within 10 seconds. We validate WESPR on a Crazyflie drone navigating turbulent obstacle courses. Our results show a 12.5-58.7% reduction in maximum trajectory deviation and a 24.6% improvement in stability compared to a wind-agnostic adaptive controller.
Chinese Translation
局部风况对无人机性能有着显著影响:逆风增加飞行时间,侧风和风切变在拥挤空间中妨碍灵活性,而顺风则减少旅行时间。尽管自适应控制器可以缓解湍流,但它们对产生湍流的周围几何形状仍然缺乏感知,从而无法进行主动规避。现有的风与环境相互作用的建模方法通常依赖于计算成本高昂的流体动力学模拟,这限制了对新环境和条件的实时适应。为了解决这一问题,我们提出了WESPR,一个快速框架,预测环境几何如何影响局部风况,从而实现主动路径规划和控制适应。我们的轻量级流程整合了几何感知和局部气象数据,以估算风场、计算成本效益路径并调整控制策略——所有这些都在10秒内完成。我们在Crazyflie无人机上验证了WESPR,该无人机在湍流障碍课程中导航。我们的结果显示,与不考虑风的自适应控制器相比,最大轨迹偏差减少了12.5%-58.7%,稳定性提高了24.6%。
cs.RO / 33 / 2603.09218
Embodied Human Simulation for Quantitative Design and Analysis of Interactive Robotics
用于交互机器人定量设计与分析的具身人类仿真
Abstract
Physical interactive robotics, ranging from wearable devices to collaborative humanoid robots, require close coordination between mechanical design and control. However, evaluating interactive dynamics is challenging due to complex human biomechanics and motor responses. Traditional experiments rely on indirect metrics without measuring human internal states, such as muscle forces or joint loads. To address this issue, we develop a scalable simulation-based framework for the quantitative analysis of physical human-robot interaction. At its core is a full-body musculoskeletal model serving as a predictive surrogate for the human dynamical system. Driven by a reinforcement learning controller, it generates adaptive, physiologically grounded motor behaviors. We employ a sequential training pipeline where the pre-trained human motion control policy acts as a consistent evaluator, making large-scale design space exploration computationally tractable. By simulating the coupled human-robot system, the framework provides access to internal biomechanical metrics, offering a systematic way to concurrently co-optimize a robot's structural parameters and control policy. We demonstrate its capability in optimizing human-exoskeleton interactions, showing improved joint alignment and reduced contact forces. This work establishes embodied human simulation as a scalable paradigm for interactive robotics design.
Chinese Translation
物理交互机器人,从可穿戴设备到协作类人机器人,要求机械设计与控制之间的紧密协调。然而,由于复杂的人体生物力学和运动反应,评估交互动态具有挑战性。传统实验依赖间接指标,而未能测量人类的内部状态,例如肌肉力量或关节负荷。为了解决这一问题,我们开发了一个可扩展的基于仿真的框架,用于物理人机交互的定量分析。其核心是一个全身肌肉骨骼模型,作为人类动力系统的预测替代。该模型由强化学习控制器驱动,生成适应性强、符合生理基础的运动行为。我们采用一个顺序训练流程,其中预训练的人体运动控制策略作为一致的评估者,使大规模设计空间探索在计算上变得可行。通过模拟耦合的人机系统,该框架提供了对内部生物力学指标的访问,提供了一种系统的方法来同时共同优化机器人的结构参数和控制策略。我们展示了其在优化人类与外骨骼交互方面的能力,显示出关节对齐的改善和接触力的减少。这项工作确立了具身人类仿真作为交互机器人设计的可扩展范式。
cs.RO / 34 / 2603.09226
TRIP-Bag: A Portable Teleoperation System for Plug-and-Play Robotic Arms and Leaders
TRIP-Bag:一种便携式遥操作系统,用于即插即用的机器人手臂和操控者
Abstract
Large scale, diverse demonstration data for manipulation tasks remains a major challenge in learning-based robot policies. Existing in-the-wild data collection approaches often rely on vision-based pose estimation of hand-held grippers or gloves, which introduces an embodiment gap between the collection platform and the target robot. Teleoperation systems eliminate the embodiment gap, but are typically impractical to deploy outside the laboratory environment. We propose TRIP-Bag (Teleoperation, Recording, Intelligence in a Portable Bag), a portable, puppeteer-style teleoperation system fully contained within a commercial suitcase, as a practical solution for collecting high-fidelity manipulation data across varied settings. With a setup time of under five minutes and direct joint-to-joint teleoperation, TRIP-Bag enables rapid and reliable data collection in any environment. We validated TRIP-Bag's usability through experiments with non-expert users, showing that the system is intuitive and easy to operate. Furthermore, we confirmed the quality of the collected data by training benchmark manipulation policies, demonstrating its value as a practical resource for robot learning.
Chinese Translation
大规模、多样化的操作任务演示数据仍然是基于学习的机器人策略面临的主要挑战。现有的野外数据收集方法通常依赖于手持夹具或手套的基于视觉的姿态估计,这在收集平台与目标机器人之间引入了体现差距。遥操作系统消除了这种体现差距,但通常在实验室环境之外的部署不够实用。我们提出了TRIP-Bag(便携式包中的遥操作、记录、智能),这是一种完全包含在商业手提箱中的便携式木偶式遥操作系统,作为在不同环境中收集高保真操作数据的实用解决方案。TRIP-Bag的设置时间不到五分钟,并且支持直接的关节到关节遥操作,使其能够在任何环境中快速可靠地收集数据。我们通过与非专业用户的实验验证了TRIP-Bag的可用性,显示该系统直观且易于操作。此外,我们通过训练基准操作策略确认了收集数据的质量,证明其作为机器人学习的实用资源的价值。
cs.RO / 35 / 2603.09237
MO-Playground: Massively Parallelized Multi-Objective Reinforcement Learning for Robotics
MO-Playground:用于机器人技术的大规模并行多目标强化学习
Abstract
Multi-objective reinforcement learning (MORL) is a powerful tool to learn Pareto-optimal policy families across conflicting objectives. However, unlike traditional RL algorithms, existing MORL algorithms do not effectively leverage large-scale parallelization to concurrently simulate thousands of environments, resulting in vastly increased computation time. Ultimately, this has limited MORL's application towards complex multi-objective robotics problems. To address these challenges, we present 1) MORLAX, a new GPU-native, fast MORL algorithm, and 2) MO-Playground, a pip-installable playground of GPU-accelerated multi-objective environments. Together, MORLAX and MO-Playground approximate Pareto sets within minutes, offering 25-270x speed-ups compared to legacy CPU-based approaches whilst achieving superior Pareto front hypervolumes. We demonstrate the versatility of our approach by implementing a custom BRUCE humanoid robot environment using MO-Playground and learning Pareto-optimal locomotion policies across 6 realistic objectives for BRUCE, such as smoothness, efficiency and arm swinging.
Chinese Translation
多目标强化学习(MORL)是一种强大的工具,用于在相互冲突的目标之间学习帕累托最优策略家族。然而,与传统的强化学习(RL)算法不同,现有的MORL算法未能有效利用大规模并行化来同时模拟数千个环境,导致计算时间大幅增加。最终,这限制了MORL在复杂多目标机器人问题中的应用。为了解决这些挑战,我们提出了1)MORLAX,一种新的GPU原生快速MORL算法,以及2)MO-Playground,一个可通过pip安装的GPU加速多目标环境的游乐场。MORLAX和MO-Playground共同在几分钟内近似帕累托集,相比于传统的基于CPU的方法,提供了25-270倍的加速,同时实现了更优的帕累托前沿超体积。我们通过使用MO-Playground实现一个定制的BRUCE人形机器人环境,并在BRUCE的6个现实目标(如平滑性、效率和手臂摆动)上学习帕累托最优的运动策略,展示了我们方法的多样性。
cs.RO / 36 / 2603.09292
See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation
观察、规划、回溯:面向进度感知的视觉-语言-行动模型用于稳健的机器人操作
Abstract
Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework's effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5\% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.
Chinese Translation
通过明确的可操作里程碑测量任务进度对于稳健的机器人操作至关重要。这种进度感知使模型能够确定当前任务状态,预测可验证的中间状态,并在进度停滞时检测和恢复故障。为体现这一能力,我们提出了观察、规划、回溯(See, Plan, Rewind,SPR),一个进度感知的视觉-语言-行动框架,动态地将语言指令转化为一系列空间子目标。SPR通过一个连续的核心循环运行,观察当前状态和即将到来的里程碑,规划朝向下一个二维航点的轨迹,并在监测进度与预期序列的对比中,在失败时回溯到可恢复状态。这种闭环方法实现了稳健的错误纠正,而无需额外的训练数据或辅助模型。大量实验表明该框架的有效性、泛化能力和稳健性:在LIBERO基准测试中,SPR比MolmoAct基线提高了5%。在具有未见指令和初始状态的挑战性LIBERO-Plus基准测试中,SPR以最小的性能下降实现了最先进的稳健性,超越了OpenVLA-OFT和UniVLA,展现出优越的分布外稳健性。
cs.RO / 37 / 2603.09298
CORAL: Scalable Multi-Task Robot Learning via LoRA Experts
CORAL:通过 LoRA 专家实现可扩展的多任务机器人学习
Abstract
Deploying Vision-Language-Action (VLA) models in real-world robotics exposes a core multi-task learning challenge: reconciling task interference in multi-task robotic learning. When multiple tasks are jointly fine-tuned in a single stage, gradients from different tasks can conflict, causing negative transfer and reducing per-task performance. Yet maintaining a separate full checkpoint per task is often storage- and deployment-prohibitive. To address this dilemma, we present CORAL, a backbone- and embodiment-agnostic framework designed primarily to mitigate multi-task interference while remaining naturally extensible to a continuous stream of new tasks. CORAL freezes a single pre-trained VLA backbone and attaches one lightweight Low-Rank Adaptation (LoRA) expert per task; at runtime, a dynamic inference engine (the CORAL Manager) routes language instructions to the appropriate expert and swaps experts on the fly with zero inference overhead. This strict parameter isolation avoids complex gating networks and prevents parameter-level cross-task interference by construction; as an added capability, it also enables sequentially introducing new tasks without parameter overwriting caused by catastrophic forgetting. We validate CORAL on a real-world Galaxea R1 dual-arm mobile manipulator and three simulation benchmarks (LIBERO, WidowX, Google Robot), where CORAL overcomes fine-grained instructional ambiguity and substantially outperforms joint training, yielding a practical and scalable system for lifelong multi-task robot learning. Website: https://frontierrobo.github.io/CORAL
Chinese Translation
在现实世界的机器人应用中部署视觉-语言-动作(VLA)模型暴露了一个核心的多任务学习挑战:调和多任务机器人学习中的任务干扰。当多个任务在单一阶段共同微调时,不同任务的梯度可能会发生冲突,导致负迁移并降低每个任务的性能。然而,为每个任务维护一个独立的完整检查点通常在存储和部署上是不可行的。为了解决这一困境,我们提出了 CORAL,这是一个与骨干网络和具体实现无关的框架,主要旨在减轻多任务干扰,同时自然地扩展到持续流入的新任务。CORAL 冻结一个单一的预训练 VLA 骨干,并为每个任务附加一个轻量级的低秩适应(LoRA)专家;在运行时,动态推理引擎(CORAL 管理器)将语言指令路由到适当的专家,并在不增加推理开销的情况下动态切换专家。这种严格的参数隔离避免了复杂的门控网络,并通过构造防止了参数级别的跨任务干扰;作为额外的能力,它还能够顺序引入新任务,而不会因灾难性遗忘而导致参数覆盖。我们在现实世界的 Galaxea R1 双臂移动操纵器和三个仿真基准(LIBERO、WidowX、Google Robot)上验证了 CORAL,结果表明 CORAL 克服了细粒度指令模糊性,并显著优于联合训练,提供了一个实用且可扩展的终身多任务机器人学习系统。网站:https://frontierrobo.github.io/CORAL
cs.RO / 38 / 2603.09319
NLiPsCalib: An Efficient Calibration Framework for High-Fidelity 3D Reconstruction of Curved Visuotactile Sensors
NLiPsCalib:高保真曲面视觉触觉传感器三维重建的高效校准框架
Abstract
Recent advances in visuotactile sensors increasingly employ biomimetic curved surfaces to enhance sensorimotor capabilities. Although such curved visuotactile sensors enable more conformal object contact, their perceptual quality is often degraded by non-uniform illumination, which reduces reconstruction accuracy and typically necessitates calibration. Existing calibration methods commonly rely on customized indenters and specialized devices to collect large-scale photometric data, but these processes are expensive and labor-intensive. To overcome these calibration challenges, we present NLiPsCalib, a physics-consistent and efficient calibration framework for curved visuotactile sensors. NLiPsCalib integrates controllable near-field light sources and leverages Near-Light Photometric Stereo (NLiPs) to estimate contact geometry, simplifying calibration to just a few simple contacts with everyday objects. We further introduce NLiPsTac, a controllable-light-source tactile sensor developed to validate our framework. Experimental results demonstrate that our approach enables high-fidelity 3D reconstruction across diverse curved form factors with a simple calibration procedure. We emphasize that our approach lowers the barrier to developing customized visuotactile sensors of diverse geometries, thereby making visuotactile sensing more accessible to the broader community.
Chinese Translation
近年来,视觉触觉传感器的进步越来越多地采用仿生曲面,以增强传感器运动能力。尽管这种曲面视觉触觉传感器能够实现更符合物体接触的效果,但其感知质量往往受到不均匀照明的影响,从而降低重建精度,并通常需要进行校准。现有的校准方法通常依赖于定制的压头和专用设备来收集大规模的光度数据,但这些过程成本高昂且劳动密集。为了解决这些校准挑战,我们提出了NLiPsCalib,一个物理一致且高效的曲面视觉触觉传感器校准框架。NLiPsCalib集成了可控的近场光源,并利用近光光度立体视觉(Near-Light Photometric Stereo, NLiPs)来估计接触几何形状,将校准简化为仅需与日常物体进行少量简单接触。我们进一步介绍了NLiPsTac,一个可控光源触觉传感器,用于验证我们的框架。实验结果表明,我们的方法能够在多种曲面形态下实现高保真的三维重建,并且校准过程简单。我们强调,我们的方法降低了开发多种几何形状定制视觉触觉传感器的门槛,从而使视觉触觉感知对更广泛的社区更加可及。
cs.RO / 39 / 2603.09399
Vision-Augmented On-Track System Identification for Autonomous Racing via Attention-Based Priors and Iterative Neural Correction
基于视觉增强的自主赛车轨道系统识别:注意力机制先验与迭代神经校正
Abstract
Operating autonomous vehicles at the absolute limits of handling requires precise, real-time identification of highly non-linear tire dynamics. However, traditional online optimization methods suffer from "cold-start" initialization failures and struggle to model high-frequency transient dynamics. To address these bottlenecks, this paper proposes a novel vision-augmented, iterative system identification framework. First, a lightweight CNN (MobileNetV3) translates visual road textures into a continuous heuristic friction prior, providing a robust "warm-start" for parameter optimization. Next, a S4 model captures complex temporal dynamic residuals, circumventing the memory and latency limitations of traditional MLPs and RNNs. Finally, a derivative-free Nelder-Mead algorithm iteratively extracts physically interpretable Pacejka tire parameters via a hybrid virtual simulation. Co-simulation in CarSim demonstrates that the lightweight vision backbone reduces friction estimation error by 76.1 using 85 fewer FLOPs, accelerating cold-start convergence by 71.4. Furthermore, the S4-augmented framework improves parameter extraction accuracy and decreases lateral force RMSE by over 60 by effectively capturing complex vehicle dynamics, demonstrating superior performance compared to conventional neural architectures.
Chinese Translation
在操控极限条件下运行自主车辆需要对高度非线性的轮胎动态进行精确的实时识别。然而,传统的在线优化方法在“冷启动”初始化时常常失败,并且难以建模高频瞬态动态。为了解决这些瓶颈,本文提出了一种新颖的视觉增强迭代系统识别框架。首先,轻量级卷积神经网络(CNN,MobileNetV3)将视觉道路纹理转化为连续的启发式摩擦先验,为参数优化提供了稳健的“热启动”。接下来,S4模型捕捉复杂的时间动态残差,规避了传统多层感知器(MLP)和递归神经网络(RNN)的内存和延迟限制。最后,基于无导数的Nelder-Mead算法通过混合虚拟仿真迭代提取物理可解释的Pacejka轮胎参数。在CarSim中的联合仿真表明,轻量级视觉主干在使用85个FLOPs更少的情况下将摩擦估计误差降低了76.1,冷启动收敛速度加快了71.4。此外,S4增强框架提高了参数提取的准确性,并通过有效捕捉复杂的车辆动态将侧向力均方根误差(RMSE)降低了超过60,展示了相较于传统神经架构的优越性能。
cs.RO / 40 / 2603.09415
From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation
从流动到一步:基于隐式最大似然估计的分布蒸馏实现实时多模态轨迹策略
Abstract
Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.
Chinese Translation
基于扩散和流动匹配的生成策略通过建模多模态人类示范在机器人操作中取得了良好的表现。然而,它们对迭代常微分方程(ODE)积分的依赖引入了显著的延迟,限制了高频闭环控制。最近的单步加速方法缓解了这一开销,但往往表现出分布崩溃,产生的平均轨迹未能执行连贯的操作策略。我们提出了一种框架,通过隐式最大似然估计(IMLE)将条件流匹配(CFM)专家蒸馏为快速单步学生。双向Chamfer距离提供了一种集水平目标,促进模式覆盖和保真度,使得在单次前向传递中保留教师的多模态动作分布。统一的感知编码器进一步将多视角RGB、深度、点云和自我感知整合为几何感知表示。最终生成的高频控制支持实时的后退地平线重新规划,并在动态干扰下提高了鲁棒性。
cs.RO / 41 / 2603.09458
Stein Variational Ergodic Surface Coverage with SE(3) Constraints
带有 SE(3) 约束的 Stein 变分遍历表面覆盖
Abstract
Surface manipulation tasks require robots to generate trajectories that comprehensively cover complex 3D surfaces while maintaining precise end-effector poses. Existing ergodic trajectory optimization (TO) methods demonstrate success in coverage tasks, while struggling with point-cloud targets due to the nonconvex optimization landscapes and the inadequate handling of SE(3) constraints in sampling-as-optimization (SAO) techniques. In this work, we introduce a preconditioned SE(3) Stein Variational Gradient Descent (SVGD) approach for SAO ergodic trajectory generation. Our proposed approach comprises multiple innovations. First, we reformulate point-cloud ergodic coverage as a manifold-aware sampling problem. Second, we derive SE(3)-specific SVGD particle updates, and, third, we develop a preconditioner to accelerate TO convergence. Our sampling-based framework consistently identifies superior local optima compared to strong optimization-based and SAO baselines while preserving the SE(3) geometric structure. Experiments on a 3D point-cloud surface coverage benchmark and robotic surface drawing tasks demonstrate that our method achieves superior coverage quality with tractable computation in our setting relative to existing TO and SAO approaches, and is validated in real-world robot experiments.
Chinese Translation
表面操作任务要求机器人生成能够全面覆盖复杂三维表面的轨迹,同时保持精确的末端执行器姿态。现有的遍历轨迹优化(TO)方法在覆盖任务中取得了一定成功,但在处理点云目标时却面临挑战,主要由于非凸优化景观和在采样作为优化(SAO)技术中对 SE(3) 约束的处理不足。在本研究中,我们提出了一种预条件的 SE(3) Stein 变分梯度下降(SVGD)方法用于 SAO 遍历轨迹生成。我们提出的方法包含多个创新。首先,我们将点云遍历覆盖重新表述为一个流形感知的采样问题。其次,我们推导出 SE(3) 特定的 SVGD 粒子更新方法,第三,我们开发了一种预条件器以加速 TO 收敛。我们的基于采样的框架在识别优越的局部最优解方面始终优于强优化基线和 SAO 基线,同时保持 SE(3) 的几何结构。在三维点云表面覆盖基准和机器人表面绘制任务上的实验表明,相较于现有的 TO 和 SAO 方法,我们的方法在可计算性上实现了更优的覆盖质量,并在实际机器人实验中得到了验证。
cs.RO / 42 / 2603.09460
SEA-Nav: Efficient Policy Learning for Safe and Agile Quadruped Navigation in Cluttered Environments
SEA-Nav:在复杂环境中安全高效的四足机器人导航策略学习
Abstract
Efficiently training quadruped robot navigation in densely cluttered environments remains a significant challenge. Existing methods are either limited by a lack of safety and agility in simple obstacle distributions or suffer from slow locomotion in complex environments, often requiring excessively long training phases. To this end, we propose SEA-Nav (Safe, Efficient, and Agile Navigation), a reinforcement learning framework for quadruped navigation. Within diverse and dense obstacle environments, a differentiable control barrier function (CBF)-based shield constraints the navigation policy to output safe velocity commands. An adaptive collision replay mechanism and hazardous exploration rewards are introduced to increase the probability of learning from critical experiences, guiding efficient exploration and exploitation. Finally, kinematic action constraints are incorporated to ensure safe velocity commands, facilitating successful physical deployment. To the best of our knowledge, this is the first approach that achieves highly challenging quadruped navigation in the real world with minute-level training time.
Chinese Translation
在密集杂乱的环境中高效训练四足机器人导航仍然是一个重大挑战。现有方法要么因简单障碍分布缺乏安全性和灵活性而受到限制,要么在复杂环境中运动缓慢,通常需要过长的训练阶段。为此,我们提出了SEA-Nav(安全、高效和灵活的导航),这是一个用于四足机器人导航的强化学习框架。在多样且密集的障碍环境中,基于可微分控制屏障函数(CBF)的保护机制约束导航策略输出安全的速度指令。引入了一种自适应碰撞重放机制和危险探索奖励,以提高从关键经验中学习的概率,指导高效的探索和利用。最后,结合运动学动作约束以确保安全的速度指令,从而促进成功的物理部署。据我们所知,这是首个在现实世界中以分钟级训练时间实现高度挑战性四足导航的方法。
cs.RO / 43 / 2603.09473
Receptogenesis in a Vascularized Robotic Embodiment
血管化机器人实体中的感受发生
Abstract
Equipping robotic systems with the capacity to generate $\textit{ex novo}$ hardware during operation extends control of physical adaptability. Unlike modular systems that rely on discrete component integration pre- or post-deployment, we envision the possibility that physical adaptation and development emerge from dynamic material restructuring to shape the body's intrinsic functions. Drawing inspiration from circulatory systems that redistribute mass and function in biological organisms, we utilize fluidics to restructure the material interface, a capability currently unpaired in robotics. Here, we realize this synthetic growth capability through a vascularized robotic composite designed for programmable material synthesis, demonstrated via receptogenesis - the on-demand construction of sensors from internal fluid reserves based on environmental cues. By coordinating the fluidic transport of precursors with external localized UV irradiation, we drive an $\textit{in situ}$ photopolymerization that chemically reconstructs the vasculature from the inside out. This reaction converts precursors with photolatent initiator into a solid dispersion of UV-sensitive polypyrrole, establishing a sensing modality validated by a characteristic decrease in electrical impedance. The newly synthesized sensor closed a control loop to regulate wing flapping in a moth-inspired robotic demonstrator. This physical update increased the robot's capability in real time. This work establishes a materials-based framework for constitutive evolution, enabling robots to physically grow the hardware needed to support emerging behaviors in a complex environment; for example, suggesting a pathway toward autonomous systems capable of generating specialized features, such as neurovascular systems in situated robotics.
Chinese Translation
为机器人系统赋予在操作过程中生成$ extit{ex novo}$硬件的能力,扩展了对物理适应性的控制。与依赖于离散组件集成的模块化系统不同,我们设想物理适应和发展可以通过动态材料重构来实现,以塑造身体的内在功能。受到生物有机体中循环系统的启发,这些系统重新分配质量和功能,我们利用流体技术重构材料界面,这一能力在机器人领域尚未实现。在此,我们通过一种设计用于可编程材料合成的血管化机器人复合材料实现了这种合成生长能力,展示了感受发生——根据环境线索从内部流体储备按需构建传感器。通过协调前体的流体运输与外部局部紫外线照射,我们驱动了一种$ extit{in situ}$光聚合反应,从内部化学重构血管。这一反应将含有光潜在引发剂的前体转化为紫外线敏感的聚吡咯固体分散体,建立了一种通过特征性电阻抗降低来验证的传感模式。新合成的传感器闭合了控制回路,以调节一种受蛾类启发的机器人演示器的翅膀拍打。这一物理更新实时增强了机器人的能力。本研究建立了一种基于材料的构成演化框架,使机器人能够物理性地生长所需的硬件,以支持在复杂环境中出现的行为;例如,指明了一条通向能够生成特定功能的自主系统的路径,如在情境机器人中的神经血管系统。
cs.RO / 44 / 2603.09482
StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving
StyleVLA:面向驾驶风格的视觉语言动作模型用于自动驾驶
Abstract
Vision Language Models (VLMs) bridge visual perception and linguistic reasoning. In Autonomous Driving (AD), this synergy has enabled Vision Language Action (VLA) models, which translate high-level multimodal understanding into driving behaviors, typically represented as future trajectories. However, existing VLA models mainly generate generic collision-free trajectories. Beyond collision avoidance, adapting to diverse driving styles (e.g., sporty, comfortable) is essential for personalized driving. Moreover, many methods treat trajectory generation as naive token prediction, which can produce kinematically infeasible actions. To address these limitations, we present StyleVLA, a physics-informed VLA framework for generating diverse and physically plausible driving behaviors. We introduce a hybrid loss that combines a kinematic consistency constraint with a continuous regression head to improve trajectory feasibility. To train StyleVLA, built on Qwen3-VL-4B, we construct a large-scale instruction dataset with over 1.2k scenarios, 76k Bird's Eye View (BEV) samples, and 42k First Person View (FPV) samples, with ground-truth trajectories for five driving styles and natural-language instructions. Experiments show that our 4B-parameter StyleVLA significantly outperforms proprietary models (e.g., Gemini-3-Pro) and state-of-the-art VLA models. Using a composite driving score measuring success rate, physical feasibility, and style adherence, StyleVLA achieves 0.55 on BEV and 0.51 on FPV, versus 0.32 and 0.35 for Gemini-3-Pro. These results show that a specialized, physics-informed, lightweight model can surpass closed-source models on domain-specific tasks.
Chinese Translation
视觉语言模型(VLMs)连接了视觉感知与语言推理。在自动驾驶(AD)中,这种协同作用使得视觉语言动作(VLA)模型得以实现,这些模型将高层次的多模态理解转化为驾驶行为,通常表现为未来轨迹。然而,现有的VLA模型主要生成通用的无碰撞轨迹。除了避免碰撞,适应多样化的驾驶风格(例如,运动型、舒适型)对于个性化驾驶至关重要。此外,许多方法将轨迹生成视为简单的标记预测,这可能产生运动学上不可行的动作。为了解决这些局限性,我们提出了StyleVLA,一个基于物理知识的VLA框架,用于生成多样化且物理上合理的驾驶行为。我们引入了一种混合损失,结合了运动学一致性约束和连续回归头,以提高轨迹的可行性。为了训练StyleVLA,我们基于Qwen3-VL-4B构建了一个大规模的指令数据集,包含超过1200个场景、76000个鸟瞰图(BEV)样本和42000个第一人称视角(FPV)样本,涵盖五种驾驶风格的真实轨迹和自然语言指令。实验表明,我们的4B参数StyleVLA显著优于专有模型(例如,Gemini-3-Pro)和最先进的VLA模型。使用综合驾驶评分来衡量成功率、物理可行性和风格遵循,StyleVLA在BEV上达到了0.55,在FPV上达到了0.51,而Gemini-3-Pro分别为0.32和0.35。这些结果表明,一个专门的、基于物理知识的轻量级模型可以在特定领域任务上超越闭源模型。
cs.RO / 45 / 2603.09513
Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks
超越短期视野:用于非马尔可夫仿真基准中稳健长远操作的VQ-记忆
Abstract
The high cost of collecting real-robot data has made robotic simulation a scalable platform for both evaluation and data generation. Yet most existing benchmarks concentrate on simple manipulation tasks such as pick-and-place, failing to capture the non-Markovian characteristics of real-world tasks and the complexity of articulated object interactions. To address this limitation, we present RuleSafe, a new articulated manipulation benchmark built upon a scalable LLM-aided simulation framework. RuleSafe features safes with diverse unlocking mechanisms, such as key locks, password locks, and logic locks, which require different multi-stage reasoning and manipulation strategies. These LLM-generated rules produce non-Markovian and long-horizon tasks that require temporal modeling and memory-based reasoning. We further propose VQ-Memory, a compact and structured temporal representation that uses vector-quantized variational autoencoders (VQ-VAEs) to encode past proprioceptive states into discrete latent tokens. This representation filters low-level noise while preserving high-level task-phase context, providing lightweight yet robust temporal cues that are compatible with existing Vision-Language-Action models (VLA). Extensive experiments on state-of-the-art VLA models and diffusion policies show that VQ-Memory consistently improves long-horizon planning, enhances generalization to unseen configurations, and enables more efficient manipulation with reduced computational cost. Project page: vqmemory.github.io
Chinese Translation
收集真实机器人数据的高成本使得机器人仿真成为一个可扩展的平台,用于评估和数据生成。然而,现有的大多数基准集中于简单的操作任务,如拾取和放置,未能捕捉到现实世界任务的非马尔可夫特性和关节物体交互的复杂性。为了解决这一局限性,我们提出了RuleSafe,一个基于可扩展的LLM辅助仿真框架的新型关节操作基准。RuleSafe的特点是具有多种解锁机制的保险箱,如钥匙锁、密码锁和逻辑锁,这些机制需要不同的多阶段推理和操作策略。这些LLM生成的规则产生了需要时间建模和基于记忆推理的非马尔可夫和长远任务。我们进一步提出了VQ-记忆,这是一种紧凑且结构化的时间表示,使用向量量化变分自编码器(VQ-VAEs)将过去的本体状态编码为离散潜在标记。这种表示在保留高级任务阶段上下文的同时,过滤低级噪声,提供轻量且稳健的时间线索,与现有的视觉-语言-动作模型(VLA)兼容。在最先进的VLA模型和扩散策略上的广泛实验表明,VQ-记忆在长远规划中始终表现出改善,增强了对未见配置的泛化能力,并以降低计算成本实现了更高效的操作。项目页面:vqmemory.github.io
cs.RO / 46 / 2603.09542
NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models
NS-VLA:迈向神经符号视觉-语言-动作模型
Abstract
Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space. Our code is available.
Chinese Translation
视觉-语言-动作(VLA)模型旨在将指令与视觉上下文相结合,并生成用于机器人操控的动作序列。尽管近期取得了一些进展,VLA模型在学习相关且可重用的基本元素、减少对大规模数据和复杂架构的依赖,以及实现超越演示的探索方面仍面临挑战。为了解决这些问题,我们提出了一种新颖的神经符号视觉-语言-动作(NS-VLA)框架,通过在线强化学习(RL)实现。该框架引入了符号编码器,以嵌入视觉和语言特征并提取结构化的基本元素,利用符号求解器实现数据高效的动作序列生成,并利用在线RL通过广泛探索优化生成过程。在机器人操控基准测试中的实验表明,NS-VLA在一次性训练和数据扰动设置中均优于先前的方法,同时展现出卓越的零-shot 泛化能力、高数据效率和扩展的探索空间。我们的代码已公开。
cs.RO / 47 / 2603.09552
On the Cost of Evolving Task Specialization in Multi-Robot Systems
多机器人系统中任务专业化演化的成本
Abstract
Task specialization can lead to simpler robot behaviors and higher efficiency in multi-robot systems. Previous works have shown the emergence of task specialization during evolutionary optimization, focusing on feasibility rather than costs. In this study, we take first steps toward a cost-benefit analysis of task specialization in robot swarms using a foraging scenario. We evolve artificial neural networks as generalist behaviors for the entire task and as task-specialist behaviors for subtasks within a limited evaluation budget. We show that generalist behaviors can be successfully optimized while the evolved task-specialist controllers fail to cooperate efficiently, resulting in worse performance than the generalists. Consequently, task specialization does not necessarily improve efficiency when optimization budget is limited.
Chinese Translation
任务专业化可以在多机器人系统中导致更简单的机器人行为和更高的效率。先前的研究已经展示了在进化优化过程中任务专业化的出现,重点关注可行性而非成本。在本研究中,我们首次对机器人群体中任务专业化的成本效益进行了分析,采用觅食场景进行实验。我们在有限的评估预算内,进化人工神经网络作为整个任务的通才行为和作为子任务的任务专家行为。结果表明,通才行为可以成功优化,而进化出的任务专家控制器在合作效率上表现不佳,导致其性能低于通才。因此,当优化预算有限时,任务专业化并不一定能提高效率。
cs.RO / 48 / 2603.09557
Trajectory Optimization for Self-Wrap-Aware Cable-Towed Planar Object Manipulation under Implicit Tension Constraints
考虑隐性张力约束的自缠绕意识下的缆索拖曳平面物体操作轨迹优化
Abstract
Cable/rope elements are pervasive in deformable-object manipulation, often serving as a deformable force-transmission medium whose routing and contact determine how wrenches are delivered. In cable-towed manipulation, transmission is unilateral and hybrid: the tether can pull only when taut and becomes force-free when slack; in practice, the tether may also contact the object boundary and self-wrap around edges, which is not merely collision avoidance but a change of the wrench transmission channel by shifting the effective application point and moment arm, thereby coupling routing geometry with rigid-body motion and tensioning. We formulate self-wrap towing as a routing-aware, tensioning-implicit trajectory optimization (TITO) problem that couples (i) a tensioning-implicit taut/slack constraint and (ii) routing-conditioned transmission maps for effective length and wrench, and we build a relaxation hierarchy from a strict mode-conditioned reference to three tractable relaxations: Full-Mode Relaxation (FMR), Binary-Mode Relaxation (BMR), and Implicit-Mode Relaxation (IMR). Across planar towing tasks, we find that making routing an explicit decision often yields conservative solutions that stay near switching boundaries, whereas IMR induces self-wrap through state evolution and exploits the redirected torque channel whenever turning requires it.
Chinese Translation
缆索/绳索元素在可变形物体操作中广泛存在,通常作为一种可变形的力传输介质,其布置和接触决定了扭矩的传递方式。在缆索拖曳操作中,传输是单向的和混合的:当缆索处于拉紧状态时可以施加拉力,而在松弛状态下则无力;在实际操作中,缆索可能还会接触物体边界并自缠绕在边缘上,这不仅仅是避免碰撞,而是通过改变有效施加点和力臂来改变扭矩传输通道,从而将布置几何与刚体运动和张力耦合在一起。我们将自缠绕拖曳形式化为一个考虑布置的、隐性张力的轨迹优化(TITO)问题,该问题耦合了(i)隐性张力的拉紧/松弛约束和(ii)针对有效长度和扭矩的布置条件传输映射,并从严格的模式条件参考建立了一个松弛层次结构,形成三种可处理的松弛方式:全模式松弛(FMR)、二元模式松弛(BMR)和隐性模式松弛(IMR)。在平面拖曳任务中,我们发现将布置作为一个显性决策通常会产生保守的解决方案,这些方案往往接近切换边界,而IMR通过状态演变引入自缠绕,并在转向需要时利用重定向的扭矩通道。
cs.RO / 49 / 2603.09565
ReTac-ACT: A State-Gated Vision-Tactile Fusion Transformer for Precision Assembly
ReTac-ACT:一种状态门控的视觉-触觉融合变换器用于精密装配
Abstract
Precision assembly requires sub-millimeter corrections in contact-rich "last-millimeter" regions where visual feedback fails due to occlusion from the end-effector and workpiece. We present ReTac-ACT (Reconstruction-enhanced Tactile ACT), a vision-tactile imitation learning policy that addresses this challenge through three synergistic mechanisms: (i) bidirectional cross-attention enabling reciprocal visuo-tactile feature enhancement before fusion, (ii) a proprioception-conditioned gating network that dynamically elevates tactile reliance when visual occlusion occurs, and (iii) a tactile reconstruction objective enforcing learning of manipulation-relevant contact information rather than generic visual textures. Evaluated on the standardized NIST Assembly Task Board M1 benchmark, ReTac-ACT achieves 90% peg-in-hole success, substantially outperforming vision-only and generalist baseline methods, and maintains 80% success at industrial-grade 0.1mm clearance. Ablation studies validate that each architectural component is indispensable. The ReTac-ACT codebase and a vision-tactile demonstration dataset covering various clearance levels with both visual and tactile features will be released to support reproducible research.
Chinese Translation
精密装配需要在接触丰富的“最后一毫米”区域进行亚毫米级的修正,而在这些区域,由于末端执行器和工件的遮挡,视觉反馈失效。我们提出了ReTac-ACT(重构增强触觉ACT),这是一种通过三种协同机制应对这一挑战的视觉-触觉模仿学习策略:(i) 双向交叉注意力机制,在融合之前实现互惠的视觉-触觉特征增强;(ii) 一种基于本体感觉的门控网络,当发生视觉遮挡时动态提升触觉依赖性;(iii) 一种触觉重构目标,强制学习与操作相关的接触信息,而非通用的视觉纹理。在标准化的NIST装配任务板M1基准测试中,ReTac-ACT实现了90%的插销入孔成功率,显著优于仅依赖视觉的基线方法和通用基线方法,并在工业级0.1mm间隙下保持80%的成功率。消融研究验证了每个架构组件的不可或缺性。ReTac-ACT代码库和涵盖不同间隙水平的视觉-触觉演示数据集将被发布,以支持可重复研究。
cs.RO / 50 / 2603.09574
SCDP: Learning Humanoid Locomotion from Partial Observations via Mixed-Observation Distillation
SCDP:通过混合观测蒸馏从部分观测中学习类人行走
Abstract
Distilling humanoid locomotion control from offline datasets into deployable policies remains a challenge, as existing methods rely on privileged full-body states that require complex and often unreliable state estimation. We present Sensor-Conditioned Diffusion Policies (SCDP) that enables humanoid locomotion using only onboard sensors, eliminating the need for explicit state estimation. SCDP decouples sensing from supervision through mixed-observation training: diffusion model conditions on sensor histories while being supervised to predict privileged future state-action trajectories, enforcing the model to infer the motion dynamics under partial observability. We further develop restricted denoising, context distribution alignment, and context-aware attention masking to encourage implicit state estimation within the model and to prevent train-deploy mismatch. We validate SCDP on velocity-commanded locomotion and motion reference tracking tasks. In simulation, SCDP achieves near-perfect success on velocity control (99-100%) and 93% tracking success in AMASS test set, performing comparable to privileged baselines while using only onboard sensors. Finally, we deploy the trained policy on a real G1 humanoid at 50 Hz, demonstrating robust real robot locomotion without external sensing or state estimation.
Chinese Translation
从离线数据集中提炼类人行走控制到可部署策略仍然是一个挑战,因为现有方法依赖于需要复杂且通常不可靠的状态估计的特权全身状态。我们提出了传感器条件扩散策略(Sensor-Conditioned Diffusion Policies, SCDP),该策略仅使用机载传感器实现类人行走,消除了对显式状态估计的需求。SCDP通过混合观测训练将感知与监督解耦:扩散模型在传感器历史的条件下进行训练,同时被监督以预测特权的未来状态-动作轨迹,从而强制模型在部分可观测性下推断运动动态。我们进一步开发了限制去噪、上下文分布对齐和上下文感知注意力掩蔽,以鼓励模型内隐式状态估计并防止训练与部署之间的不匹配。我们在速度命令行走和运动参考跟踪任务上验证了SCDP。在仿真中,SCDP在速度控制任务上实现了近乎完美的成功率(99-100%),在AMASS测试集上实现了93%的跟踪成功率,表现与使用特权基线相当,同时仅使用机载传感器。最后,我们在真实的G1类人机器人上以50 Hz部署了训练好的策略,展示了在没有外部传感或状态估计的情况下,稳健的真实机器人行走能力。
cs.RO / 51 / 2603.09585
Towards Terrain-Aware Safe Locomotion for Quadrupedal Robots Using Proprioceptive Sensing
基于本体感知的四足机器人地形感知安全运动研究
Abstract
Achieving safe quadrupedal locomotion in real-world environments has attracted much attention in recent years. When walking over uneven terrain, achieving reliable estimation and realising safety-critical control based on the obtained information is still an open question. To address this challenge, especially for low-cost robots equipped solely with proprioceptive sensors (e.g., IMUs, joint encoders, and contact force sensors), this work first presents an estimation framework that generates a 2.5-D terrain map and extracts support plane parameters, which are then integrated into contact and state estimation. Then, we integrate this estimation framework into a safety-critical control pipeline by formulating control barrier functions that provide rigorous safety guarantees. Experiments demonstrate that the proposed terrain estimation method provides smooth terrain representations. Moreover, the coupled estimation framework of terrain, state, and contact reduces the mean absolute error of base position estimation by 64.8%, decreases the estimation variance by 47.2%, and improves the robustness of contact estimation compared to a decoupled framework. The terrain-informed CBFs integrate historical terrain information and current proprioceptive measurements to ensure global safety by keeping the robot out of hazardous areas and local safety by preventing body-terrain collision, relying solely on proprioceptive sensing.
Chinese Translation
近年来,实现四足机器人在真实环境中的安全运动引起了广泛关注。当在不平坦地形上行走时,如何可靠地进行估计并基于获得的信息实现安全关键控制仍然是一个未解的问题。为了解决这一挑战,尤其是对于仅配备本体感知传感器(如惯性测量单元、关节编码器和接触力传感器)的低成本机器人,本文首先提出了一种估计框架,该框架生成2.5维地形图并提取支撑平面参数,随后将其整合到接触和状态估计中。接着,我们将该估计框架整合到安全关键控制流程中,通过制定控制屏障函数(Control Barrier Functions, CBFs)来提供严格的安全保障。实验表明,所提出的地形估计方法提供了平滑的地形表示。此外,与解耦框架相比,地形、状态和接触的耦合估计框架将基座位置估计的平均绝对误差降低了64.8%,估计方差降低了47.2%,并提高了接触估计的鲁棒性。基于地形信息的CBFs整合了历史地形信息和当前的本体感知测量,以确保全球安全,避免机器人进入危险区域,并通过防止身体与地形碰撞来确保局部安全,完全依赖于本体感知。
cs.RO / 52 / 2603.09596
A Generalized Voronoi Graph based Coverage Control Approach for Non-Convex Environment
基于广义沃罗诺伊图的非凸环境覆盖控制方法
Abstract
To address the challenge of efficient coverage by multi-robot systems in non-convex regions with multiple obstacles, this paper proposes a coverage control method based on the Generalized Voronoi Graph (GVG), which has two phases: Load-Balancing Algorithm phase and Collaborative Coverage phase. In Load-Balancing Algorithm phase, the non-convex region is partitioned into multiple sub-regions based on GVG. Besides, a weighted load-balancing algorithm is developed, which considers the quality differences among sub-regions. By iteratively optimizing the robot allocation ratio, the number of robots in each sub-region is matched with the sub-region quality to achieve load balance. In Collaborative Coverage phase, each robot is controlled by a new controller to effectively coverage the region. The convergence of the method is proved and its performance is evaluated through simulations.
Chinese Translation
为了解决多机器人系统在具有多个障碍物的非凸区域中高效覆盖的挑战,本文提出了一种基于广义沃罗诺伊图(Generalized Voronoi Graph, GVG)的覆盖控制方法,该方法分为两个阶段:负载均衡算法阶段和协同覆盖阶段。在负载均衡算法阶段,非凸区域根据 GVG 被划分为多个子区域。此外,开发了一种加权负载均衡算法,该算法考虑了子区域之间的质量差异。通过迭代优化机器人分配比例,使每个子区域内的机器人数量与子区域质量相匹配,从而实现负载均衡。在协同覆盖阶段,每个机器人由一个新的控制器进行控制,以有效覆盖该区域。本文证明了该方法的收敛性,并通过仿真评估了其性能。
cs.RO / 53 / 2603.09695
DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds
DRIFT:用于自动驾驶感知的双重表示交融变换器(Dual-Representation Inter-Fusion Transformer),基于4D雷达点云
Abstract
4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6\% (compared to, say, 45.4\% of CenterPoint) on the VoD dataset.
Chinese Translation
4D雷达提供了带有多普勒速度的3D点云数据,因其低成本和在恶劣天气条件下的鲁棒性,成为现代自动驾驶系统中颇具吸引力的组成部分。然而,与激光雷达(LiDAR)传感器相比,4D雷达提供的点云密度显著较低。这使得充分利用局部和全局上下文场景信息变得尤为重要。本文提出了DRIFT,一个通过双路径架构有效捕捉和融合局部与全局上下文的模型。该模型包含一个点路径用于聚合细粒度的局部特征,以及一个柱路径用于编码粗粒度的全局特征。这两条并行路径通过多个阶段的新型特征共享层交织在一起,从而实现对两种表示的充分利用。DRIFT在广泛使用的Delft视图(View-of-Delft, VoD)数据集和一个内部专有数据集上进行了评估。在目标检测和/或自由道路估计任务中,DRIFT的表现优于基线。例如,在VoD数据集上,DRIFT实现了52.6%的平均精度均值(mean average precision, mAP),而CenterPoint的mAP为45.4%。
cs.RO / 54 / 2603.09712
Robotic Scene Cloning:Advancing Zero-Shot Robotic Scene Adaptation in Manipulation via Visual Prompt Editing
机器人场景克隆:通过视觉提示编辑推进零-shot机器人场景适应在操作中的应用
Abstract
Modern robots can perform a wide range of simple tasks and adapt to diverse scenarios in the well-trained environment. However, deploying pre-trained robot models in real-world user scenarios remains challenging due to their limited zero-shot capabilities, often necessitating extensive on-site data collection. To address this issue, we propose Robotic Scene Cloning (RSC), a novel method designed for scene-specific adaptation by editing existing robot operation trajectories. RSC achieves accurate and scene-consistent sample generation by leveraging a visual prompting mechanism and a carefully tuned condition injection module. Not only transferring textures but also performing moderate shape adaptations in response to the visual prompts, RSC demonstrates reliable task performance across a variety of object types. Experiments across various simulated and real-world environments demonstrate that RSC significantly enhances policy generalization in target environments.
Chinese Translation
现代机器人能够执行广泛的简单任务,并在经过良好训练的环境中适应多样化的场景。然而,由于其有限的零-shot 能力,将预训练的机器人模型部署到真实用户场景中仍然面临挑战,通常需要大量的现场数据收集。为了解决这一问题,我们提出了机器人场景克隆(Robotic Scene Cloning, RSC),这是一种通过编辑现有机器人操作轨迹实现场景特定适应的新方法。RSC 通过利用视觉提示机制和精心调校的条件注入模块,实现了准确且场景一致的样本生成。RSC 不仅能够转移纹理,还能根据视觉提示进行适度的形状适应,展示了在多种物体类型下可靠的任务表现。在各种模拟和真实环境中的实验表明,RSC 显著增强了目标环境中的策略泛化能力。
cs.RO / 55 / 2603.09740
Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments
逐步奖励:面向连续环境的视觉-语言导航的步态感知对比对齐
Abstract
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to learn complex reasoning from long-horizon human interactions. While Multi-modal Large Language Models (MLLMs) have driven recent progress, current training paradigms struggle to balance generalization capability, error recovery and training stability. Specifically, (i) policies derived from SFT suffer from compounding errors, struggling to recover from out-of-distribution states, and (ii) Reinforcement Fine-Tuning (RFT) methods e.g. GRPO are bottlenecked by sparse outcome rewards. Their binary feedback fails to assign credit to individual steps, leading to gradient signal collapse in failure dominant batches. To address these challenges, we introduce Step-Aware Contrastive Alignment (SACA), a framework designed to extract dense supervision from imperfect trajectories. At its core, the Perception-Grounded Step-Aware auditor evaluates progress step-by-step, disentangling failed trajectories into valid prefixes and exact divergence points. Leveraging these signals, Scenario-Conditioned Group Construction mechanism dynamically routes batches to specialized resampling and optimization strategies. Extensive experiments on VLN-CE benchmarks demonstrate that SACA achieves state-of-the-art performance.
Chinese Translation
连续环境中的视觉-语言导航(VLN-CE)要求智能体从长期的人类交互中学习复杂的推理。尽管多模态大型语言模型(MLLMs)推动了近期的进展,但当前的训练范式在平衡泛化能力、错误恢复和训练稳定性方面面临挑战。具体而言,(i) 从监督微调(SFT)派生的策略受到累积错误的影响,难以从分布外状态中恢复,(ii) 强化微调(RFT)方法,如GRPO,受到稀疏结果奖励的瓶颈限制。它们的二元反馈未能为单个步骤分配信用,导致在失败主导的批次中梯度信号崩溃。为了解决这些挑战,我们提出了步态感知对比对齐(SACA),这是一个旨在从不完美轨迹中提取密集监督的框架。其核心是感知基础的步态感知审计器逐步评估进展,将失败的轨迹解构为有效的前缀和确切的偏差点。利用这些信号,场景条件下的组构建机制动态地将批次路由到专门的重采样和优化策略。在VLN-CE基准上的大量实验表明,SACA达到了最先进的性能。
cs.RO / 56 / 2603.09745
Caterpillar-Inspired Spring-Based Compressive Continuum Robot for Bristle-based Exploration
基于毛虫启发的弹簧驱动压缩连续机器人用于毛刷式探索
Abstract
Exploration of confined spaces, such as pipelines and ducts, remains challenging for conventional rigid robots due to limited space, irregular geometry, and restricted access. Inspired by caterpillar locomotion and sensing, this paper presents a compact spring-based tendon-driven continuum robot that integrates with commercial robotic arms for confined-space inspection. The system combines a mechanically compliant continuum body with a tendon actuation module, enabling coupled bending and axial length change, and uses a constant-curvature kinematic model for positional control. Experiments show a mean position error of 4.32 mm under the proposed model and control pipeline. To extend the system from motion to inspection, we integrate an artificial bristle contact sensor and demonstrate surface perception and confined-space exploration through contact interactions. This compact and compliant design offers a cost-effective upgrade for commercial robots and promises effective exploration in challenging environments.
Chinese Translation
由于空间有限、几何形状不规则和通道受限,传统的刚性机器人在狭小空间的探索中仍然面临挑战。受毛虫运动和感知的启发,本文提出了一种紧凑的基于弹簧的腱驱动连续机器人,该机器人与商业机器人手臂集成,用于狭小空间的检查。该系统结合了机械柔性连续体和腱驱动模块,实现了弯曲和轴向长度变化的耦合,并采用恒曲率运动学模型进行位置控制。实验表明,在所提出的模型和控制管道下,平均位置误差为4.32毫米。为了将系统从运动扩展到检查,我们集成了一种人工毛刷接触传感器,并通过接触交互展示了表面感知和狭小空间探索。这种紧凑且柔性的设计为商业机器人提供了一种具有成本效益的升级方案,并有望在具有挑战性的环境中实现有效探索。
cs.RO / 57 / 2603.09761
MuxGel: Simultaneous Dual-Modal Visuo-Tactile Sensing via Spatially Multiplexing and Deep Reconstruction
MuxGel:通过空间复用和深度重建实现双模态视觉-触觉传感的同时感知
Abstract
High-fidelity visuo-tactile sensing is important for precise robotic manipulation. However, most vision-based tactile sensors face a fundamental trade-off: opaque coatings enable tactile sensing but block pre-contact vision. To address this, we propose MuxGel, a spatially multiplexed sensor that captures both external visual information and contact-induced tactile signals through a single camera. By using a checkerboard coating pattern, MuxGel interleaves tactile-sensitive regions with transparent windows for external vision. This design maintains standard form factors, allowing for plug-and-play integration into GelSight-style sensors by simply replacing the gel pad. To recover full-resolution vision and tactile signals from the multiplexed inputs, we develop a U-Net-based reconstruction framework. Leveraging a sim-to-real pipeline, our model effectively decouples and restores high-fidelity tactile and visual fields simultaneously. Experiments on unseen objects demonstrate the framework's generalization and accuracy. Furthermore, we demonstrate MuxGel's utility in grasping tasks, where dual-modality feedback facilitates both pre-contact alignment and post-contact interaction. Results show that MuxGel enhances the perceptual capabilities of existing vision-based tactile sensors while maintaining compatibility with their hardware stacks. Project webpage: https://zhixianhu.github.io/muxgel/.
Chinese Translation
高保真视觉-触觉传感对于精确的机器人操作至关重要。然而,大多数基于视觉的触觉传感器面临一个基本的权衡:不透明涂层能够实现触觉感知,但阻挡了接触前的视觉。为了解决这一问题,我们提出了MuxGel,一种空间复用传感器,通过单个摄像头捕获外部视觉信息和接触引起的触觉信号。通过使用棋盘格涂层模式,MuxGel将触觉敏感区域与透明窗口交错,以便于外部视觉。这一设计保持了标准的形状因子,允许通过简单替换凝胶垫实现与GelSight风格传感器的即插即用集成。为了从复用输入中恢复全分辨率的视觉和触觉信号,我们开发了基于U-Net的重建框架。利用仿真到现实的管道,我们的模型有效地解耦并同时恢复高保真的触觉和视觉场。对未见物体的实验表明了框架的泛化能力和准确性。此外,我们展示了MuxGel在抓取任务中的实用性,其中双模态反馈促进了接触前的对齐和接触后的交互。结果表明,MuxGel增强了现有基于视觉的触觉传感器的感知能力,同时保持与其硬件堆栈的兼容性。项目网页:https://zhixianhu.github.io/muxgel/
cs.RO / 58 / 2603.09782
TIMID: Time-Dependent Mistake Detection in Videos of Robot Executions
TIMID:机器人执行视频中的时间依赖性错误检测
Abstract
As robotic systems execute increasingly difficult task sequences, so does the number of ways in which they can fail. Video Anomaly Detection (VAD) frameworks typically focus on singular, low-level kinematic or action failures, struggling to identify more complex temporal or spatial task violations, because they do not necessarily manifest as low-level execution errors. To address this problem, the main contribution of this paper is a new VAD-inspired architecture, TIMID, which is able to detect robot time-dependent mistakes when executing high-level tasks. Our architecture receives as inputs a video and prompts of the task and the potential mistake, and returns a frame-level prediction in the video of whether the mistake is present or not. By adopting a VAD formulation, the model can be trained with weak supervision, requiring only a single label per video. Additionally, to alleviate the problem of data scarcity of incorrect executions, we introduce a multi-robot simulation dataset with controlled temporal errors and real executions for zero-shot sim-to-real evaluation. Our experiments demonstrate that out-of-the-box VLMs lack the explicit temporal reasoning required for this task, whereas our framework successfully detects different types of temporal errors. Project: https://ropertunizar.github.io/TIMID/
Chinese Translation
随着机器人系统执行越来越复杂的任务序列,它们失败的方式也日益增多。视频异常检测(VAD)框架通常专注于单一的、低级的运动或动作失败,难以识别更复杂的时间或空间任务违规,因为这些违规不一定表现为低级执行错误。为了解决这个问题,本文的主要贡献是一个新的受VAD启发的架构TIMID,它能够在执行高级任务时检测机器人时间依赖性错误。我们的架构以视频和任务及潜在错误的提示作为输入,并返回视频中每一帧是否存在错误的预测。通过采用VAD的形式,模型可以在弱监督下进行训练,仅需每个视频一个标签。此外,为了缓解错误执行数据稀缺的问题,我们引入了一个具有可控时间错误和真实执行的多机器人仿真数据集,以便进行零样本的仿真到真实评估。我们的实验表明,现成的视觉语言模型(VLM)缺乏执行此任务所需的明确时间推理,而我们的框架成功地检测了不同类型的时间错误。项目链接:https://ropertunizar.github.io/TIMID/
cs.RO / 59 / 2603.09783
Lightweight 3D LiDAR-Based UAV Tracking: An Adaptive Extended Kalman Filtering Approach
轻量级3D LiDAR基础的无人机跟踪:一种自适应扩展卡尔曼滤波方法
Abstract
Accurate relative positioning is crucial for swarm aerial robotics, enabling coordinated flight and collision avoidance. Although vision-based tracking has been extensively studied, 3D LiDAR-based methods remain underutilized despite their robustness under varying lighting conditions. Existing systems often rely on bulky, power-intensive sensors, making them impractical for small UAVs with strict payload and energy constraints. This paper presents a lightweight LiDAR-based UAV tracking system incorporating an Adaptive Extended Kalman Filter (AEKF) framework. Our approach effectively addresses the challenges posed by sparse, noisy, and nonuniform point cloud data generated by non-repetitive scanning 3D LiDARs, ensuring reliable tracking while remaining suitable for small drones with strict payload constraints. Unlike conventional filtering techniques, the proposed method dynamically adjusts the noise covariance matrices using innovation and residual statistics, thereby enhancing tracking accuracy under real-world conditions. Additionally, a recovery mechanism ensures continuity of tracking during temporary detection failures caused by scattered LiDAR returns or occlusions. Experimental validation was performed using a Livox Mid-360 LiDAR mounted on a DJI F550 UAV in real-world flight scenarios. The proposed method demonstrated robust UAV tracking performance under sparse LiDAR returns and intermittent detections, consistently outperforming both standard Kalman filtering and particle filtering approaches during aggressive maneuvers. These results confirm that the framework enables reliable relative positioning in GPS-denied environments without the need for multi-sensor arrays or external infrastructure.
Chinese Translation
准确的相对定位对于群体空中机器人至关重要,能够实现协调飞行和避免碰撞。尽管基于视觉的跟踪已被广泛研究,但基于3D LiDAR的方法在不同光照条件下的鲁棒性仍未得到充分利用。现有系统通常依赖于体积庞大、耗电量大的传感器,使其在严格的载荷和能量限制下不适合小型无人机。本文提出了一种轻量级的基于LiDAR的无人机跟踪系统,采用自适应扩展卡尔曼滤波器(Adaptive Extended Kalman Filter, AEKF)框架。我们的方法有效解决了由非重复扫描3D LiDAR生成的稀疏、噪声和非均匀点云数据所带来的挑战,确保了可靠的跟踪,同时适用于具有严格载荷限制的小型无人机。与传统的滤波技术不同,所提方法通过创新和残差统计动态调整噪声协方差矩阵,从而提高了在现实条件下的跟踪精度。此外,恢复机制确保在由于散乱的LiDAR回波或遮挡导致的暂时检测失败期间跟踪的连续性。实验验证是在真实飞行场景中使用安装在DJI F550无人机上的Livox Mid-360 LiDAR进行的。所提方法在稀疏LiDAR回波和间歇性检测下展示了强大的无人机跟踪性能,在激烈机动期间始终优于标准卡尔曼滤波和粒子滤波方法。这些结果证实该框架能够在无GPS环境中实现可靠的相对定位,而无需多传感器阵列或外部基础设施。
cs.RO / 60 / 2603.09882
Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning
通过动态感知策略学习在杂乱场景中实现外部灵巧性
Abstract
Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.
Chinese Translation
外部灵巧性利用环境接触来克服抓取操作的局限性。然而,在杂乱场景中实现这种灵巧性仍然具有挑战性且未得到充分探索,因为它需要在多个相互作用的物体之间选择性地利用接触,这些物体具有固有的耦合动态。现有方法缺乏对这种复杂动态的明确建模,因此在杂乱环境中的非抓取操作中表现不佳,从而限制了它们在现实环境中的实际应用。在本文中,我们引入了一种动态感知策略学习(Dynamics-Aware Policy Learning, DAPL)框架,该框架能够通过学习接触引起的物体动态的表示来促进政策学习。这种表示通过明确的世界建模学习而来,并用于条件强化学习,使外部灵巧性得以在没有手工设计的接触启发式或复杂奖励塑造的情况下出现。我们在模拟和现实世界中评估了我们的方法。我们的算法在未见过的模拟杂乱场景中,成功率超过25%地优于抓取操作、人类遥控和先前基于表示的策略。在10个杂乱场景中,现实世界的成功率达到约50%,而实际的杂货部署进一步展示了稳健的模拟到现实的迁移和适用性。
cs.RO / 61 / 2603.09886
Robust Cooperative Localization in Featureless Environments: A Comparative Study of DCL, StCL, CCL, CI, and Standard-CL
无特征环境下的鲁棒协作定位:DCL、StCL、CCL、CI 和标准协作定位的比较研究
Abstract
Cooperative localization (CL) enables accurate position estimation in multi-robot systems operating in GPS-denied environments. This paper presents a comparative study of five CL approaches: Centralized Cooperative Localization (CCL), Decentralized Cooperative Localization (DCL), Sequential Cooperative Localization (StCL), Covariance Intersection (CI), and Standard Cooperative Localization (Standard-CL). All methods are implemented in ROS and evaluated through Monte Carlo simulations under two conditions: weak data association and robust detection. Our analysis reveals fundamental trade-offs among the methods. StCL and Standard-CL achieve the lowest position errors but exhibit severe filter inconsistency, making them unsuitable for safety-critical applications. DCL demonstrates remarkable stability under challenging conditions due to its measurement stride mechanism, which provides implicit regularization against outliers. CI emerges as the most balanced approach, achieving near-optimal consistency while maintaining competitive accuracy. CCL provides theoretically optimal estimation but shows sensitivity to measurement outliers. These findings offer practical guidance for selecting CL algorithms based on application requirements.
Chinese Translation
协作定位(CL)使得在无GPS环境下的多机器人系统中实现准确的位置估计。本文对五种协作定位方法进行了比较研究:集中式协作定位(CCL)、分散式协作定位(DCL)、顺序协作定位(StCL)、协方差交集(CI)和标准协作定位(Standard-CL)。所有方法均在ROS中实现,并通过蒙特卡洛仿真在两种条件下进行评估:弱数据关联和鲁棒检测。我们的分析揭示了这些方法之间的基本权衡。StCL和Standard-CL实现了最低的位置误差,但表现出严重的滤波器不一致性,使其不适合安全关键应用。DCL在挑战性条件下表现出显著的稳定性,这得益于其测量步幅机制,为抗干扰提供了隐式正则化。CI则成为最平衡的方法,在保持竞争性准确度的同时实现了近乎最优的一致性。CCL提供了理论上最优的估计,但对测量异常值表现出敏感性。这些发现为根据应用需求选择协作定位算法提供了实用指导。
cs.RO / 62 / 2603.09908
NanoBench: A Multi-Task Benchmark Dataset for Nano-Quadrotor System Identification, Control, and State Estimation
NanoBench:用于纳米四旋翼系统识别、控制和状态估计的多任务基准数据集
Abstract
Existing aerial-robotics benchmarks target vehicles from hundreds of grams to several kilograms and typically expose only high-level state data. They omit the actuator-level signals required to study nano-scale quadrotors, where low-Reynolds number aerodynamics, coreless DC motor nonlinearities, and severe computational constraints invalidate models and controllers developed for larger vehicles. We introduce NanoBench, an open-source multi-task benchmark collected on the commercially available Crazyflie 2.1 nano-quadrotor (takeoff weight 27 g) in a Vicon motion capture arena. The dataset contains over 170 flight trajectories spanning hover, multi-frequency excitation, standard tracking, and aggressive maneuvers across multiple speed regimes. Each trajectory provides synchronized Vicon ground truth, raw IMU data, onboard extended Kalman filter estimates, PID controller internals, and motor PWM commands at 100 Hz, alongside battery telemetry at 10 Hz, aligned with sub-0.5 ms consistency. NanoBench defines standardized evaluation protocols, train/test splits, and open-source baselines for three tasks: nonlinear system identification, closed-loop controller benchmarking, and onboard state estimation assessment. To our knowledge, it is the first public dataset to jointly provide actuator commands, controller internals, and estimator outputs with millimeter-accurate ground truth on a commercially available nano-scale aerial platform.
Chinese Translation
现有的空中机器人基准测试主要针对重量从几百克到几千克的飞行器,通常仅提供高层次的状态数据。它们忽略了研究纳米级四旋翼所需的执行器级信号,在低雷诺数气动学、无心直流电机非线性以及严苛的计算约束下,使得为更大飞行器开发的模型和控制器失效。我们推出了NanoBench,这是一个在商业可用的Crazyflie 2.1纳米四旋翼(起飞重量27克)上收集的开源多任务基准数据集,数据采集在Vicon运动捕捉场地进行。该数据集包含超过170条飞行轨迹,涵盖悬停、多频激励、标准跟踪和多种速度下的激进机动。每条轨迹提供同步的Vicon真实值、原始IMU数据、机载扩展卡尔曼滤波器估计、PID控制器内部状态以及100 Hz下的电机PWM命令,同时在10 Hz下提供电池遥测,且一致性达到亚0.5毫秒。NanoBench定义了标准化的评估协议、训练/测试划分以及三个任务的开源基线:非线性系统识别、闭环控制器基准测试和机载状态估计评估。据我们所知,这是第一个公开数据集,能够在商业可用的纳米级空中平台上共同提供执行器命令、控制器内部状态和估计器输出,并且具有毫米级准确的真实值。
cs.RO / 63 / 2603.09956
Kinodynamic Motion Retargeting for Humanoid Locomotion via Multi-Contact Whole-Body Trajectory Optimization
基于多接触全身轨迹优化的人形运动动机重定向
Abstract
We present the KinoDynamic Motion Retargeting (KDMR) framework, a novel approach for humanoid locomotion that models the retargeting process as a multi-contact, whole-body trajectory optimization problem. Conventional kinematics-based retargeting methods rely solely on spatial motion capture (MoCap) data, inevitably introducing physically inconsistent artifacts, such as foot sliding and ground penetration, that severely degrade the performance of downstream imitation learning policies. To bridge this gap, KDMR extends beyond pure kinematics by explicitly enforcing rigid-body dynamics and contact complementarity constraints. Further, by integrating ground reaction force (GRF) measurements alongside MoCap data, our method automatically detects heel-toe contact events to accurately replicate complex human-like contact patterns. We evaluate KDMR against the state-of-the-art baseline, GMR, across three key dimensions: 1) the dynamic feasibility and smoothness of the retargeted motions, 2) the accuracy of GRF tracking compared to raw source data, and 3) the training efficiency and final performance of downstream control policies trained via the BeyondMimic framework. Experimental results demonstrate that KDMR significantly outperforms purely kinematic methods, yielding dynamically viable reference trajectories that accelerate policy convergence and enhance overall locomotion stability. Our end-to-end pipeline will be open-sourced upon publication.
Chinese Translation
我们提出了动机重定向框架(KinoDynamic Motion Retargeting, KDMR),这是一种新颖的人形运动方法,将重定向过程建模为多接触的全身轨迹优化问题。传统的基于运动学的重定向方法仅依赖空间运动捕捉(MoCap)数据,必然引入物理不一致的伪影,如脚滑动和地面穿透,这严重降低了下游模仿学习策略的性能。为了弥补这一差距,KDMR 超越了纯运动学,通过明确施加刚体动力学和接触互补约束。进一步地,通过将地面反作用力(Ground Reaction Force, GRF)测量与 MoCap 数据结合,我们的方法自动检测脚跟-脚尖接触事件,以准确复制复杂的人类接触模式。我们在三个关键维度上评估 KDMR 与最先进的基线方法 GMR 的表现:1)重定向运动的动态可行性和流畅性,2)与原始源数据相比 GRF 跟踪的准确性,以及 3)通过 BeyondMimic 框架训练的下游控制策略的训练效率和最终性能。实验结果表明,KDMR 显著优于纯运动学方法,产生动态可行的参考轨迹,加速策略收敛并增强整体运动稳定性。我们的端到端管道将在发表后开源。
cs.RO / 64 / 2603.09961
BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
BEACON:遮挡下的语言条件导航可达性预测
Abstract
Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth-derived BEV features. Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations. Our project page is: https://xin-yu-gao.github.io/beacon.
Chinese Translation
语言条件的局部导航要求机器人根据当前观察和开放词汇的关系指令推断出附近可通行的目标位置。现有的视觉-语言空间定位方法通常依赖于视觉-语言模型(VLM)在图像空间中进行推理,生成与可见像素相关的二维预测。因此,它们在推断被家具或移动人类遮挡区域的目标位置时面临挑战。为了解决这个问题,我们提出了BEACON,它在包括遮挡区域的有限局部区域上预测以自我为中心的鸟瞰图(BEV)可达性热图。在给定指令和来自机器人周围四个方向的全景RGB-D观测的情况下,BEACON通过将空间线索注入VLM并将VLM的输出与深度派生的BEV特征融合,预测BEV热图。使用在Habitat模拟器中构建的遮挡感知数据集,我们进行了详细的实验分析,以验证我们的BEV空间公式和各模块的设计选择。我们的方法在验证子集上,针对遮挡目标位置,相较于最先进的图像空间基线,平均提高了22.74个百分点的精度。我们的项目页面是:https://xin-yu-gao.github.io/beacon。
cs.RO / 65 / 2603.09971
TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation
TiPToP:一个模块化的开放词汇规划系统用于机器人操作
Abstract
We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $\pi_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io
Chinese Translation
我们提出了TiPToP,这是一个可扩展的模块化系统,将预训练的视觉基础模型与现有的任务与运动规划器(Task and Motion Planner, TAMP)相结合,以直接从输入的RGB图像和自然语言指令中解决多步骤操作任务。我们的系统旨在简单易用:它可以在标准的DROID设置上安装并运行,时间不超过一小时,并且可以以最小的努力适应新的实现。我们在模拟和现实世界中对TiPToP进行了评估,涉及28个桌面操作任务,发现其性能与在350小时特定实现演示上微调的视觉-语言-动作(Vision-Language-Action, VLA)模型$ ext{π}_{0.5} ext{-DROID}$相当或更优。TiPToP的模块化架构使我们能够在组件级别分析系统的失败模式。我们分析了173次试验的结果,并确定了改进方向。我们将TiPToP开源,以促进对模块化操作系统的进一步研究以及学习与规划之间的更紧密集成。项目网站和代码: https://tiptop-robot.github.io
cs.CV / 1 / 2603.08800
Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
Granulon:通过自适应多粒度语义唤醒像素级视觉编码器的 MLLM
Abstract
Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.
Chinese Translation
最近在多模态大型语言模型方面的进展在很大程度上依赖于基于 CLIP 的视觉编码器,这些编码器强调全局语义对齐,但在细粒度视觉理解方面存在困难。相比之下,DINOv3 提供了强大的像素级感知,但缺乏粗粒度的语义抽象,导致多粒度推理能力有限。为了解决这一问题,我们提出了 Granulon,一种基于 DINOv3 的新型 MLLM,具有自适应粒度增强功能。Granulon 引入了一个文本条件的粒度控制器,该控制器根据文本输入的语义范围动态调整视觉抽象水平,以及一个自适应令牌聚合模块,该模块执行粒度引导的池化和关系感知聚类,以生成紧凑且语义丰富的视觉令牌。这一设计使得在单次前向传播中实现统一的“像素到细粒度再到粗粒度”推理。大量可解释的实验表明,Granulon 的准确性提高了约 30%,幻觉现象减少了约 20%,在相同设置下优于所有视觉编码器。
cs.CV / 2 / 2603.08809
Where, What, Why: Toward Explainable 3D-GS Watermarking
哪里、什么、为什么:朝向可解释的3D高斯水印
Abstract
As 3D Gaussian Splatting becomes the de facto representation for interactive 3D assets, robust yet imperceptible watermarking is critical. We present a representation-native framework that separates where to write from how to preserve quality. A Trio-Experts module operates directly on Gaussian primitives to derive priors for carrier selection, while a Safety and Budget Aware Gate (SBAG) allocates Gaussians to watermark carriers, optimized for bit resilience under perturbation and bitrate budgets, and to visual compensators that are insulated from watermark loss. To maintain fidelity, we introduce a channel-wise group mask that controls gradient propagation for carriers and compensators, thereby limiting Gaussian parameter updates, repairing local artifacts, and preserving high-frequency details without increasing runtime. Our design yields view-consistent watermark persistence and strong robustness against common image distortions such as compression and noise, while achieving a favorable robustness-quality trade-off compared with prior methods. In addition, decoupled finetuning provides per-Gaussian attributions that reveal where the message is carried and why those carriers are selected, enabling auditable explainability. Compared with state-of-the-art methods, our approach achieves a PSNR improvement of +0.83 dB and a bit-accuracy gain of +1.24%.
Chinese Translation
随着3D高斯散点(3D Gaussian Splatting)成为互动3D资产的事实标准,稳健且不可察觉的水印技术显得尤为重要。我们提出了一种基于表示的框架,将水印的写入位置与质量保持方式分开。一个三专家模块直接作用于高斯原语,以推导载体选择的先验知识,而一个安全与预算感知门(Safety and Budget Aware Gate, SBAG)则将高斯分配给水印载体,优化以应对扰动和比特率预算下的比特韧性,并为免受水印损失影响的视觉补偿器分配资源。为了保持保真度,我们引入了一种通道级组掩码,控制载体和补偿器的梯度传播,从而限制高斯参数的更新,修复局部伪影,并在不增加运行时间的情况下保留高频细节。我们的设计实现了视图一致的水印持久性,并对常见图像失真(如压缩和噪声)表现出强大的鲁棒性,同时与先前的方法相比,达成了良好的鲁棒性与质量的权衡。此外,解耦微调提供了每个高斯的归因,揭示了信息的承载位置及选择这些载体的原因,从而实现可审计的可解释性。与最先进的方法相比,我们的方法在峰值信噪比(PSNR)上提高了+0.83 dB,位准确率提升了+1.24%。
cs.CV / 3 / 2603.08812
VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model
VisionCreator-R1:一种增强反思的原生视觉生成智能模型
Abstract
Visual content generation has advanced from single-image to multi-image workflows, yet existing agents remain largely plan-driven and lack systematic reflection mechanisms to correct mid-trajectory visual errors. To address this limitation, we propose VisionCreator-R1, a native visual generation agent with explicit reflection, together with a Reflection-Plan Co-Optimization (RPCO) training methodology. Through extensive experiments and trajectory-level analysis, we uncover reflection-plan optimization asymmetry in reinforcement learning (RL): planning can be reliably optimized via plan rewards, while reflection learning is hindered by noisy credit assignment. Guided by this insight, our RPCO first trains on the self-constructed VCR-SFT dataset with reflection-strong single-image trajectories and planning-strong multi-image trajectories, then co-optimization on VCR-RL dataset via RL. This yields our unified VisionCreator-R1 agent, which consistently outperforms Gemini2.5Pro on existing benchmarks and our VCR-bench covering single-image and multi-image tasks.
Chinese Translation
视觉内容生成已从单图像工作流发展到多图像工作流,但现有的智能体仍然主要依赖计划驱动,缺乏系统的反思机制来纠正中途的视觉错误。为了解决这一局限性,我们提出了VisionCreator-R1,这是一种具有明确反思的原生视觉生成智能体,以及一种反思-计划协同优化(Reflection-Plan Co-Optimization, RPCO)训练方法。通过广泛的实验和轨迹级分析,我们揭示了强化学习(Reinforcement Learning, RL)中的反思-计划优化不对称性:计划可以通过计划奖励可靠地优化,而反思学习则受到噪声信用分配的阻碍。在这一洞察的指导下,我们的RPCO首先在自构建的VCR-SFT数据集上进行训练,该数据集包含反思强的单图像轨迹和规划强的多图像轨迹,然后通过RL在VCR-RL数据集上进行协同优化。这使我们得到了统一的VisionCreator-R1智能体,该智能体在现有基准测试和涵盖单图像及多图像任务的VCR-bench上始终优于Gemini2.5Pro。
cs.CV / 4 / 2603.08827
Computer Vision-Based Vehicle Allotment System using Perspective Mapping
基于计算机视觉的车辆分配系统使用透视映射
Abstract
Smart city research envisions a future in which data-driven solutions and sustainable infrastructure work together to define urban living at the crossroads of urbanization and technology. Within this framework, smart parking systems play an important role in reducing urban congestion and supporting sustainable transportation. Automating parking solutions have considerable benefits, such as increased efficiency and less reliance on human involvement, but obstacles such as sensor limitations and integration complications remain. To overcome them, a more sophisticated car allotment system is required, particularly in heavily populated urban areas. Computer vision, with its higher accuracy and adaptability, outperforms traditional sensor-based systems for recognizing vehicles and vacant parking spaces. Unlike fixed sensor technologies, computer vision can dynamically assess a wide range of visual inputs while adjusting to changing parking layouts. This research presents a cost-effective, easy-to-implement smart parking system utilizing computer vision and object detection models like YOLOv8. Using inverse perspective mapping (IPM) to merge images from four camera views, we extract data on vacant spaces. The system simulates a 3D parking environment, representing available spots with a 3D Cartesian plot to guide users.
Chinese Translation
智慧城市研究设想了一个未来,在这个未来中,数据驱动的解决方案与可持续基础设施共同作用,定义了城市化与技术交汇处的城市生活。在这一框架内,智能停车系统在减少城市拥堵和支持可持续交通方面发挥着重要作用。自动化停车解决方案具有显著的优势,例如提高效率和减少对人力的依赖,但仍然存在传感器限制和集成复杂性等障碍。为了解决这些问题,特别是在高度人口密集的城市地区,需要一个更复杂的车辆分配系统。计算机视觉凭借其更高的准确性和适应性,在识别车辆和空闲停车位方面优于传统的基于传感器的系统。与固定传感器技术不同,计算机视觉可以动态评估广泛的视觉输入,同时适应不断变化的停车布局。本研究提出了一种经济高效、易于实施的智能停车系统,利用计算机视觉和物体检测模型,如YOLOv8。通过逆透视映射(IPM)合并来自四个摄像头视角的图像,我们提取空闲停车位的数据。该系统模拟了一个三维停车环境,使用三维笛卡尔图表示可用停车位,以指导用户。
cs.CV / 5 / 2603.08844
A Lightweight Multi-Cancer Tumor Localization Framework for Deployable Digital Pathology
可部署数字病理学的轻量级多癌症肿瘤定位框架
Abstract
Accurate localization of tumor regions from hematoxylin and eosin-stained whole-slide images is fundamental for translational research including spatial analysis, molecular profiling, and tissue architecture investigation. However, deep learning-based tumor detection trained within specific cancers may exhibit reduced robustness when applied across different tumor types. We investigated whether balanced training across cancers at modest scale can achieve high performance and generalize to unseen tumor types. A multi-cancer tumor localization model (MuCTaL) was trained on 79,984 non-overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non-small cell lung cancer) using transfer learning with DenseNet169. The model achieved a tile-level ROC-AUC of 0.97 in validation data from the four training cancers, and 0.71 on an independent pancreatic ductal adenocarcinoma cohort. A scalable inference workflow was built to generate spatial tumor probability heatmaps compatible with existing digital pathology tools. Code and models are publicly available at https://github.com/AivaraX-AI/MuCTaL.
Chinese Translation
从苏木精-伊红染色的全切片图像中准确定位肿瘤区域对于包括空间分析、分子特征分析和组织结构研究在内的转化研究至关重要。然而,基于深度学习的肿瘤检测在特定癌症中训练后,应用于不同肿瘤类型时可能表现出较低的鲁棒性。我们研究了在适度规模下跨癌症进行平衡训练是否能够实现高性能并对未见过的肿瘤类型进行泛化。我们训练了一个多癌症肿瘤定位模型(MuCTaL),该模型使用DenseNet169进行迁移学习,基于来自四种癌症(黑色素瘤、肝细胞癌、结直肠癌和非小细胞肺癌)的79,984个非重叠图块。该模型在四种训练癌症的验证数据中达到了0.97的图块级ROC-AUC,而在一个独立的胰腺导管腺癌队列中达到了0.71。我们构建了一个可扩展的推理工作流,以生成与现有数字病理学工具兼容的空间肿瘤概率热图。代码和模型可在 https://github.com/AivaraX-AI/MuCTaL 上公开获取。
cs.CV / 6 / 2603.08850
HECTOR: Hybrid Editable Compositional Object References for Video Generation
HECTOR:用于视频生成的混合可编辑组合对象引用
Abstract
Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.
Chinese Translation
现实世界的视频自然地描绘了不同物理对象之间的复杂交互,有效地形成了动态视觉元素的组合。然而,目前大多数视频生成模型以整体方式合成场景,因此缺乏显式组合操作的机制。为了解决这一限制,我们提出了HECTOR,一种能够实现细粒度组合控制的生成管道。与之前的方法相比,HECTOR支持混合引用条件,使得生成可以同时受到静态图像和/或动态视频的引导。此外,用户可以明确指定每个引用元素的轨迹,精确控制其位置、尺度和速度(见图1)。这一设计使得模型能够合成符合复杂时空约束的连贯视频,同时保持对引用的高保真遵循。大量实验表明,HECTOR在视觉质量、引用保持能力和运动可控性方面优于现有方法。
cs.CV / 7 / 2603.08897
Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures
基于视觉语言模型的自主驾驶架构中的补丁攻击比较分析
Abstract
Vision-language models are emerging for autonomous driving, yet their robustness to physical adversarial attacks remains unexplored. This paper presents a systematic framework for comparative adversarial evaluation across three VLM architectures: Dolphins, OmniDrive (Omni-L), and LeapVAD. Using black-box optimization with semantic homogenization for fair comparison, we evaluate physically realizable patch attacks in CARLA simulation. Results reveal severe vulnerabilities across all architectures, sustained multi-frame failures, and critical object detection degradation. Our analysis exposes distinct architectural vulnerability patterns, demonstrating that current VLM designs inadequately address adversarial threats in safety-critical autonomous driving applications.
Chinese Translation
视觉语言模型在自主驾驶领域逐渐崭露头角,但其对物理对抗攻击的鲁棒性仍未得到探讨。本文提出了一个系统框架,用于对三种视觉语言模型架构(Dolphins、OmniDrive (Omni-L) 和 LeapVAD)进行比较对抗评估。我们采用黑箱优化和语义同质化的方法进行公平比较,在CARLA仿真环境中评估可物理实现的补丁攻击。结果显示,所有架构均存在严重的脆弱性,持续的多帧失败,以及关键物体检测性能的下降。我们的分析揭示了不同架构的脆弱性模式,表明当前的视觉语言模型设计未能充分应对安全关键的自主驾驶应用中的对抗威胁。
cs.CV / 8 / 2603.08898
Towards Visual Query Segmentation in the Wild
面向野外的视觉查询分割
Abstract
In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.
Chinese Translation
在本文中,我们引入了视觉查询分割(Visual Query Segmentation, VQS),这是一种新的视觉查询定位(Visual Query Localization, VQL)范式,旨在根据外部视觉查询在未裁剪的视频中分割出感兴趣对象的所有像素级出现。与现有的仅使用边界框定位目标最后一次出现的 VQL 相比,VQS 实现了更全面(即所有对象出现)和更精确(即像素级掩膜)的定位,使其在现实场景中更具实用性。为了促进这一任务的研究,我们提出了 VQS-4K,这是一个专门用于 VQS 的大规模基准数据集。具体而言,VQS-4K 包含 4,111 个视频,超过 130 万帧,涵盖 222 个多样化的对象类别。每个视频都与一个由搜索视频外部的帧及其目标掩膜定义的视觉查询配对,并附有与查询目标对应的时空掩膜。为了确保高质量,VQS-4K 中的所有视频都经过细致检查和迭代修正的人工标注。根据我们所知,VQS-4K 是第一个专门为 VQS 设计的基准。此外,为了激励未来的研究,我们提出了一种简单而有效的方法,称为 VQ-SAM,该方法通过利用视频中的目标特定和背景干扰信息,采用一种新颖的多阶段框架与自适应记忆生成(Adaptive Memory Generation, AMG)模块,逐步演化记忆,从而显著提高 VQS 的性能。在我们对 VQS-4K 的广泛实验中,VQ-SAM 取得了令人满意的结果,超越了所有现有方法,证明了其有效性。通过提出的 VQS-4K 和 VQ-SAM,我们期待超越当前的 VQL 范式,并激励更多关于 VQS 的未来研究和实际应用。我们的基准、代码和结果将公开发布。
cs.CV / 9 / 2603.08906
Multi-Kernel Gated Decoder Adapters for Robust Multi-Task Thyroid Ultrasound under Cross-Center Shift
跨中心偏移下的多核门控解码器适配器用于稳健的多任务甲状腺超声
Abstract
Thyroid ultrasound (US) automation couples two competing requirements: global, geometry-driven reasoning for nodule delineation and local, texture-driven reasoning for malignancy risk assessment. Under cross-center domain shift, these cues degrade asymmetrically, yet most multi-task pipelines rely on a single shared backbone, often inducing negative transfer. In this paper, we characterize this interference across CNN (ResNet34) and medical ViT (MedSAM) backbones, and observe a consistent trend: ViTs transfer geometric priors that benefit segmentation, whereas CNNs more reliably preserve texture cues for malignancy discrimination under strong shift and artifacts. Motivated by this failure mode, we propose a lightweight family of decoder-side adapters, the Multi-Kernel Gated Adapter (MKGA) and a residual variant (ResMKGA), which refine multi-scale skip features using complementary receptive fields and apply semantic, context-conditioned gating to suppress artifact-prone content before fusion. Across two US benchmarks, the proposed adapters improve cross-center robustness: they strengthen out-of-domain segmentation and, in the CNN setting, yield clear gains in clinical TI-RADS diagnostic accuracy compared to standard multi-task baselines. Code and models will be released.
Chinese Translation
甲状腺超声(US)自动化结合了两个相互竞争的需求:基于全局几何的推理用于结节轮廓描绘,以及基于局部纹理的推理用于恶性风险评估。在跨中心领域偏移下,这些线索不对称地退化,但大多数多任务管道依赖于单一共享的主干,往往导致负迁移。在本文中,我们对CNN(ResNet34)和医学ViT(MedSAM)主干之间的干扰进行了表征,并观察到一个一致的趋势:ViT传递有利于分割的几何先验,而CNN在强偏移和伪影下更可靠地保留用于恶性鉴别的纹理线索。基于这一失败模式,我们提出了一种轻量级的解码器侧适配器家族,即多核门控适配器(Multi-Kernel Gated Adapter, MKGA)及其残差变体(ResMKGA),该适配器利用互补的感受野精炼多尺度跳跃特征,并应用语义上下文条件门控在融合之前抑制易受伪影影响的内容。在两个超声基准测试中,所提出的适配器提高了跨中心的稳健性:它们增强了域外分割,并且在CNN设置下,相较于标准多任务基线,明显提高了临床TI-RADS诊断准确性。代码和模型将会发布。
cs.CV / 10 / 2603.08921
Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning
视觉-语言模型编码临床指南以进行基于概念的医学推理
Abstract
Concept Bottleneck Models (CBMs) are a prominent framework for interpretable AI that map learned visual features to a set of meaningful concepts for task-specific downstream predictions. Their sequential structure enhances transparency by connecting model predictions to the underlying concepts that support them. In medical imaging, where transparency is essential, CBMs offer an appealing foundation for explainable model design. However, discrete concept representations often overlook broader clinical context such as diagnostic guidelines and expert heuristics, reducing reliability in complex cases. We propose MedCBR, a concept-based reasoning framework that integrates clinical guidelines with vision-language and reasoning models. Labeled clinical descriptors are transformed into guideline-conformant text, and a concept-based model is trained with a multitask objective combining multimodal contrastive alignment, concept supervision, and diagnostic classification to jointly ground image features, concepts, and pathology. A reasoning model then converts these predictions into structured clinical narratives that explain the diagnosis, emulating expert reasoning based on established guidelines. MedCBR achieves superior diagnostic and concept-level performance, with AUROCs of 94.2% on ultrasound and 84.0% on mammography. Further experiments on non-medical datasets achieve 86.1% accuracy. Our framework enhances interpretability and forms an end-to-end bridge from medical image analysis to decision-making.
Chinese Translation
概念瓶颈模型(Concept Bottleneck Models, CBMs)是一个突出的可解释人工智能框架,它将学习到的视觉特征映射到一组有意义的概念,以进行特定任务的下游预测。其顺序结构通过将模型预测与支持它们的基础概念连接起来,从而增强了透明度。在医学影像学中,透明度至关重要,CBMs为可解释模型设计提供了一个吸引人的基础。然而,离散的概念表示往往忽视了更广泛的临床背景,如诊断指南和专家启发式,降低了在复杂案例中的可靠性。我们提出了MedCBR,这是一个基于概念的推理框架,旨在将临床指南与视觉-语言和推理模型相结合。标记的临床描述符被转换为符合指南的文本,并且一个基于概念的模型通过结合多模态对比对齐、概念监督和诊断分类的多任务目标进行训练,以共同定位图像特征、概念和病理。随后,推理模型将这些预测转换为结构化的临床叙述,以解释诊断,模拟基于既定指南的专家推理。MedCBR在诊断和概念层面上实现了卓越的性能,在超声图像上的AUROC为94.2%,在乳腺摄影上的AUROC为84.0%。在非医学数据集上的进一步实验实现了86.1%的准确率。我们的框架增强了可解释性,并形成了从医学图像分析到决策制定的端到端桥梁。
cs.CV / 11 / 2603.08927
MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering
MEGC2026:视觉问答中的微表情大挑战
Abstract
Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at https://megc2026.github.io.
Chinese Translation
面部微表情(MEs)是人们在经历情绪但试图抑制或压制面部表情时自发产生的非自愿面部运动,通常出现在高风险环境中。近年来,在微表情识别、检测和生成领域取得了显著进展。多模态大型语言模型(MLLMs)和大型视觉-语言模型(LVLMs)的出现,为通过其强大的多模态推理能力增强微表情分析提供了有希望的新途径。微表情大挑战(MEGC)2026引入了两个反映这些不断发展的研究方向的任务:(1)微表情视频问答(ME-VQA),通过对相对较短的视频序列进行视觉问答,探索微表情理解,利用MLLMs或LVLMs解决与微表情相关的多种问题类型;(2)微表情长视频问答(ME-LVQA),将视频问答扩展到现实环境中的长时间视频序列,要求模型在较长时间内处理时间推理和微妙的微表情检测。所有参与算法均需在公共排行榜上提交其结果。更多详细信息请访问 https://megc2026.github.io。
cs.CV / 12 / 2603.08928
TIDE: Text-Informed Dynamic Extrapolation with Step-Aware Temperature Control for Diffusion Transformers
TIDE:基于文本的信息动态外推与步态感知温度控制的扩散变换器
Abstract
Diffusion Transformer (DiT) faces challenges when generating images with higher resolution compared at training resolution, causing especially structural degradation due to attention dilution. Previous approaches attempt to mitigate this by sharpening attention distributions, but fail to preserve fine-grained semantic details and introduce obvious artifacts. In this work, we analyze the characteristics of DiTs and propose TIDE, a training-free text-to-image (T2I) extrapolation method that enables generation with arbitrary resolution and aspect ratio without additional sampling overhead. We identify the core factor for prompt information loss, and introduce a text anchoring mechanism to correct the imbalance between text and image tokens. To further eliminate artifacts, we design a dynamic temperature control mechanism that leverages the pattern of spectral progression in the diffusion process. Extensive evaluations demonstrate that TIDE delivers high-quality resolution extrapolation capability and integrates seamlessly with existing state-of-the-art methods.
Chinese Translation
扩散变换器(DiT)在生成高分辨率图像时面临挑战,尤其是在结构上由于注意力稀释而导致的退化。以往的方法试图通过锐化注意力分布来缓解这一问题,但未能保留细粒度的语义细节,并引入明显的伪影。在本研究中,我们分析了DiT的特性,并提出了TIDE,一种无训练的文本到图像(T2I)外推方法,能够在不增加额外采样开销的情况下生成任意分辨率和纵横比的图像。我们识别出提示信息丢失的核心因素,并引入了一种文本锚定机制,以纠正文本和图像标记之间的不平衡。为了进一步消除伪影,我们设计了一种动态温度控制机制,利用扩散过程中的光谱进展模式。大量评估表明,TIDE提供了高质量的分辨率外推能力,并与现有的最先进方法无缝集成。
cs.CV / 13 / 2603.08930
Using Vision Language Foundation Models to Generate Plant Simulation Configurations via In-Context Learning
利用视觉语言基础模型通过上下文学习生成植物仿真配置
Abstract
This paper introduces a synthetic benchmark to evaluate the performance of vision language models (VLMs) in generating plant simulation configurations for digital twins. While functional-structural plant models (FSPMs) are useful tools for simulating biophysical processes in agricultural environments, their high complexity and low throughput create bottlenecks for deployment at scale. We propose a novel approach that leverages state-of-the-art open-source VLMs -- Gemma 3 and Qwen3-VL -- to directly generate simulation parameters in JSON format from drone-based remote sensing images. Using a synthetic cowpea plot dataset generated via the Helios 3D procedural plant generation library, we tested five in-context learning methods and evaluated the models across three categories: JSON integrity, geometric evaluations, and biophysical evaluations. Our results show that while VLMs can interpret structural metadata and estimate parameters like plant count and sun azimuth, they often exhibit performance degradation due to contextual bias or rely on dataset means when visual cues are insufficient. Validation on a real-world drone orthophoto dataset and an ablation study using a blind baseline further characterize the models' reasoning capabilities versus their reliance on contextual priors. To the best of our knowledge, this is the first study to utilize VLMs to generate structural JSON configurations for plant simulations, providing a scalable framework for reconstruction 3D plots for digital twin in agriculture.
Chinese Translation
本文介绍了一种合成基准,用于评估视觉语言模型(VLMs)在为数字双胞胎生成植物仿真配置方面的性能。功能结构植物模型(FSPMs)是模拟农业环境中生物物理过程的有用工具,但其高复杂性和低吞吐量在大规模部署中造成了瓶颈。我们提出了一种新颖的方法,利用最先进的开源VLMs——Gemma 3和Qwen3-VL——直接从基于无人机的遥感图像生成JSON格式的仿真参数。通过使用Helios 3D程序化植物生成库生成的合成豇豆种植数据集,我们测试了五种上下文学习方法,并在三个类别中评估了模型:JSON完整性、几何评估和生物物理评估。我们的结果表明,尽管VLMs能够解释结构元数据并估计植物数量和太阳方位角等参数,但当视觉线索不足时,它们常常表现出性能下降,或者依赖于数据集均值。对真实世界无人机正射影像数据集的验证以及使用盲基线的消融研究进一步表征了模型的推理能力与其对上下文先验的依赖。根据我们的最佳知识,这是首次利用VLMs生成植物仿真的结构JSON配置,为农业数字双胞胎的3D图重建提供了可扩展的框架。
cs.CV / 14 / 2603.08935
PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration
PathoScribe:将病理数据转化为统一的基于大型语言模型(LLM)驱动的语义检索与临床整合的活库
Abstract
Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.
Chinese Translation
病理学是现代诊断和癌症护理的基础,但其最有价值的资产——以数百万份叙述报告编码的积累经验,仍然在很大程度上无法获取。尽管各机构正在迅速数字化病理工作流程,但如果没有有效的检索和推理机制,仅仅存储数据将风险将档案转变为被动的数据存储库,其中的机构知识存在但无法有意义地指导患者护理。真正的进步不仅需要数字化,还需要病理学家在评估新的诊断难题时,能够实时查询先前的类似案例。我们提出了PathoScribe,一个统一的检索增强大型语言模型(LLM)框架,旨在将静态病理档案转变为可搜索、具备推理能力的活库。PathoScribe支持自然语言案例探索、自动化队列构建、临床问题解答、免疫组化(IHC)面板推荐以及在单一架构内的提示控制报告转化。在对70,000份多机构外科病理报告的评估中,PathoScribe在自然语言案例检索中实现了完美的Recall@10,并展示了高质量的检索基础推理(平均评审分数4.56/5)。关键是,该系统实现了从自由文本资格标准的自动化队列构建,在几分钟内组装出研究准备好的队列(平均9.2分钟),与人工评审者的协议率为91.3%,且没有错误排除的合格案例,相较于传统的手动图表审查,时间和成本减少了几个数量级。这项工作为将数字病理档案从被动存储系统转变为主动临床智能平台奠定了可扩展的基础。
cs.CV / 15 / 2603.08942
BiCLIP: Domain Canonicalization via Structured Geometric Transformation
BiCLIP:通过结构化几何变换进行领域规范化
Abstract
Recent advances in vision-language models (VLMs) have demonstrated remarkable zero-shot capabilities, yet adapting these models to specialized domains remains a significant challenge. Building on recent theoretical insights suggesting that independently trained VLMs are related by a canonical transformation, we extend this understanding to the concept of domains. We hypothesize that image features across disparate domains are related by a canonicalized geometric transformation that can be recovered using a small set of anchors. Few-shot classification provides a natural setting for this alignment, as the limited labeled samples serve as the anchors required to estimate this transformation. Motivated by this hypothesis, we introduce BiCLIP, a framework that applies a targeted transformation to multimodal features to enhance cross-modal alignment. Our approach is characterized by its extreme simplicity and low parameter footprint. Extensive evaluations across 11 standard benchmarks, including EuroSAT, DTD, and FGVCAircraft, demonstrate that BiCLIP consistently achieves state-of-the-art results. Furthermore, we provide empirical verification of existing geometric findings by analyzing the orthogonality and angular distribution of the learned transformations, confirming that structured alignment is the key to robust domain adaptation. Code is available at https://github.com/QuantitativeImagingLaboratory/BilinearCLIP
Chinese Translation
最近在视觉-语言模型(VLMs)方面的进展展示了显著的零-shot 能力,但将这些模型适应于专业领域仍然是一个重大挑战。基于最近的理论见解,表明独立训练的 VLMs 之间存在规范变换的关系,我们将这种理解扩展到领域的概念。我们假设,不同领域之间的图像特征通过一种规范化的几何变换相互关联,该变换可以通过一小组锚点进行恢复。少样本分类为这种对齐提供了自然的环境,因为有限的标注样本充当了估计该变换所需的锚点。受到这一假设的启发,我们提出了 BiCLIP,一个将针对性变换应用于多模态特征以增强跨模态对齐的框架。我们的方法以其极简的设计和低参数占用为特征。在包括 EuroSAT、DTD 和 FGVCAircraft 在内的 11 个标准基准上进行的广泛评估表明,BiCLIP 一直保持着最先进的结果。此外,我们通过分析学习到的变换的正交性和角度分布,提供了对现有几何发现的实证验证,确认结构化对齐是稳健领域适应的关键。代码可在 https://github.com/QuantitativeImagingLaboratory/BilinearCLIP 获取。
cs.CV / 16 / 2603.08967
Can You Hear, Localize, and Segment Continually? An Exemplar-Free Continual Learning Benchmark for Audio-Visual Segmentation
你能持续听到、定位和分割吗?一种无示例的持续学习基准用于音视频分割
Abstract
Audio-Visual Segmentation (AVS) aims to produce pixel-level masks of sound producing objects in videos, by jointly learning from audio and visual signals. However, real-world environments are inherently dynamic, causing audio and visual distributions to evolve over time, which challenge existing AVS systems that assume static training settings. To address this gap, we introduce the first exemplar-free continual learning benchmark for Audio-Visual Segmentation, comprising four learning protocols across single-source and multi-source AVS datasets. We further propose a strong baseline, ATLAS, which uses audio-guided pre-fusion conditioning to modulate visual feature channels via projected audio context before cross-modal attention. Finally, we mitigate catastrophic forgetting by introducing Low-Rank Anchoring (LRA), which stabilizes adapted weights based on loss sensitivity. Extensive experiments demonstrate competitive performance across diverse continual scenarios, establishing a foundation for lifelong audio-visual perception. Code is available at${}^{*}$\footnote{Paper under review} - \hyperlink{https://gitlab.com/viper-purdue/atlas}{https://gitlab.com/viper-purdue/atlas} \keywords{Continual Learning \and Audio-Visual Segmentation \and Multi-Modal Learning}
Chinese Translation
音视频分割(Audio-Visual Segmentation, AVS)旨在通过联合学习音频和视觉信号,生成视频中声音产生物体的像素级掩码。然而,现实世界环境本质上是动态的,导致音频和视觉分布随时间演变,这对假设静态训练环境的现有AVS系统构成了挑战。为了解决这一问题,我们引入了首个无示例的持续学习基准,用于音视频分割,包含了单源和多源AVS数据集的四种学习协议。我们进一步提出了一个强基线模型ATLAS,该模型利用音频引导的预融合条件,通过投影的音频上下文调节视觉特征通道,然后进行跨模态注意力处理。最后,我们通过引入低秩锚定(Low-Rank Anchoring, LRA)来缓解灾难性遗忘,该方法基于损失敏感性稳定适应权重。大量实验表明,在多样的持续场景中表现出竞争力,为终身音视频感知奠定了基础。代码可在${}^{*}$ootnote{论文正在审稿中} - ext{https://gitlab.com/viper-purdue/atlas}获取。
cs.CV / 17 / 2603.08982
SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing
SVG-EAR:无参数线性补偿用于稀疏视频生成的误差感知路由
Abstract
Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77$\times$ and 1.93$\times$ speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.
Chinese Translation
扩散变换器(Diffusion Transformers, DiTs)已成为视频生成的主要框架,但其二次注意力成本仍然是一个主要瓶颈。稀疏注意力通过仅计算一部分注意力块来降低这一成本。然而,先前的方法通常要么丢弃剩余的块,这会导致信息损失,要么依赖学习的预测器来近似它们,从而引入训练开销和潜在的输出分布偏移。在本文中,我们展示了缺失的贡献可以在不训练的情况下恢复:经过语义聚类后,每个块内的键和值表现出强相似性,并且可以通过一小组聚类中心很好地总结。基于这一观察,我们引入了SVG-EAR,一个无参数的线性补偿分支,利用聚类中心来近似跳过的块并恢复其贡献。虽然聚类中心补偿对于大多数块是准确的,但在少数子集上可能会失败。标准稀疏化通常通过注意力分数选择块,这些分数指示模型将注意力集中在哪里,但并不指示近似误差将最大的位置。因此,SVG-EAR执行误差感知路由:一个轻量级探测器估计每个块的补偿误差,我们精确计算具有最高误差与成本比的块,同时补偿跳过的块。我们提供了理论保证,将注意力重建误差与聚类质量相关联,并实证表明SVG-EAR改善了质量与效率的权衡,并在视频扩散任务中以相同的生成保真度提高了吞吐量。总体而言,SVG-EAR在先前方法上建立了明确的帕累托前沿,实现了高达1.77×和1.93×的加速,同时在Wan2.2和HunyuanVideo上保持了高达29.759和31.043的PSNR。
cs.CV / 18 / 2603.08997
SkipGS: Post-Densification Backward Skipping for Efficient 3DGS Training
SkipGS:高效3DGS训练的后稠密化反向跳过
Abstract
3D Gaussian Splatting (3DGS) achieves real-time novel-view synthesis by optimizing millions of anisotropic Gaussians, yet its training remains expensive, with the backward pass dominating runtime in the post-densification refinement phase. We observe substantial update redundancy in this phase: many sampled views have near-plateaued losses and provide diminishing gradient benefits, but standard training still runs full backpropagation. We propose SkipGS with a novel view-adaptive backward gating mechanism for efficient post-densification training. SkipGS always performs the forward pass to update per-view loss statistics, and selectively skips backward passes when the sampled view's loss is consistent with its recent per-view baseline, while enforcing a minimum backward budget for stable optimization. On Mip-NeRF 360, compared to 3DGS, SkipGS reduces end-to-end training time by 23.1%, driven by a 42.0% reduction in post-densification time, with comparable reconstruction quality. Because it only changes when to backpropagate -- without modifying the renderer, representation, or loss -- SkipGS is plug-and-play and compatible with other complementary efficiency strategies for additive speedups.
Chinese Translation
3D高斯溅射(3DGS)通过优化数百万个各向异性高斯体实现实时新视角合成,但其训练仍然昂贵,后稠密化精炼阶段的反向传播占据了主要的运行时间。我们观察到在这一阶段存在显著的更新冗余:许多采样视图的损失几乎达到平台期,并提供递减的梯度收益,但标准训练仍然执行完整的反向传播。我们提出了SkipGS,一种具有新颖视角自适应反向门控机制的高效后稠密化训练方法。SkipGS始终执行前向传播以更新每个视图的损失统计,并在采样视图的损失与其最近的每个视图基线一致时选择性地跳过反向传播,同时强制执行最低反向预算以确保稳定的优化。在Mip-NeRF 360上,与3DGS相比,SkipGS将端到端训练时间减少了23.1%,其中后稠密化时间减少了42.0%,且重建质量相当。由于它仅在何时进行反向传播时发生变化——而不修改渲染器、表示或损失——SkipGS是即插即用的,并且与其他互补的效率策略兼容,以实现附加的加速。
cs.CV / 19 / 2603.08998
Diffusion-Based Authentication of Copy Detection Patterns: A Multimodal Framework with Printer Signature Conditioning
基于扩散的复制检测模式认证:一种带有打印机签名条件的多模态框架
Abstract
Counterfeiting affects diverse industries, including pharmaceuticals, electronics, and food, posing serious health and economic risks. Printable unclonable codes, such as Copy Detection Patterns (CDPs), are widely used as an anti-counterfeiting measure and are applied to products and packaging. However, the increasing availability of high-resolution printing and scanning devices, along with advances in generative deep learning, undermines traditional authentication systems, which often fail to distinguish high-quality counterfeits from genuine prints. In this work, we propose a diffusion-based authentication framework that jointly leverages the original binary template, the printed CDP, and a representation of printer identity that captures relevant semantic information. Formulating authentication as multi-class printer classification over printer signatures lets our model capture fine-grained, device-specific features via spatial and textual conditioning. We extend ControlNet by repurposing the denoising process for class-conditioned noise prediction, enabling effective printer classification. On the Indigo 1 x 1 Base dataset, our method outperforms traditional similarity metrics and prior deep learning approaches. Results show the framework generalises to counterfeit types unseen during training.
Chinese Translation
伪造行为影响着制药、电子和食品等多个行业,带来了严重的健康和经济风险。可打印的不可克隆代码,如复制检测模式(Copy Detection Patterns, CDPs),被广泛用作反伪造措施,并应用于产品和包装。然而,随着高分辨率打印和扫描设备的日益普及,以及生成深度学习的进步,传统认证系统受到威胁,往往无法区分高质量的伪造品和真品。在本研究中,我们提出了一种基于扩散的认证框架,该框架联合利用原始二进制模板、打印的CDP以及捕捉相关语义信息的打印机身份表示。将认证问题表述为基于打印机签名的多类打印机分类,使我们的模型能够通过空间和文本条件捕捉细粒度的设备特征。我们通过重新利用去噪过程进行类条件噪声预测来扩展ControlNet,从而实现有效的打印机分类。在Indigo 1 x 1 Base数据集上,我们的方法优于传统相似性度量和先前的深度学习方法。结果表明,该框架能够推广到训练期间未见过的伪造类型。
cs.CV / 20 / 2603.09037
WS-Net: Weak-Signal Representation Learning and Gated Abundance Reconstruction for Hyperspectral Unmixing via State-Space and Weak Signal Attention Fusion
WS-Net:通过状态空间和弱信号注意力融合进行弱信号表征学习和门控丰度重构的高光谱解混
Abstract
Weak spectral responses in hyperspectral images are often obscured by dominant endmembers and sensor noise, resulting in inaccurate abundance estimation. This paper introduces WS-Net, a deep unmixing framework specifically designed to address weak-signal collapse through state-space modelling and Weak Signal Attention fusion. The network features a multi-resolution wavelet-fused encoder that captures both high-frequency discontinuities and smooth spectral variations with a hybrid backbone that integrates a Mamba state-space branch for efficient long-range dependency modelling. It also incorporates a Weak Signal Attention branch that selectively enhances low-similarity spectral cues. A learnable gating mechanism adaptively fuses both representations, while the decoder leverages KL-divergence-based regularisation to enforce separability between dominant and weak endmembers. Experiments on one simulated and two real datasets (synthetic dataset, Samson, and Apex) demonstrate consistent improvements over six state-of-the-art baselines, achieving up to 55% and 63% reductions in RMSE and SAD, respectively. The framework maintains stable accuracy under low-SNR conditions, particularly for weak endmembers, establishing WS-Net as a robust and computationally efficient benchmark for weak-signal hyperspectral unmixing.
Chinese Translation
高光谱图像中的弱光谱响应常常被主导端元和传感器噪声所掩盖,导致丰度估计不准确。本文介绍了WS-Net,一种深度解混框架,专门设计用于通过状态空间建模和弱信号注意力融合来解决弱信号崩溃问题。该网络具有多分辨率小波融合编码器,能够捕捉高频不连续性和平滑光谱变化,同时集成了Mamba状态空间分支的混合骨干网络,以高效建模长距离依赖关系。它还包含一个弱信号注意力分支,选择性地增强低相似度的光谱线索。一个可学习的门控机制自适应地融合这两种表征,而解码器利用基于KL散度的正则化来强制主导端元和弱端元之间的可分离性。在一个模拟数据集和两个真实数据集(合成数据集、Samson和Apex)上的实验表明,WS-Net在六个最先进的基准上实现了一致的改进,RMSE和SAD分别减少了高达55%和63%。该框架在低信噪比条件下保持稳定的准确性,特别是对于弱端元,确立了WS-Net作为弱信号高光谱解混的稳健且计算高效的基准。
cs.CV / 21 / 2603.09054
Spectral-Structured Diffusion for Single-Image Rain Removal
基于谱结构扩散的单幅图像雨水去除
Abstract
Rain streaks manifest as directional and frequency-concentrated structures that overlap across multiple scales, making single-image rain removal particularly challenging. While diffusion-based restoration models provide a powerful framework for progressive denoising, standard spatial-domain diffusion does not explicitly account for such structured spectral characteristics. We introduce SpectralDiff, a spectral-structured diffusion-based framework tailored for single-image rain removal. Rather than redefining the diffusion formulation, our method incorporates structured spectral perturbations to guide the progressive suppression of multi-directional rain components. To support this design, we further propose a full-product U-Net architecture that leverages the convolution theorem to replace convolution operations with element-wise product layers, improving computational efficiency while preserving modeling capacity. Extensive experiments on synthetic and real-world benchmarks demonstrate that SpectralDiff achieves competitive rain removal performance with improved model compactness and favorable inference efficiency compared to existing diffusion-based approaches.
Chinese Translation
雨条表现为在多个尺度上重叠的方向性和频率集中结构,这使得单幅图像的雨水去除特别具有挑战性。尽管基于扩散的恢复模型为渐进去噪提供了强大的框架,但标准的空间域扩散并未明确考虑这种结构化的谱特征。我们提出了SpectralDiff,一种针对单幅图像雨水去除的基于谱结构扩散的框架。我们的方法并未重新定义扩散公式,而是结合结构化的谱扰动,以引导多方向雨水成分的渐进抑制。为了支持这一设计,我们进一步提出了一种全产品U-Net架构,该架构利用卷积定理将卷积操作替换为逐元素乘积层,从而提高计算效率,同时保持建模能力。在合成和真实世界基准上的大量实验表明,SpectralDiff在雨水去除性能上具有竞争力,并且与现有的基于扩散的方法相比,模型紧凑性和推理效率得到了改善。
cs.CV / 22 / 2603.09069
Intelligent Spatial Estimation for Fire Hazards in Engineering Sites: An Enhanced YOLOv8-Powered Proximity Analysis Framework
工程现场火灾风险的智能空间估计:增强型YOLOv8驱动的邻近分析框架
Abstract
This study proposes an enhanced dual-model YOLOv8 framework for intelligent fire detection and proximity-aware risk assessment, extending conventional vision-based monitoring beyond simple detection to actionable hazard prioritization. The system is trained on a dataset of 9,860 annotated images to segment fire and smoke across complex environments. The framework combines a primary YOLOv8 instance segmentation model for fire and smoke detection with a secondary object detection model pretrained on the COCO dataset to identify surrounding entities such as people, vehicles, and infrastructure. By integrating the outputs of both models, the system computes pixel-based distances between detected fire regions and nearby objects and converts these values into approximate real-world measurements using a pixel-to-meter scaling approach. This proximity information is incorporated into a risk assessment mechanism that combines fire evidence, object vulnerability, and distance-based exposure to produce a quantitative risk score and alert level. The proposed framework achieves strong performance, with precision, recall, and F1 scores exceeding 90% and
[email protected] above 91%. The system generates annotated visual outputs showing fire locations, detected objects, estimated distances, and contextual risk information to support situational awareness. Implemented using open-source tools within the Google Colab environment, the framework is lightweight and suitable for deployment in industrial and resource-constrained settings.
Chinese Translation
本研究提出了一种增强型双模型YOLOv8框架,用于智能火灾检测和邻近感知风险评估,扩展了传统基于视觉的监测,从简单检测提升至可操作的危害优先级排序。该系统在一个包含9,860张标注图像的数据集上进行训练,以分割复杂环境中的火焰和烟雾。该框架结合了一个主要的YOLOv8实例分割模型用于火焰和烟雾检测,以及一个在COCO数据集上预训练的次级物体检测模型,用于识别周围实体,如人员、车辆和基础设施。通过整合两个模型的输出,该系统计算检测到的火灾区域与附近物体之间的基于像素的距离,并使用像素到米的缩放方法将这些值转换为近似的现实世界测量。该邻近信息被纳入风险评估机制,该机制结合火灾证据、物体脆弱性和基于距离的暴露,生成定量风险评分和警报级别。所提出的框架表现出色,精确度、召回率和F1分数均超过90%,
[email protected]超过91%。该系统生成标注的视觉输出,显示火灾位置、检测到的物体、估计的距离和上下文风险信息,以支持情境意识。该框架在Google Colab环境中使用开源工具实现,轻量且适合在工业和资源受限的环境中部署。
cs.CV / 23 / 2603.09079
GST-VLA: Structured Gaussian Spatial Tokens for 3D Depth-Aware Vision-Language-Action Models
GST-VLA:用于3D深度感知视觉-语言-动作模型的结构化高斯空间标记
Abstract
VLA models encode visual observations as 2D patch tokens with no intrinsic geometric structure. We introduce GST-VLA with two contributions. First, the Gaussian Spatial Tokenizer (GST) converts frozen dense depth and frozen semantic patch features into $N_g{=}128$ anisotropic 3D Gaussian primitives, each parameterized by a metric residual mean $\mu \in \mathbb{R}^3$, log-scale covariance $\log \sigma \in \mathbb{R}^3$, and learned opacity $\alpha \in (0,1)$. The covariance eigenstructure encodes local surface orientation, and opacity provides per-primitive geometric confidence, both inaccessible from scalar depth. Spatial attention pooling with learned queries concentrates the fixed token budget on geometrically salient regions rather than distributing uniformly. Second, 3D Depth-Aware Chain-of-Thought (DA-CoT) reasoning supervises four structured intermediate spatial thoughts, covering 3D object grounding, grasp affordance contact geometry, pairwise metric distances, and coarse SE(3) waypoints, as explicit generation targets in the training loss. A cross-attention sublayer at every VLM transformer block provides direct access to the raw 256-primitive Gaussian field during DA-CoT generation. A 300M-parameter flow-matching action expert with mixture-of-experts feedforward sublayers decodes 7-DoF delta action chunks via conditional ODE integration, conditioned on both VLM hidden states and DA-CoT outputs through dual cross-attention. Trained with composite $\mathcal{L}_\mathrm{flow} + \mathcal{L}_\mathrm{CoT} + \mathcal{L}_\mathrm{depth}$ across three progressive stages, GST-VLA achieves 96.4% on LIBERO (+2.0%), and 80.2% on SimplerEnv (+5.4%). Ablations isolate the contribution of each GST component, each DA-CoT thought, and each training stage, confirming independent and synergistic gains concentrated on precision demanding tasks.
Chinese Translation
VLA模型将视觉观察编码为没有内在几何结构的2D补丁标记。我们提出了GST-VLA,具有两个贡献。首先,高斯空间标记器(GST)将冻结的密集深度和冻结的语义补丁特征转换为$N_g{=}128$个各向异性的3D高斯原语,每个原语由一个度量残差均值$bc C3$、对数尺度协方差$C0 C3$和学习的透明度$B1 C0(0,1)$参数化。协方差特征结构编码局部表面方向,而透明度为每个原语提供几何置信度,这些信息无法从标量深度中获取。使用学习的查询进行空间注意力池化,将固定的标记预算集中在几何显著区域,而不是均匀分配。其次,3D深度感知思维链(DA-CoT)推理监督四个结构化的中间空间思维,涵盖3D物体定位、抓取可行接触几何、成对度量距离和粗略的SE(3)路径点,作为训练损失中的显式生成目标。在每个VLM变换器块中,交叉注意力子层提供对原始256个原语高斯场的直接访问,以便在DA-CoT生成过程中使用。一个具有300M参数的流匹配动作专家,配备混合专家前馈子层,通过条件ODE积分解码7自由度的增量动作块,条件是VLM隐藏状态和DA-CoT输出,通过双重交叉注意力进行连接。通过在三个渐进阶段训练复合损失$C1_ ext{flow} + C1_ ext{CoT} + C1_ ext{depth}$,GST-VLA在LIBERO上达到了96.4%(+2.0%),在SimplerEnv上达到了80.2%(+5.4%)。消融实验隔离了每个GST组件、每个DA-CoT思维和每个训练阶段的贡献,确认了在精度要求高的任务中集中体现的独立和协同增益。
cs.CV / 24 / 2603.09084
OmniEdit: A Training-free framework for Lip Synchronization and Audio-Visual Editing
OmniEdit:一种无训练的唇同步与音视频编辑框架
Abstract
Lip synchronization and audio-visual editing have emerged as fundamental challenges in multimodal learning, underpinning a wide range of applications, including film production, virtual avatars, and telepresence. Despite recent progress, most existing methods for lip synchronization and audio-visual editing depend on supervised fine-tuning of pre-trained models, leading to considerable computational overhead and data requirements. In this paper, we present OmniEdit, a training-free framework designed for both lip synchronization and audio-visual editing. Our approach reformulates the editing paradigm by substituting the edit sequence in FlowEdit with the target sequence, yielding an unbiased estimation of the desired output. Moreover, by removing stochastic elements from the generation process, we establish a smooth and stable editing trajectory. Extensive experimental results validate the effectiveness and robustness of the proposed framework. Code is available at https://github.com/l1346792580123/OmniEdit.
Chinese Translation
唇同步和音视频编辑已成为多模态学习中的基本挑战,支撑着包括电影制作、虚拟化身和远程存在等广泛应用。尽管最近取得了一些进展,但大多数现有的唇同步和音视频编辑方法依赖于对预训练模型的监督微调,这导致了相当大的计算开销和数据需求。本文提出了OmniEdit,一种旨在实现唇同步和音视频编辑的无训练框架。我们的方法通过将FlowEdit中的编辑序列替换为目标序列,重新构建了编辑范式,从而获得了所需输出的无偏估计。此外,通过消除生成过程中的随机元素,我们建立了平滑且稳定的编辑轨迹。大量实验结果验证了所提出框架的有效性和鲁棒性。代码可在https://github.com/l1346792580123/OmniEdit获取。
cs.CV / 25 / 2603.09094
Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
以事件为中心的因果思维链用于物理上合理的视频生成
Abstract
Physically Plausible Video Generation (PPVG) has emerged as a promising avenue for modeling real-world physical phenomena. PPVG requires an understanding of commonsense knowledge, which remains a challenge for video diffusion models. Current approaches leverage commonsense reasoning capability of large language models to embed physical concepts into prompts. However, generation models often render physical phenomena as a single moment defined by prompts, due to the lack of conditioning mechanisms for modeling causal progression. In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. To realize this paradigm, we design two key modules: (1) Physics-driven Event Chain Reasoning. This module decomposes the physical phenomena described in prompts into multiple elementary event units, leveraging chain-of-thought reasoning. To mitigate causal ambiguity, we embed physical formulas as constraints to impose deterministic causal dependencies during reasoning. (2) Transition-aware Cross-modal Prompting (TCP). To maintain continuity between events, this module transforms causal event units into temporally aligned vision-language prompts. It summarizes discrete event descriptions to obtain causally consistent narratives, while progressively synthesizing visual keyframes of individual events by interactive editing. Comprehensive experiments on PhyGenBench and VideoPhy benchmarks demonstrate that our framework achieves superior performance in generating physically plausible videos across diverse physical domains. Our code will be released soon.
Chinese Translation
物理上合理的视频生成(PPVG)已成为建模现实世界物理现象的一条有前景的途径。PPVG需要对常识知识的理解,这对视频扩散模型来说仍然是一个挑战。目前的方法利用大型语言模型的常识推理能力,将物理概念嵌入提示中。然而,由于缺乏建模因果进展的条件机制,生成模型通常将物理现象呈现为由提示定义的单一时刻。在本文中,我们将PPVG视为生成一系列因果连接和动态演变的事件。为了实现这一范式,我们设计了两个关键模块:(1)基于物理的事件链推理。该模块将提示中描述的物理现象分解为多个基本事件单元,利用思维链推理。为了减轻因果模糊性,我们将物理公式嵌入作为约束,以在推理过程中施加确定性的因果依赖。(2)过渡感知的跨模态提示(TCP)。为了保持事件之间的连续性,该模块将因果事件单元转换为时间对齐的视觉-语言提示。它总结离散事件描述,以获得因果一致的叙述,同时通过交互编辑逐步合成单个事件的视觉关键帧。在PhyGenBench和VideoPhy基准上的全面实验表明,我们的框架在生成多样物理领域的物理上合理的视频方面表现优越。我们的代码将很快发布。
cs.CV / 26 / 2603.09101
MedKCO: Medical Vision-Language Pretraining via Knowledge-Driven Cognitive Orchestration
MedKCO:通过知识驱动的认知协调进行医学视觉-语言预训练
Abstract
Medical vision-language pretraining (VLP) models have recently been investigated for their generalization to diverse downstream tasks. However, current medical VLP methods typically force the model to learn simple and complex concepts simultaneously. This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. Specifically, we design a two level curriculum by incorporating diagnostic sensitivity and intra-class sample representativeness for the ordering of the pretraining data. Moreover, considering the inter-class similarity of medical images, we introduce a self-paced asymmetric contrastive loss to dynamically adjust the participation of the pretraining objective. We evaluate the proposed pretraining method on three medical imaging scenarios in multiple vision-language downstream tasks, and compare it with several curriculum learning methods. Extensive experiments show that our method significantly surpasses all baselines. https://github.com/Mr-Talon/MedKCO.
Chinese Translation
最近,医学视觉-语言预训练(VLP)模型在其对多样化下游任务的泛化能力方面得到了研究。然而,目前的医学VLP方法通常迫使模型同时学习简单和复杂的概念。这种反认知过程导致特征表示的次优,尤其是在分布转移的情况下。为了解决这一局限性,我们提出了一种用于医学VLP的知识驱动认知协调(MedKCO),该方法涉及预训练数据的排序和视觉-语言对比的学习目标。具体而言,我们通过结合诊断敏感性和类内样本代表性设计了一个两级课程,以确定预训练数据的排序。此外,考虑到医学图像的类间相似性,我们引入了一种自适应不对称对比损失,以动态调整预训练目标的参与度。我们在多个视觉-语言下游任务中的三种医学成像场景上评估了所提出的预训练方法,并与几种课程学习方法进行了比较。大量实验表明,我们的方法显著超越了所有基线。
cs.CV / 27 / 2603.09104
Training-free Motion Factorization for Compositional Video Generation
无训练运动因子分解用于组合视频生成
Abstract
Compositional video generation aims to synthesize multiple instances with diverse appearance and motion, which is widely applicable in real-world scenarios. However, current approaches mainly focus on binding semantics, neglecting to understand diverse motion categories specified in prompts. In this paper, we propose a motion factorization framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. Specifically, our framework follows a planning before generation paradigm. (1) During planning, we reason about motion laws on the motion graph to obtain frame-wise changes in the shape and position of each instance. This alleviates semantic ambiguities in the user prompt by organizing it into a structured representation of instances and their interactions. (2) During generation, we modulate the synthesis of distinct motion categories in a disentangled manner. Conditioned on the motion cues, guidance branches stabilize appearance in motionless regions, preserve rigid-body geometry, and regularize local non-rigid deformations. Crucially, our two modules are model-agnostic, which can be seamlessly incorporated into various diffusion model architectures. Extensive experiments demonstrate that our framework achieves impressive performance in motion synthesis on real-world benchmarks. Our code will be released soon.
Chinese Translation
组合视频生成旨在合成具有多样外观和运动的多个实例,这在现实场景中具有广泛的应用。然而,当前的方法主要集中于绑定语义,忽视了对提示中指定的多样运动类别的理解。本文提出了一种运动因子分解框架,将复杂运动分解为三种主要类别:静止运动、刚性运动和非刚性运动。具体而言,我们的框架遵循生成前规划的范式。(1) 在规划阶段,我们在运动图上推理运动规律,以获得每个实例在形状和位置上的逐帧变化。这通过将用户提示组织成实例及其相互作用的结构化表示,缓解了语义歧义。(2) 在生成阶段,我们以解耦的方式调节不同运动类别的合成。基于运动线索,指导分支在静止区域稳定外观,保持刚体几何形状,并规范局部非刚性变形。重要的是,我们的两个模块是模型无关的,可以无缝集成到各种扩散模型架构中。大量实验表明,我们的框架在现实基准上的运动合成中取得了令人印象深刻的性能。我们的代码将很快发布。
cs.CV / 28 / 2603.09108
Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations
通过全局与局部表示的联合对齐实现的皮肤癌病例搜索的复合视觉-语言检索
Abstract
Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.
Chinese Translation
医学图像检索旨在识别与临床相关的病变病例,以支持诊断决策、教育和质量控制。在实际应用中,检索查询通常将参考病变图像与文本描述符(如皮肤镜特征)结合在一起。我们研究了针对皮肤癌的复合视觉-语言检索,其中每个查询由图像-文本对组成,数据库包含经过活检确认的多类疾病病例。我们提出了一种基于变换器(transformer)的框架,该框架学习层次化的复合查询表示,并在查询与候选图像之间执行全局-局部联合对齐。局部对齐通过多个空间注意力掩码聚合判别性区域,而全局对齐则提供整体语义监督。最终相似度通过一个凸的、领域信息驱动的加权计算,该加权强调临床显著的局部证据,同时保持全局一致性。在公共Derm7pt数据集上的实验表明,相较于最先进的方法,取得了一致的改进。所提出的框架实现了对相关医疗记录的高效访问,并支持实际临床部署。
cs.CV / 29 / 2603.09109
VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs
VIVID-Med:用于可部署医疗视觉变换器的LLM监督结构预训练
Abstract
Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight, deployable ViT-only backbone. We evaluated VIVID-Med across multiple settings: on CheXpert linear probing, it achieves a macro-AUC of 0.8588, outperforming BiomedCLIP by +6.65 points while using 500x less data. It also demonstrates robust zero-shot cross-domain transfer to NIH ChestX-ray14 (0.7225 macro-AUC) and strong cross-modality generalization to CT, achieving 0.8413 AUC on LIDC-IDRI lung nodule classification and 0.9969 macro-AUC on OrganAMNIST 11-organ classification. VIVID-Med offers a highly efficient, scalable alternative to deploying resource-heavy vision-language models in clinical settings.
Chinese Translation
视觉-语言预训练在医疗图像分析中推动了显著进展。然而,目前的方法通常使用独热编码标签或自由格式文本来监督视觉编码器,这两者都无法有效捕捉临床发现之间复杂的语义关系。在本研究中,我们介绍了VIVID-Med,一个新颖的框架,利用冻结的大型语言模型(LLM)作为结构化语义教师来预训练医疗视觉变换器(ViTs)。VIVID-Med通过统一医疗模式(Unified Medical Schema, UMS)将临床发现转换为可验证的JSON字段状态对,利用可回答性感知掩码来聚焦优化。然后,它采用结构化预测分解(Structured Prediction Decomposition, SPD)将交叉注意力划分为正交正则化的查询组,以提取互补的视觉方面。关键是,LLM在训练后被丢弃,从而产生一个轻量级、可部署的仅ViT骨干网络。我们在多个设置中评估了VIVID-Med:在CheXpert线性探测中,它实现了0.8588的宏AUC,超越了BiomedCLIP,提升了6.65点,同时使用的数据量少了500倍。它还展示了对NIH ChestX-ray14的强大零样本跨领域迁移(0.7225宏AUC)和对CT的强大跨模态泛化,在LIDC-IDRI肺结节分类中实现了0.8413 AUC,在OrganAMNIST 11个器官分类中实现了0.9969宏AUC。VIVID-Med为在临床环境中部署资源密集型视觉-语言模型提供了一个高效、可扩展的替代方案。
cs.CV / 30 / 2603.09111
Progressive Representation Learning for Multimodal Sentiment Analysis with Incomplete Modalities
针对不完整模态的多模态情感分析的渐进表示学习
Abstract
Multimodal Sentiment Analysis (MSA) seeks to infer human emotions by integrating textual, acoustic, and visual cues. However, existing approaches often rely on all modalities are completeness, whereas real-world applications frequently encounter noise, hardware failures, or privacy restrictions that result in missing modalities. There exists a significant feature misalignment between incomplete and complete modalities, and directly fusing them may even distort the well-learned representations of the intact modalities. To this end, we propose PRLF, a Progressive Representation Learning Framework designed for MSA under uncertain missing-modality conditions. PRLF introduces an Adaptive Modality Reliability Estimator (AMRE), which dynamically quantifies the reliability of each modality using recognition confidence and Fisher information to determine the dominant modality. In addition, the Progressive Interaction (ProgInteract) module iteratively aligns the other modalities with the dominant one, thereby enhancing cross-modal consistency while suppressing noise. Extensive experiments on CMU-MOSI, CMU-MOSEI, and SIMS verify that PRLF outperforms state-of-the-art methods across both inter- and intra-modality missing scenarios, demonstrating its robustness and generalization capability.
Chinese Translation
多模态情感分析(MSA)旨在通过整合文本、声学和视觉线索来推断人类情感。然而,现有的方法通常依赖于所有模态的完整性,而现实世界的应用常常会遇到噪声、硬件故障或隐私限制,导致模态缺失。在不完整模态和完整模态之间存在显著的特征不对齐,直接融合它们甚至可能扭曲完整模态的良好学习表示。为此,我们提出了PRLF(渐进表示学习框架),旨在应对不确定缺失模态条件下的多模态情感分析。PRLF引入了一种自适应模态可靠性估计器(AMRE),该估计器动态量化每个模态的可靠性,利用识别置信度和Fisher信息来确定主导模态。此外,渐进交互(ProgInteract)模块迭代地将其他模态与主导模态对齐,从而增强跨模态一致性,同时抑制噪声。在CMU-MOSI、CMU-MOSEI和SIMS上的大量实验验证了PRLF在跨模态和同模态缺失场景下均优于最先进的方法,展示了其鲁棒性和泛化能力。
cs.CV / 31 / 2603.09125
QUSR: Quality-Aware and Uncertainty-Guided Image Super-Resolution Diffusion Model
QUSR:质量感知与不确定性引导的图像超分辨率扩散模型
Abstract
Diffusion-based image super-resolution (ISR) has shown strong potential, but it still struggles in real-world scenarios where degradations are unknown and spatially non-uniform, often resulting in lost details or visual artifacts. To address this challenge, we propose a novel super-resolution diffusion model, QUSR, which integrates a Quality-Aware Prior (QAP) with an Uncertainty-Guided Noise Generation (UNG) module. The UNG module adaptively adjusts the noise injection intensity, applying stronger perturbations to high-uncertainty regions (e.g., edges and textures) to reconstruct complex details, while minimizing noise in low-uncertainty regions (e.g., flat areas) to preserve original information. Concurrently, the QAP leverages an advanced Multimodal Large Language Model (MLLM) to generate reliable quality descriptions, providing an effective and interpretable quality prior for the restoration process. Experimental results confirm that QUSR can produce high-fidelity and high-realism images in real-world scenarios. The source code is available at https://github.com/oTvTog/QUSR.
Chinese Translation
基于扩散的图像超分辨率(ISR)展现出强大的潜力,但在现实场景中,面对未知且空间上不均匀的退化时,仍然存在困难,常常导致细节丢失或视觉伪影。为了解决这一挑战,我们提出了一种新颖的超分辨率扩散模型QUSR,该模型将质量感知先验(Quality-Aware Prior, QAP)与不确定性引导噪声生成(Uncertainty-Guided Noise Generation, UNG)模块相结合。UNG模块自适应地调整噪声注入强度,对高不确定性区域(例如边缘和纹理)施加更强的扰动,以重建复杂细节,同时在低不确定性区域(例如平坦区域)最小化噪声,以保留原始信息。同时,QAP利用先进的多模态大语言模型(Multimodal Large Language Model, MLLM)生成可靠的质量描述,为恢复过程提供有效且可解释的质量先验。实验结果证实,QUSR能够在现实场景中生成高保真度和高真实感的图像。源代码可在 https://github.com/oTvTog/QUSR 获取。
cs.CV / 32 / 2603.09137
Transformer-Based Multi-Region Segmentation and Radiomic Analysis of HR-pQCT Imaging
基于变换器的多区域分割与HR-pQCT影像的放射组学分析
Abstract
Osteoporosis is a skeletal disease typically diagnosed using dual-energy X-ray absorptiometry (DXA), which quantifies areal bone mineral density but overlooks bone microarchitecture and surrounding soft tissues. High-resolution peripheral quantitative computed tomography (HR-pQCT) enables three-dimensional microstructural imaging with minimal radiation. However, current analysis pipelines largely focus on mineralized bone compartments, leaving much of the acquired image data underutilized. We introduce a fully automated framework for binary osteoporosis classification using radiomics features extracted from anatomically segmented HR-pQCT images. To our knowledge, this work is the first to leverage a transformer-based segmentation architecture, i.e., the SegFormer, for fully automated multi-region HR-pQCT analysis. The SegFormer model simultaneously delineated the cortical and trabecular bone of the tibia and fibula along with surrounding soft tissues and achieved a mean F1 score of 95.36%. Soft tissues were further subdivided into skin, myotendinous, and adipose regions through post-processing. From each region, 939 radiomic features were extracted and dimensionally reduced to train six machine learning classifiers on an independent dataset comprising 20,496 images from 122 HR-pQCT scans. The best image level performance was achieved using myotendinous tissue features, yielding an accuracy of 80.08% and an area under the receiver operating characteristic curve (AUROC) of 0.85, outperforming bone-based models. At the patient level, replacing standard biological, DXA, and HR-pQCT parameters with soft tissue radiomics improved AUROC from 0.792 to 0.875. These findings demonstrate that automated, multi-region HR-pQCT segmentation enables the extraction of clinically informative signals beyond bone alone, highlighting the importance of integrated tissue assessment for osteoporosis detection.
Chinese Translation
骨质疏松症是一种骨骼疾病,通常通过双能X射线吸收法(DXA)进行诊断,该方法量化面积骨矿密度,但忽略了骨微结构和周围软组织。高分辨率外周定量计算机断层扫描(HR-pQCT)能够以最小的辐射实现三维微结构成像。然而,当前的分析流程主要集中在矿化骨部分,导致获取的图像数据未得到充分利用。我们提出了一种完全自动化的框架,用于基于从解剖分割的HR-pQCT图像中提取的放射组学特征进行二元骨质疏松分类。据我们所知,这项工作是首次利用基于变换器的分割架构,即SegFormer,进行完全自动化的多区域HR-pQCT分析。SegFormer模型同时描绘了胫骨和腓骨的皮质骨和松质骨以及周围软组织,达到了95.36%的平均F1分数。通过后处理,软组织进一步细分为皮肤、肌腱和脂肪区域。从每个区域提取了939个放射组学特征,并进行了降维,以在一个包含122个HR-pQCT扫描的20,496幅图像的独立数据集上训练六个机器学习分类器。使用肌腱组织特征时,图像级性能最佳,准确率为80.08%,接收者操作特征曲线下面积(AUROC)为0.85,优于基于骨的模型。在患者级别上,用软组织放射组学替代标准生物学、DXA和HR-pQCT参数,使AUROC从0.792提高到0.875。这些发现表明,自动化的多区域HR-pQCT分割能够提取超越骨骼的临床信息信号,强调了综合组织评估在骨质疏松检测中的重要性。
cs.CV / 33 / 2603.09138
Rotation Equivariant Mamba for Vision Tasks
用于视觉任务的旋转等变Mamba
Abstract
Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we propose to incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks - including high-level image classification task, mid-level semantic segmentation task, and low-level image super-resolution task - demonstrate that EQ-VMamba achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at https://github.com/zhongchenzhao/EQ-VMamba.
Chinese Translation
旋转等变性是视觉数据中最一般且至关重要的结构先验之一,但在当前基于Mamba的视觉架构中却明显缺失。尽管Mamba在自然语言处理中的成功以及其在计算机视觉中的日益采用,现有的视觉Mamba模型在设计上未能考虑旋转对称性。这一遗漏使得它们对图像旋转本质上敏感,从而限制了其鲁棒性和跨任务的泛化能力。为了解决这一局限性,我们提出将旋转对称性这一图像中的普遍且基本的几何先验融入基于Mamba的架构中。具体而言,我们引入了EQ-VMamba,这是首个用于视觉任务的旋转等变视觉Mamba架构。EQ-VMamba的核心组件包括精心设计的旋转等变交叉扫描策略和群体Mamba模块。此外,我们对内在的等变误差进行了严格的理论分析,证明所提出的架构在整个网络中强制执行端到端的旋转等变性。在多个基准测试中的广泛实验——包括高层图像分类任务、中层语义分割任务和低层图像超分辨率任务——表明,EQ-VMamba在性能上优于或与非等变基线相当,同时所需参数约减少50%。这些结果表明,嵌入旋转等变性不仅有效增强了视觉Mamba模型对旋转变换的鲁棒性,还显著提高了参数效率,提升了整体性能。代码可在 https://github.com/zhongchenzhao/EQ-VMamba 获取。
cs.CV / 34 / 2603.09141
Agentic AI as a Network Control-Plane Intelligence Layer for Federated Learning over 6G
作为网络控制平面智能层的自主人工智能在6G联邦学习中的应用
Abstract
The shift toward user-customized on-device learning places new demands on wireless systems: models must be trained on diverse, distributed data while meeting strict latency, bandwidth, and reliability constraints. To address this, we propose an Agentic AI as the control layer for managing federated learning (FL) over 6G networks, which translates high-level task goals into actions that are aware of network conditions. Rather than simply viewing FL as a learning challenge, our system sees it as a combined task of learning and network management. A set of specialized agents focused on retrieval, planning, coding, and evaluation utilizes monitoring tools and optimization methods to handle client selection, incentive structuring, scheduling, resource allocation, adaptive local training, and code generation. The use of closed-loop evaluation and memory allows the system to consistently refine its decisions, taking into account varying signal-to-noise ratios, bandwidth conditions, and device capabilities. Finally, our case study has demonstrated the effectiveness of the Agentic AI system's use of tools for achieving high performance.
Chinese Translation
向用户定制的设备学习的转变对无线系统提出了新的要求:模型必须在多样化、分布式的数据上进行训练,同时满足严格的延迟、带宽和可靠性约束。为了解决这一问题,我们提出将自主人工智能(Agentic AI)作为管理6G网络上联邦学习(Federated Learning, FL)的控制层,该层将高层次的任务目标转化为考虑网络条件的行动。我们的系统不仅将FL视为学习挑战,更将其视为学习与网络管理的综合任务。一组专注于检索、规划、编码和评估的专业代理利用监控工具和优化方法来处理客户端选择、激励结构、调度、资源分配、自适应本地训练和代码生成。闭环评估和记忆的使用使系统能够持续优化其决策,考虑到不同的信噪比、带宽条件和设备能力。最后,我们的案例研究展示了自主人工智能系统在实现高性能方面工具使用的有效性。
cs.CV / 35 / 2603.09149
RTFDNet: Fusion-Decoupling for Robust RGB-T Segmentation
RTFDNet: 强健RGB-T分割的融合-解耦方法
Abstract
RGB-Thermal (RGB-T) semantic segmentation is essential for robotic systems operating in low-light or dark environments. However, traditional approaches often overemphasize modality balance, resulting in limited robustness and severe performance degradation when sensor signals are partially missing. Recent advances such as cross-modal knowledge distillation and modality-adaptive fine-tuning attempt to enhance cross-modal interaction, but they typically decouple modality fusion and modality adaptation, requiring multi-stage training with frozen models or teacher-student frameworks. We present RTFDNet, a three-branch encoder-decoder that unifies fusion and decoupling for robust RGB-T segmentation. Synergistic Feature Fusion (SFF) performs channel-wise gated exchange and lightweight spatial attention to inject complementary cues. Cross-Modal Decouple Regularization (CMDR) isolates modality-specific components from the fused representation and supervises unimodal decoders via stop-gradient targets. Region Decouple Regularization (RDR) enforces class-selective prediction consistency in confident regions while blocking gradients to the fusion branch. This feedback loop strengthens unimodal paths without degrading the fused stream, enabling efficient standalone inference at test time. Extensive experiments demonstrate the effectiveness of RTFDNet, showing consistent performance across varying modality conditions. Our implementation will be released to facilitate further research. Our source code are publicly available at https://github.com/curapima/RTFDNet.
Chinese Translation
RGB-热成像(RGB-T)语义分割对于在低光或黑暗环境中操作的机器人系统至关重要。然而,传统方法往往过于强调模态平衡,导致在传感器信号部分缺失时,鲁棒性有限且性能严重下降。最近的进展,如跨模态知识蒸馏和模态自适应微调,试图增强跨模态交互,但通常将模态融合和模态适应解耦,要求使用冻结模型或师生框架进行多阶段训练。我们提出了RTFDNet,这是一种三分支编码器-解码器,统一了融合和解耦以实现强健的RGB-T分割。协同特征融合(SFF)执行通道级门控交换和轻量级空间注意力,以注入互补线索。跨模态解耦正则化(CMDR)将模态特定组件从融合表示中隔离,并通过停止梯度目标监督单模态解码器。区域解耦正则化(RDR)在置信区域内强制类选择预测一致性,同时阻止向融合分支的梯度传递。这个反馈循环增强了单模态路径而不降低融合流的性能,从而在测试时实现高效的独立推理。大量实验表明RTFDNet的有效性,在不同模态条件下表现出一致的性能。我们的实现将被公开发布,以促进进一步的研究。我们的源代码可在 https://github.com/curapima/RTFDNet 上公开获取。
cs.CV / 36 / 2603.09160
RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
RubiCap:基于评分标准的强化学习用于密集图像描述
Abstract
Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers -- a luxury not available in open-ended captioning. We address this bottleneck with RubiCap, a novel RL framework that derives fine-grained, sample-specific reward signals from LLM-written rubrics. RubiCap first assembles a diverse committee of candidate captions, then employs an LLM rubric writer to extract consensus strengths and diagnose deficiencies in the current policy. These insights are converted into explicit evaluation criteria, enabling an LLM judge to decompose holistic quality assessment and replace coarse scalar rewards with structured, multi-faceted evaluations. Across extensive benchmarks, RubiCap achieves the highest win rates on CapArena, outperforming supervised distillation, prior RL methods, human-expert annotations, and GPT-4V-augmented outputs. On CaptionQA, it demonstrates superior word efficiency: our 7B model matches Qwen2.5-VL-32B-Instruct, and our 3B model surpasses its 7B counterpart. Remarkably, using the compact RubiCap-3B as a captioner produces stronger pretrained VLMs than those trained on captions from proprietary models.
Chinese Translation
密集图像描述对于视觉-语言预训练和文本到图像生成中的跨模态对齐至关重要,但扩展专家级注释的成本极高。虽然通过强大的视觉-语言模型(VLMs)进行合成描述是一个实用的替代方案,但监督蒸馏往往导致输出多样性有限和泛化能力弱。强化学习(RL)可以克服这些局限性,但迄今为止,其成功主要集中在依赖确定性检查器的可验证领域——而这在开放式描述中并不可用。我们通过RubiCap解决了这一瓶颈,RubiCap是一个新颖的RL框架,它从LLM(大型语言模型)编写的评分标准中推导出细粒度、样本特定的奖励信号。RubiCap首先组建一个多样化的候选描述委员会,然后利用LLM评分标准编写器提取共识强度并诊断当前策略的缺陷。这些见解被转化为明确的评估标准,使得LLM评审能够分解整体质量评估,并用结构化的多维评估替代粗略的标量奖励。在广泛的基准测试中,RubiCap在CapArena上实现了最高的胜率,超越了监督蒸馏、先前的RL方法、人类专家注释和GPT-4V增强的输出。在CaptionQA上,它展示了优越的词汇效率:我们的7B模型与Qwen2.5-VL-32B-Instruct相匹配,而我们的3B模型超越了其7B对应模型。值得注意的是,使用紧凑的RubiCap-3B作为描述生成器,产生的预训练VLM比那些基于专有模型的描述训练的模型更强。
cs.CV / 37 / 2603.09171
Progressive Split Mamba: Effective State Space Modelling for Image Restoration
渐进式分裂Mamba:有效的图像恢复状态空间建模
Abstract
Image restoration requires simultaneously preserving fine-grained local structures and maintaining long-range spatial coherence. While convolutional networks struggle with limited receptive fields, and Transformers incur quadratic complexity for global attention, recent State Space Models (SSMs), such as Mamba, provide an appealing linear-time alternative for long-range dependency modelling. However, naively extending Mamba to 2D images exposes two intrinsic shortcomings. First, flattening 2D feature maps into 1D sequences disrupts spatial topology, leading to locality distortion that hampers precise structural recovery. Second, the stability-driven recurrent dynamics of SSMs induce long-range decay, progressively attenuating information across distant spatial positions and weakening global consistency. Together, these effects limit the effectiveness of state-space modelling in high-fidelity restoration. We propose Progressive Split-Mamba (PS-Mamba), a topology-aware hierarchical state-space framework designed to reconcile locality preservation with efficient global propagation. Instead of sequentially flattening entire feature maps, PS-Mamba performs geometry-consistent partitioning, maintaining neighbourhood integrity prior to state-space processing. A progressive split hierarchy (halves, quadrants, octants) enables structured multi-scale modelling while retaining linear complexity. To counteract long-range decay, we introduce symmetric cross-scale shortcut pathways that directly transmit low-frequency global context across hierarchical levels, stabilising information flow over large spatial extents. Extensive experiments on super-resolution, denoising, and JPEG artifact reduction show consistent improvements over recent Mamba-based and attention-based models with a clear margin.
Chinese Translation
图像恢复需要同时保留细粒度的局部结构和保持长距离的空间一致性。尽管卷积网络在有限的感受野下表现不佳,而变换器则因全局注意力的平方复杂度而受到限制,近期的状态空间模型(State Space Models, SSMs),如Mamba,提供了一种在长距离依赖建模中具有吸引力的线性时间替代方案。然而,简单地将Mamba扩展到二维图像暴露了两个内在缺陷。首先,将二维特征图展平为一维序列破坏了空间拓扑,导致局部性失真,从而妨碍了精确的结构恢复。其次,SSMs的稳定性驱动递归动态会导致长距离衰减,逐渐削弱远距离空间位置的信息,并削弱全局一致性。这些效应共同限制了状态空间建模在高保真恢复中的有效性。我们提出了渐进式分裂Mamba(Progressive Split-Mamba, PS-Mamba),这是一个具有拓扑意识的分层状态空间框架,旨在调和局部性保护与高效的全局传播。PS-Mamba不再是顺序地展平整个特征图,而是进行几何一致的分区,在状态空间处理之前保持邻域的完整性。渐进式分裂层次(半分、象限、八分)使得在保持线性复杂度的同时实现结构化的多尺度建模。为了抵消长距离衰减,我们引入了对称的跨尺度快捷通道,直接在层次级别之间传递低频全局上下文,稳定大范围空间的信息流。在超分辨率、去噪和JPEG伪影减少的广泛实验中,我们的模型在与近期基于Mamba和注意力的模型相比,显示出一致的改进,并具有明显的优势。
cs.CV / 38 / 2603.09173
Point Cloud as a Foreign Language for Multi-modal Large Language Model
点云作为多模态大型语言模型的外语
Abstract
Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead. In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens--treating 3D data as a foreign language that naturally extends the LLM's vocabulary. Furthermore, to enhance the model's reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment-based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations. Code is available at: github.com/snehaputul/SAGE3D.
Chinese Translation
多模态大型语言模型(MLLMs)在整合视觉与语言理解方面取得了显著进展。近期的研究努力将这些能力扩展到3D理解,采用基于编码器的架构,依赖于预训练的3D编码器提取几何特征。然而,这些方法存在几何与语言空间之间的语义不对齐、对分辨率的敏感性以及显著的计算开销等问题。在本研究中,我们提出了SAGE,这是首个端到端的3D MLLM,能够直接处理原始点云,而无需依赖预训练的3D编码器。我们的方法引入了一种轻量级的3D标记器,结合几何采样和邻域聚合与向量量化,将点云转换为离散的标记——将3D数据视为一种外语,自然扩展了LLM的词汇。此外,为了增强模型在复杂3D任务上的推理能力,我们提出了一种基于语义对齐奖励的偏好优化训练策略,专门设计用于开放式3D问答,其中的回答是描述性的。在多种3D理解基准上的广泛实验表明,我们的端到端方法在计算效率、跨LLM骨干的泛化能力以及对输入分辨率变化的鲁棒性方面优于现有的基于编码器的方法。代码可在:github.com/snehaputul/SAGE3D获取。
cs.CV / 39 / 2603.09206
MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data
MM-Zero:从零数据自我演化的多模型视觉语言模型
Abstract
Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.
Chinese Translation
自我演化已成为提升基础模型(如大型语言模型(LLMs)和视觉语言模型(VLMs))的重要范式,且所需的人为干预最小。尽管近期的研究表明,LLM代理可以在几乎没有数据的情况下从零开始自我演化,但VLM引入了额外的视觉模态,通常需要至少一些种子数据(如图像)来启动自我演化过程。在本研究中,我们提出了多模型多模态零(MM-Zero),这是第一个基于强化学习的框架,旨在实现VLM推理的零数据自我演化。MM-Zero超越了以往的双角色(提议者和求解者)设置,引入了一个多角色自我演化训练框架,包含三个专业角色:提议者生成抽象视觉概念并提出问题;编码器将这些概念转换为可执行代码(例如,Python、SVG)以渲染视觉图像;求解者对生成的视觉内容进行多模态推理。这三个角色均从同一基础模型初始化,并使用群体相对策略优化(GRPO)进行训练,采用精心设计的奖励机制,整合执行反馈、视觉验证和难度平衡。我们的实验表明,MM-Zero在广泛的多模态基准测试中提高了VLM推理性能。MM-Zero为多模态模型的自我演化多模型系统建立了一条可扩展的路径,拓展了自我改进的前沿,超越了传统的双模型范式。
cs.CV / 40 / 2603.09213
Geometry-Aware Metric Learning for Cross-Lingual Few-Shot Sign Language Recognition on Static Hand Keypoints
基于几何感知的度量学习在静态手部关键点上的跨语言少样本手语识别
Abstract
Sign language recognition (SLR) systems typically require large labeled corpora for each language, yet the majority of the world's 300+ sign languages lack sufficient annotated data. Cross-lingual few-shot transfer, pretraining on a data-rich source language and adapting with only a handful of target-language examples, offers a scalable alternative, but conventional coordinate-based keypoint representations are susceptible to domain shift arising from differences in camera viewpoint, hand scale, and recording conditions. This shift is particularly detrimental in the few-shot regime, where class prototypes estimated from only K examples are highly sensitive to extrinsic variance. We propose a geometry-aware metric-learning framework centered on a compact 20-dimensional inter-joint angle descriptor derived from MediaPipe static hand keypoints. These angles are invariant to SO(3) rotation, translation, and isotropic scaling, eliminating the dominant sources of cross-dataset shift and yielding tighter, more stable class prototypes. Evaluated on four fingerspelling alphabets spanning typologically diverse sign languages, ASL, LIBRAS, Arabic Sign Language, and Thai Sign Language, the proposed angle features improve over normalized-coordinate baselines by up to 25 percentage points within-domain and enable frozen cross-lingual transfer that frequently exceeds within-domain accuracy, using a lightweight MLP encoder with about 10^5 parameters. These findings demonstrate that invariant hand-geometry descriptors provide a portable and effective foundation for cross-lingual few-shot SLR in low-resource settings.
Chinese Translation
手语识别(SLR)系统通常需要为每种语言提供大量标注语料,然而世界上300多种手语中的大多数缺乏足够的注释数据。跨语言少样本迁移,通过在数据丰富的源语言上进行预训练,并仅使用少量目标语言示例进行适应,提供了一种可扩展的替代方案,但传统的基于坐标的关键点表示容易受到由于相机视角、手部规模和录制条件差异引起的领域转移的影响。这种转移在少样本情况下尤其有害,因为仅从K个示例中估计的类别原型对外部变化高度敏感。我们提出了一种以基于MediaPipe静态手部关键点推导的紧凑20维关节间角度描述符为中心的几何感知度量学习框架。这些角度对SO(3)旋转、平移和各向同性缩放不变,消除了跨数据集转移的主要来源,从而产生更紧凑、更稳定的类别原型。在涵盖类型学多样的手语的四种拼写字母(ASL、LIBRAS、阿拉伯手语和泰国手语)上进行评估,所提角度特征在领域内相较于归一化坐标基线提高了多达25个百分点,并实现了冷冻跨语言迁移,通常超过领域内的准确性,使用约10^5参数的轻量级多层感知器编码器。这些发现表明,不变的手部几何描述符为低资源环境中的跨语言少样本SLR提供了一个可移植且有效的基础。
cs.CV / 41 / 2603.09217
TubeMLLM: A Foundation Model for Topology Knowledge Exploration in Vessel-like Anatomy
TubeMLLM:一种用于血管状解剖结构拓扑知识探索的基础模型
Abstract
Modeling medical vessel-like anatomy is challenging due to its intricate topology and sensitivity to dataset shifts. Consequently, task-specific models often suffer from topological inconsistencies, including artificial disconnections and spurious merges. Motivated by the promise of multimodal large language models (MLLMs) for zero-shot generalization, we propose TubeMLLM, a unified foundation model that couples structured understanding with controllable generation for medical vessel-like anatomy. By integrating topological priors through explicit natural language prompting and aligning them with visual representations in a shared-attention architecture, TubeMLLM significantly enhances topology-aware perception. Furthermore, we construct TubeMData, a pionner multimodal benchmark comprising comprehensive topology-centric tasks, and introduce an adaptive loss weighting strategy to emphasize topology-critical regions during training. Extensive experiments on fifteen diverse datasets demonstrate our superiority. Quantitatively, TubeMLLM achieves state-of-the-art out-of-distribution performance, substantially reducing global topological discrepancies on color fundus photography (decreasing the $\beta_{0}$ number error from 37.42 to 8.58 compared to baselines). Notably, TubeMLLM exhibits exceptional zero-shot cross-modality transferring ability on unseen X-ray angiography, achieving a Dice score of 67.50% while significantly reducing the $\beta_{0}$ error to 1.21. TubeMLLM also maintains robustness against degradations such as blur, noise, and low resolution. Furthermore, in topology-aware understanding tasks, the model achieves 97.38% accuracy in evaluating mask topological quality, significantly outperforming standard vision-language baselines.
Chinese Translation
由于血管状解剖结构的复杂拓扑和对数据集变化的敏感性,建模这一领域面临挑战。因此,特定任务模型常常遭遇拓扑不一致的问题,包括人工断连和虚假合并。受到多模态大语言模型(MLLMs)在零样本泛化方面潜力的启发,我们提出了TubeMLLM,这是一种统一的基础模型,将结构化理解与可控生成相结合,专注于医学血管状解剖结构。通过通过明确的自然语言提示整合拓扑先验,并将其与共享注意力架构中的视觉表示对齐,TubeMLLM显著增强了对拓扑的感知。此外,我们构建了TubeMData,这是一个开创性的多模态基准,包含全面的以拓扑为中心的任务,并引入了一种自适应损失加权策略,以在训练过程中强调拓扑关键区域。在十五个多样化数据集上的广泛实验展示了我们的优势。从定量上看,TubeMLLM在分布外性能上达到了最先进水平,显著减少了彩色眼底摄影中的全局拓扑差异(将$eta_{0}$数字错误从37.42减少到8.58,相较于基线)。值得注意的是,TubeMLLM在未见的X光血管造影中表现出卓越的零样本跨模态迁移能力,达到了67.50%的Dice得分,同时将$eta_{0}$错误显著降低至1.21。TubeMLLM在模糊、噪声和低分辨率等退化情况下也保持了鲁棒性。此外,在拓扑感知理解任务中,该模型在评估掩膜拓扑质量时达到了97.38%的准确率,显著优于标准视觉-语言基线。
cs.CV / 42 / 2603.09220
Distributed Convolutional Neural Networks for Object Recognition
用于目标识别的分布式卷积神经网络
Abstract
This paper proposes a novel loss function for training a distributed convolutional neural network (DisCNN) to recognize only a specific positive class. By mapping positive samples to a compact set in high-dimensional space and negative samples to Origin, the DisCNN extracts only the features of the positive class. An experiment is given to prove this. Thus, the features of the positive class are disentangled from those of the negative classes. The model has a lightweight architecture because only a few positive-class features need to be extracted. The model demonstrates excellent generalization on the test data and remains effective even for unseen classes. Finally, using DisCNN, object detection of positive samples embedded in a large and complex background is straightforward.
Chinese Translation
本文提出了一种新颖的损失函数,用于训练分布式卷积神经网络(DisCNN),以仅识别特定的正类。通过将正样本映射到高维空间中的一个紧凑集合,并将负样本映射到原点,DisCNN仅提取正类的特征。实验结果证明了这一点。因此,正类的特征与负类的特征得以解耦。该模型具有轻量级架构,因为只需提取少量正类特征。该模型在测试数据上表现出色的泛化能力,即使对于未见过的类别也能保持有效。最后,使用DisCNN,在大型复杂背景中嵌入的正样本的目标检测变得简单明了。
cs.CV / 43 / 2603.09223
UniField: A Unified Field-Aware MRI Enhancement Framework
UniField:一个统一的场感知MRI增强框架
Abstract
Magnetic Resonance Imaging (MRI) field-strength enhancement holds immense value for both clinical diagnostics and advanced research. However, existing methods typically focus on isolated enhancement tasks, such as specific 64mT-to-3T or 3T-to-7T transitions using limited subject cohorts, thereby failing to exploit the shared degradation patterns inherent across different field strengths and severely restricting model generalization. To address this challenge, we propose \methodname, a unified framework integrating multiple modalities and enhancement tasks to mutually promote representation learning by exploiting these shared degradation characteristics. Specifically, our main contributions are threefold. Firstly, to overcome MRI data scarcity and capture continuous anatomical structures, \methodname departs from conventional methods that treat 3D MRI volumes as independent 2D slices. Instead, we directly exploit comprehensive 3D volumetric information by leveraging pre-trained 3D foundation models, thereby embedding generalized and robust structural representations to significantly boost enhancement performance. In addition, to mitigate the spectral bias of mainstream flow-matching models that often over-smooth high-frequency details, we explicitly incorporate the physical mechanisms of magnetic fields to introduce a Field-Aware Spectral Rectification Mechanism (FASRM), tailoring customized spectral corrections to distinct field strengths. Finally, to resolve the fundamental data bottleneck, we organize and publicly release a comprehensive paired multi-field MRI dataset, which is an order of magnitude larger than existing datasets. Extensive experiments demonstrate our method's superiority over state-of-the-art approaches, achieving an average improvement of approximately 1.81 dB in PSNR and 9.47\% in SSIM. Code will be released upon acceptance.
Chinese Translation
磁共振成像(MRI)场强增强在临床诊断和先进研究中具有重要价值。然而,现有方法通常集中于孤立的增强任务,例如使用有限的受试者群体进行特定的64mT到3T或3T到7T的转换,从而未能充分利用不同场强之间固有的共享退化模式,严重限制了模型的泛化能力。为了解决这一挑战,我们提出了 extit{UniField},一个统一框架,整合多种模态和增强任务,通过利用这些共享的退化特征来相互促进表征学习。具体而言,我们的主要贡献有三方面。首先,为了克服MRI数据稀缺并捕捉连续的解剖结构, extit{UniField}摒弃了将3D MRI体积视为独立2D切片的传统方法。相反,我们直接利用全面的3D体积信息,通过利用预训练的3D基础模型,嵌入通用且稳健的结构表征,从而显著提升增强性能。此外,为了减轻主流流匹配模型的光谱偏差,这些模型往往过度平滑高频细节,我们明确结合了磁场的物理机制,引入了场感知光谱校正机制(Field-Aware Spectral Rectification Mechanism, FASRM),为不同场强量身定制光谱校正。最后,为了解决根本的数据瓶颈,我们组织并公开发布了一个全面的配对多场MRI数据集,其规模比现有数据集大一个数量级。大量实验表明,我们的方法优于最先进的方法,在PSNR上平均提高约1.81 dB,在SSIM上提高9.47%。代码将在接受后发布。
cs.CV / 44 / 2603.09235
HelixTrack: Event-Based Tracking and RPM Estimation of Propeller-like Objects
HelixTrack:基于事件的螺旋桨类物体跟踪与转速估计
Abstract
Safety-critical perception for unmanned aerial vehicles and rotating machinery requires microsecond-latency tracking of fast, periodic motion under egomotion and strong distractors. Frame-based and event-based trackers drift or break on propellers because periodic signatures violate their smooth-motion assumptions. We tackle this gap with HelixTrack, a fully event-driven method that jointly tracks propeller-like objects and estimates their rotations per minute (RPM). Incoming events are back-warped from the image plane into the rotor plane via a homography estimated on the fly. A Kalman Filter maintains instantaneous estimates of phase. Batched iterative updates refine the object pose by coupling phase residuals to geometry. To our knowledge, no public dataset targets joint tracking and RPM estimation of propeller-like objects. We therefore introduce the Timestamped Quadcopter with Egomotion (TQE) dataset with 13 high-resolution event sequences, containing 52 rotating objects in total, captured at distances of 2 m / 4 m, with increasing egomotion and microsecond RPM ground truth. On TQE, HelixTrack processes full-rate events (approx. 11.8x real time) faster than real time and microsecond latency. It consistently outperforms per-event and aggregation-based baselines adapted for RPM estimation.
Chinese Translation
无人机和旋转机械的安全关键感知需要对快速、周期性运动进行微秒级延迟跟踪,尤其是在自运动和强干扰因素下。基于帧和基于事件的跟踪器在螺旋桨上会出现漂移或失效,因为周期性特征违反了它们的平滑运动假设。我们通过HelixTrack来解决这一问题,这是一种完全基于事件驱动的方法,能够同时跟踪螺旋桨类物体并估计其每分钟转速(RPM)。输入事件通过在飞行中估计的单应性从图像平面反向扭曲到转子平面。卡尔曼滤波器保持相位的瞬时估计。批量迭代更新通过将相位残差与几何形状耦合来细化物体姿态。根据我们的了解,目前没有公开数据集专门针对螺旋桨类物体的联合跟踪和RPM估计。因此,我们引入了带有自运动的时间戳四旋翼(TQE)数据集,其中包含13个高分辨率事件序列,总共捕获52个旋转物体,拍摄距离为2米/4米,具有逐渐增加的自运动和微秒级RPM真实值。在TQE上,HelixTrack以超过实时的速度(约11.8倍实时)处理全速事件,并实现微秒级延迟。它在RPM估计方面始终优于为RPM估计调整的基于事件和聚合的基线方法。
cs.CV / 45 / 2603.09236
BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off
BridgeDiff:连接人类观察与平面服装合成的虚拟试穿
Abstract
Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.
Chinese Translation
虚拟试穿(VTOFF)旨在从穿着者的图像中恢复标准化展示和后续虚拟试穿所需的典型平面服装表示。以往的方法通常将VTOFF视为由局部掩码或仅文本提示驱动的直接图像转换,忽视了身体外观与平面布局之间的差距。这一差距常常导致未观察区域的完成不一致以及服装结构的不稳定。我们提出了BridgeDiff,这是一种基于扩散的框架,通过两个互补组件明确连接以人为中心的观察与平面服装合成。首先,服装条件桥接模块(GCBM)构建了一个捕捉全局外观和语义身份的服装线索表示,使得在部分可见情况下能够稳健推断连续细节。其次,平面结构约束模块(FSCM)通过平面约束注意力(FC-Attention)在选定的去噪阶段注入明确的平面服装结构先验,从而提高了结构稳定性,超越了仅基于文本的条件。对标准VTOFF基准的广泛实验表明,BridgeDiff实现了最先进的性能,生成了更高质量的平面服装重建,同时保持了细致的外观和结构完整性。
cs.CV / 46 / 2603.09241
RAE-NWM: Navigation World Model in Dense Visual Representation Space
RAE-NWM:密集视觉表征空间中的导航世界模型
Abstract
Visual navigation requires agents to reach goals in complex environments through perception and planning. World models address this task by simulating action-conditioned state transitions to predict future observations. Current navigation world models typically learn state evolution under actions within the compressed latent space of a Variational Autoencoder, where spatial compression often discards fine-grained structural information and hinders precise control. To better understand the propagation characteristics of different representations, we conduct a linear dynamics probe and observe that dense DINOv2 features exhibit stronger linear predictability for action-conditioned transitions. Motivated by this observation, we propose the Representation Autoencoder-based Navigation World Model (RAE-NWM), which models navigation dynamics in a dense visual representation space. We employ a Conditional Diffusion Transformer with Decoupled Diffusion Transformer head (CDiT-DH) to model continuous transitions, and introduce a separate time-driven gating module for dynamics conditioning to regulate action injection strength during generation. Extensive evaluations show that modeling sequential rollouts in this space improves structural stability and action accuracy, benefiting downstream planning and navigation.
Chinese Translation
视觉导航要求智能体通过感知和规划在复杂环境中到达目标。世界模型通过模拟基于动作的状态转移来预测未来观察,从而解决这一任务。目前的导航世界模型通常在变分自编码器的压缩潜在空间中学习动作下的状态演变,而空间压缩往往会丢弃细粒度的结构信息,从而妨碍精确控制。为了更好地理解不同表征的传播特性,我们进行了一项线性动力学探测,观察到密集的 DINOv2 特征在基于动作的转移中表现出更强的线性可预测性。基于这一观察,我们提出了基于表征自编码器的导航世界模型(RAE-NWM),该模型在密集视觉表征空间中建模导航动态。我们采用了带有解耦扩散变换器头的条件扩散变换器(CDiT-DH)来建模连续转移,并引入一个独立的时间驱动门控模块用于动态条件调节,以在生成过程中调节动作注入强度。广泛的评估表明,在该空间中建模序列展开提高了结构稳定性和动作准确性,从而有利于下游规划和导航。
cs.CV / 47 / 2603.09242
When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection
当检测器忘记取证:阻止语义捷径以实现可泛化的AI生成图像检测
Abstract
AI-generated image detection has become increasingly important with the rapid advancement of generative AI. However, detectors built on Vision Foundation Models (VFMs, \emph{e.g.}, CLIP) often struggle to generalize to images created using unseen generation pipelines. We identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, where VFM-based detectors rely on dominant pre-trained semantic priors (such as identity) rather than forgery-specific traces under distribution shifts. To address this issue, we propose \textbf{Geometric Semantic Decoupling (GSD)}, a parameter-free module that explicitly removes semantic components from learned representations by leveraging a frozen VFM as a semantic guide with a trainable VFM as an artifact detector. GSD estimates semantic directions from batch-wise statistics and projects them out via a geometric constraint, forcing the artifact detector to rely on semantic-invariant forensic evidence. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving 94.4\% video-level AUC (+\textbf{1.2\%}) in cross-dataset evaluation, improving robustness to unseen manipulations (+\textbf{3.0\%} on DF40), and generalizing beyond faces to the detection of synthetic images of general scenes, including UniversalFakeDetect (+\textbf{0.9\%}) and GenImage (+\textbf{1.7\%}).
Chinese Translation
随着生成性人工智能的快速发展,AI生成图像的检测变得越来越重要。然而,基于视觉基础模型(Vision Foundation Models, VFMs,例如CLIP)构建的检测器在处理使用未见生成管道创建的图像时,往往难以实现泛化。我们首次识别出一种关键的失效机制,称为“语义回退”(semantic fallback),在这种机制下,基于VFM的检测器依赖于主导的预训练语义先验(如身份),而不是在分布转变下的伪造特定痕迹。为了解决这一问题,我们提出了 extbf{几何语义解耦(Geometric Semantic Decoupling, GSD)},这是一种无参数模块,通过利用一个冻结的VFM作为语义引导,并将可训练的VFM作为伪造检测器,明确地从学习的表示中去除语义成分。GSD通过批量统计估计语义方向,并通过几何约束将其投影出去,迫使伪造检测器依赖于语义不变的取证证据。大量实验表明,我们的方法在跨数据集评估中始终优于最先进的方法,视频级AUC达94.4 ext{%}(+ extbf{1.2 ext{%} }),提高了对未见操控的鲁棒性(在DF40上提高了+ extbf{3.0 ext{%} }),并在超越人脸的情况下,实现了对一般场景合成图像的检测,包括UniversalFakeDetect(+ extbf{0.9 ext{%} })和GenImage(+ extbf{1.7 ext{%} })。
cs.CV / 48 / 2603.09245
Towards Instance Segmentation with Polygon Detection Transformers
基于多边形检测变换器的实例分割研究
Abstract
One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high-resolution inputs and lightweight, real-time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly-DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel-wise mask prediction. Considering the box-to-polygon reference shift in Detection Transformers, we propose Polar Deformable Attention and Position-Aware Training Scheme to dynamically update supervision and focus attention on boundary cues. Compared with state-of-the-art polar-based methods, Poly-DETR achieves a 4.7 mAP improvement on MS COCO test-dev. Moreover, we construct a parallel mask-based counterpart to support a systematic comparison between polar and mask representations. Experimental results show that Poly-DETR is more lightweight in high-resolution scenarios, reducing memory consumption by almost half on Cityscapes dataset. Notably, on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets, Poly-DETR surpasses its mask-based counterpart on all metrics, which validates its advantage on regular-shaped instances in domain-specific settings.
Chinese Translation
当前实例分割的一个瓶颈在于高分辨率输入与轻量级实时推理之间的矛盾需求。为了解决这一瓶颈,我们提出了一种多边形检测变换器(Polygon Detection Transformer,Poly-DETR),将实例分割重新表述为通过极坐标表示进行稀疏顶点回归,从而消除对密集像素级掩膜预测的依赖。考虑到检测变换器中的框到多边形参考偏移,我们提出了极坐标可变注意力(Polar Deformable Attention)和位置感知训练方案(Position-Aware Training Scheme),以动态更新监督并聚焦于边界线索。与最先进的基于极坐标的方法相比,Poly-DETR在MS COCO测试开发集上实现了4.7的mAP提升。此外,我们构建了一个并行的基于掩膜的对照模型,以支持极坐标与掩膜表示之间的系统比较。实验结果表明,Poly-DETR在高分辨率场景中更为轻量,Cityscapes数据集上的内存消耗几乎减少了一半。值得注意的是,在PanNuke(细胞分割)和SpaceNet(建筑轮廓)数据集上,Poly-DETR在所有指标上均超越其基于掩膜的对照模型,这验证了其在特定领域设置中对规则形状实例的优势。
cs.CV / 49 / 2603.09255
Multi-model approach for autonomous driving: A comprehensive study on traffic sign-, vehicle- and lane detection and behavioral cloning
多模型方法在自动驾驶中的应用:关于交通标志、车辆和车道检测及行为克隆的综合研究
Abstract
Deep learning and computer vision techniques have become increasingly important in the development of self-driving cars. These techniques play a crucial role in enabling self-driving cars to perceive and understand their surroundings, allowing them to safely navigate and make decisions in real-time. Using Neural Networks self-driving cars can accurately identify and classify objects such as pedestrians, other vehicles, and traffic signals. Using deep learning and analyzing data from sensors such as cameras and radar, self-driving cars can predict the likely movement of other objects and plan their own actions accordingly. In this study, a novel approach to enhance the performance of selfdriving cars by using pre-trained and custom-made neural networks for key tasks, including traffic sign classification, vehicle detection, lane detection, and behavioral cloning is provided. The methodology integrates several innovative techniques, such as geometric and color transformations for data augmentation, image normalization, and transfer learning for feature extraction. These techniques are applied to diverse datasets,including the German Traffic Sign Recognition Benchmark (GTSRB), road and lane segmentation datasets, vehicle detection datasets, and data collected using the Udacity selfdriving car simulator to evaluate the model efficacy. The primary objective of the work is to review the state-of-the-art in deep learning and computer vision for self-driving cars. The findings of the work are effective in solving various challenges related to self-driving cars like traffic sign classification, lane prediction, vehicle detection, and behavioral cloning, and provide valuable insights into improving the robustness and reliability of autonomous systems, paving the way for future research and deployment of safer and more efficient self-driving technologies.
Chinese Translation
深度学习和计算机视觉技术在自动驾驶汽车的发展中变得越来越重要。这些技术在使自动驾驶汽车感知和理解其周围环境方面发挥着关键作用,使其能够安全地实时导航和做出决策。通过使用神经网络,自动驾驶汽车能够准确识别和分类行人、其他车辆和交通信号等物体。利用深度学习并分析来自摄像头和雷达等传感器的数据,自动驾驶汽车可以预测其他物体的可能运动并相应地规划自己的行动。本研究提供了一种新颖的方法,通过使用预训练和定制的神经网络来增强自动驾驶汽车在关键任务中的性能,包括交通标志分类、车辆检测、车道检测和行为克隆。该方法整合了多种创新技术,如用于数据增强的几何和颜色变换、图像归一化以及用于特征提取的迁移学习。这些技术应用于多样化的数据集,包括德国交通标志识别基准(GTSRB)、道路和车道分割数据集、车辆检测数据集,以及使用Udacity自动驾驶汽车模拟器收集的数据,以评估模型的有效性。本工作的主要目标是回顾自动驾驶汽车领域内深度学习和计算机视觉的最新进展。研究结果有效地解决了与自动驾驶汽车相关的各种挑战,如交通标志分类、车道预测、车辆检测和行为克隆,并为提高自主系统的鲁棒性和可靠性提供了宝贵的见解,为未来更安全和高效的自动驾驶技术的研究和部署铺平了道路。
cs.CV / 50 / 2603.09258
Multimodal Graph Representation Learning with Dynamic Information Pathways
具有动态信息通道的多模态图表示学习
Abstract
Multimodal graphs, where nodes contain heterogeneous features such as images and text, are increasingly common in real-world applications. Effectively learning on such graphs requires both adaptive intra-modal message passing and efficient inter-modal aggregation. However, most existing approaches to multimodal graph learning are typically extended from conventional graph neural networks and rely on static structures or dense attention, which limit flexibility and expressive node embedding learning. In this paper, we propose a novel multimodal graph representation learning framework with Dynamic information Pathways (DiP). By introducing modality-specific pseudo nodes, DiP enables dynamic message routing within each modality via proximity-guided pseudo-node interactions and captures inter-modality dependence through efficient information pathways in a shared state space. This design achieves adaptive, expressive, and sparse message propagation across modalities with linear complexity. We conduct the link prediction and node classification tasks to evaluate performance and carry out full experimental analyses. Extensive experiments across multiple benchmarks demonstrate that DiP consistently outperforms baselines.
Chinese Translation
多模态图的节点包含异构特征,如图像和文本,在现实应用中越来越普遍。有效地在此类图上进行学习需要适应性的模内信息传递和高效的模间聚合。然而,大多数现有的多模态图学习方法通常是从传统图神经网络扩展而来,依赖于静态结构或密集注意力,这限制了灵活性和节点嵌入学习的表达能力。本文提出了一种新颖的多模态图表示学习框架,称为动态信息通道(Dynamic information Pathways, DiP)。通过引入特定模态的伪节点,DiP能够通过邻近引导的伪节点交互实现模内动态信息路由,并通过共享状态空间中的高效信息通道捕捉模间依赖。该设计实现了跨模态的自适应、表达性和稀疏信息传播,且具有线性复杂度。我们进行了链接预测和节点分类任务以评估性能,并进行了全面的实验分析。在多个基准测试中的广泛实验表明,DiP始终优于基线方法。
cs.CV / 51 / 2603.09259
Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos
来自网络视频的视觉与语言导航的隐式几何表示
Abstract
Vision-and-Language Navigation (VLN) has long been constrained by the limited diversity and scalability of simulator-curated datasets, which fail to capture the complexity of real-world environments. To overcome this limitation, we introduce a large-scale video-instruction framework derived from web-based room tour videos, enabling agents to learn from natural human walking demonstrations in diverse, realistic indoor settings. Unlike existing datasets, our framework integrates both open-ended description-enriched trajectories and action-enriched trajectories reconstructed in 3D, providing richer spatial and semantic supervision. A key extension in this work is the incorporation of implicit geometry representations, which extract spatial cues directly from RGB frames without requiring fragile 3D reconstruction. This approach substantially improves data utilization, alleviates reconstruction failures, and unlocks large portions of previously unusable video data. Comprehensive experiments across multiple VLN benchmarks (CVDN, SOON, R2R, and REVERIE) demonstrate that our method not only sets new state-of-the-art performance but also enables the development of robust zero-shot navigation agents. By bridging large-scale web videos with implicit spatial reasoning, this work advances embodied navigation towards more scalable, generalizable, and real-world applicable solutions.
Chinese Translation
视觉与语言导航(VLN)长期以来受到模拟器策划数据集的有限多样性和可扩展性的限制,这些数据集未能捕捉现实环境的复杂性。为了克服这一限制,我们引入了一种基于网络房间导览视频的大规模视频指令框架,使得智能体能够在多样且真实的室内环境中,从自然的人类行走演示中学习。与现有数据集不同,我们的框架整合了开放式描述丰富的轨迹和在三维中重建的动作丰富轨迹,提供了更丰富的空间和语义监督。本研究的一个关键扩展是引入隐式几何表示,它直接从RGB帧中提取空间线索,而无需脆弱的三维重建。这种方法显著提高了数据利用率,缓解了重建失败,并解锁了大量之前无法使用的视频数据。在多个VLN基准(CVDN、SOON、R2R和REVERIE)上的综合实验表明,我们的方法不仅设定了新的最先进性能,还使得开发稳健的零-shot导航智能体成为可能。通过将大规模网络视频与隐式空间推理相结合,本研究推动了具身导航朝着更可扩展、可泛化和适用于现实世界的解决方案发展。
cs.CV / 52 / 2603.09266
ForgeDreamer: Industrial Text-to-3D Generation with Multi-Expert LoRA and Cross-View Hypergraph
ForgeDreamer:基于多专家LoRA和跨视图超图的工业文本到3D生成
Abstract
Current text-to-3D generation methods excel in natural scenes but struggle with industrial applications due to two critical limitations: domain adaptation challenges where conventional LoRA fusion causes knowledge interference across categories, and geometric reasoning deficiencies where pairwise consistency constraints fail to capture higher-order structural dependencies essential for precision manufacturing. We propose a novel framework named ForgeDreamer addressing both challenges through two key innovations. First, we introduce a Multi-Expert LoRA Ensemble mechanism that consolidates multiple category-specific LoRA models into a unified representation, achieving superior cross-category generalization while eliminating knowledge interference. Second, building on enhanced semantic understanding, we develop a Cross-View Hypergraph Geometric Enhancement approach that captures structural dependencies spanning multiple viewpoints simultaneously. These components work synergistically improved semantic understanding, enables more effective geometric reasoning, while hypergraph modeling ensures manufacturing-level consistency. Extensive experiments on a custom industrial dataset demonstrate superior semantic generalization and enhanced geometric fidelity compared to state-of-the-art approaches. Our code and data are provided in the supplementary material attached in the appendix for review purposes.
Chinese Translation
当前的文本到3D生成方法在自然场景中表现出色,但在工业应用中面临两大关键限制:领域适应挑战,传统的LoRA融合导致跨类别的知识干扰;几何推理不足,成对一致性约束未能捕捉到精密制造所需的高阶结构依赖关系。我们提出了一种名为ForgeDreamer的新框架,针对这两大挑战提出了两项关键创新。首先,我们引入了多专家LoRA集成机制,将多个特定类别的LoRA模型整合为统一表示,实现在消除知识干扰的同时,提升跨类别的泛化能力。其次,基于增强的语义理解,我们开发了一种跨视图超图几何增强方法,能够同时捕捉跨多个视点的结构依赖关系。这些组件协同工作,改善了语义理解,提升了几何推理的有效性,同时超图建模确保了制造级的一致性。在一个定制的工业数据集上的广泛实验表明,与最先进的方法相比,我们的方法在语义泛化和几何保真度上具有显著优势。我们的代码和数据已在附录的补充材料中提供以供审阅。
cs.CV / 53 / 2603.09277
Speeding Up the Learning of 3D Gaussians with Much Shorter Gaussian Lists
通过更短的高斯列表加速3D高斯的学习
Abstract
3D Gaussian splatting (3DGS) has become a vital tool for learning a radiance field from multiple posed images. Although 3DGS shows great advantages over NeRF in terms of rendering quality and efficiency, it remains a research challenge to further improve the efficiency of learning 3D Gaussians. To overcome this challenge, we propose novel training strategies and losses to shorten each Gaussian list used to render a pixel, which speeds up the splatting by involving fewer Gaussians along a ray. Specifically, we shrink the size of each Gaussian by resetting their scales regularly, encouraging smaller Gaussians to cover fewer nearby pixels, which shortens the Gaussian lists of pixels. Additionally, we introduce an entropy constraint on the alpha blending procedure to sharpen the weight distribution of Gaussians along each ray, which drives dominant weights larger while making minor weights smaller. As a result, each Gaussian becomes more focused on the pixels where it is dominant, which reduces its impact on nearby pixels, leading to even shorter Gaussian lists. Eventually, we integrate our method into a rendering resolution scheduler which further improves efficiency through progressive resolution increase. We evaluate our method by comparing it with state-of-the-art methods on widely used benchmarks. Our results show significant advantages over others in efficiency without sacrificing rendering quality.
Chinese Translation
3D高斯喷溅(3D Gaussian splatting, 3DGS)已成为从多视角图像学习辐射场的重要工具。尽管3DGS在渲染质量和效率方面相较于NeRF表现出显著优势,但进一步提高3D高斯学习效率仍然是一个研究挑战。为了解决这一挑战,我们提出了新颖的训练策略和损失函数,以缩短用于渲染像素的每个高斯列表,从而通过沿光线涉及更少的高斯来加速喷溅过程。具体而言,我们通过定期重置高斯的尺度来缩小每个高斯的大小,鼓励较小的高斯覆盖更少的附近像素,从而缩短像素的高斯列表。此外,我们在α混合过程中引入了熵约束,以锐化每条光线上的高斯权重分布,使主导权重增大,而次要权重减小。结果是,每个高斯更加集中于其主导的像素,减少了对附近像素的影响,从而导致更短的高斯列表。最终,我们将我们的方法集成到一个渲染分辨率调度器中,通过逐步提高分辨率进一步提高效率。我们通过在广泛使用的基准上将我们的方法与最先进的方法进行比较来评估我们的工作。我们的结果显示,在不牺牲渲染质量的情况下,在效率上显著优于其他方法。
cs.CV / 54 / 2603.09283
From Ideal to Real: Stable Video Object Removal under Imperfect Conditions
从理想到现实:不完美条件下的稳定视频物体移除
Abstract
Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose Stable Video Object Removal (SVOR), a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) Mask Union for Stable Erasure (MUSE), a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) Denoising-Aware Segmentation (DA-Seg), a lightweight segmentation head on a decoupled side branch equipped with Denoising-Aware AdaLN and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) Curriculum Two-Stage Training: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.
Chinese Translation
在现实世界的不完美条件下,如阴影、突发运动和缺陷遮罩,视频中物体的移除仍然困难。现有的基于扩散的视频修复模型在这些挑战下往往难以保持时间稳定性和视觉一致性。我们提出了稳定视频物体移除(Stable Video Object Removal, SVOR),这是一个强大的框架,通过三个关键设计实现无阴影、无闪烁和对遮罩缺陷的容忍移除:(1) 稳定擦除的遮罩联合(Mask Union for Stable Erasure, MUSE),一种在时间遮罩下采样过程中应用的窗口联合策略,以保留每个窗口中观察到的所有目标区域,有效处理突发运动并减少遗漏移除;(2) 去噪感知分割(Denoising-Aware Segmentation, DA-Seg),一个轻量级的分割头,位于一个解耦的侧支路上,配备去噪感知自适应层归一化(Denoising-Aware AdaLN),并通过遮罩退化进行训练,以提供内部的去噪感知定位先验,而不影响内容生成;(3) 课程式两阶段训练:第一阶段在未配对的真实背景视频上进行自监督预训练,使用在线随机遮罩学习真实背景和时间先验,第二阶段在合成对上进行细化,使用遮罩退化和副作用加权损失,联合移除物体及其相关阴影/反射,同时提高跨域鲁棒性。大量实验表明,SVOR在多个数据集和退化遮罩基准上达到了新的最先进结果,推动了视频物体移除从理想设置向现实应用的发展。
cs.CV / 55 / 2603.09285
Learning Convex Decomposition via Feature Fields
通过特征场学习凸分解
Abstract
This work proposes a new formulation to the long-standing problem of convex decomposition through learning feature fields, enabling the first feed-forward model for open-world convex decomposition. Our method produces high-quality decompositions of 3D shapes into a union of convex bodies, which are essential to accelerate collision detection in physical simulation, amongst many other applications. The key insight is to adopt a feature learning approach and learn a continuous feature field that can later be clustered to yield a good convex decomposition via our self-supervised, purely-geometric objective derived from the classical definition of convexity. Our formulation can be used for single shape optimization, but more importantly, feature prediction unlocks scalable, self-supervised learning on large datasets resulting in the first learned open-world model for convex decomposition. Experiments show that our decompositions are higher-quality than alternatives and generalize across open-world objects as well as across representations to meshes, CAD models, and even Gaussian splats. https://research.nvidia.com/labs/sil/projects/learning-convex-decomp/
Chinese Translation
本研究提出了一种新的公式化方法,解决了长期存在的凸分解问题,通过学习特征场,使得开放世界凸分解的前馈模型成为可能。我们的方法能够将三维形状高质量地分解为一组凸体,这对于加速物理仿真中的碰撞检测以及其他许多应用至关重要。关键的见解在于采用特征学习的方法,学习一个连续的特征场,随后可以通过我们的自监督、纯几何目标进行聚类,从而获得良好的凸分解,该目标源自于凸性的经典定义。我们的公式化方法可以用于单一形状的优化,但更重要的是,特征预测解锁了在大规模数据集上进行可扩展的自监督学习,从而实现了第一个学习的开放世界凸分解模型。实验表明,我们的分解质量优于其他方法,并且能够在开放世界对象以及在网格、CAD模型甚至高斯点云等表示之间进行泛化。
cs.CV / 56 / 2603.09286
CogBlender: Towards Continuous Cognitive Intervention in Text-to-Image Generation
CogBlender:朝着文本到图像生成中的持续认知干预
Abstract
Beyond conveying semantic information, an image can also manifest cognitive attributes that elicit specific cognitive processes from the viewer, such as memory encoding or emotional response. While modern text-to-image models excel at generating semantically coherent content, they remain limited in their ability to control such cognitive properties of images (e.g., valence, memorability), often failing to align with the specific psychological intent. To bridge this gap, we introduce CogBlender, a framework that enables continuous and multi-dimensional intervention of cognitive properties during text-to-image generation. Our approach is built upon a mapping between the Cognitive Space, representing the space of cognitive properties, and the Semantic Manifold, representing the manifold of the visual semantics. We define a set of Cognitive Anchors, serving as the boundary points for the cognitive space. Then we reformulate the velocity field within the flow-matching process by interpolating from the velocity field of different anchors. Consequently, the generative process is driven by the velocity field and dynamically steered by multi-dimensional cognitive scores, enabling precise, fine-grained, and continuous intervention. We validate the effectiveness of CogBlender across four representative cognitive dimensions: valence, arousal, dominance, and image memorability. Extensive experiments demonstrate that our method achieves effective cognitive intervention. Our work provides an effective paradigm for cognition-driven creative design.
Chinese Translation
除了传达语义信息外,图像还可以表现出引发观众特定认知过程的认知属性,例如记忆编码或情感反应。尽管现代文本到图像模型在生成语义连贯内容方面表现出色,但它们在控制图像的认知属性(例如,情感价值、可记忆性)方面仍然有限,常常无法与特定的心理意图对齐。为了解决这一问题,我们提出了CogBlender,一个框架,能够在文本到图像生成过程中实现对认知属性的持续和多维干预。我们的方法基于认知空间(Cognitive Space)与语义流形(Semantic Manifold)之间的映射,前者表示认知属性的空间,后者表示视觉语义的流形。我们定义了一组认知锚点(Cognitive Anchors),作为认知空间的边界点。然后,我们通过从不同锚点的速度场进行插值,重新构造流匹配过程中的速度场。因此,生成过程由速度场驱动,并通过多维认知评分动态引导,实现精确、细致和持续的干预。我们在四个代表性的认知维度上验证了CogBlender的有效性:情感价值、唤醒度、主导性和图像可记忆性。大量实验表明,我们的方法实现了有效的认知干预。我们的工作为基于认知驱动的创意设计提供了一种有效的范式。
cs.CV / 57 / 2603.09287
Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking
探索面向模态的融合与解耦时间传播在多模态目标跟踪中的应用
Abstract
Most existing multimodal trackers adopt uniform fusion strategies, overlooking the inherent differences between modalities. Moreover, they propagate temporal information through mixed tokens, leading to entangled and less discriminative temporal representations. To address these limitations, we propose MDTrack, a novel framework for modality aware fusion and decoupled temporal propagation in multimodal object tracking. Specifically, for modality aware fusion, we allocate dedicated experts to each modality, including infrared, event, depth, and RGB, to process their respective representations. The gating mechanism within the Mixture of Experts dynamically selects the optimal experts based on the input features, enabling adaptive and modality specific fusion. For decoupled temporal propagation, we introduce two separate State Space Model structures to independently store and update the hidden states of the RGB and X modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate a set of cross attention modules between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone through another set of cross attention modules, enhancing MDTrack's ability to leverage temporal information. Extensive experiments demonstrate the effectiveness of our proposed method. Both MDTrack S and MDTrack U achieve state of the art performance across five multimodal tracking benchmarks.
Chinese Translation
大多数现有的多模态跟踪器采用统一的融合策略,忽视了模态之间的固有差异。此外,它们通过混合标记传播时间信息,导致时间表示的纠缠和辨别能力降低。为了解决这些局限性,我们提出了MDTrack,一个用于多模态目标跟踪的面向模态的融合和解耦时间传播的新框架。具体而言,对于面向模态的融合,我们为每种模态分配专门的专家,包括红外、事件、深度和RGB,以处理各自的表示。在专家混合体中的门控机制根据输入特征动态选择最佳专家,实现自适应和模态特定的融合。对于解耦时间传播,我们引入两个独立的状态空间模型(State Space Model)结构,以独立存储和更新RGB和X模态流的隐藏状态,有效捕捉它们各自的时间信息。为了确保这两种时间表示之间的协同,我们在两个状态空间模型的输入特征之间引入一组交叉注意力模块,促进隐式信息交换。最终,得到的时间丰富特征通过另一组交叉注意力模块整合到主干网络中,增强了MDTrack利用时间信息的能力。大量实验表明我们提出的方法的有效性。MDTrack S和MDTrack U在五个多模态跟踪基准上均实现了最先进的性能。
cs.CV / 58 / 2603.09291
DenoiseSplat: Feed-Forward Gaussian Splatting for Noisy 3D Scene Reconstruction
去噪溅射:用于噪声3D场景重建的前馈高斯溅射
Abstract
3D scene reconstruction and novel-view synthesis are fundamental for VR, robotics, and content creation. However, most NeRF and 3D Gaussian Splatting pipelines assume clean inputs and degrade under real noise and artifacts. We therefore propose DenoiseSplat, a feed-forward 3D Gaussian splatting method for noisy multi-view images. We build a large-scale, scene-consistent noisy--clean benchmark on RE10K by injecting Gaussian, Poisson, speckle, and salt-and-pepper noise with controlled intensities. With a lightweight MVSplat-style feed-forward backbone, we train end-to-end using only clean 2D renderings as supervision and no 3D ground truth. On noisy RE10K, DenoiseSplat outperforms vanilla MVSplat and a strong two-stage baseline (IDF + MVSplat) in PSNR/SSIM and LPIPS across noise types and levels.
Chinese Translation
3D场景重建和新视角合成是虚拟现实、机器人技术和内容创作的基础。然而,大多数神经辐射场(NeRF)和3D高斯溅射管道假设输入是干净的,并在真实噪声和伪影下表现不佳。因此,我们提出了DenoiseSplat,这是一种用于噪声多视图图像的前馈3D高斯溅射方法。我们通过在RE10K上注入具有可控强度的高斯噪声、泊松噪声、斑点噪声和椒盐噪声,构建了一个大规模、场景一致的噪声-干净基准。借助轻量级的MVSplat风格前馈骨干网络,我们仅使用干净的2D渲染作为监督,且不依赖于3D真实值进行端到端训练。在噪声RE10K上,DenoiseSplat在PSNR/SSIM和LPIPS指标上超越了普通的MVSplat和一个强大的两阶段基线(IDF + MVSplat),在不同的噪声类型和水平下均表现出色。
cs.CV / 59 / 2603.09312
IntroSVG: Learning from Rendering Feedback for Text-to-SVG Generation via an Introspective Generator-Critic Framework
IntroSVG:通过内省生成-评价框架从渲染反馈中学习文本到SVG的生成
Abstract
Scalable Vector Graphics (SVG) are central to digital design due to their inherent scalability and editability. Despite significant advancements in content generation enabled by Visual Language Models (VLMs), existing text-to-SVG generation methods are limited by a core challenge: the autoregressive training process does not incorporate visual perception of the final rendered image, which fundamentally constrains generation quality. To address this limitation, we propose an Introspective SVG Generation Framework (IntroSVG). At its core, the framework instantiates a unified VLM that operates in a closed loop, assuming dual roles of both generator and critic. Specifically, through Supervised Fine-Tuning (SFT), the model learns to draft SVGs and to provide feedback on their rendered outputs; moreover, we systematically convert early-stage failures into high-quality error-correction training data, thereby enhancing model robustness. Subsequently, we leverage a high-capacity teacher VLM to construct a preference dataset and further align the generator's policy through Direct Preference Optimization (DPO). During inference, the optimized generator and critic operate collaboratively in an iterative "generate-review-refine" cycle, starting from imperfect intermediate drafts to autonomously improve output quality. Experimental results demonstrate that our method achieves state-of-the-art performance across several key evaluation metrics, generating SVGs with more complex structures, stronger semantic alignment, and greater editability. These results corroborate the effectiveness of incorporating explicit visual feedback into the generation loop.
Chinese Translation
可缩放矢量图形(SVG)因其固有的可缩放性和可编辑性而在数字设计中占据核心地位。尽管视觉语言模型(VLMs)在内容生成方面取得了显著进展,但现有的文本到SVG生成方法受到一个核心挑战的限制:自回归训练过程未能纳入最终渲染图像的视觉感知,这从根本上限制了生成质量。为了解决这一限制,我们提出了内省SVG生成框架(IntroSVG)。该框架的核心是实例化一个统一的VLM,在闭环中运行,承担生成器和评价者的双重角色。具体而言,通过监督微调(SFT),模型学习绘制SVG并对其渲染输出提供反馈;此外,我们系统性地将早期阶段的失败转化为高质量的错误纠正训练数据,从而增强模型的鲁棒性。随后,我们利用高容量的教师VLM构建偏好数据集,并通过直接偏好优化(DPO)进一步调整生成器的策略。在推理过程中,优化后的生成器和评价者在迭代的“生成-评审-精炼”循环中协同工作,从不完美的中间草稿开始,自主提高输出质量。实验结果表明,我们的方法在多个关键评估指标上实现了最先进的性能,生成的SVG具有更复杂的结构、更强的语义对齐性和更高的可编辑性。这些结果证实了将显式视觉反馈纳入生成循环的有效性。
cs.CV / 60 / 2603.09316
CLoE: Expert Consistency Learning for Missing Modality Segmentation
CLoE:缺失模态分割的专家一致性学习
Abstract
Multimodal medical image segmentation often faces missing modalities at inference, which induces disagreement among modality experts and makes fusion unstable, particularly on small foreground structures. We propose Consistency Learning of Experts (CLoE), a consistency-driven framework for missing-modality segmentation that preserves strong performance when all modalities are available. CLoE formulates robustness as decision-level expert consistency control and introduces a dual-branch Expert Consistency Learning objective. Modality Expert Consistency enforces global agreement among expert predictions to reduce case-wise drift under partial inputs, while Region Expert Consistency emphasizes agreement on clinically critical foreground regions to avoid background-dominated regularization. We further map consistency scores to modality reliability weights using a lightweight gating network, enabling reliability-aware feature recalibration before fusion. Extensive experiments on BraTS 2020 and MSD Prostate demonstrate that CLoE outperforms state-of-the-art methods in incomplete multimodal segmentation, while exhibiting strong cross-dataset generalization and improving robustness on clinically critical structures.
Chinese Translation
多模态医学图像分割在推理过程中常常面临缺失模态的问题,这会导致模态专家之间的不一致,并使融合过程不稳定,尤其是在小前景结构上。我们提出了一种名为专家一致性学习(CLoE)的框架,该框架以一致性为驱动,旨在实现缺失模态分割,并在所有模态可用时保持强大的性能。CLoE将鲁棒性形式化为决策层面的专家一致性控制,并引入了双分支的专家一致性学习目标。模态专家一致性强制专家预测之间的全局一致性,以减少部分输入下的个案漂移,而区域专家一致性则强调在临床关键前景区域上的一致性,以避免背景主导的正则化。我们进一步利用轻量级门控网络将一致性得分映射为模态可靠性权重,从而在融合前实现可靠性感知的特征重校准。在BraTS 2020和MSD前列腺数据集上的大量实验表明,CLoE在不完整的多模态分割中优于最先进的方法,同时展现出强大的跨数据集泛化能力,并提高了在临床关键结构上的鲁棒性。
cs.CV / 61 / 2603.09320
SpaceSense-Bench: A Large-Scale Multi-Modal Benchmark for Spacecraft Perception and Pose Estimation
SpaceSense-Bench:一个大规模多模态基准用于航天器感知与姿态估计
Abstract
Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative navigation of target spacecraft, yet collecting large-scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single-modality sensing, and incomplete ground-truth annotations. We present \textbf{SpaceSense-Bench}, a large-scale multi-modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data. Each frame provides time-synchronized 1024$\times$1024 RGB images, millimeter-precision depth maps, and 256-beam LiDAR point clouds, together with dense 7-class part-level semantic labels at both the pixel and point level as well as accurate 6-DoF pose ground truth. The dataset is generated through a high-fidelity space simulation built in Unreal Engine~5 and a fully automated pipeline covering data acquisition, multi-stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB--LiDAR fusion-based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)~perceiving small-scale components (\emph{e.g.}, thrusters and omni-antennas) and generalizing to entirely unseen spacecraft in a zero-shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at https://github.com/wuaodi/SpaceSense-Bench.
Chinese Translation
自主航天操作,如在轨服务和主动去除太空垃圾,要求对目标航天器进行稳健的部件级语义理解和精确的相对导航。然而,由于成本和获取限制,在轨收集大规模真实数据仍然不切实际。此外,现有的合成数据集存在目标多样性有限、单一模态传感和地面真实标注不完整等问题。我们提出了 extbf{SpaceSense-Bench},这是一个大规模多模态基准,涵盖136个卫星模型,数据量约为70GB。每帧提供时间同步的1024×1024 RGB图像、毫米级精度的深度图以及256束激光雷达点云,同时在像素和点级别提供密集的7类部件级语义标签以及准确的6自由度姿态真实值。该数据集通过在Unreal Engine 5中构建的高保真空间模拟和覆盖数据采集、多阶段质量控制及转换为主流格式的全自动管道生成。我们基准测试了五个代表性任务(目标检测、2D语义分割、基于RGB-激光雷达融合的3D点云分割、单目深度估计和方向估计),并识别出两个关键发现:(i)感知小规模组件(如推进器和全向天线)以及在零-shot设置中对完全未见航天器的泛化仍然是当前方法的关键瓶颈;(ii)增加训练卫星的数量显著提高了新目标的性能,强调了大规模、多样化数据集在航天感知研究中的价值。数据集、代码和工具包可在https://github.com/wuaodi/SpaceSense-Bench公开获取。
cs.CV / 62 / 2603.09326
OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models
OddGridBench:揭示多模态大型语言模型在细粒度视觉差异敏感性方面的不足
Abstract
Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.
Chinese Translation
多模态大型语言模型(MLLMs)在广泛的视觉语言任务中取得了显著的性能。然而,它们在低级视觉感知方面的能力,特别是在检测细粒度视觉差异方面,仍然未得到充分探索且缺乏系统分析。在本研究中,我们引入了OddGridBench,一个可控的基准,用于评估MLLMs的视觉差异敏感性。OddGridBench包含超过1400个基于网格的图像,其中一个元素在颜色、大小、旋转或位置等一个或多个视觉属性上与其他元素不同。实验表明,所有评估的MLLMs,包括开源家族如Qwen3-VL和InternVL3.5,以及专有系统如Gemini-2.5-Pro和GPT-5,在视觉差异检测方面的表现远低于人类水平。我们进一步提出了OddGrid-GRPO,一个集成了课程学习和距离感知奖励的强化学习框架。通过逐步控制训练样本的难度,并将空间邻近性约束纳入奖励设计,OddGrid-GRPO显著增强了模型的细粒度视觉辨别能力。我们希望OddGridBench和OddGrid-GRPO能够为推进感知基础和多模态智能中的视觉差异敏感性奠定基础。代码和数据集可在https://wwwtttjjj.github.io/OddGridBench/获取。
cs.CV / 63 / 2603.09337
Beyond Scaling: Assessing Strategic Reasoning and Rapid Decision-Making Capability of LLMs in Zero-sum Environments
超越规模:评估大型语言模型在零和环境中的战略推理与快速决策能力
Abstract
Large Language Models (LLMs) have achieved strong performance on static reasoning benchmarks, yet their effectiveness as interactive agents operating in adversarial, time-sensitive environments remains poorly understood. Existing evaluations largely treat reasoning as a single-shot capability, overlooking the challenges of opponent-aware decision-making, temporal constraints, and execution under pressure. This paper introduces Strategic Tactical Agent Reasoning (STAR) Benchmark, a multi-agent evaluation framework that assesses LLMs through 1v1 zero-sum competitive interactions, framing reasoning as an iterative, adaptive decision-making process. STAR supports both turn-based and real-time settings, enabling controlled analysis of long-horizon strategic planning and fast-paced tactical execution within a unified environment. Built on a modular architecture with a standardized API and fully implemented execution engine, STAR facilitates reproducible evaluation and flexible task customization. To move beyond binary win-loss outcomes, we introduce a Strategic Evaluation Suite that assesses not only competitive success but also the quality of strategic behavior, such as execution efficiency and outcome stability. Extensive pairwise evaluations reveal a pronounced strategy-execution gap: while reasoning-intensive models dominate turn-based settings, their inference latency often leads to inferior performance in real-time scenarios, where faster instruction-tuned models prevail. These results show that strategic intelligence in interactive environments depends not only on reasoning depth, but also on the ability to translate plans into timely actions, positioning STAR as a principled benchmark for studying this trade-off in competitive, dynamic settings.
Chinese Translation
大型语言模型(LLMs)在静态推理基准测试中表现出色,但它们作为在对抗性、时间敏感环境中运作的互动代理的有效性仍然不甚明了。现有评估大多将推理视为一次性能力,忽视了对手意识决策、时间限制和压力下执行的挑战。本文介绍了战略战术代理推理(Strategic Tactical Agent Reasoning, STAR)基准,这是一个多代理评估框架,通过1对1的零和竞争互动来评估LLMs,将推理框定为一个迭代的、适应性的决策过程。STAR支持回合制和实时设置,能够在统一环境中对长期战略规划和快速战术执行进行受控分析。基于模块化架构,配备标准化API和完全实现的执行引擎,STAR促进了可重复的评估和灵活的任务定制。为了超越二元胜负结果,我们引入了战略评估套件,不仅评估竞争成功,还评估战略行为的质量,如执行效率和结果稳定性。广泛的成对评估揭示了显著的战略执行差距:尽管推理密集型模型在回合制设置中占据主导地位,但其推理延迟常常导致在实时场景中表现不佳,而更快的指令调优模型则占据优势。这些结果表明,在互动环境中,战略智能不仅依赖于推理深度,还依赖于将计划转化为及时行动的能力,使STAR成为研究这一竞争性动态环境中权衡的原则性基准。
cs.CV / 64 / 2603.09338
Predictive Spectral Calibration for Source-Free Test-Time Regression
无源测试时间回归的预测光谱校准
Abstract
Test-time adaptation (TTA) for image regression has received far less attention than its classification counterpart. Methods designed for classification often depend on classification-specific objectives and decision boundaries, making them difficult to transfer directly to continuous regression targets. Recent progress revisits regression TTA through subspace alignment, showing that simple source-guided alignment can be both practical and effective. Building on this line of work, we propose Predictive Spectral Calibration (PSC), a source-free framework that extends subspace alignment to block spectral matching. Instead of relying on a fixed support subspace alone, PSC jointly aligns target features within the source predictive support and calibrates residual spectral slack in the orthogonal complement. PSC remains simple to implement, model-agnostic, and compatible with off-the-shelf pretrained regressors. Experiments on multiple image regression benchmarks show consistent improvements over strong baselines, with particularly clear gains under severe distribution shifts.
Chinese Translation
与分类任务相比,图像回归的测试时间适应(TTA)受到的关注明显较少。为分类设计的方法通常依赖于特定于分类的目标和决策边界,这使得它们难以直接迁移到连续回归目标上。近期的进展通过子空间对齐重新审视了回归 TTA,表明简单的源引导对齐在实践中既有效又可行。在此基础上,我们提出了预测光谱校准(Predictive Spectral Calibration, PSC),这是一个无源框架,将子空间对齐扩展到区块光谱匹配。PSC 不仅依赖于固定的支持子空间,而是共同对齐目标特征与源预测支持,并校准正交补空间中的残余光谱松弛。PSC 实现简单,模型无关,并且与现成的预训练回归器兼容。在多个图像回归基准测试中的实验表明,PSC 在强基线之上持续改善,尤其在严重的分布变化下表现出明显的提升。
cs.CV / 65 / 2603.09359
Evidential Perfusion Physics-Informed Neural Networks with Residual Uncertainty Quantification
证据性灌注物理信息神经网络与残差不确定性量化
Abstract
Physics-informed neural networks (PINNs) have shown promise in addressing the ill-posed deconvolution problem in computed tomography perfusion (CTP) imaging for acute ischemic stroke assessment. However, existing PINN-based approaches remain deterministic and do not quantify uncertainty associated with violations of physics constraints, limiting reliability assessment. We propose Evidential Perfusion Physics-Informed Neural Networks (EPPINN), a framework that integrates evidential deep learning with physics-informed modeling to enable uncertainty-aware perfusion parameter estimation. EPPINN models arterial input, tissue concentration, and perfusion parameters using coordinate-based networks, and places a Normal--Inverse--Gamma distribution over the physics residual to characterize voxel-wise aleatoric and epistemic uncertainty in physics consistency without requiring Bayesian sampling or ensemble inference. The framework further incorporates physiologically constrained parameterization and stabilization strategies to promote robust per-case optimization. We evaluate EPPINN on digital phantom data, the ISLES 2018 benchmark, and a clinical cohort. On the evaluated datasets, EPPINN achieves lower normalized mean absolute error than classical deconvolution and PINN baselines, particularly under sparse temporal sampling and low signal-to-noise conditions, while providing conservative uncertainty estimates with high empirical coverage. On clinical data, EPPINN attains the highest voxel-level and case-level infarct-core detection sensitivity. These results suggest that evidential physics-informed learning can improve both accuracy and reliability of CTP analysis for time-critical stroke assessment.
Chinese Translation
物理信息神经网络(PINNs)在解决计算机断层灌注(CTP)成像中急性缺血性中风评估的病态反卷积问题方面展现了良好的前景。然而,现有的基于PINN的方法仍然是确定性的,并未量化与物理约束违背相关的不确定性,从而限制了可靠性评估。我们提出了证据性灌注物理信息神经网络(EPPINN),这是一个将证据深度学习与物理信息建模相结合的框架,旨在实现不确定性感知的灌注参数估计。EPPINN使用基于坐标的网络对动脉输入、组织浓度和灌注参数进行建模,并在物理残差上放置正态-逆伽马分布,以表征物理一致性中的体素级别的随机和认知不确定性,而无需贝叶斯采样或集成推理。该框架进一步结合了生理约束参数化和稳定化策略,以促进每个案例的稳健优化。我们在数字幻影数据、ISLES 2018基准和临床队列上评估了EPPINN。在评估的数据集中,EPPINN在稀疏时间采样和低信噪比条件下,相较于经典反卷积和PINN基线,达到了更低的归一化平均绝对误差,同时提供了具有高经验覆盖率的保守不确定性估计。在临床数据上,EPPINN在体素级和案例级的梗死核心检测灵敏度上达到了最高水平。这些结果表明,证据性物理信息学习可以提高CTP分析在时间敏感的中风评估中的准确性和可靠性。
cs.CV / 66 / 2603.09367
M3GCLR: Multi-View Mini-Max Infinite Skeleton-Data Game Contrastive Learning For Skeleton-Based Action Recognition
M3GCLR:基于多视角的迷你-极大无限骨骼数据游戏对比学习用于骨骼动作识别
Abstract
In recent years, contrastive learning has drawn significant attention as an effective approach to reducing reliance on labeled data. However, existing methods for self-supervised skeleton-based action recognition still face three major limitations: insufficient modeling of view discrepancies, lack of effective adversarial mechanisms, and uncontrollable augmentation perturbations. To tackle these issues, we propose the Multi-view Mini-Max infinite skeleton-data Game Contrastive Learning for skeleton-based action Recognition (M3GCLR), a game-theoretic contrastive framework. First, we establish the Infinite Skeleton-data Game (ISG) model and the ISG equilibrium theorem, and further provide a rigorous proof, enabling mini-max optimization based on multi-view mutual information. Then, we generate normal-extreme data pairs through multi-view rotation augmentation and adopt temporally averaged input as a neutral anchor to achieve structural alignment, thereby explicitly characterizing perturbation strength. Next, leveraging the proposed equilibrium theorem, we construct a strongly adversarial mini-max skeleton-data game to encourage the model to mine richer action-discriminative information. Finally, we introduce the dual-loss equilibrium optimizer to optimize the game equilibrium, allowing the learning process to maximize action-relevant information while minimizing encoding redundancy, and we prove the equivalence between the proposed optimizer and the ISG model. Extensive Experiments show that M3GCLR achieves three-stream 82.1%, 85.8% accuracy on NTU RGB+D 60 (X-Sub, X-View) and 72.3%, 75.0% accuracy on NTU RGB+D 120 (X-Sub, X-Set). On PKU-MMD Part I and II, it attains 89.1%, 45.2% in three-stream respectively, all results matching or outperforming state-of-the-art performance. Ablation studies confirm the effectiveness of each component.
Chinese Translation
近年来,对比学习作为一种有效减少对标注数据依赖的方法,受到了广泛关注。然而,现有的自监督骨骼动作识别方法仍面临三个主要限制:视角差异建模不足、缺乏有效的对抗机制以及不可控的增强扰动。为了解决这些问题,我们提出了基于多视角的迷你-极大无限骨骼数据游戏对比学习(M3GCLR),这是一种博弈论对比框架。首先,我们建立了无限骨骼数据游戏(ISG)模型及其平衡定理,并进一步提供了严格的证明,从而实现基于多视角互信息的迷你-极大优化。然后,我们通过多视角旋转增强生成正常-极端数据对,并采用时间平均输入作为中性锚点以实现结构对齐,从而明确表征扰动强度。接下来,利用所提出的平衡定理,我们构建了一个强对抗的迷你-极大骨骼数据游戏,以鼓励模型挖掘更丰富的动作区分信息。最后,我们引入双损失平衡优化器来优化游戏平衡,使学习过程能够最大化与动作相关的信息,同时最小化编码冗余,并证明了所提出的优化器与ISG模型之间的等价性。大量实验表明,M3GCLR在NTU RGB+D 60(X-Sub, X-View)上实现了三流82.1%、85.8%的准确率,在NTU RGB+D 120(X-Sub, X-Set)上实现了72.3%、75.0%的准确率。在PKU-MMD第一和第二部分中,分别达到了89.1%、45.2%的三流准确率,所有结果均与最先进的性能相匹配或超越。消融研究证实了每个组件的有效性。
cs.CV / 67 / 2603.09374
MIL-PF: Multiple Instance Learning on Precomputed Features for Mammography Classification
MIL-PF:基于预计算特征的多实例学习用于乳腺X光分类
Abstract
Modern foundation models provide highly expressive visual representations, yet adapting them to high-resolution medical imaging remains challenging due to limited annotations and weak supervision. Mammography, in particular, is characterized by large images, variable multi-view studies and predominantly breast-level labels, making end-to-end fine-tuning computationally expensive and often impractical. We propose Multiple Instance Learning on Precomputed Features (MIL-PF), a scalable framework that combines frozen foundation encoders with a lightweight MIL head for mammography classification. By precomputing the semantic representations and training only a small task-specific aggregation module (40k parameters), the method enables efficient experimentation and adaptation without retraining large backbones. The architecture explicitly models the global tissue context and the sparse local lesion signals through attention-based aggregation. MIL-PF achieves state-of-the-art classification performance at clinical scale while substantially reducing training complexity. We release the code for full reproducibility.
Chinese Translation
现代基础模型提供了高度表达的视觉表示,但由于标注有限和弱监督,将其适应于高分辨率医学影像仍然具有挑战性。乳腺X光特别具有大图像、可变的多视角研究和主要以乳腺级别标签为主的特点,使得端到端的微调在计算上成本高昂且往往不切实际。我们提出了基于预计算特征的多实例学习(MIL-PF),这是一个可扩展的框架,将冻结的基础编码器与轻量级的多实例学习头结合用于乳腺X光分类。通过预计算语义表示并仅训练一个小型特定任务的聚合模块(40k参数),该方法实现了高效的实验和适应,而无需重新训练大型主干网络。该架构通过基于注意力的聚合显式建模全局组织上下文和稀疏局部病变信号。MIL-PF在临床规模上实现了最先进的分类性能,同时显著降低了训练复杂性。我们发布了代码以确保完全可重复性。
cs.CV / 68 / 2603.09377
SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization
SinGeo:释放单一模型在稳健跨视角地理定位中的潜力
Abstract
Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions -- implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.
Chinese Translation
尽管最近在稳健跨视角地理定位(CVGL)方面取得了显著进展,但这一任务仍然充满挑战。现有方法依然依赖于特定视场(FoV)的训练范式,其中模型在固定的FoV下进行优化,但在未见过的FoV和未知方向上测试时会崩溃。这一局限性迫使我们部署多个模型以覆盖多样的变化。尽管已有研究通过简单地随机化FoV探索了动态FoV训练,但未能在多样条件下实现稳健性——隐含假设所有FoV的难度相同。为了解决这一问题,我们提出了SinGeo,一个简单而强大的框架,使单一模型能够实现稳健的跨视角地理定位,而无需额外模块或显式变换。SinGeo采用双重判别学习架构,增强了地面和卫星分支内的视角可区分性,并首次引入了课程学习策略以实现稳健的CVGL。在四个基准数据集上的广泛评估表明,SinGeo在多样条件下设定了最新的技术水平(SOTA)结果,并显著优于专门针对极端FoV训练的方法。除了卓越的性能外,SinGeo还表现出跨架构的可迁移性。此外,我们提出了一种一致性评估方法,以定量评估模型在不同视角下的稳定性,为理解和推动未来CVGL研究中的稳健性提供了可解释的视角。代码将在论文接受后提供。
cs.CV / 69 / 2603.09385
EventVGGT: Exploring Cross-Modal Distillation for Consistent Event-based Depth Estimation
EventVGGT:探索跨模态蒸馏以实现一致的基于事件的深度估计
Abstract
Event cameras offer superior sensitivity to high-speed motion and extreme lighting, making event-based monocular depth estimation a promising approach for robust 3D perception in challenging conditions. However, progress is severely hindered by the scarcity of dense depth annotations. While recent annotation-free approaches mitigate this by distilling knowledge from Vision Foundation Models (VFMs), a critical limitation persists: they process event streams as independent frames. By neglecting the inherent temporal continuity of event data, these methods fail to leverage the rich temporal priors encoded in VFMs, ultimately yielding temporally inconsistent and less accurate depth predictions. To address this, we introduce EventVGGT, a novel framework that explicitly models the event stream as a coherent video sequence. To the best of our knowledge, we are the first to distill spatio-temporal and multi-view geometric priors from the Visual Geometry Grounded Transformer (VGGT) into the event domain. We achieve this via a comprehensive tri-level distillation strategy: (i) Cross-Modal Feature Mixture (CMFM) bridges the modality gap at the output level by fusing RGB and event features to generate auxiliary depth predictions; (ii) Spatio-Temporal Feature Distillation (STFD) distills VGGT's powerful spatio-temporal representations at the feature level; and (iii) Temporal Consistency Distillation (TCD) enforces cross-frame coherence at the temporal level by aligning inter-frame depth changes. Extensive experiments demonstrate that EventVGGT consistently outperforms existing methods -- reducing the absolute mean depth error at 30m by over 53\% on EventScape (from 2.30 to 1.06) -- while exhibiting robust zero-shot generalization on the unseen DENSE and MVSEC datasets.
Chinese Translation
事件相机对高速运动和极端光照具有优越的敏感性,使得基于事件的单目深度估计成为在挑战性条件下实现稳健3D感知的有前景的方法。然而,稠密深度标注的稀缺严重阻碍了这一进展。尽管最近的无标注方法通过从视觉基础模型(Vision Foundation Models, VFMs)中蒸馏知识来缓解这一问题,但一个关键的限制依然存在:它们将事件流处理为独立的帧。由于忽视了事件数据固有的时间连续性,这些方法未能利用VFMs中编码的丰富时间先验,最终导致时间上不一致且准确性较低的深度预测。为了解决这一问题,我们提出了EventVGGT,一个明确将事件流建模为连贯视频序列的新框架。据我们所知,我们是首个将视觉几何基础变换器(Visual Geometry Grounded Transformer, VGGT)的时空和多视角几何先验蒸馏到事件领域的研究。我们通过全面的三级蒸馏策略实现这一目标:(i) 跨模态特征混合(Cross-Modal Feature Mixture, CMFM)在输出层面通过融合RGB和事件特征来弥合模态差距,从而生成辅助深度预测;(ii) 时空特征蒸馏(Spatio-Temporal Feature Distillation, STFD)在特征层面蒸馏VGGT强大的时空表示;(iii) 时间一致性蒸馏(Temporal Consistency Distillation, TCD)通过对齐帧间深度变化在时间层面强制执行跨帧一致性。大量实验表明,EventVGGT在现有方法中表现出色——在EventScape上将30米的绝对平均深度误差减少超过53\%(从2.30降至1.06),同时在未见的DENSE和MVSEC数据集上展现出稳健的零样本泛化能力。
cs.CV / 70 / 2603.09390
Training-Free Coverless Multi-Image Steganography with Access Control
无载体多图像隐写术的无训练访问控制
Abstract
Coverless Image Steganography (CIS) hides information without explicitly modifying a cover image, providing strong imperceptibility and inherent robustness to steganalysis. However, existing CIS methods largely lack robust access control, making it difficult to selectively reveal different hidden contents to different authorized users. Such access control is critical for scalable and privacy-sensitive information hiding in multi-user settings. We propose MIDAS, a training-free diffusion-based CIS framework that enables multi-image hiding with user-specific access control via latent-level fusion. MIDAS introduces a Random Basis mechanism to suppress residual structural information and a Latent Vector Fusion module that reshapes aggregated latents to align with the diffusion process. Experimental results demonstrate that MIDAS consistently outperforms existing training-free CIS baselines in access control functionality, stego image quality and diversity, robustness to noise, and resistance to steganalysis, establishing a practical and scalable approach to access-controlled coverless steganography.
Chinese Translation
无载体图像隐写术(CIS)在不显式修改载体图像的情况下隐藏信息,提供强大的不可察觉性和对隐写分析的固有鲁棒性。然而,现有的CIS方法在访问控制方面普遍缺乏稳健性,这使得向不同授权用户选择性地揭示不同隐藏内容变得困难。这种访问控制对于在多用户环境中进行可扩展和隐私敏感的信息隐藏至关重要。我们提出了MIDAS,一个无训练的基于扩散的CIS框架,通过潜在级别融合实现用户特定的访问控制,支持多图像隐藏。MIDAS引入了一种随机基机制,以抑制残余结构信息,并采用潜在向量融合模块,重塑聚合的潜在向量以与扩散过程对齐。实验结果表明,MIDAS在访问控制功能、隐写图像质量和多样性、对噪声的鲁棒性以及抵抗隐写分析方面,始终优于现有的无训练CIS基线,确立了一种实用且可扩展的无载体隐写术的访问控制方法。
cs.CV / 71 / 2603.09392
ICDAR 2025 Competition on End-to-End Document Image Machine Translation Towards Complex Layouts
ICDAR 2025 复杂布局文档图像端到端机器翻译竞赛
Abstract
Document Image Machine Translation (DIMT) seeks to translate text embedded in document images from one language to another by jointly modeling both textual content and page layout, bridging optical character recognition (OCR) and natural language processing (NLP). The DIMT 2025 Challenge advances research on end-to-end document image translation, a rapidly evolving area within multimodal document understanding. The competition features two tracks, OCR-free and OCR-based, each with two subtasks for small (less than 1B parameters) and large (greater than 1B parameters) models. Participants submit a single unified DIMT system, with the option to incorporate provided OCR transcripts. Running from December 10, 2024 to April 20, 2025, the competition attracted 69 teams and 27 valid submissions in total. Track 1 had 34 teams and 13 valid submissions, while Track 2 had 35 teams and 14 valid submissions. In this report, we present the challenge motivation, dataset construction, task definitions, evaluation protocol, and a summary of results. Our analysis shows that large-model approaches establish a promising new paradigm for translating complex-layout document images and highlight substantial opportunities for future research.
Chinese Translation
文档图像机器翻译(DIMT)旨在通过联合建模文本内容和页面布局,将嵌入文档图像中的文本从一种语言翻译为另一种语言,架起光学字符识别(OCR)与自然语言处理(NLP)之间的桥梁。DIMT 2025 挑战推动了端到端文档图像翻译的研究,这是多模态文档理解领域中一个快速发展的方向。该竞赛设有两个赛道,分别为无OCR和基于OCR的,每个赛道又包含两个子任务,针对小模型(参数少于10亿)和大模型(参数超过10亿)。参与者需提交一个统一的DIMT系统,并可选择纳入提供的OCR转录文本。竞赛时间从2024年12月10日到2025年4月20日,共吸引了69支队伍和27个有效提交。赛道1有34支队伍和13个有效提交,而赛道2则有35支队伍和14个有效提交。在本报告中,我们介绍了挑战的动机、数据集构建、任务定义、评估协议以及结果总结。我们的分析表明,大模型方法为翻译复杂布局文档图像建立了一个有前景的新范式,并突显了未来研究的重大机遇。
cs.CV / 72 / 2603.09405
YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search
YOLO-NAS-Bench:一种具有自我演化预测器的YOLO架构搜索替代基准
Abstract
Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor's training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor's R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor's discriminative power for top-performing detection architectures.
Chinese Translation
神经架构搜索(NAS)在目标检测中的应用受到高评估成本的严重制约,因为在COCO上完全训练每个候选YOLO架构需要数天的GPU时间。同时,现有的NAS基准主要针对图像分类,导致检测领域缺乏可比的NAS评估基准。为了解决这一问题,我们提出了YOLO-NAS-Bench,这是第一个专为YOLO风格检测器量身定制的替代基准。YOLO-NAS-Bench定义了一个搜索空间,涵盖了通道宽度、块深度和操作类型,涉及到YOLOv8到YOLO12的核心模块。我们通过随机、分层和拉丁超立方体策略抽样了1,000个架构,在COCO-mini上进行训练,并构建了一个LightGBM替代预测器。为了在与NAS最相关的高性能领域中提升预测器的准确性,我们提出了一种自我演化机制,该机制通过使用预测器本身在每次迭代中发现和评估信息架构,逐步将预测器的训练分布与高性能前沿对齐。这种方法将架构池扩大到1,500个,并将集成预测器的R²从0.770提升至0.815,稀疏Kendall Tau从0.694提升至0.752,展示了强大的预测准确性和排名一致性。使用最终的预测器作为进化搜索的适应度函数,我们发现的架构在COCO-mini上超越了所有官方YOLOv8-YOLO12基线,并在可比延迟下确认了预测器对顶级检测架构的区分能力。
cs.CV / 73 / 2603.09408
Reviving ConvNeXt for Efficient Convolutional Diffusion Models
复兴 ConvNeXt 以实现高效的卷积扩散模型
Abstract
Recent diffusion models increasingly favor Transformer backbones, motivated by the remarkable scalability of fully attentional architectures. Yet the locality bias, parameter efficiency, and hardware friendliness--the attributes that established ConvNets as the efficient vision backbone--have seen limited exploration in modern generative modeling. Here we introduce the fully convolutional diffusion model (FCDM), a model having a backbone similar to ConvNeXt, but designed for conditional diffusion modeling. We find that using only 50% of the FLOPs of DiT-XL/2, FCDM-XL achieves competitive performance with 7$\times$ and 7.5$\times$ fewer training steps at 256$\times$256 and 512$\times$512 resolutions, respectively. Remarkably, FCDM-XL can be trained on a 4-GPU system, highlighting the exceptional training efficiency of our architecture. Our results demonstrate that modern convolutional designs provide a competitive and highly efficient alternative for scaling diffusion models, reviving ConvNeXt as a simple yet powerful building block for efficient generative modeling.
Chinese Translation
近年来,扩散模型越来越倾向于使用 Transformer 主干,这得益于完全注意力架构的显著可扩展性。然而,局部性偏差、参数效率和硬件友好性——这些使得卷积网络(ConvNets)成为高效视觉主干的特性——在现代生成建模中却鲜有探索。在此,我们引入了完全卷积扩散模型(FCDM),该模型的主干与 ConvNeXt 类似,但专为条件扩散建模而设计。我们发现,FCDM-XL 仅使用 DiT-XL/2 50% 的 FLOPs,在 256×256 和 512×512 分辨率下分别以 7 倍和 7.5 倍更少的训练步骤实现了具有竞争力的性能。值得注意的是,FCDM-XL 可以在 4-GPU 系统上进行训练,突显了我们架构的卓越训练效率。我们的结果表明,现代卷积设计为扩展扩散模型提供了一种具有竞争力且高效的替代方案,使 ConvNeXt 作为高效生成建模的简单而强大的构建块得以复兴。
cs.CV / 74 / 2603.09411
RiO-DETR: DETR for Real-time Oriented Object Detection
RiO-DETR:用于实时定向目标检测的DETR
Abstract
We present RiO-DETR: DETR for Real-time Oriented Object Detection, the first real-time oriented detection transformer to the best of our knowledge. Adapting DETR to oriented bounding boxes (OBBs) poses three challenges: semantics-dependent orientation, angle periodicity that breaks standard Euclidean refinement, and an enlarged search space that slows convergence. RiO-DETR resolves these issues with task-native designs while preserving real-time efficiency. First, we propose Content-Driven Angle Estimation by decoupling angle from positional queries, together with Rotation-Rectified Orthogonal Attention to capture complementary cues for reliable orientation. Second, Decoupled Periodic Refinement combines bounded coarse-to-fine updates with a Shortest-Path Periodic Loss for stable learning across angular seams. Third, Oriented Dense O2O injects angular diversity into dense supervision to speed up angle convergence at no extra cost. Extensive experiments on DOTA-1.0, DIOR-R, and FAIR-1M-2.0 demonstrate RiO-DETR establishes a new speed--accuracy trade-off for real-time oriented detection. Code will be made publicly available.
Chinese Translation
我们提出了RiO-DETR:用于实时定向目标检测的DETR,这是我们所知的首个实时定向检测变换器。将DETR适配于定向边界框(OBBs)面临三大挑战:依赖语义的方向性、打破标准欧几里得细化的角度周期性,以及放大搜索空间导致的收敛速度减慢。RiO-DETR通过任务本地设计解决了这些问题,同时保持了实时效率。首先,我们提出了内容驱动的角度估计,通过将角度与位置查询解耦,并结合旋转校正的正交注意力,以捕捉可靠方向的互补线索。其次,解耦的周期性细化结合了有界的粗到细更新和最短路径周期损失,以实现跨角缝的稳定学习。第三,定向稠密O2O将角度多样性注入稠密监督中,以在不增加额外成本的情况下加速角度收敛。在DOTA-1.0、DIOR-R和FAIR-1M-2.0上的大量实验表明,RiO-DETR为实时定向检测建立了新的速度-准确性平衡。代码将公开发布。
cs.CV / 75 / 2603.09414
PromptDLA: A Domain-aware Prompt Document Layout Analysis Framework with Descriptive Knowledge as a Cue
PromptDLA:一种基于领域知识提示的文档布局分析框架
Abstract
Document Layout Analysis (DLA) is crucial for document artificial intelligence and has recently received increasing attention, resulting in an influx of large-scale public DLA datasets. Existing work often combines data from various domains in recent public DLA datasets to improve the generalization of DLA. However, directly merging these datasets for training often results in suboptimal model performance, as it overlooks the different layout structures inherent to various domains. These variations include different labeling styles, document types, and languages. This paper introduces PromptDLA, a domain-aware Prompter for Document Layout Analysis that effectively leverages descriptive knowledge as cues to integrate domain priors into DLA. The innovative PromptDLA features a unique domain-aware prompter that customizes prompts based on the specific attributes of the data domain. These prompts then serve as cues that direct the DLA toward critical features and structures within the data, enhancing the model's ability to generalize across varied domains. Extensive experiments show that our proposal achieves state-of-the-art performance among DocLayNet, PubLayNet, M6Doc, and D$^4$LA. Our code is available at https://github.com/Zirui00/PromptDLA.
Chinese Translation
文档布局分析(DLA)对于文档人工智能至关重要,近年来受到了越来越多的关注,导致大规模公共DLA数据集的涌现。现有研究通常结合来自不同领域的公共DLA数据集,以提高DLA的泛化能力。然而,直接合并这些数据集进行训练往往导致模型性能不佳,因为这忽略了不同领域固有的布局结构差异。这些差异包括不同的标注风格、文档类型和语言。本文提出了PromptDLA,一种领域感知的文档布局分析提示器,能够有效利用描述性知识作为线索,将领域先验知识整合到DLA中。创新的PromptDLA具有独特的领域感知提示器,根据数据领域的特定属性定制提示。这些提示作为线索,引导DLA关注数据中的关键特征和结构,从而增强模型在不同领域间的泛化能力。大量实验表明,我们的提案在DocLayNet、PubLayNet、M6Doc和D$^4$LA中实现了最先进的性能。我们的代码可在 https://github.com/Zirui00/PromptDLA 获取。
cs.CV / 76 / 2603.09418
CIGPose: Causal Intervention Graph Neural Network for Whole-Body Pose Estimation
CIGPose:用于全身姿态估计的因果干预图神经网络
Abstract
State-of-the-art whole-body pose estimators often lack robustness, producing anatomically implausible predictions in challenging scenes. We posit this failure stems from spurious correlations learned from visual context, a problem we formalize using a Structural Causal Model (SCM). The SCM identifies visual context as a confounder that creates a non-causal backdoor path, corrupting the model's reasoning. We introduce the Causal Intervention Graph Pose (CIGPose) framework to address this by approximating the true causal effect between visual evidence and pose. The core of CIGPose is a novel Causal Intervention Module: it first identifies confounded keypoint representations via predictive uncertainty and then replaces them with learned, context-invariant canonical embeddings. These deconfounded embeddings are processed by a hierarchical graph neural network that reasons over the human skeleton at both local and global semantic levels to enforce anatomical plausibility. Extensive experiments show CIGPose achieves a new state-of-the-art on COCO-WholeBody. Notably, our CIGPose-x model achieves 67.0\% AP, surpassing prior methods that rely on extra training data. With the additional UBody dataset, CIGPose-x is further boosted to 67.5\% AP, demonstrating superior robustness and data efficiency. The codes and models are publicly available at https://github.com/53mins/CIGPose.
Chinese Translation
最先进的全身姿态估计器往往缺乏鲁棒性,在具有挑战性的场景中产生解剖学上不合理的预测。我们认为这种失败源于从视觉上下文中学习到的虚假相关性,这一问题我们通过结构因果模型(Structural Causal Model, SCM)进行了形式化。SCM将视觉上下文识别为一个混淆因素,创建了一个非因果的后门路径,破坏了模型的推理。我们提出了因果干预图姿态(Causal Intervention Graph Pose, CIGPose)框架,以通过近似视觉证据与姿态之间的真实因果效应来解决这一问题。CIGPose的核心是一个新颖的因果干预模块:它首先通过预测不确定性识别混淆的关键点表示,然后用学习到的上下文不变的典型嵌入替换这些表示。这些去混淆的嵌入通过一个层次图神经网络进行处理,该网络在局部和全局语义层面上对人类骨架进行推理,以确保解剖学上的合理性。大量实验表明,CIGPose在COCO-WholeBody上达到了新的最先进水平。值得注意的是,我们的CIGPose-x模型达到了67.0%的平均精度(AP),超越了依赖额外训练数据的先前方法。通过额外的UBody数据集,CIGPose-x的AP进一步提升至67.5%,展现出更优越的鲁棒性和数据效率。代码和模型已在https://github.com/53mins/CIGPose上公开发布。
cs.CV / 77 / 2603.09419
MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating
MetaDAT:通过元预训练和数据自适应测试时更新实现可泛化的轨迹预测
Abstract
Existing trajectory prediction methods exhibit significant performance degradation under distribution shifts during test time. Although test-time training techniques have been explored to enable adaptation, current approaches rely on an offline pre-trained predictor that lacks online learning flexibility. Moreover, they depend on fixed online model updating rules that do not accommodate the specific characteristics of test data. To address these limitations, we first propose a meta-learning framework to directly optimize the predictor for fast and accurate online adaptation, which performs bi-level optimization on the performance of simulated test-time adaptation tasks during pre-training. Furthermore, at test time, we introduce a data-adaptive model updating mechanism that dynamically adjusts the predefined learning rates and updating frequencies based on online partial derivatives and hard sample selection. This mechanism enables the online learning rate to suit the test data, and focuses on informative hard samples to enhance efficiency. Experiments are conducted on various challenging cross-dataset distribution shift scenarios, including nuScenes, Lyft, and Waymo. Results demonstrate that our method achieves superior adaptation accuracy, surpassing state-of-the-art test-time training methods for trajectory prediction. Additionally, our method excels under suboptimal learning rates and high FPS demands, showcasing its robustness and practicality.
Chinese Translation
现有的轨迹预测方法在测试时的分布转变下表现出显著的性能下降。尽管已经探索了测试时训练技术以实现适应性,但当前的方法依赖于缺乏在线学习灵活性的离线预训练预测器。此外,它们依赖于固定的在线模型更新规则,这些规则无法适应测试数据的特定特征。为了解决这些局限性,我们首先提出了一种元学习框架,直接优化预测器以实现快速和准确的在线适应,该框架在预训练期间对模拟测试时适应任务的性能进行双层优化。此外,在测试时,我们引入了一种数据自适应模型更新机制,该机制根据在线偏导数和困难样本选择动态调整预定义的学习率和更新频率。该机制使在线学习率适应测试数据,并专注于信息量丰富的困难样本以提高效率。在各种具有挑战性的跨数据集分布转变场景中进行了实验,包括nuScenes、Lyft和Waymo。结果表明,我们的方法实现了优越的适应准确性,超越了轨迹预测的最先进测试时训练方法。此外,我们的方法在次优学习率和高帧率需求下表现出色,展示了其鲁棒性和实用性。
cs.CV / 78 / 2603.09420
Open-World Motion Forecasting
开放世界运动预测
Abstract
Motion forecasting aims to predict the future trajectories of dynamic agents in the scene, enabling autonomous vehicles to effectively reason about scene evolution. Existing approaches operate under the closed-world regime and assume fixed object taxonomy as well as access to high-quality perception. Therefore, they struggle in real-world settings where perception is imperfect and object taxonomy evolves over time. In this work, we bridge this fundamental gap by introducing open-world motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are estimated directly from camera images. We tackle this setting by proposing the first end-to-end class-incremental motion forecasting framework to mitigate catastrophic forgetting while simultaneously learning to forecast newly introduced classes. When a new class is introduced, our framework employs a pseudo-labeling strategy to first generate motion forecasting pseudo-labels for all known classes which are then processed by a vision-language model to filter inconsistent and over-confident predictions. Parallelly, our approach further mitigates catastrophic forgetting by using a novel replay sampling strategy that leverages query feature variance to sample previous sequences with informative motion patterns. Extensive evaluation on the nuScenes and Argoverse 2 datasets demonstrates that our approach successfully resists catastrophic forgetting and maintains performance on previously learned classes while improving adaptation to novel ones. Further, we demonstrate that our approach supports zero-shot transfer to real-world driving and naturally extends to end-to-end class-incremental planning, enabling continual adaptation of the full autonomous driving system. We provide the code at https://omen.cs.uni-freiburg.de .
Chinese Translation
运动预测旨在预测场景中动态代理的未来轨迹,使得自主车辆能够有效推理场景演变。现有方法在封闭世界的框架下运作,假设对象分类是固定的,并且能够获得高质量的感知。因此,它们在感知不完美且对象分类随时间演变的现实场景中表现不佳。在本研究中,我们通过引入开放世界运动预测这一新颖设置来弥补这一根本性差距,在该设置中,新的对象类别会随着时间的推移逐步引入,并且未来的对象轨迹直接从摄像头图像中估计。我们通过提出首个端到端的类增量运动预测框架来应对这一设置,以减轻灾难性遗忘,同时学习预测新引入的类别。当引入新类别时,我们的框架采用伪标记策略,首先为所有已知类别生成运动预测伪标签,然后由视觉-语言模型处理这些标签,以过滤不一致和过于自信的预测。同时,我们的方法通过使用一种新颖的重放采样策略进一步减轻灾难性遗忘,该策略利用查询特征的方差来采样具有信息性运动模式的先前序列。在nuScenes和Argoverse 2数据集上的广泛评估表明,我们的方法成功抵抗了灾难性遗忘,并在保持对先前学习类别的性能的同时,改善了对新类别的适应。此外,我们证明了我们的方法支持零样本迁移到现实世界驾驶,并自然扩展到端到端的类增量规划,从而实现全自主驾驶系统的持续适应。我们提供了代码,网址为 https://omen.cs.uni-freiburg.de 。
cs.CV / 79 / 2603.09446
GIIM: Graph-based Learning of Inter- and Intra-view Dependencies for Multi-view Medical Image Diagnosis
GIIM:基于图的多视角医学影像诊断中的视角间和视角内依赖关系学习
Abstract
Computer-aided diagnosis (CADx) has become vital in medical imaging, but automated systems often struggle to replicate the nuanced process of clinical interpretation. Expert diagnosis requires a comprehensive analysis of how abnormalities relate to each other across various views and time points, but current multi-view CADx methods frequently overlook these complex dependencies. Specifically, they fail to model the crucial relationships within a single view and the dynamic changes lesions exhibit across different views. This limitation, combined with the common challenge of incomplete data, greatly reduces their predictive reliability. To address these gaps, we reframe the diagnostic task as one of relationship modeling and propose GIIM, a novel graph-based approach. Our framework is uniquely designed to simultaneously capture both critical intra-view dependencies between abnormalities and inter-view dynamics. Furthermore, it ensures diagnostic robustness by incorporating specific techniques to effectively handle missing data, a common clinical issue. We demonstrate the generality of this approach through extensive evaluations on diverse imaging modalities, including CT, MRI, and mammography. The results confirm that our GIIM model significantly enhances diagnostic accuracy and robustness over existing methods, establishing a more effective framework for future CADx systems.
Chinese Translation
计算机辅助诊断(CADx)在医学影像中变得至关重要,但自动化系统往往难以复制临床解读的细微过程。专家诊断需要全面分析异常在不同视角和时间点之间的相互关系,但当前的多视角CADx方法常常忽视这些复杂的依赖关系。具体而言,它们未能建模单一视角内的关键关系以及病变在不同视角间表现出的动态变化。这一局限性,加上常见的不完整数据问题,极大降低了它们的预测可靠性。为了解决这些问题,我们将诊断任务重新框定为关系建模,并提出GIIM,一种新颖的基于图的方法。我们的框架独特地设计为能够同时捕捉异常之间的关键视角内依赖关系和视角间动态。此外,它通过采用特定技术有效处理缺失数据,确保诊断的稳健性,这是一个常见的临床问题。我们通过对包括CT、MRI和乳腺摄影在内的多种影像模态进行广泛评估,展示了该方法的普遍性。结果确认我们的GIIM模型显著提高了诊断的准确性和稳健性,建立了未来CADx系统更有效的框架。
cs.CV / 80 / 2603.09448
A Guideline-Aware AI Agent for Zero-Shot Target Volume Auto-Delineation
一种基于指南的零样本靶体积自动勾画的人工智能代理
Abstract
Delineating the clinical target volume (CTV) in radiotherapy involves complex margins constrained by tumor location and anatomical barriers. While deep learning models automate this process, their rigid reliance on expert-annotated data requires costly retraining whenever clinical guidelines update. To overcome this limitation, we introduce OncoAgent, a novel guideline-aware AI agent framework that seamlessly converts textual clinical guidelines into three-dimensional target contours in a training-free manner. Evaluated on esophageal cancer cases, the agent achieves a zero-shot Dice similarity coefficient of 0.842 for the CTV and 0.880 for the planning target volume, demonstrating performance highly comparable to a fully supervised nnU-Net baseline. Notably, in a blinded clinical evaluation, physicians strongly preferred OncoAgent over the supervised baseline, rating it higher in guideline compliance, modification effort, and clinical acceptability. Furthermore, the framework generalizes zero-shot to alternative esophageal guidelines and other anatomical sites (e.g., prostate) without any retraining. Beyond mere volumetric overlap, our agent-based paradigm offers near-instantaneous adaptability to alternative guidelines, providing a scalable and transparent pathway toward interpretability in radiotherapy treatment planning.
Chinese Translation
在放射治疗中,勾画临床靶体积(CTV)涉及复杂的边界,这些边界受到肿瘤位置和解剖障碍的限制。尽管深度学习模型可以自动化这一过程,但它们对专家标注数据的严格依赖使得每当临床指南更新时都需要昂贵的重新训练。为了解决这一限制,我们提出了OncoAgent,一个新颖的基于指南的人工智能代理框架,它能够无缝地将文本临床指南转换为三维靶体轮廓,且无需训练。在食管癌病例的评估中,该代理在CTV上实现了0.842的零样本Dice相似系数,在计划靶体积上实现了0.880,表现与完全监督的nnU-Net基线高度相当。值得注意的是,在盲法临床评估中,医生们强烈偏好OncoAgent,相较于监督基线,在指南合规性、修改工作量和临床可接受性方面给予了更高的评分。此外,该框架能够零样本推广到其他食管指南和其他解剖部位(如前列腺),且无需任何重新训练。我们的基于代理的范式不仅仅关注体积重叠,还提供了对替代指南的近乎即时的适应能力,为放射治疗计划的可解释性提供了可扩展和透明的路径。
cs.CV / 81 / 2603.09465
EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation
EvoDriveVLA:通过协作感知-规划蒸馏演化自主驾驶视觉-语言-动作模型
Abstract
Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and oracle-guided trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, oracle-guided trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to produce high-quality trajectory candidates, thereby selecting the optimal trajectory to guide the student's prediction. EvoDriveVLA achieves SOTA performance in open-loop evaluation and significantly enhances performance in closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.
Chinese Translation
视觉-语言-动作模型在自主驾驶中展现出巨大的潜力,但在解冻视觉编码器后,其感知能力会下降,并且在长期规划中面临累积的不稳定性。为了解决这些挑战,我们提出了EvoDriveVLA——一种新颖的协作感知-规划蒸馏框架,该框架整合了自锚定感知约束和oracle引导的轨迹优化。具体而言,自锚定视觉蒸馏利用自锚定教师提供视觉锚定约束,通过轨迹引导的关键区域意识来规范学生表示。与此同时,oracle引导的轨迹蒸馏则采用具有未来感知能力的oracle教师,结合粗到细的轨迹细化和蒙特卡洛丢弃采样,生成高质量的轨迹候选,从而选择最佳轨迹来指导学生的预测。EvoDriveVLA在开环评估中实现了最先进的性能,并在闭环评估中显著提升了表现。我们的代码可在以下链接获取:https://github.com/hey-cjj/EvoDriveVLA。
cs.CV / 82 / 2603.09466
TopoOR: A Unified Topological Scene Representation for the Operating Room
TopoOR:一种统一的手术室拓扑场景表示
Abstract
Surgical Scene Graphs abstract the complexity of surgical operating rooms (OR) into a structure of entities and their relations, but existing paradigms suffer from strictly dyadic structural limitations. Frameworks that predominantly rely on pairwise message passing or tokenized sequences flatten the manifold geometry inherent to relational structures and lose structure in the process. We introduce TopoOR, a new paradigm that models multimodal operating rooms as a higher-order structure, innately preserving pairwise and group relationships. By lifting interactions between entities into higher-order topological cells, TopoOR natively models complex dynamics and multimodality present in the OR. This topological representation subsumes traditional scene graphs, thereby offering strictly greater expressivity. We also propose a higher-order attention mechanism that explicitly preserves manifold structure and modality-specific features throughout hierarchical relational attention. In this way, we circumvent combining 3D geometry, audio, and robot kinematics into a single joint latent representation, preserving the precise multimodal structure required for safety-critical reasoning, unlike existing methods. Extensive experiments demonstrate that our approach outperforms traditional graph and LLM-based baselines across sterility breach detection, robot phase prediction, and next-action anticipation
Chinese Translation
手术场景图将手术室(OR)的复杂性抽象为实体及其关系的结构,但现有范式受到严格的二元结构限制。主要依赖于成对信息传递或标记序列的框架将关系结构固有的流形几何简化,从而在此过程中丧失了结构。我们提出了TopoOR,一种将多模态手术室建模为高阶结构的新范式,内在地保留了成对和群体关系。通过将实体之间的交互提升到高阶拓扑单元,TopoOR 本质上建模了手术室中存在的复杂动态和多模态性。这种拓扑表示超越了传统的场景图,从而提供了更强的表达能力。我们还提出了一种高阶注意力机制,明确保留了流形结构和特定模态特征,贯穿于层次关系注意力中。通过这种方式,我们避免了将3D几何、音频和机器人运动学合并为单一的联合潜在表示,从而保留了安全关键推理所需的精确多模态结构,这与现有方法不同。大量实验表明,我们的方法在无菌破坏检测、机器人阶段预测和下一步行动预测等任务上优于传统图和基于大语言模型的基线。
cs.CV / 83 / 2603.09470
The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions
教父文献希腊文集:十九世纪噪声多音希腊文版的光学字符识别、注释与开放发布
Abstract
We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.
Chinese Translation
我们介绍了教父文献希腊文集(Patrologia Graeca Corpus),这是第一个针对十九世纪古希腊文版的大规模开放光学字符识别(OCR)和语言资源。该收藏涵盖了尚未数字化的教父文献希腊文集(PG)剩余卷册,这些卷册采用复杂的双语(希腊语-拉丁语)排版,并以高度退化的多音希腊文排版为特征。通过结合基于YOLO的布局检测和基于CRNN的文本识别的专用流程,我们实现了1.05%的字符错误率(CER)和4.69%的单词错误率(WER),在多音希腊文的现有OCR系统中表现优异。最终生成的语料库包含约六百万个词形化和词性标注的词元,并与完整的OCR和布局注释对齐。除了其文献学价值外,该语料库还为噪声多音希腊文的OCR建立了新的基准,并为未来的模型(包括大型语言模型,LLMs)提供了训练材料。
cs.CV / 84 / 2603.09471
OmniEarth: A Benchmark for Evaluating Vision-Language Models in Geospatial Tasks
OmniEarth:评估地理空间任务中视觉-语言模型的基准
Abstract
Vision-Language Models (VLMs) have demonstrated effective perception and reasoning capabilities on general-domain tasks, leading to growing interest in their application to Earth observation. However, a systematic benchmark for comprehensively evaluating remote sensing vision-language models (RSVLMs) remains lacking. To address this gap, we introduce OmniEarth, a benchmark for evaluating RSVLMs under realistic Earth observation scenarios. OmniEarth organizes tasks along three capability dimensions: perception, reasoning, and robustness. It defines 28 fine-grained tasks covering multi-source sensing data and diverse geospatial contexts. The benchmark supports two task formulations: multiple-choice VQA and open-ended VQA. The latter includes pure text outputs for captioning tasks, bounding box outputs for visual grounding tasks, and mask outputs for segmentation tasks. To reduce linguistic bias and examine whether model predictions rely on visual evidence, OmniEarth adopts a blind test protocol and a quintuple semantic consistency requirement. OmniEarth includes 9,275 carefully quality-controlled images, including proprietary satellite imagery from Jilin-1 (JL-1), along with 44,210 manually verified instructions. We conduct a systematic evaluation of contrastive learning-based models, general closed-source and open-source VLMs, as well as RSVLMs. Results show that existing VLMs still struggle with geospatially complex tasks, revealing clear gaps that need to be addressed for remote sensing applications. OmniEarth is publicly available at https://huggingface.co/datasets/sjeeudd/OmniEarth.
Chinese Translation
视觉-语言模型(VLMs)在一般领域任务中展示了有效的感知和推理能力,导致对其在地球观测应用中的兴趣日益增长。然而,目前缺乏一个系统的基准来全面评估遥感视觉-语言模型(RSVLMs)。为了解决这一问题,我们推出了OmniEarth,一个用于在现实地球观测场景下评估RSVLMs的基准。OmniEarth根据三个能力维度组织任务:感知、推理和鲁棒性。它定义了28个细粒度任务,涵盖多源传感数据和多样的地理空间背景。该基准支持两种任务形式:多项选择视觉问答(VQA)和开放式视觉问答(VQA)。后者包括用于图像描述任务的纯文本输出、用于视觉定位任务的边界框输出,以及用于分割任务的掩码输出。为了减少语言偏见并检查模型预测是否依赖于视觉证据,OmniEarth采用了盲测协议和五重语义一致性要求。OmniEarth包含9,275张经过严格质量控制的图像,包括来自吉林一号(Jilin-1,JL-1)的专有卫星图像,以及44,210条人工验证的指令。我们对基于对比学习的模型、通用闭源和开源VLMs以及RSVLMs进行了系统评估。结果表明,现有的VLMs在处理地理空间复杂任务时仍然存在困难,揭示了在遥感应用中需要解决的明显差距。OmniEarth已在https://huggingface.co/datasets/sjeeudd/OmniEarth上公开发布。
cs.CV / 85 / 2603.09480
Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
剪除冗余,保留本质:通过协同重要性-多样性实现视觉语言模型中的视觉标记压缩
Abstract
Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID}{https://github.com/ZhengyaoFang/PruneSID.
Chinese Translation
视觉语言模型(VLMs)面临由于过度生成视觉标记而导致的显著计算低效。虽然先前的研究表明大量视觉标记是冗余的,但现有的压缩方法在保持重要性和信息多样性之间难以取得平衡。为了解决这个问题,我们提出了PruneSID,这是一种无训练的协同重要性-多样性方法,采用两阶段管道:(1)主语义成分分析(PSCA)用于将标记聚类为语义一致的组,以确保全面的概念覆盖;(2)组内非极大值抑制(NMS)用于修剪冗余标记,同时保留每组内的关键代表性标记。此外,PruneSID还结合了一种信息感知的动态压缩比机制,根据图像复杂性优化标记压缩率,从而在不同场景中实现更有效的平均信息保留。大量实验表明,PruneSID在LLaVA-1.5上以仅11.1%的标记保留率达到了96.3%的准确率,并在LLaVA-NeXT上以极端压缩率(5.6%)达到了92.8%的准确率,超越了先前方法2.5%的表现,同时相比于原始模型实现了7.8倍的预填充速度提升。我们的框架在多种VLMs及图像和视频模态中具有良好的通用性,展示了强大的跨模态适应性。代码可在 https://github.com/ZhengyaoFang/PruneSID 获取。
cs.CV / 86 / 2603.09484
Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion
基于组件感知的自注意力编码与坐标保持融合的素描到图像生成
Abstract
Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.
Chinese Translation
将自由手绘素描转换为照片级真实图像仍然是图像合成中的一个基本挑战,尤其是由于素描的抽象性、稀疏性和风格多样性。现有的方法,包括基于生成对抗网络(GAN)和扩散模型,往往难以重建细致的细节、保持空间对齐或在不同素描领域之间适应。在本文中,我们提出了一种组件感知的自我精炼框架,用于素描到图像的生成,通过一种新颖的两阶段架构来解决这些挑战。基于自注意力的自编码网络(SA2N)首先从组件区域捕捉局部的语义和结构特征,而坐标保持门控融合(CGF)模块则将这些特征整合为一致的空间布局。最后,基于修改后的StyleGAN2骨干网构建的空间自适应精炼修订器(SARR)通过基于空间上下文的迭代精炼增强了真实感和一致性。在面部(CelebAMask-HQ, CUFSF)和非面部(Sketchy, ChairsV2, ShoesV2)数据集上的大量实验表明我们方法的鲁棒性和通用性。所提出的框架在图像保真度、语义准确性和感知质量方面始终优于最先进的GAN和扩散模型。在CelebAMask-HQ上,我们的模型在FID上提高了21%、在IS上提高了58%、在KID上提高了41%以及在SSIM上提高了20%。这些结果,以及在不同领域中更高的效率和视觉一致性,使我们的方法成为法医学、数字艺术修复和一般基于素描的图像合成应用的有力候选者。
cs.CV / 87 / 2603.09488
Streaming Autoregressive Video Generation via Diagonal Distillation
通过对角蒸馏进行流式自回归视频生成
Abstract
Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.
Chinese Translation
大型预训练扩散模型显著提升了生成视频的质量,但其在实时流媒体中的应用仍然有限。自回归模型为连续帧合成提供了自然框架,但需要大量计算以实现高保真度。扩散蒸馏可以将这些模型压缩为高效的少步变体,但现有的视频蒸馏方法主要适应于图像特定的方法,忽视了时间依赖性。这些技术在图像生成方面通常表现出色,但在视频合成中表现不佳,出现了运动一致性降低、长序列中的错误累积以及延迟与质量之间的权衡。我们识别出导致这些限制的两个因素:在步骤减少过程中对时间上下文的利用不足,以及在下一个块预测中的隐式后续噪声水平预测(即曝光偏差)。为了解决这些问题,我们提出了对角蒸馏,它与现有方法正交运行,更好地利用了视频块和去噪步骤之间的时间信息。我们方法的核心是一种不对称生成策略:早期更多步骤,后期更少步骤。这种设计使得后续块能够从经过充分处理的早期块中继承丰富的外观信息,同时将部分去噪的块作为后续合成的条件输入。通过在块生成过程中将后续噪声水平的隐式预测与实际推理条件对齐,我们的方法减轻了错误传播,并减少了长序列中的过饱和现象。我们进一步结合隐式光流建模,以在严格的步骤约束下保持运动质量。我们的方法在2.61秒内生成一个5秒的视频(高达31帧每秒),相比未蒸馏模型实现了277.3倍的加速。
cs.CV / 88 / 2603.09493
Evolving Prompt Adaptation for Vision-Language Models
视觉语言模型的演化提示适应
Abstract
The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for stable, knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.
Chinese Translation
将大规模视觉语言模型(VLMs)适应于标注数据有限的下游任务仍然是一个重大挑战。尽管参数高效的提示学习方法提供了一条有前景的路径,但它们往往面临预训练知识的灾难性遗忘。为了解决这一限制,我们的工作基于一个洞察,即引导提示的演变路径对于无遗忘适应至关重要。为此,我们提出了EvoPrompt,一个旨在明确引导提示轨迹以实现稳定、知识保留微调的新框架。具体而言,我们的方法采用了模态共享提示投影器(Modality-Shared Prompt Projector, MPP),从统一的嵌入空间生成分层提示。关键的是,演化训练策略将低秩更新解耦为方向和幅度两个组件,保留早期学习的语义方向,同时仅调整其幅度,从而使提示能够演变而不丢弃基础知识。这个过程通过特征几何正则化(Feature Geometric Regularization, FGR)进一步稳定,强制特征去相关以防止表示崩溃。大量实验表明,EvoPrompt在少样本学习中实现了最先进的性能,同时稳健地保留了预训练VLMs的原始零样本能力。
cs.CV / 89 / 2603.09496
SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding
SurgFed:基于语言引导的多任务联邦学习用于外科视频理解
Abstract
Surgical scene Multi-Task Federated Learning (MTFL) is essential for robot-assisted minimally invasive surgery (RAS) but remains underexplored in surgical video understanding due to two key challenges: (1) Tissue Diversity: Local models struggle to adapt to site-specific tissue features, limiting their effectiveness in heterogeneous clinical environments and leading to poor local predictions. (2) Task Diversity: Server-side aggregation, relying solely on gradient-based clustering, often produces suboptimal or incorrect parameter updates due to inter-site task heterogeneity, resulting in inaccurate localization. In light of these two issues, we propose SurgFed, a multi-task federated learning framework, enabling federated learning for surgical scene segmentation and depth estimation across diverse surgical types. SurgFed is powered by two appealing designs, i.e., Language-guided Channel Selection (LCS) and Language-guided Hyper Aggregation (LHA), to address the challenge of fully exploration on corss-site and cross-task. Technically, the LCS is first designed a lightweight personalized channel selection network that enhances site-specific adaptation using pre-defined text inputs, which optimally the local model learn the specific embeddings. We further introduce the LHA that employs a layer-wise cross-attention mechanism with pre-defined text inputs to model task interactions across sites and guide a hypernetwork for personalized parameter updates. Extensive empirical evidence shows that SurgFed yields improvements over the state-of-the-art methods in five public datasets across four surgical types. The code is available at https://anonymous.4open.science/r/SurgFed-070E/.
Chinese Translation
外科场景多任务联邦学习(MTFL)对于机器人辅助的微创手术(RAS)至关重要,但由于两个关键挑战在外科视频理解中仍未得到充分探索:(1)组织多样性:本地模型难以适应特定部位的组织特征,限制了其在异质临床环境中的有效性,并导致本地预测不佳。(2)任务多样性:服务器端聚合仅依赖基于梯度的聚类,常常由于不同站点任务的异质性而产生次优或错误的参数更新,导致定位不准确。针对这两个问题,我们提出了SurgFed,一个多任务联邦学习框架,能够在不同外科类型之间实现外科场景分割和深度估计的联邦学习。SurgFed通过两个引人注目的设计,即语言引导的通道选择(LCS)和语言引导的超聚合(LHA),来应对跨站点和跨任务的全面探索挑战。在技术上,LCS首先设计了一个轻量化的个性化通道选择网络,通过使用预定义的文本输入增强特定站点的适应性,从而使本地模型能够最佳地学习特定的嵌入。我们进一步引入LHA,它采用层级交叉注意机制,结合预定义的文本输入来建模不同站点之间的任务交互,并指导超网络进行个性化参数更新。大量实证证据表明,SurgFed在四种外科类型的五个公共数据集上相较于最先进的方法取得了显著改善。代码可在 https://anonymous.4open.science/r/SurgFed-070E/ 获取。
cs.CV / 90 / 2603.09506
Context-Nav: Context-Driven Exploration and Viewpoint-Aware 3D Spatial Reasoning for Instance Navigation
Context-Nav:基于上下文的探索与视角感知的3D空间推理用于实例导航
Abstract
Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers -- guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.
Chinese Translation
文本目标实例导航(TGIN)要求代理将单一的自由形式描述转化为行动,以到达同类干扰物中的正确对象实例。我们提出了 extit{Context-Nav},将长的上下文描述从局部匹配线索提升为全局探索先验,并通过3D空间推理验证候选对象。首先,我们计算密集的文本-图像对齐,生成一个价值图,用于对前沿区域进行排名——引导探索朝向与整个描述一致的区域,而不是早期检测。其次,在观察到候选对象后,我们执行视角感知的关系检查:代理采样可行的观察者姿态,调整局部帧,并仅在至少一个视角可以满足空间关系时接受目标。该流程不需要特定任务的训练或微调;我们在InstanceNav和CoIN-Bench上达到了最先进的性能。消融实验表明:(i)将完整描述编码到价值图中可以避免无效运动;(ii)显式的视角感知3D验证防止了语义上合理但不正确的停顿。这表明,基于几何的空间推理是一个可扩展的替代方案,能够替代繁重的策略训练或人机交互,以实现对杂乱3D场景中细粒度实例的消歧。
cs.CV / 91 / 2603.09512
Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning
探讨驾驶视觉语言模型的可靠性:从不一致的响应到基于时间的推理
Abstract
A reliable driving assistant should provide consistent responses based on temporally grounded reasoning derived from observed information. In this work, we investigate whether Vision-Language Models (VLMs), when applied as driving assistants, can response consistantly and understand how present observations shape future outcomes, or whether their outputs merely reflect patterns memorized during training without temporally grounded reasoning. While recent efforts have integrated VLMs into autonomous driving, prior studies typically emphasize scene understanding and instruction generation, implicitly assuming that strong visual interpretation naturally enables consistant future reasoning and thus ensures reliable decision-making, a claim we critically examine. We focus on two major challenges limiting VLM reliability in this setting: response inconsistency, where minor input perturbations yield different answers or, in some cases, responses degenerate toward near-random guessing, and limited temporal reasoning, in which models fail to reason and align sequential events from current observations, often resulting in incorrect or even contradictory responses. Moreover, we find that models with strong visual understanding do not necessarily perform best on tasks requiring temporal reasoning, indicating a tendency to over-rely on pretrained patterns rather than modeling temporal dynamics. To address these issues, we adopt existing evaluation methods and introduce FutureVQA, a human-annotated benchmark dataset specifically designed to assess future scene reasoning. In addition, we propose a simple yet effective self-supervised tuning approach with chain-of-thought reasoning that improves both consistency and temporal reasoning without requiring temporal labels.
Chinese Translation
一个可靠的驾驶助手应基于从观察信息中得出的时间性推理提供一致的响应。在本研究中,我们探讨了视觉语言模型(Vision-Language Models, VLMs)作为驾驶助手时,是否能够一致地响应并理解当前观察如何影响未来结果,或者它们的输出是否仅仅反映了在训练过程中记忆的模式,而没有基于时间的推理。尽管最近的研究已将VLMs整合到自动驾驶中,但以往的研究通常强调场景理解和指令生成,隐含地假设强大的视觉解读自然能够实现一致的未来推理,从而确保可靠的决策,这一主张我们进行了批判性审视。我们关注限制VLM在此环境中可靠性的两个主要挑战:响应不一致性,即微小的输入扰动导致不同的答案,或在某些情况下,响应退化为近乎随机的猜测;以及有限的时间推理,即模型无法推理并对当前观察的顺序事件进行对齐,常常导致错误甚至矛盾的响应。此外,我们发现具有强大视觉理解的模型在需要时间推理的任务上并不一定表现最佳,表明其倾向于过度依赖预训练模式,而不是建模时间动态。为了解决这些问题,我们采用现有的评估方法,并引入FutureVQA,一个专门设计用于评估未来场景推理的人类标注基准数据集。此外,我们提出了一种简单而有效的自监督调优方法,结合链式思维推理,能够在不需要时间标签的情况下提高一致性和时间推理能力。
cs.CV / 92 / 2603.09529
RESBev: Making BEV Perception More Robust
RESBev:增强鸟瞰视角感知的鲁棒性
Abstract
Bird's-eye-view (BEV) perception has emerged as a cornerstone of autonomous driving systems, providing a structured, ego-centric representation critical for downstream planning and control. However, real-world deployment faces challenges from sensor degradation and adversarial attacks, which can cause severe perceptual anomalies and ultimately compromise the safety of autonomous driving systems. To address this, we propose a resilient and plug-and-play BEV perception method, RESBev, which can be easily applied to existing BEV perception methods to enhance their robustness to diverse disturbances. Specifically, we reframe perception robustness as a latent semantic prediction problem. A latent world model is constructed to extract spatiotemporal correlations across sequential BEV observations, thereby learning the underlying BEV state transitions to predict clean BEV features for reconstructing corrupted observations. The proposed framework operates at the semantic feature level of the Lift-Splat-Shoot pipeline, enabling recovery that generalizes across both natural disturbances and adversarial attacks without modifying the underlying backbone. Extensive experiments on the nuScenes dataset demonstrate that, with few-shot fine-tuning, RESBev significantly improves the robustness of existing BEV perception models against various external disturbances and adversarial attacks.
Chinese Translation
鸟瞰视角(BEV)感知已成为自动驾驶系统的基石,提供了结构化的、自我中心的表示,这对于下游的规划和控制至关重要。然而,实际部署面临来自传感器退化和对抗性攻击的挑战,这可能导致严重的感知异常,并最终危及自动驾驶系统的安全。为此,我们提出了一种具有弹性和即插即用特性的BEV感知方法RESBev,该方法可以轻松应用于现有的BEV感知方法,以增强其对各种干扰的鲁棒性。具体而言,我们将感知鲁棒性重新定义为潜在语义预测问题。构建一个潜在世界模型,以提取连续BEV观测之间的时空相关性,从而学习潜在的BEV状态转变,以预测干净的BEV特征,以重建受损的观测。所提出的框架在Lift-Splat-Shoot管道的语义特征层面上运行,使得恢复能够在自然干扰和对抗性攻击之间进行泛化,而无需修改底层骨干网络。在nuScenes数据集上进行的大量实验表明,通过少量的微调,RESBev显著提高了现有BEV感知模型对各种外部干扰和对抗性攻击的鲁棒性。
cs.CV / 93 / 2603.09530
DCAU-Net: Differential Cross Attention and Channel-Spatial Feature Fusion for Medical Image Segmentation
DCAU-Net:用于医学图像分割的差异交叉注意力与通道-空间特征融合
Abstract
Accurate medical image segmentation requires effective modeling of both long-range dependencies and fine-grained boundary details. While transformers mitigate the issue of insufficient semantic information arising from the limited receptive field inherent in convolutional neural networks, they introduce new challenges: standard self-attention incurs quadratic computational complexity and often assigns non-negligible attention weights to irrelevant regions, diluting focus on discriminative structures and ultimately compromising segmentation accuracy. Existing attention variants, although effective in reducing computational complexity, fail to suppress redundant computation and inadvertently impair global context modeling. Furthermore, conventional fusion strategies in encoder-decoder architectures, typically based on simple concatenation or summation, can not adaptively integrate high-level semantic information with low-level spatial details. To address these limitations, we propose DCAU-Net, a novel yet efficient segmentation framework with two key ideas. First, a new Differential Cross Attention (DCA) is designed to compute the difference between two independent softmax attention maps to adaptively highlight discriminative structures. By replacing pixel-wise key and value tokens with window-level summary tokens, DCA dramatically reduces computational complexity without sacrificing precision. Second, a Channel-Spatial Feature Fusion (CSFF) strategy is introduced to adaptively recalibrate features from skip connections and up-sampling paths through using sequential channel and spatial attention, effectively suppressing redundant information and amplifying salient cues. Experiments on two public benchmarks demonstrate that DCAU-Net achieves competitive performance with enhanced segmentation accuracy and robustness.
Chinese Translation
准确的医学图像分割需要有效建模长距离依赖关系和细粒度边界细节。尽管变换器(transformers)缓解了卷积神经网络固有的有限感受野所导致的语义信息不足的问题,但它们也带来了新的挑战:标准自注意力(self-attention)会导致二次计算复杂度,并且常常将非忽略的注意力权重分配给无关区域,从而削弱对判别结构的关注,最终影响分割精度。现有的注意力变体虽然在降低计算复杂度方面有效,但未能抑制冗余计算,反而无意中损害了全局上下文建模。此外,基于简单连接或求和的传统编码器-解码器架构中的融合策略,无法自适应地将高层语义信息与低层空间细节进行整合。为了解决这些局限性,我们提出了DCAU-Net,这是一种新颖而高效的分割框架,具有两个关键思想。首先,设计了一种新的差异交叉注意力(Differential Cross Attention, DCA),用于计算两个独立softmax注意力图之间的差异,以自适应地突出判别结构。通过用窗口级摘要标记替换逐像素的关键和数值标记,DCA显著降低了计算复杂度而不牺牲精度。其次,引入了一种通道-空间特征融合(Channel-Spatial Feature Fusion, CSFF)策略,通过使用顺序通道和空间注意力,自适应地重新校准来自跳跃连接和上采样路径的特征,有效抑制冗余信息并放大显著线索。在两个公共基准上的实验表明,DCAU-Net在分割精度和鲁棒性方面达到了具有竞争力的表现。
cs.CV / 94 / 2603.09538
Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
通过群体相对策略优化实现统一的多模态交错生成
Abstract
Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-training strategy to unlock this capability in existing unified models, without relying on large-scale multimodal interleaved datasets. We begin with a warm-up stage using a hybrid dataset comprising curated interleaved sequences and limited data for multimodal understanding and text-to-image generation, which exposes the model to interleaved generation patterns while preserving its pretrained capabilities. To further refine interleaved generation, we propose a unified policy optimization framework that extends Group Relative Policy Optimization (GRPO) to the multimodal setting. Our approach jointly models text and image generation within a single decoding trajectory and optimizes it with our novel hybrid rewards covering textual relevance, visual-text alignment, and structural fidelity. Additionally, we incorporate process-level rewards to provide step-wise guidance, enhancing training efficiency in complex multimodal tasks. Experiments on MMIE and InterleavedBench demonstrate that our approach significantly enhances the quality and coherence of multimodal interleaved generation.
Chinese Translation
统一的视觉-语言模型在多模态理解和生成方面取得了显著进展,但在生成多模态交错输出方面仍然存在不足,这对于视觉故事讲述和逐步视觉推理等任务至关重要。在本研究中,我们提出了一种基于强化学习的后训练策略,以在现有的统一模型中解锁这一能力,而无需依赖大规模的多模态交错数据集。我们首先使用一个混合数据集进行热身阶段,该数据集包含策划的交错序列和有限的多模态理解及文本到图像生成数据,这使模型接触到交错生成模式,同时保留其预训练能力。为了进一步优化交错生成,我们提出了一个统一的策略优化框架,将群体相对策略优化(Group Relative Policy Optimization, GRPO)扩展到多模态设置。我们的方法在单一解码轨迹中联合建模文本和图像生成,并使用我们新颖的混合奖励进行优化,这些奖励涵盖了文本相关性、视觉-文本对齐和结构保真度。此外,我们还引入了过程级奖励,以提供逐步指导,从而提高复杂多模态任务的训练效率。在MMIE和InterleavedBench上的实验表明,我们的方法显著提升了多模态交错生成的质量和一致性。
cs.CV / 95 / 2603.09541
Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA
基于记忆的动态人机交互环境下视图精炼
Abstract
Embodied Question Answering (EQA) has traditionally been evaluated in temporally stable environments where visual evidence can be accumulated reliably. However, in dynamic, human-populated scenes, human activities and occlusions introduce significant perceptual non-stationarity: task-relevant cues are transient and view-dependent, while a store-then-retrieve strategy over-accumulates redundant evidence and increases inference cost. This setting exposes two practical challenges for EQA agents: resolving ambiguity caused by viewpoint-dependent occlusions, and maintaining compact yet up-to-date evidence for efficient inference. To enable systematic study of this setting, we introduce DynHiL-EQA, a human-in-the-loop EQA dataset with two subsets: a Dynamic subset featuring human activities and temporal changes, and a Static subset with temporally stable observations. To address the above challenges, we present DIVRR (Dynamic-Informed View Refinement and Relevance-guided Adaptive Memory Selection), a training-free framework that couples relevance-guided view refinement with selective memory admission. By verifying ambiguous observations before committing them and retaining only informative evidence, DIVRR improves robustness under occlusions while preserving fast inference with compact memory. Extensive experiments on DynHiL-EQA and the established HM-EQA dataset demonstrate that DIVRR consistently improves over existing baselines in both dynamic and static settings while maintaining high inference efficiency.
Chinese Translation
具身问答(EQA)传统上是在时间稳定的环境中进行评估的,在这些环境中,视觉证据可以可靠地积累。然而,在动态的人口场景中,人类活动和遮挡引入了显著的感知非平稳性:任务相关的线索是短暂且依赖视角的,而存储后再检索的策略则过度积累冗余证据,增加了推理成本。这种环境对EQA代理提出了两个实际挑战:解决由视角依赖的遮挡引起的模糊性,以及维护紧凑但最新的证据以实现高效推理。为了系统地研究这一环境,我们引入了DynHiL-EQA,一个包含两个子集的人机交互EQA数据集:一个动态子集,包含人类活动和时间变化,以及一个静态子集,包含时间稳定的观察。为了解决上述挑战,我们提出了DIVRR(动态信息视图精炼与相关性引导的自适应记忆选择),这是一个无训练的框架,将相关性引导的视图精炼与选择性记忆接纳相结合。通过在提交模糊观察之前进行验证,并仅保留有信息的证据,DIVRR在遮挡下提高了鲁棒性,同时保持了紧凑记忆下的快速推理。在DynHiL-EQA和已建立的HM-EQA数据集上的大量实验表明,DIVRR在动态和静态环境中始终优于现有基线,同时保持高效的推理效率。
cs.CV / 96 / 2603.09548
A comprehensive study of time-of-flight non-line-of-sight imaging
时间飞行非视线成像的综合研究
Abstract
Time-of-Flight non-line-of-sight (ToF NLOS) imaging techniques provide state-of-the-art reconstructions of scenes hidden around corners by inverting the optical path of indirect photons scattered by visible surfaces and measured by picosecond resolution sensors. The emergence of a wide range of ToF NLOS imaging methods with heterogeneous formulae and hardware implementations obscures the assessment of both their theoretical and experimental aspects. We present a comprehensive study of a representative set of ToF NLOS imaging methods by discussing their similarities and differences under common formulation and hardware. We first outline the problem statement under a common general forward model for ToF NLOS measurements, and the typical assumptions that yield tractable inverse models. We discuss the relationship of the resulting simplified forward and inverse models to a family of Radon transforms, and how migrating these to the frequency domain relates to recent phasor-based virtual line-of-sight imaging models for NLOS imaging that obey the constraints of conventional lens-based imaging systems. We then evaluate performance of the selected methods on hidden scenes captured under the same hardware setup and similar photon counts. Our experiments show that existing methods share similar limitations on spatial resolution, visibility, and sensitivity to noise when operating under equal hardware constraints, with particular differences that stem from method-specific parameters. We expect our methodology to become a reference in future research on ToF NLOS imaging to obtain objective comparisons of existing and new methods.
Chinese Translation
时间飞行非视线(ToF NLOS)成像技术通过反转由可见表面散射的间接光子所形成的光学路径,并由皮秒分辨率传感器测量,提供了对隐藏在角落场景的尖端重建。各种具有异构公式和硬件实现的ToF NLOS成像方法的出现,使得对其理论和实验方面的评估变得复杂。我们通过讨论一组具有代表性的ToF NLOS成像方法的相似性和差异性,在共同的公式和硬件下,呈现了一项综合研究。我们首先在ToF NLOS测量的共同一般前向模型下概述问题陈述,以及产生可处理逆模型的典型假设。我们讨论了所得到的简化前向模型和逆模型与一类拉东变换的关系,以及将这些模型迁移到频域如何与遵循传统透镜成像系统约束的NLOS成像的相位基虚拟视线成像模型相关。接着,我们在相同硬件设置和相似光子计数下评估所选方法在隐藏场景上的性能。我们的实验表明,现有方法在相同硬件约束下在空间分辨率、可见性和对噪声的敏感性方面存在相似的局限性,具体差异源于方法特定的参数。我们期望我们的研究方法能够成为未来ToF NLOS成像研究中的参考,以便对现有和新方法进行客观比较。
cs.CV / 97 / 2603.09551
GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision
GeoSolver:通过细粒度过程监督扩展遥感中的测试时推理
Abstract
While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.
Chinese Translation
尽管视觉-语言模型(VLMs)在遥感解释方面取得了显著进展,但使其能够执行复杂的逐步推理仍然面临很大挑战。最近将思维链(Chain-of-Thought, CoT)推理引入该领域的努力显示出希望,但确保这些中间步骤的视觉真实性仍然是一个关键瓶颈。为了解决这个问题,我们提出了GeoSolver,一个将遥感推理转向可验证的过程监督强化学习的新框架。我们首先构建了Geo-PRM-2M,一个通过熵引导的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)和针对性的视觉幻觉注入合成的大规模标记级过程监督数据集。在此数据集的基础上,我们训练了GeoPRM,一个提供细粒度真实性反馈的标记级过程奖励模型(Process Reward Model, PRM)。为了有效利用这些验证信号,我们提出了过程感知树状GRPO(Process-Aware Tree-GRPO),这是一种将树结构探索与真实性加权奖励机制相结合的强化学习算法,以精确地为中间步骤分配信用。大量实验表明,我们的模型GeoSolver-9B在多种遥感基准测试中达到了最先进的性能。至关重要的是,GeoPRM解锁了强大的测试时扩展(Test-Time Scaling, TTS)。作为一个通用的地理空间验证器,它无缝地扩展了GeoSolver-9B的性能,并直接增强了通用VLMs,突显了其显著的跨模型泛化能力。
cs.CV / 98 / 2603.09566
GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning
GeoAlignCLIP:通过多粒度一致性学习增强遥感中的细粒度视觉-语言对齐
Abstract
Vision-language pretraining models have made significant progress in bridging remote sensing imagery with natural language. However, existing approaches often fail to effectively integrate multi-granular visual and textual information, relying primarily on global image-text alignment. This limitation hinders the model's ability to accurately capture fine-grained details in images, thus restricting its performance in complex, fine-grained tasks. To address this, we propose GeoAlignCLIP, a unified framework that achieves fine-grained alignment in remote sensing tasks by learning multi-granular semantic alignments and incorporating intra-modal consistency, enabling more precise visual-semantic alignment between image regions and text concepts. Additionally, we construct RSFG-100k, a fine-granular remote sensing dataset containing scene descriptions, region-level annotations, and challenging hard-negative samples, providing hierarchical supervision for model training. Extensive experiments conducted on multiple public remote-sensing benchmarks demonstrate that GeoAlignCLIP consistently outperforms existing RS-specific methods across diverse tasks, exhibiting more robust and accurate fine-grained vision-language alignment.
Chinese Translation
视觉-语言预训练模型在连接遥感图像与自然语言方面取得了显著进展。然而,现有方法往往未能有效整合多粒度的视觉和文本信息,主要依赖于全局图像-文本对齐。这一局限性妨碍了模型准确捕捉图像中的细粒度细节,从而限制了其在复杂细粒度任务中的表现。为此,我们提出了GeoAlignCLIP,一个统一框架,通过学习多粒度语义对齐并结合模态内部一致性,实现遥感任务中的细粒度对齐,从而使图像区域与文本概念之间的视觉-语义对齐更加精确。此外,我们构建了RSFG-100k,一个包含场景描述、区域级注释和具有挑战性的难负样本的细粒度遥感数据集,为模型训练提供层次化的监督。在多个公共遥感基准上进行的广泛实验表明,GeoAlignCLIP在不同任务中始终优于现有的遥感特定方法,展现出更强大和准确的细粒度视觉-语言对齐能力。
cs.CV / 99 / 2603.09573
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
超越总和:逆境全景语言模型
Abstract
Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.
Chinese Translation
现有的视觉-语言模型(VLMs)主要针对针孔图像,通过拼接多个狭窄视场的输入来构建完整的全景场景理解。然而,这种多视角感知忽略了单一全景图像所固有的整体空间和上下文关系。在本研究中,我们提出了全景语言建模(Panorama-Language Modeling, PLM)范式,这是一种统一的 $360^ ext{°}$ 视觉-语言推理,超越了其针孔对应物的总和。此外,我们还提出了 PanoVQA,这是一个大规模的全景视觉问答(VQA)数据集,涉及逆境全景场景,使得在物体遮挡和驾驶事故下进行全面推理成为可能。为了为 PLM 建立基础,我们开发了一个即插即用的全景稀疏注意力模块,使现有的基于针孔的 VLMs 能够处理等距全景图而无需重新训练。大量实验表明,我们的 PLM 在具有挑战性的全景场景下实现了更优的鲁棒性和整体推理,产生的理解超越了其狭窄部分的总和。项目页面:https://github.com/InSAI-Lab/PanoVQA。
cs.CV / 100 / 2603.09582
BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers
BinaryAttention:用于视觉和扩散变换器的一位 QK-注意力
Abstract
Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at https://github.com/EdwardChasel/BinaryAttention.
Chinese Translation
变换器在多个领域取得了广泛而显著的成功,但其注意力模块的计算复杂性仍然是视觉任务的主要瓶颈。现有方法主要采用 8 位或 4 位量化来平衡效率和准确性。本文通过理论证明,指出注意力的二值化能够保留基本的相似性关系,并提出了一种有效的快速且准确的 1 位 qk-注意力方法——BinaryAttention。具体而言,我们在计算注意力时仅保留查询和键的符号,并用位运算替代浮点点积,显著降低了计算成本。我们通过引入可学习的偏置来减轻 1 位量化下固有的信息损失,从而实现端到端的加速。为了保持注意力的准确性,我们采用了量化感知训练和自蒸馏技术,减轻量化误差,同时确保符号对齐的相似性。BinaryAttention 在 A100 GPU 上的速度比 FlashAttention2 快超过 2 倍。在视觉变换器和扩散变换器基准上的大量实验表明,BinaryAttention 的性能与全精度注意力相当,甚至超越了全精度注意力,验证了其有效性。我们的工作为全精度注意力提供了一种高效且有效的替代方案,推动了低位视觉和扩散变换器的前沿。代码和模型可在 https://github.com/EdwardChasel/BinaryAttention 找到。
cs.CV / 101 / 2603.09611
ParTY: Part-Guidance for Expressive Text-to-Motion Synthesis
ParTY:用于表现力文本到动作合成的部件引导
Abstract
Text-to-motion synthesis aims to generate natural and expressive human motions from textual descriptions. While existing approaches primarily focus on generating holistic motions from text descriptions, they struggle to accurately reflect actions involving specific body parts. Recent part-wise motion generation methods attempt to resolve this but face two critical limitations: (i) they lack explicit mechanisms for aligning textual semantics with individual body parts, and (ii) they often generate incoherent full-body motions due to integrating independently generated part motions. To overcome these issues and resolve the fundamental trade-off in existing methods, we propose ParTY, a novel framework that enhances part expressiveness while generating coherent full-body motions. ParTY comprises: (1) Part-Guided Network, which first generates part motions to obtain part guidance, then uses it to generate holistic motions; (2) Part-aware Text Grounding, which diversely transforms text embeddings and appropriately aligns them with each body part; and (3) Holistic-Part Fusion, which adaptively fuses holistic motions and part motions. Extensive experiments, including part-level and coherence-level evaluations, demonstrate that ParTY achieves substantial improvements over previous methods.
Chinese Translation
文本到动作合成旨在从文本描述中生成自然且富有表现力的人体动作。尽管现有方法主要集中于从文本描述生成整体动作,但它们在准确反映涉及特定身体部位的动作方面存在困难。最近的部件级动作生成方法试图解决这一问题,但面临两个关键限制:(i)它们缺乏将文本语义与各个身体部位对齐的明确机制;(ii)由于独立生成的部件动作的整合,它们往往生成不连贯的全身动作。为了克服这些问题并解决现有方法中的基本权衡,我们提出了ParTY,一个新颖的框架,旨在增强部件的表现力,同时生成连贯的全身动作。ParTY包括:(1)部件引导网络,首先生成部件动作以获得部件引导,然后利用该引导生成整体动作;(2)部件感知文本对齐,能够多样化转换文本嵌入并适当地与每个身体部位对齐;(3)整体-部件融合,自适应地融合整体动作和部件动作。大量实验,包括部件级和连贯性评估,表明ParTY在性能上显著优于之前的方法。
cs.CV / 102 / 2603.09613
A saccade-inspired approach to image classification using visiontransformer attention maps
一种受注视运动启发的图像分类方法:基于视觉变换器的注意力图
Abstract
Human vision achieves remarkable perceptual performance while operating under strict metabolic constraints. A key ingredient is the selective attention mechanism, driven by rapid saccadic eye movements that constantly reposition the high-resolution fovea onto task-relevant locations, unlike conventional AI systems that process entire images with equal emphasis. Our work aims to draw inspiration from the human visual system to create smarter, more efficient image processing models. Using DINO, a self-supervised Vision Transformer that produces attention maps strikingly similar to human gaze patterns, we explore a saccade inspired method to focus the processing of information on key regions in visual space. To do so, we use the ImageNet dataset in a standard classification task and measure how each successive saccade affects the model's class scores. This selective-processing strategy preserves most of the full-image classification performance and can even outperform it in certain cases. By benchmarking against established saliency models built for human gaze prediction, we demonstrate that DINO provides superior fixation guidance for selecting informative regions. These findings highlight Vision Transformer attention as a promising basis for biologically inspired active vision and open new directions for efficient, neuromorphic visual processing.
Chinese Translation
人类视觉在严格的代谢限制下实现了卓越的感知性能。其关键因素是选择性注意机制,由快速的注视眼动驱动,这种眼动不断将高分辨率的中央凹区域重新定位到与任务相关的位置,这与传统的人工智能系统处理整幅图像时给予相同重视的方式截然不同。我们的工作旨在从人类视觉系统中汲取灵感,以创建更智能、更高效的图像处理模型。我们使用 DINO,这是一种自监督的视觉变换器,能够生成与人类注视模式惊人相似的注意力图,探索一种受注视启发的方法,将信息处理集中在视觉空间中的关键区域。为此,我们在标准分类任务中使用 ImageNet 数据集,并测量每次连续的注视运动如何影响模型的类别得分。这种选择性处理策略保留了大部分全图像分类性能,甚至在某些情况下超越了全图像分类性能。通过与为人类注视预测建立的成熟显著性模型进行基准测试,我们证明 DINO 提供了优越的注视引导,以选择信息丰富的区域。这些发现突显了视觉变换器注意力作为生物启发的主动视觉的有希望基础,并为高效的类脑视觉处理开辟了新的方向。
cs.CV / 103 / 2603.09621
Physics-Driven 3D Gaussian Rendering for Zero-Shot MRI Super-Resolution
基于物理驱动的三维高斯渲染用于零样本MRI超分辨率
Abstract
High-resolution Magnetic Resonance Imaging (MRI) is vital for clinical diagnosis but limited by long acquisition times and motion artifacts. Super-resolution (SR) reconstructs low-resolution scans into high-resolution images, yet existing methods are mutually constrained: paired-data methods achieve efficiency only by relying on costly aligned datasets, while implicit neural representation approaches avoid such data needs at the expense of heavy computation. We propose a zero-shot MRI SR framework using explicit Gaussian representation to balance data requirements and efficiency. MRI-tailored Gaussian parameters embed tissue physical properties, reducing learnable parameters while preserving MR signal fidelity. A physics-grounded volume rendering strategy models MRI signal formation via normalized Gaussian aggregation. Additionally, a brick-based order-independent rasterization scheme enables highly parallel 3D computation, lowering training and inference costs. Experiments on two public MRI datasets show superior reconstruction quality and efficiency, demonstrating the method's potential for clinical MRI SR.
Chinese Translation
高分辨率磁共振成像(MRI)对临床诊断至关重要,但受到长时间采集和运动伪影的限制。超分辨率(SR)技术将低分辨率扫描重建为高分辨率图像,然而现有方法相互制约:配对数据方法仅通过依赖昂贵的对齐数据集来实现效率,而隐式神经表示方法则在避免此类数据需求的同时付出了高计算成本。我们提出了一种基于显式高斯表示的零样本MRI SR框架,以平衡数据需求和效率。针对MRI的高斯参数嵌入了组织物理特性,减少了可学习参数的数量,同时保持了MRI信号的保真度。基于物理的体积渲染策略通过归一化高斯聚合建模MRI信号形成。此外,基于砖块的无序独立光栅化方案实现了高度并行的三维计算,降低了训练和推理成本。在两个公共MRI数据集上的实验表明,该方法在重建质量和效率上优于现有技术,展示了其在临床MRI超分辨率中的潜力。
cs.CV / 104 / 2603.09624
Decoder-Free Distillation for Quantized Image Restoration
无解码器蒸馏的量化图像恢复
Abstract
Quantization-Aware Training (QAT), combined with Knowledge Distillation (KD), holds immense promise for compressing models for edge deployment. However, joint optimization for precision-sensitive image restoration (IR) to recover visual quality from degraded images remains largely underexplored. Directly adapting QAT-KD to low-level vision reveals three critical bottlenecks: teacher-student capacity mismatch, spatial error amplification during decoder distillation, and an optimization "tug-of-war" between reconstruction and distillation losses caused by quantization noise. To tackle these, we introduce Quantization-aware Distilled Restoration (QDR), a framework for edge-deployed IR. QDR eliminates capacity mismatch via FP32 self-distillation and prevents error amplification through Decoder-Free Distillation (DFD), which corrects quantization errors strictly at the network bottleneck. To stabilize the optimization tug-of-war, we propose a Learnable Magnitude Reweighting (LMR) that dynamically balances competing gradients. Finally, we design an Edge-Friendly Model (EFM) featuring a lightweight Learnable Degradation Gating (LDG) to dynamically modulate spatial degradation localization. Extensive experiments across four IR tasks demonstrate that our Int8 model recovers 96.5% of FP32 performance, achieves 442 frames per second (FPS) on an NVIDIA Jetson Orin, and boosts downstream object detection by 16.3 mAP
Chinese Translation
量化感知训练(Quantization-Aware Training, QAT)结合知识蒸馏(Knowledge Distillation, KD)在边缘部署模型压缩方面具有巨大潜力。然而,针对精度敏感的图像恢复(Image Restoration, IR)进行联合优化,以从退化图像中恢复视觉质量,仍然在很大程度上未得到充分探索。直接将QAT-KD应用于低级视觉任务暴露出三个关键瓶颈:教师-学生能力不匹配、在解码器蒸馏过程中空间误差放大,以及由于量化噪声引起的重建损失与蒸馏损失之间的优化“拉锯战”。为了解决这些问题,我们提出了量化感知蒸馏恢复(Quantization-aware Distilled Restoration, QDR),这是一个用于边缘部署图像恢复的框架。QDR通过FP32自蒸馏消除了能力不匹配,并通过无解码器蒸馏(Decoder-Free Distillation, DFD)防止误差放大,该方法严格在网络瓶颈处纠正量化误差。为了稳定优化的拉锯战,我们提出了一种可学习的幅度重加权(Learnable Magnitude Reweighting, LMR),该方法动态平衡竞争梯度。最后,我们设计了一种边缘友好的模型(Edge-Friendly Model, EFM),其特点是轻量级的可学习退化门控(Learnable Degradation Gating, LDG),用于动态调节空间退化定位。通过在四个图像恢复任务上的广泛实验,我们的Int8模型恢复了96.5%的FP32性能,在NVIDIA Jetson Orin上达到了每秒442帧(FPS),并将下游目标检测的mAP提升了16.3。
cs.CV / 105 / 2603.09625
Grounding Synthetic Data Generation With Vision and Language Models
基于视觉和语言模型的合成数据生成基础
Abstract
Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at zenodo.org/records/18890661 and the code base at github.com/caglarmert/ARAS400k.
Chinese Translation
深度学习模型受益于数据多样性和数量的增加,这推动了合成数据增强以改善现有数据集。然而,现有的合成数据评估指标通常计算潜在特征相似性,这难以解释,并且并不总是与下游任务的贡献相关。我们提出了一种基于视觉-语言的框架,用于可解释的合成数据增强和遥感评估。我们的方法结合了生成模型、语义分割和图像描述,利用视觉和语言模型。基于该框架,我们引入了ARAS400k:一个用于分割和描述的合成数据增强的大规模遥感数据集,包含10万张真实图像和30万张合成图像,每张图像都配有分割图和描述。ARAS400k通过分析语义组成、最小化描述冗余以及验证视觉结构与语言描述之间的跨模态一致性,实现了合成数据的自动评估。实验结果表明,尽管仅在合成数据上训练的模型达到竞争性能水平,但使用增强数据(真实图像和合成图像的组合)训练的模型始终优于真实数据基线。因此,本研究为遥感任务,特别是在语义分割和图像描述方面,建立了一个可扩展的基准。数据集可在zenodo.org/records/18890661获取,代码库可在github.com/caglarmert/ARAS400k找到。
cs.CV / 106 / 2603.09632
X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models
X-GS:一个可扩展的开放框架,统一3DGS架构与下游多模态模型
Abstract
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods are isolated, focusing on specific domains such as online SLAM, semantic enrichment, or 3DGS for unposed images. In this paper, we introduce X-GS, an extensible open framework that unifies a broad range of techniques to enable real-time 3DGS-based online SLAM enriched with semantics, bridging the gap to downstream multimodal models. At the core of X-GS is a highly efficient pipeline called X-GS-Perceiver, capable of taking unposed RGB (or optionally RGB-D) video streams as input to co-optimize geometry and poses, and distill high-dimensional semantic features from vision foundation models into the 3D Gaussians. We achieve real-time performance through a novel online Vector Quantization (VQ) module, a GPU-accelerated grid-sampling scheme, and a highly parallelized pipeline design. The semantic 3D Gaussians can then be utilized by vision-language models within the X-GS-Thinker component, enabling downstream tasks such as object detection, zero-shot caption generation, and potentially embodied tasks. Experimental results on real-world datasets showcase the efficacy, efficiency, and newly unlocked multimodal capabilities of the X-GS framework.
Chinese Translation
3D高斯点云(3D Gaussian Splatting, 3DGS)已成为新视图合成的一种强大技术,随后扩展到众多空间人工智能应用。然而,大多数现有的3DGS方法是孤立的,专注于特定领域,如在线SLAM、语义增强或针对未定姿态图像的3DGS。本文介绍了X-GS,一个可扩展的开放框架,统一了广泛的技术,以实现基于3DGS的实时在线SLAM,并增强语义,从而弥合与下游多模态模型之间的差距。X-GS的核心是一个高效的管道,称为X-GS-Perceiver,能够将未定姿态的RGB(或可选的RGB-D)视频流作为输入,协同优化几何形状和姿态,并从视觉基础模型中提取高维语义特征到3D高斯中。我们通过一种新颖的在线向量量化(Vector Quantization, VQ)模块、一个GPU加速的网格采样方案以及高度并行化的管道设计,实现了实时性能。然后,语义3D高斯可以在X-GS-Thinker组件中被视觉-语言模型利用,从而支持下游任务,如物体检测、零样本标题生成,以及潜在的具身任务。对真实世界数据集的实验结果展示了X-GS框架的有效性、效率以及新解锁的多模态能力。
cs.CV / 107 / 2603.09653
OTPL-VIO: Robust Visual-Inertial Odometry with Optimal Transport Line Association and Adaptive Uncertainty
OTPL-VIO:具有最优传输线关联和自适应不确定性的鲁棒视觉惯性里程计
Abstract
Robust stereo visual-inertial odometry (VIO) remains challenging in low-texture scenes and under abrupt illumination changes, where point features become sparse and unstable, leading to ambiguous association and under-constrained estimation. Line structures offer complementary geometric cues, yet many efficient point-line systems still rely on point-guided line association, which can break down when point support is weak and may lead to biased constraints. We present a stereo point-line VIO system in which line segments are equipped with dedicated deep descriptors and matched using an entropy-regularized optimal transport formulation, enabling globally consistent correspondences under ambiguity, outliers, and partial observations. The proposed descriptor is training-free and is computed by sampling and pooling network feature maps. To improve estimation stability, we analyze the impact of line measurement noise and introduce reliability-adaptive weighting to regulate the influence of line constraints during optimization. Experiments on EuRoC and UMA-VI, together with real-world deployments in low-texture and illumination-challenging environments, demonstrate improved accuracy and robustness over representative baselines while maintaining real-time performance.
Chinese Translation
在低纹理场景和突变光照变化下,鲁棒的立体视觉惯性里程计(VIO)仍然面临挑战,此时点特征变得稀疏且不稳定,导致关联模糊和估计不足约束。线结构提供了互补的几何线索,但许多高效的点-线系统仍然依赖于点引导的线关联,当点支持较弱时,这种方法可能会失效,并导致偏置约束。我们提出了一种立体点-线 VIO 系统,其中线段配备了专用的深度描述符,并使用熵正则化的最优传输公式进行匹配,从而在模糊、异常值和部分观测下实现全局一致的对应关系。所提出的描述符无需训练,通过采样和池化网络特征图进行计算。为了提高估计的稳定性,我们分析了线测量噪声的影响,并引入可靠性自适应加权,以调节优化过程中线约束的影响。在 EuRoC 和 UMA-VI 上的实验,以及在低纹理和光照挑战环境中的实际部署,均显示出相较于代表性基线的更高准确性和鲁棒性,同时保持实时性能。
cs.CV / 108 / 2603.09657
When to Lock Attention: Training-Free KV Control in Video Diffusion
何时锁定注意力:视频扩散中的无训练KV控制
Abstract
Maintaining background consistency while enhancing foreground quality remains a core challenge in video editing. Injecting full-image information often leads to background artifacts, whereas rigid background locking severely constrains the model's capacity for foreground generation. To address this issue, we propose KV-Lock, a training-free framework tailored for DiT-based video diffusion models. Our core insight is that the hallucination metric (variance of denoising prediction) directly quantifies generation diversity, which is inherently linked to the classifier-free guidance (CFG) scale. Building upon this, KV-Lock leverages diffusion hallucination detection to dynamically schedule two key components: the fusion ratio between cached background key-values (KVs) and newly generated KVs, and the CFG scale. When hallucination risk is detected, KV-Lock strengthens background KV locking and simultaneously amplifies conditional guidance for foreground generation, thereby mitigating artifacts and improving generation fidelity. As a training-free, plug-and-play module, KV-Lock can be easily integrated into any pre-trained DiT-based models. Extensive experiments validate that our method outperforms existing approaches in improved foreground quality with high background fidelity across various video editing tasks.
Chinese Translation
在视频编辑中,保持背景一致性的同时增强前景质量仍然是一个核心挑战。注入全图信息往往会导致背景伪影,而严格的背景锁定则严重限制了模型生成前景的能力。为了解决这一问题,我们提出了KV-Lock,这是一种针对基于DiT的视频扩散模型的无训练框架。我们的核心见解是,幻觉度量(去噪预测的方差)直接量化了生成多样性,这与无分类器引导(CFG)尺度密切相关。在此基础上,KV-Lock利用扩散幻觉检测动态调度两个关键组件:缓存背景关键值(KVs)与新生成KVs之间的融合比例,以及CFG尺度。当检测到幻觉风险时,KV-Lock加强背景KVs的锁定,同时增强前景生成的条件引导,从而减轻伪影并提高生成的保真度。作为一个无训练的即插即用模块,KV-Lock可以轻松集成到任何预训练的基于DiT的模型中。大量实验验证了我们的方法在各种视频编辑任务中,在提高前景质量和高背景保真度方面优于现有方法。
cs.CV / 109 / 2603.09668
DiffWind: Physics-Informed Differentiable Modeling of Wind-Driven Object Dynamics
DiffWind:基于物理的可微风驱动物体动力学建模
Abstract
Modeling wind-driven object dynamics from video observations is highly challenging due to the invisibility and spatio-temporal variability of wind, as well as the complex deformations of objects. We present DiffWind, a physics-informed differentiable framework that unifies wind-object interaction modeling, video-based reconstruction, and forward simulation. Specifically, we represent wind as a grid-based physical field and objects as particle systems derived from 3D Gaussian Splatting, with their interaction modeled by the Material Point Method (MPM). To recover wind-driven object dynamics, we introduce a reconstruction framework that jointly optimizes the spatio-temporal wind force field and object motion through differentiable rendering and simulation. To ensure physical validity, we incorporate the Lattice Boltzmann Method (LBM) as a physics-informed constraint, enforcing compliance with fluid dynamics laws. Beyond reconstruction, our method naturally supports forward simulation under novel wind conditions and enables new applications such as wind retargeting. We further introduce WD-Objects, a dataset of synthetic and real-world wind-driven scenes. Extensive experiments demonstrate that our method significantly outperforms prior dynamic scene modeling approaches in both reconstruction accuracy and simulation fidelity, opening a new avenue for video-based wind-object interaction modeling.
Chinese Translation
从视频观察中建模风驱动物体的动力学具有很高的挑战性,这主要由于风的不可见性和时空变异性,以及物体的复杂变形。我们提出了DiffWind,一个基于物理的可微框架,统一了风与物体的交互建模、基于视频的重建和前向模拟。具体而言,我们将风表示为基于网格的物理场,将物体表示为源自3D高斯点云的粒子系统,其交互通过材料点方法(Material Point Method, MPM)进行建模。为了恢复风驱动的物体动力学,我们引入了一个重建框架,通过可微渲染和模拟共同优化时空风力场和物体运动。为了确保物理有效性,我们将格子玻尔兹曼方法(Lattice Boltzmann Method, LBM)作为物理约束,强制遵循流体动力学定律。除了重建之外,我们的方法自然支持在新风条件下的前向模拟,并启用诸如风重定向等新应用。我们进一步引入了WD-Objects,一个合成和真实世界风驱动场景的数据集。大量实验表明,我们的方法在重建精度和模拟保真度上显著优于先前的动态场景建模方法,为基于视频的风与物体交互建模开辟了新的途径。
cs.CV / 110 / 2603.09673
VarSplat: Uncertainty-aware 3D Gaussian Splatting for Robust RGB-D SLAM
VarSplat:一种考虑不确定性的鲁棒RGB-D SLAM的3D高斯点云渲染
Abstract
Simultaneous Localization and Mapping (SLAM) with 3D Gaussian Splatting (3DGS) enables fast, differentiable rendering and high-fidelity reconstruction across diverse real-world scenes. However, existing 3DGS-SLAM approaches handle measurement reliability implicitly, making pose estimation and global alignment susceptible to drift in low-texture regions, transparent surfaces, or areas with complex reflectance properties. To this end, we introduce VarSplat, an uncertainty-aware 3DGS-SLAM system that explicitly learns per-splat appearance variance. By using the law of total variance with alpha compositing, we then render differentiable per-pixel uncertainty map via efficient, single-pass rasterization. This map guides tracking, submap registration, and loop detection toward focusing on reliable regions and contributes to more stable optimization. Experimental results on Replica (synthetic) and TUM-RGBD, ScanNet, and ScanNet++ (real-world) show that VarSplat improves robustness and achieves competitive or superior tracking, mapping, and novel view synthesis rendering compared to existing studies for dense RGB-D SLAM.
Chinese Translation
同时定位与地图构建(SLAM)结合3D高斯点云渲染(3DGS)能够在多样的真实场景中实现快速、可微分的渲染和高保真重建。然而,现有的3DGS-SLAM方法隐式地处理测量可靠性,使得在低纹理区域、透明表面或具有复杂反射特性的区域,姿态估计和全局对齐容易出现漂移。为此,我们提出了VarSplat,一个考虑不确定性的3DGS-SLAM系统,它显式地学习每个点云的外观方差。通过使用全方差法则与α合成,我们利用高效的单遍历光栅化渲染出可微分的每像素不确定性图。这张图引导跟踪、子图注册和回环检测,聚焦于可靠区域,并有助于更稳定的优化。在Replica(合成数据集)和TUM-RGBD、ScanNet及ScanNet++(真实数据集)上的实验结果表明,VarSplat提高了鲁棒性,并在密集RGB-D SLAM的跟踪、地图构建和新视图合成渲染方面达到了与现有研究相当或更优的表现。
cs.CV / 111 / 2603.09681
Improving 3D Foot Motion Reconstruction in Markerless Monocular Human Motion Capture
改进无标记单目人类动作捕捉中的三维足部运动重建
Abstract
State-of-the-art methods can recover accurate overall 3D human body motion from in-the-wild videos. However, they often fail to capture fine-grained articulations, especially in the feet, which are critical for applications such as gait analysis and animation. This limitation results from training datasets with inaccurate foot annotations and limited foot motion diversity. We address this gap with FootMR, a Foot Motion Refinement method that refines foot motion estimated by an existing human recovery model through lifting 2D foot keypoint sequences to 3D. By avoiding direct image input, FootMR circumvents inaccurate image-3D annotation pairs and can instead leverage large-scale motion capture data. To resolve ambiguities of 2D-to-3D lifting, FootMR incorporates knee and foot motion as context and predicts only residual foot motion. Generalization to extreme foot poses is further improved by representing joints in global rather than parent-relative rotations and applying extensive data augmentation. To support evaluation of foot motion reconstruction, we introduce MOOF, a 2D dataset of complex foot movements. Experiments on MOOF, MOYO, and RICH show that FootMR outperforms state-of-the-art methods, reducing ankle joint angle error on MOYO by up to 30% over the best video-based approach.
Chinese Translation
最先进的方法能够从野外视频中恢复准确的整体三维人体运动。然而,它们往往无法捕捉到细致的关节运动,尤其是在足部,这对于步态分析和动画等应用至关重要。这一局限性源于训练数据集中足部注释的不准确性和足部运动多样性的有限性。我们通过FootMR(足部运动精炼方法)来解决这一问题,该方法通过将二维足部关键点序列提升到三维,来精炼现有人体恢复模型估计的足部运动。通过避免直接的图像输入,FootMR规避了不准确的图像-三维注释对,而是可以利用大规模的运动捕捉数据。为了解决二维到三维提升的模糊性,FootMR将膝盖和足部运动作为上下文,并仅预测残余的足部运动。通过在全局而非父级相对旋转中表示关节,并应用广泛的数据增强,FootMR进一步提高了对极端足部姿势的泛化能力。为了支持足部运动重建的评估,我们引入了MOOF,一个复杂足部运动的二维数据集。在MOOF、MOYO和RICH上的实验表明,FootMR的表现优于最先进的方法,在MOYO上将踝关节角度误差降低了多达30%,相较于最佳的视频基础方法。
cs.CV / 112 / 2603.09689
AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering
AutoViVQA:一个大规模自动构建的越南视觉问答数据集
Abstract
Visual Question Answering (VQA) is a fundamental multimodal task that requires models to jointly understand visual and textual information. Early VQA systems relied heavily on language biases, motivating subsequent work to emphasize visual grounding and balanced datasets. With the success of large-scale pre-trained transformers for both text and vision domains -- such as PhoBERT for Vietnamese language understanding and Vision Transformers (ViT) for image representation learning -- multimodal fusion has achieved remarkable progress. For Vietnamese VQA, several datasets have been introduced to promote research in low-resource multimodal learning, including ViVQA, OpenViVQA, and the recently proposed ViTextVQA. These resources enable benchmarking of models that integrate linguistic and visual features in the Vietnamese context. Evaluation of VQA systems often employs automatic metrics originally designed for image captioning or machine translation, such as BLEU, METEOR, CIDEr, Recall, Precision, and F1-score. However, recent research suggests that large language models can further improve the alignment between automatic evaluation and human judgment in VQA tasks. In this work, we explore Vietnamese Visual Question Answering using transformer-based architectures, leveraging both textual and visual pre-training while systematically comparing automatic evaluation metrics under multilingual settings.
Chinese Translation
视觉问答(VQA)是一项基本的多模态任务,要求模型共同理解视觉和文本信息。早期的VQA系统严重依赖语言偏见,这促使后续研究强调视觉定位和均衡数据集。随着大规模预训练变换器在文本和视觉领域的成功——例如用于越南语言理解的PhoBERT和用于图像表示学习的视觉变换器(Vision Transformers, ViT)——多模态融合取得了显著进展。为了促进低资源多模态学习的研究,已经引入了多个越南VQA数据集,包括ViVQA、OpenViVQA和最近提出的ViTextVQA。这些资源使得在越南语境中整合语言和视觉特征的模型进行基准测试成为可能。VQA系统的评估通常采用最初为图像描述或机器翻译设计的自动评估指标,如BLEU、METEOR、CIDEr、召回率、精确率和F1-score。然而,最近的研究表明,大型语言模型可以进一步改善自动评估与人类判断在VQA任务中的一致性。在本研究中,我们探索了基于变换器架构的越南视觉问答,利用文本和视觉的预训练,同时在多语言环境下系统地比较自动评估指标。
cs.CV / 113 / 2603.09696
TemporalDoRA: Temporal PEFT for Robust Surgical Video Question Answering
TemporalDoRA:用于稳健外科视频问答的时间参数高效微调
Abstract
Surgical Video Question Answering (VideoQA) requires accurate temporal grounding while remaining robust to natural variation in how clinicians phrase questions, where linguistic bias can arise. Standard Parameter Efficient Fine Tuning (PEFT) methods adapt pretrained projections without explicitly modeling frame-to-frame interactions within the adaptation pathway, limiting their ability to exploit sparse temporal evidence. We introduce TemporalDoRA, a video-specific PEFT formulation that extends Weight-Decomposed Low-Rank Adaptation by (i) inserting lightweight temporal Multi-Head Attention (MHA) inside the low-rank bottleneck of the vision encoder and (ii) selectively applying weight decomposition only to the trainable low-rank branch rather than the full adapted weight. This design enables temporally-aware updates while preserving a frozen backbone and stable scaling. By mixing information across frames within the adaptation subspace, TemporalDoRA steers updates toward temporally consistent visual cues and improves robustness with minimal parameter overhead. To benchmark this setting, we present REAL-Colon-VQA, a colonoscopy VideoQA dataset with 6,424 clip--question pairs, including paired rephrased Out-of-Template questions to evaluate sensitivity to linguistic variation. TemporalDoRA improves Out-of-Template performance, and ablation studies confirm that temporal mixing inside the low-rank branch is the primary driver of these gains. We also validate on EndoVis18-VQA adapted to short clips and observe consistent improvements on the Out-of-Template split. Code and dataset available at~\href{https://anonymous.4open.science/r/TemporalDoRA-BFC8/}{Anonymous GitHub}.
Chinese Translation
外科视频问答(VideoQA)需要准确的时间定位,同时对临床医生提问方式的自然变异保持稳健性,这可能导致语言偏差。标准的参数高效微调(PEFT)方法在适应路径中未明确建模帧间交互,从而限制了其利用稀疏时间证据的能力。我们提出了TemporalDoRA,一种视频特定的PEFT形式,它通过(i)在视觉编码器的低秩瓶颈中插入轻量级的时间多头注意力(MHA),以及(ii)仅对可训练的低秩分支应用权重分解,而不是对整个适应权重进行分解,来扩展权重分解低秩适应。这种设计在保持冻结主干和稳定缩放的同时,实现了时间感知的更新。通过在适应子空间内跨帧混合信息,TemporalDoRA将更新引导至时间一致的视觉线索,并在最小参数开销下提高了稳健性。为了基准测试这一设置,我们提出了REAL-Colon-VQA,一个包含6,424个剪辑-问题对的结肠镜视频问答数据集,其中包括成对的重述模板外问题,以评估对语言变异的敏感性。TemporalDoRA提高了模板外性能,消融研究确认低秩分支内的时间混合是这些增益的主要驱动因素。我们还在适应于短剪辑的EndoVis18-VQA上进行了验证,并观察到在模板外分割上持续的改进。代码和数据集可在~ exttt{https://anonymous.4open.science/r/TemporalDoRA-BFC8/}获取。
cs.CV / 114 / 2603.09702
TriFusion-SR: Joint Tri-Modal Medical Image Fusion and SR
TriFusion-SR:联合三模态医学图像融合与超分辨率
Abstract
Multimodal medical image fusion facilitates comprehensive diagnosis by aggregating complementary structural and functional information, but its effectiveness is limited by resolution degradation and modality discrepancies. Existing approaches typically perform image fusion and super-resolution (SR) in separate stages, leading to artifacts and degraded perceptual quality. These limitations are further amplified in tri-modal settings that combine anatomical modalities (e.g., MRI, CT) with functional scans (e.g., PET, SPECT) due to pronounced frequency domain imbalances. We propose TriFusionSR, a wavelet-guided conditional diffusion framework for joint tri-modal fusion and SR. The framework explicitly decomposes multimodal features into frequency bands using the 2D Discrete Wavelet Transform, enabling frequency-aware crossmodal interaction. We further introduce a Rectified Wavelet Features (RWF) strategy for latent coefficient calibration, followed by an Adaptive Spatial-Frequency Fusion (ASFF) module with gated channel-spatial attention to enable structure-driven multimodal refinement. Extensive experiments demonstrate state-of-the-art performance, achieving 4.8-12.4% PSNR improvement and substantial reductions in RMSE and LPIPS across multiple upsampling scales.
Chinese Translation
多模态医学图像融合通过聚合互补的结构和功能信息促进全面诊断,但其有效性受到分辨率下降和模态差异的限制。现有方法通常在不同阶段进行图像融合和超分辨率(SR),导致伪影和感知质量下降。在结合解剖模态(如MRI、CT)与功能扫描(如PET、SPECT)的三模态设置中,由于频域不平衡,这些限制进一步加剧。我们提出了TriFusion-SR,一种基于小波引导的条件扩散框架,用于联合三模态融合和超分辨率。该框架使用二维离散小波变换(2D Discrete Wavelet Transform)显式地将多模态特征分解为频带,从而实现频率感知的跨模态交互。我们进一步引入了一种用于潜在系数校准的整流小波特征(Rectified Wavelet Features, RWF)策略,随后采用带有门控通道-空间注意力的自适应空间-频率融合(Adaptive Spatial-Frequency Fusion, ASFF)模块,以实现结构驱动的多模态精细化。大量实验表明,该方法在多个上采样尺度上实现了最先进的性能,PSNR提高了4.8%-12.4%,并显著降低了RMSE和LPIPS。
cs.CV / 115 / 2603.09703
ProGS: Towards Progressive Coding for 3D Gaussian Splatting
ProGS:面向3D高斯点云的渐进编码
Abstract
With the emergence of 3D Gaussian Splatting (3DGS), numerous pioneering efforts have been made to address the effective compression issue of massive 3DGS data. 3DGS offers an efficient and scalable representation of 3D scenes by utilizing learnable 3D Gaussians, but the large size of the generated data has posed significant challenges for storage and transmission. Existing methods, however, have been limited by their inability to support progressive coding, a crucial feature in streaming applications with varying bandwidth. To tackle this limitation, this paper introduce a novel approach that organizes 3DGS data into an octree structure, enabling efficient progressive coding. The proposed ProGS is a streaming-friendly codec that facilitates progressive coding for 3D Gaussian splatting, and significantly improves both compression efficiency and visual fidelity. The proposed method incorporates mutual information enhancement mechanisms to mitigate structural redundancy, leveraging the relevance between nodes in the octree hierarchy. By adapting the octree structure and dynamically adjusting the anchor nodes, ProGS ensures scalable data compression without compromising the rendering quality. ProGS achieves a remarkable 45X reduction in file storage compared to the original 3DGS format, while simultaneously improving visual performance by over 10%. This demonstrates that ProGS can provide a robust solution for real-time applications with varying network conditions.
Chinese Translation
随着3D高斯点云(3DGS)的出现,许多开创性的工作致力于解决海量3DGS数据的有效压缩问题。3DGS通过利用可学习的3D高斯函数,提供了一种高效且可扩展的3D场景表示,但生成数据的庞大体积对存储和传输带来了重大挑战。然而,现有方法受到限制,无法支持渐进编码,这在带宽变化的流媒体应用中是一个至关重要的特性。为了解决这一限制,本文提出了一种新颖的方法,将3DGS数据组织成八叉树结构,从而实现高效的渐进编码。所提出的ProGS是一种适合流媒体的编解码器,促进了3D高斯点云的渐进编码,并显著提高了压缩效率和视觉保真度。该方法结合了互信息增强机制,以减轻结构冗余,利用八叉树层次中节点之间的相关性。通过调整八叉树结构和动态调整锚节点,ProGS确保了可扩展的数据压缩,而不影响渲染质量。与原始3DGS格式相比,ProGS在文件存储方面实现了惊人的45倍减少,同时视觉性能提高超过10%。这表明ProGS能够为具有不同网络条件的实时应用提供强有力的解决方案。
cs.CV / 116 / 2603.09718
GSStream: 3D Gaussian Splatting based Volumetric Scene Streaming System
GSStream:基于3D高斯点云的体积场景流媒体系统
Abstract
Recently, the 3D Gaussian splatting (3DGS) technique for real-time radiance field rendering has revolutionized the field of volumetric scene representation, providing users with an immersive experience. But in return, it also poses a large amount of data volume, which is extremely bandwidth-intensive. Cutting-edge researchers have tried to introduce different approaches and construct multiple variants for 3DGS to obtain a more compact scene representation, but it is still challenging for real-time distribution. In this paper, we propose GSStream, a novel volumetric scene streaming system to support 3DGS data format. Specifically, GSStream integrates a collaborative viewport prediction module to better predict users' future behaviors by learning collaborative priors and historical priors from multiple users and users' viewport sequences and a deep reinforcement learning (DRL)-based bitrate adaptation module to tackle the state and action space variability challenge of the bitrate adaptation problem, achieving efficient volumetric scene delivery. Besides, we first build a user viewport trajectory dataset for volumetric scenes to support the training and streaming simulation. Extensive experiments prove that our proposed GSStream system outperforms existing representative volumetric scene streaming systems in visual quality and network usage. Demo video: https://youtu.be/3WEe8PN8yvA.
Chinese Translation
近年来,3D高斯点云(3DGS)技术在实时辐射场渲染方面的应用彻底改变了体积场景表示领域,为用户提供了沉浸式体验。然而,这也带来了大量的数据量,极其依赖带宽。前沿研究者尝试引入不同的方法并构建多种变体,以获得更紧凑的场景表示,但实时分发仍然具有挑战性。本文提出了GSStream,一种新颖的体积场景流媒体系统,以支持3DGS数据格式。具体而言,GSStream集成了一个协作视口预测模块,通过学习来自多个用户的协作先验和历史先验以及用户的视口序列,更好地预测用户的未来行为,并且还集成了基于深度强化学习(DRL)的比特率自适应模块,以应对比特率自适应问题的状态和动作空间变异性挑战,实现高效的体积场景传输。此外,我们首次构建了一个用户视口轨迹数据集,以支持体积场景的训练和流媒体仿真。大量实验证明,我们提出的GSStream系统在视觉质量和网络使用方面优于现有的代表性体积场景流媒体系统。演示视频:https://youtu.be/3WEe8PN8yvA。
cs.CV / 117 / 2603.09721
FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation
FrameDiT:具有帧级矩阵注意力的扩散变换器用于高效视频生成
Abstract
High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.
Chinese Translation
高保真视频生成对于扩散模型仍然具有挑战性,因为高效建模复杂的时空动态十分困难。最近的视频扩散方法通常将视频表示为一系列时空标记,这些标记可以使用扩散变换器(Diffusion Transformers, DiTs)进行建模。然而,这种方法在强大但昂贵的全3D注意力(Full 3D Attention)与高效但时间有限的局部因式分解注意力(Local Factorized Attention)之间存在权衡。为了解决这一权衡,我们提出了矩阵注意力(Matrix Attention),这是一种帧级时间注意力机制,它将整个帧处理为一个矩阵,并通过矩阵原生操作生成查询、键和值矩阵。通过跨帧而非标记进行关注,矩阵注意力有效地保留了全局时空结构,并适应显著的运动。我们构建了基于矩阵注意力的DiT架构FrameDiT-G,并进一步引入FrameDiT-H,该架构将矩阵注意力与局部因式分解注意力相结合,以捕捉大运动和小运动。大量实验表明,FrameDiT-H在多个视频生成基准测试中实现了最先进的结果,提供了更好的时间一致性和视频质量,同时保持了与局部因式分解注意力相当的效率。
cs.CV / 118 / 2603.09731
EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning
EXPLORE-Bench:具备长远推理的自我中心场景预测
Abstract
Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.
Chinese Translation
多模态大型语言模型(MLLMs)越来越被视为具身智能体的基础,然而尚不清楚它们是否能够从自我中心的视角可靠地推理行动的长期物理后果。我们通过一项新任务研究这一差距,即自我中心场景预测与长远推理(Egocentric Scene Prediction with LOng-horizon REasoning):给定初始场景图像和一系列原子行动描述,模型需预测在执行所有行动后最终的场景。为了实现系统评估,我们引入了EXPLORE-Bench,这是一个从真实第一人称视频中策划的基准,涵盖多种场景。每个实例将长行动序列与结构化的最终场景注释配对,包括物体类别、视觉属性和物体间关系,支持细粒度的定量评估。在一系列专有和开源的MLLMs上进行的实验显示,与人类相比存在显著的性能差距,表明长远的自我中心推理仍然是一个主要挑战。我们进一步分析了通过逐步推理进行测试时的扩展,并显示分解长行动序列在一定程度上可以提高性能,但会带来不容小觑的计算开销。总体而言,EXPLORE-Bench为测量和推动自我中心具身感知的长远推理提供了一个原则性的测试平台。
cs.CV / 119 / 2603.09733
FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis
FetalAgents:一种用于胎儿超声图像和视频分析的多代理系统
Abstract
Fetal ultrasound (US) is the primary imaging modality for prenatal screening, yet its interpretation relies heavily on the expertise of the clinician. Despite advances in deep learning and foundation models, existing automated tools for fetal US analysis struggle to balance task-specific accuracy with the whole-process versatility required to support end-to-end clinical workflows. To address these limitations, we propose FetalAgents, the first multi-agent system for comprehensive fetal US analysis. Through a lightweight, agentic coordination framework, FetalAgents dynamically orchestrates specialized vision experts to maximize performance across diagnosis, measurement, and segmentation. Furthermore, FetalAgents advances beyond static image analysis by supporting end-to-end video stream summarization, where keyframes are automatically identified across multiple anatomical planes, analyzed by coordinated experts, and synthesized with patient metadata into a structured clinical report. Extensive multi-center external evaluations across eight clinical tasks demonstrate that FetalAgents consistently delivers the most robust and accurate performance when compared against specialized models and multimodal large language models (MLLMs), ultimately providing an auditable, workflow-aligned solution for fetal ultrasound analysis and reporting.
Chinese Translation
胎儿超声(US)是产前筛查的主要成像方式,但其解读在很大程度上依赖于临床医生的专业知识。尽管深度学习和基础模型取得了进展,现有的胎儿超声分析自动化工具在任务特定准确性与支持端到端临床工作流程所需的全流程多功能性之间难以取得平衡。为了解决这些局限性,我们提出了FetalAgents,这是第一个用于全面胎儿超声分析的多代理系统。通过一个轻量级的代理协调框架,FetalAgents动态协调专业视觉专家,以最大化在诊断、测量和分割等方面的性能。此外,FetalAgents超越了静态图像分析,支持端到端的视频流摘要,在多个解剖平面中自动识别关键帧,由协调的专家进行分析,并与患者元数据合成成结构化的临床报告。通过对八个临床任务进行的广泛多中心外部评估表明,FetalAgents在与专门模型和多模态大型语言模型(MLLMs)比较时,始终提供最稳健和准确的性能,最终为胎儿超声分析和报告提供了一个可审计的、与工作流程对齐的解决方案。
cs.CV / 120 / 2603.09737
$M^2$-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs
$M^2$-Occ:针对不完整摄像头输入的自主驾驶弹性3D语义占用预测
Abstract
Semantic occupancy prediction enables dense 3D geometric and semantic understanding for autonomous driving. However, existing camera-based approaches implicitly assume complete surround-view observations, an assumption that rarely holds in real-world deployment due to occlusion, hardware malfunction, or communication failures. We study semantic occupancy prediction under incomplete multi-camera inputs and introduce $M^2$-Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing. $M^2$-Occ addresses two complementary challenges. First, a Multi-view Masked Reconstruction (MMR) module leverages the spatial overlap among neighboring cameras to recover missing-view representations directly in the feature space. Second, a Feature Memory Module (FMM) introduces a learnable memory bank that stores class-level semantic prototypes. By retrieving and integrating these global priors, the FMM refines ambiguous voxel features, ensuring semantic consistency even when observational evidence is incomplete. We introduce a systematic missing-view evaluation protocol on the nuScenes-based SurroundOcc benchmark, encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios. Under the safety-critical missing back-view setting, $M^2$-Occ improves the IoU by 4.93%. As the number of missing cameras increases, the robustness gap further widens; for instance, under the setting with five missing views, our method boosts the IoU by 5.01%. These gains are achieved without compromising full-view performance. The source code will be publicly released at https://github.com/qixi7up/M2-Occ.
Chinese Translation
语义占用预测使自主驾驶能够实现密集的3D几何和语义理解。然而,现有的基于摄像头的方法隐含假设完整的全景观察,这一假设在实际应用中由于遮挡、硬件故障或通信失败而很少成立。我们研究了在不完整多摄像头输入下的语义占用预测,并提出了$M^2$-Occ,一个旨在在视角缺失时保持几何结构和语义一致性的框架。$M^2$-Occ解决了两个互补的挑战。首先,一个多视角掩码重建(Multi-view Masked Reconstruction, MMR)模块利用邻近摄像头之间的空间重叠,直接在特征空间中恢复缺失视角的表示。其次,一个特征记忆模块(Feature Memory Module, FMM)引入了一个可学习的记忆库,用于存储类别级语义原型。通过检索和整合这些全局先验,FMM细化模糊的体素特征,确保即使在观测证据不完整时也能保持语义一致性。我们在基于nuScenes的SurroundOcc基准上引入了一种系统的缺失视角评估协议,涵盖了确定性的单视角故障和随机的多视角丢失场景。在安全关键的缺失后视角设置下,$M^2$-Occ将IoU提高了4.93%。随着缺失摄像头数量的增加,鲁棒性差距进一步扩大;例如,在五个缺失视角的设置下,我们的方法将IoU提升了5.01%。这些提升是在不影响全视角性能的情况下实现的。源代码将公开发布在https://github.com/qixi7up/M2-Occ。
cs.CV / 121 / 2603.09741
ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios
ENIGMA-360:用于工业场景中人类行为理解的自我-外部数据集
Abstract
Understanding human behavior from complementary egocentric (ego) and exocentric (exo) points of view enables the development of systems that can support workers in industrial environments and enhance their safety. However, progress in this area is hindered by the lack of datasets capturing both views in realistic industrial scenarios. To address this gap, we propose ENIGMA-360, a new ego-exo dataset acquired in a real industrial scenario. The dataset is composed of 180 egocentric and 180 exocentric procedural videos temporally synchronized offering complementary information of the same scene. The 360 videos have been labeled with temporal and spatial annotations, enabling the study of different aspects of human behavior in industrial domain. We provide baseline experiments for 3 foundational tasks for human behavior understanding: 1) Temporal Action Segmentation, 2) Keystep Recognition and 3) Egocentric Human-Object Interaction Detection, showing the limits of state-of-the-art approaches on this challenging scenario. These results highlight the need for new models capable of robust ego-exo understanding in real-world environments. We publicly release the dataset and its annotations at https://iplab.dmi.unict.it/ENIGMA-360.
Chinese Translation
从互补的自我中心(ego)和外部中心(exo)视角理解人类行为,可以促进支持工业环境中工人的系统开发,并增强他们的安全性。然而,该领域的进展受到缺乏能够捕捉这两种视角的现实工业场景数据集的限制。为了解决这一问题,我们提出了ENIGMA-360,这是一个在真实工业场景中获取的新自我-外部数据集。该数据集由180个自我中心和180个外部中心的程序视频组成,时间上同步,提供同一场景的互补信息。这360个视频已标注了时间和空间注释,使得研究工业领域中人类行为的不同方面成为可能。我们为人类行为理解的三个基础任务提供了基准实验:1)时间动作分割,2)关键步骤识别和3)自我中心人类-物体交互检测,展示了当前最先进方法在这一挑战性场景中的局限性。这些结果突显了在真实环境中需要新的模型,以实现稳健的自我-外部理解。我们将在 https://iplab.dmi.unict.it/ENIGMA-360 上公开发布该数据集及其注释。
cs.CV / 122 / 2603.09743
LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos
LAP:一种面向语言的程序规划模型用于教学视频中的程序规划
Abstract
Procedure planning requires a model to predict a sequence of actions that transform a start visual observation into a goal in instructional videos. While most existing methods rely primarily on visual observations as input, they often struggle with the inherent ambiguity where different actions can appear visually similar. In this work, we argue that language descriptions offer a more distinctive representation in the latent space for procedure planning. We introduce Language-Aware Planning (LAP), a novel method that leverages the expressiveness of language to bridge visual observation and planning. LAP uses a finetuned Vision Language Model (VLM) to translate visual observations into text descriptions and to predict actions and extract text embeddings. These text embeddings are more distinctive than visual embeddings and are used in a diffusion model for planning action sequences. We evaluate LAP on three procedure planning benchmarks: CrossTask, Coin, and NIV. LAP achieves new state-of-the-art performance across multiple metrics and time horizons by large margin, demonstrating the significant advantage of language-aware planning.
Chinese Translation
程序规划需要一个模型来预测将起始视觉观察转变为目标的动作序列。在教学视频中,尽管大多数现有方法主要依赖视觉观察作为输入,但它们往往在不同动作可能在视觉上相似的固有模糊性方面面临挑战。在本研究中,我们认为语言描述在潜在空间中提供了更具辨识度的表示,适用于程序规划。我们提出了语言感知规划(Language-Aware Planning,LAP),这是一种新颖的方法,利用语言的表达能力来桥接视觉观察与规划。LAP使用经过微调的视觉语言模型(Vision Language Model,VLM)将视觉观察转换为文本描述,并预测动作和提取文本嵌入。这些文本嵌入比视觉嵌入更具辨识度,并用于扩散模型中以规划动作序列。我们在三个程序规划基准上评估LAP:CrossTask、Coin和NIV。LAP在多个指标和时间范围内以较大幅度实现了新的最先进性能,展示了语言感知规划的显著优势。
cs.CV / 123 / 2603.09759
LogoDiffuser: Training-Free Multilingual Logo Generation and Stylization via Letter-Aware Attention Control
LogoDiffuser:无训练的多语言标志生成与风格化通过字母感知注意力控制
Abstract
Recent advances in text-to-image generation have been remarkable, but generating multilingual design logos that harmoniously integrate visual and textual elements remains a challenging task. Existing methods often distort character geometry when applying creative styles and struggle to support multilingual text generation without additional training. To address these challenges, we propose LogoDiffuser, a training-free method that synthesizes multilingual logo designs using the multimodal diffusion transformer. Instead of using textual prompts, we input the target characters as images, enabling robust character structure control regardless of language. We first analyze the joint attention mechanism to identify core tokens, which are tokens that strongly respond to textual structures. With this observation, our method integrates character structure and visual design by injecting the most informative attention maps. Furthermore, we perform layer-wise aggregation of attention maps to mitigate attention shifts across layers and obtain consistent core tokens. Extensive experiments and user studies demonstrate that our method achieves state-of-the-art performance in multilingual logo generation.
Chinese Translation
近年来文本到图像生成的进展显著,但生成能够和谐地整合视觉与文本元素的多语言设计标志仍然是一项具有挑战性的任务。现有方法在应用创意风格时常常扭曲字符几何形状,并且在没有额外训练的情况下难以支持多语言文本生成。为了解决这些挑战,我们提出了LogoDiffuser,这是一种无训练的方法,利用多模态扩散变换器合成多语言标志设计。我们不使用文本提示,而是将目标字符作为图像输入,从而能够在任何语言下实现稳健的字符结构控制。我们首先分析联合注意力机制,以识别核心标记,即对文本结构强烈响应的标记。基于这一观察,我们的方法通过注入最具信息量的注意力图,将字符结构与视觉设计相结合。此外,我们对注意力图进行逐层聚合,以减轻层间的注意力转移,并获得一致的核心标记。大量实验和用户研究表明,我们的方法在多语言标志生成中达到了最先进的性能。
cs.CV / 124 / 2603.09760
PanoAffordanceNet: Towards Holistic Affordance Grounding in 360{\deg} Indoor Environments
PanoAffordanceNet:面向360度室内环境的整体可供性基础
Abstract
Global perception is essential for embodied agents in 360{\deg} spaces, yet current affordance grounding remains largely object-centric and restricted to perspective views. To bridge this gap, we introduce a novel task: Holistic Affordance Grounding in 360{\deg} Indoor Environments. This task faces unique challenges, including severe geometric distortions from Equirectangular Projection (ERP), semantic dispersion, and cross-scale alignment difficulties. We propose PanoAffordanceNet, an end-to-end framework featuring a Distortion-Aware Spectral Modulator (DASM) for latitude-dependent calibration and an Omni-Spherical Densification Head (OSDH) to restore topological continuity from sparse activations. By integrating multi-level constraints comprising pixel-wise, distributional, and region-text contrastive objectives, our framework effectively suppresses semantic drift under low supervision. Furthermore, we construct 360-AGD, the first high-quality panoramic affordance grounding dataset. Extensive experiments demonstrate that PanoAffordanceNet significantly outperforms existing methods, establishing a solid baseline for scene-level perception in embodied intelligence. The source code and benchmark dataset will be made publicly available at https://github.com/GL-ZHU925/PanoAffordanceNet.
Chinese Translation
全球感知对于360度空间中的具身智能体至关重要,但目前的可供性基础仍然主要集中于物体,并且受到视角的限制。为了解决这一问题,我们提出了一项新任务:在360度室内环境中进行整体可供性基础。该任务面临独特的挑战,包括来自等距投影(Equirectangular Projection, ERP)的严重几何失真、语义分散以及跨尺度对齐困难。我们提出了PanoAffordanceNet,一个端到端框架,具有一个用于纬度依赖校准的失真感知光谱调制器(Distortion-Aware Spectral Modulator, DASM)和一个用于从稀疏激活恢复拓扑连续性的全球体稠密化头(Omni-Spherical Densification Head, OSDH)。通过整合像素级、分布式和区域-文本对比目标等多层次约束,我们的框架有效抑制了低监督下的语义漂移。此外,我们构建了360-AGD,这是第一个高质量的全景可供性基础数据集。大量实验表明,PanoAffordanceNet显著优于现有方法,为具身智能的场景级感知建立了坚实的基线。源代码和基准数据集将公开发布于 https://github.com/GL-ZHU925/PanoAffordanceNet。
cs.CV / 125 / 2603.09771
Ego: Embedding-Guided Personalization of Vision-Language Models
Ego:基于嵌入的视觉-语言模型个性化
Abstract
AI assistants that support humans in daily life are becoming increasingly feasible, driven by the rapid advancements in multimodal language models. A key challenge lies in overcoming the generic nature of these models to deliver personalized experiences. Existing approaches to personalizing large vision language models often rely on additional training stages, which limit generality and scalability, or on engineered pipelines with external pre-trained modules, which hinder deployment efficiency. In this work, we propose an efficient personalization method that leverages the model's inherent ability to capture personalized concepts. Specifically, we extract visual tokens that predominantly represent the target concept by utilizing the model's internal attention mechanisms. These tokens serve as a memory of that specific concept, enabling the model to recall and describe it when it appears in test images. We conduct a comprehensive and unified evaluation of our approach and SOTA methods across various personalization settings including single-concept, multi-concept, and video personalization, demonstrating strong performance gains with minimal personalization overhead.
Chinese Translation
支持人类日常生活的人工智能助手正变得越来越可行,这得益于多模态语言模型的快速进展。一个关键挑战在于克服这些模型的通用性,以提供个性化的体验。现有的大型视觉语言模型个性化方法通常依赖于额外的训练阶段,这限制了其通用性和可扩展性,或者依赖于带有外部预训练模块的工程化管道,这妨碍了部署效率。在本研究中,我们提出了一种高效的个性化方法,利用模型固有的捕捉个性化概念的能力。具体而言,我们通过利用模型的内部注意机制提取主要代表目标概念的视觉标记。这些标记作为该特定概念的记忆,使模型能够在测试图像中出现时回忆并描述该概念。我们对我们的方法及其在单一概念、多概念和视频个性化等各种个性化设置下的最先进(SOTA)方法进行了全面统一的评估,展示了在最小个性化开销下的显著性能提升。
cs.CV / 126 / 2603.09772
Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors
去除触发器,而非后门:替代触发器与潜在后门
Abstract
Current backdoor defenses assume that neutralizing a known trigger removes the backdoor. We show this trigger-centric view is incomplete: \emph{alternative triggers}, patterns perceptually distinct from training triggers, reliably activate the same backdoor. We estimate the alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, and then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. First, we theoretically prove that alternative triggers exist and are an inevitable consequence of backdoor training. Then, we verify this empirically. Additionally, defenses that remove training triggers often leave backdoors intact, and alternative triggers can exploit the latent backdoor feature-space. Our findings motivate defenses targeting backdoor directions in representation space rather than input-space triggers.
Chinese Translation
当前的后门防御假设中和已知触发器的中和能够去除后门。我们展示了这种以触发器为中心的观点是不完整的: extit{替代触发器},即与训练触发器在感知上不同的模式,能够可靠地激活同一后门。我们通过对比干净和触发后的表示,估计替代触发器在特征空间中的后门方向,并开发了一种特征引导攻击,该攻击联合优化目标预测和方向对齐。首先,我们从理论上证明了替代触发器的存在,并且这是后门训练的必然结果。然后,我们通过实证验证了这一点。此外,去除训练触发器的防御往往会使后门保持完整,而替代触发器可以利用潜在的后门特征空间。我们的研究结果激励了针对表示空间中的后门方向而非输入空间触发器的防御措施。
cs.CV / 127 / 2603.09798
Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency
基于多标签原型增长和双线索一致性的动作预测测试时自我-外部中心适应
Abstract
Efficient adaptation between Egocentric (Ego) and Exocentric (Exo) views is crucial for applications such as human-robot cooperation. However, the success of most existing Ego-Exo adaptation methods relies heavily on target-view data for training, thereby increasing computational and data collection costs. In this paper, we make the first exploration of a Test-time Ego-Exo Adaptation for Action Anticipation (TE$^{2}$A$^{3}$) task, which aims to adjust the source-view-trained model online during test time to anticipate target-view actions. It is challenging for existing Test-Time Adaptation (TTA) methods to address this task due to the multi-action candidates and significant temporal-spatial inter-view gap. Hence, we propose a novel Dual-Clue enhanced Prototype Growing Network (DCPGN), which accumulates multi-label knowledge and integrates cross-modality clues for effective test-time Ego-Exo adaptation and action anticipation. Specifically, we propose a Multi-Label Prototype Growing Module (ML-PGM) to balance multiple positive classes via multi-label assignment and confidence-based reweighting for class-wise memory banks, which are updated by an entropy priority queue strategy. Then, the Dual-Clue Consistency Module (DCCM) introduces a lightweight narrator to generate textual clues indicating action progressions, which complement the visual clues containing various objects. Moreover, we constrain the inferred textual and visual logits to construct dual-clue consistency for temporally and spatially bridging Ego and Exo views. Extensive experiments on the newly proposed EgoMe-anti and the existing EgoExoLearn benchmarks show the effectiveness of our method, which outperforms related state-of-the-art methods by a large margin. Code is available at \href{https://github.com/ZhaofengSHI/DCPGN}{https://github.com/ZhaofengSHI/DCPGN}.
Chinese Translation
自我中心(Ego)视角与外部中心(Exo)视角之间的高效适应对于人机合作等应用至关重要。然而,大多数现有的自我-外部适应方法的成功在很大程度上依赖于目标视角数据进行训练,从而增加了计算和数据收集成本。本文首次探索了测试时自我-外部适应用于动作预测(TE$^{2}$A$^{3}$)任务,旨在在测试时在线调整源视角训练的模型,以预测目标视角的动作。现有的测试时适应(TTA)方法在处理此任务时面临多动作候选和显著的时空视角间差距的挑战。因此,我们提出了一种新颖的双线索增强原型增长网络(DCPGN),该网络积累多标签知识并整合跨模态线索,以实现有效的测试时自我-外部适应和动作预测。具体而言,我们提出了一个多标签原型增长模块(ML-PGM),通过多标签分配和平衡基于置信度的重加权来平衡多个正类,并通过熵优先队列策略更新类记忆库。然后,双线索一致性模块(DCCM)引入一个轻量级叙述者,生成指示动作进展的文本线索,以补充包含各种对象的视觉线索。此外,我们约束推断的文本和视觉逻辑,以构建时空桥接自我和外部视角的双线索一致性。在新提出的EgoMe-anti和现有的EgoExoLearn基准上的大量实验表明我们方法的有效性,其性能大幅超越相关的最新方法。代码可在 exttt{https://github.com/ZhaofengSHI/DCPGN} 获取。
cs.CV / 128 / 2603.09809
RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding
RA-SSU:面向区域感知声音源理解的细粒度音视频学习
Abstract
Audio-Visual Learning (AVL) is one fundamental task of multi-modality learning and embodied intelligence, displaying the vital role in scene understanding and interaction. However, previous researchers mostly focus on exploring downstream tasks from a coarse-grained perspective (e.g., audio-visual correspondence, sound source localization, and audio-visual event localization). Considering providing more specific scene perception details, we newly define a fine-grained Audio-Visual Learning task, termed Region-Aware Sound Source Understanding (RA-SSU), which aims to achieve region-aware, frame-level, and high-quality sound source understanding. To support this goal, we innovatively construct two corresponding datasets, i.e. fine-grained Music (f-Music) and fine-grained Lifescene (f-Lifescene), each containing annotated sound source masks and frame-by-frame textual descriptions. The f-Music dataset includes 3,976 samples across 22 scene types related to specific application scenarios, focusing on music scenes with complex instrument mixing. The f-Lifescene dataset contains 6,156 samples across 61 types representing diverse sounding objects in life scenarios. Moreover, we propose SSUFormer, a Sound-Source Understanding TransFormer benchmark that facilitates both the sound source segmentation and sound region description with a multi-modal input and multi-modal output architecture. Specifically, we design two modules for this framework, Mask Collaboration Module (MCM) and Mixture of Hierarchical-prompted Experts (MoHE), to respectively enhance the accuracy and enrich the elaboration of the sound source description. Extensive experiments are conducted on our two datasets to verify the feasibility of the task, evaluate the availability of the datasets, and demonstrate the superiority of the SSUFormer, which achieves SOTA performance on the Sound Source Understanding benchmark.
Chinese Translation
音视频学习(AVL)是多模态学习和具身智能的一个基本任务,在场景理解和交互中发挥着重要作用。然而,以往的研究者主要从粗粒度的角度探索下游任务(例如,音视频对应、声音源定位和音视频事件定位)。为了提供更具体的场景感知细节,我们新定义了一项细粒度音视频学习任务,称为区域感知声音源理解(RA-SSU),旨在实现区域感知、帧级别和高质量的声音源理解。为了支持这一目标,我们创新性地构建了两个相应的数据集,即细粒度音乐(f-Music)和细粒度生活场景(f-Lifescene),每个数据集都包含注释的声音源掩码和逐帧文本描述。f-Music 数据集包含 3,976 个样本,涵盖 22 种与特定应用场景相关的场景类型,重点关注具有复杂乐器混合的音乐场景。f-Lifescene 数据集包含 6,156 个样本,涵盖 61 种代表生活场景中多样声音对象的类型。此外,我们提出了 SSUFormer,一个声音源理解变换器基准,促进了声音源分割和声音区域描述,采用多模态输入和多模态输出架构。具体而言,我们为该框架设计了两个模块,掩码协作模块(Mask Collaboration Module, MCM)和分层提示专家混合(Mixture of Hierarchical-prompted Experts, MoHE),分别增强声音源描述的准确性和丰富性。我们在两个数据集上进行了广泛的实验,以验证任务的可行性,评估数据集的可用性,并展示 SSUFormer 的优越性,其在声音源理解基准上达到了 SOTA 性能。
cs.CV / 129 / 2603.09819
ConfCtrl: Enabling Precise Camera Control in Video Diffusion via Confidence-Aware Interpolation
ConfCtrl:通过置信度感知插值实现视频扩散中的精确相机控制
Abstract
We address the challenge of novel view synthesis from only two input images under large viewpoint changes. Existing regression-based methods lack the capacity to reconstruct unseen regions, while camera-guided diffusion models often deviate from intended trajectories due to noisy point cloud projections or insufficient conditioning from camera poses. To address these issues, we propose ConfCtrl, a confidence-aware video interpolation framework that enables diffusion models to follow prescribed camera poses while completing unseen regions. ConfCtrl initializes the diffusion process by combining a confidence-weighted projected point cloud latent with noise as the conditioning input. It then applies a Kalman-inspired predict-update mechanism, treating the projected point cloud as a noisy measurement and using learned residual corrections to balance pose-driven predictions with noisy geometric observations. This allows the model to rely on reliable projections while down-weighting uncertain regions, yielding stable, geometry-aware generation. Experiments on multiple datasets show that ConfCtrl produces geometrically consistent and visually plausible novel views, effectively reconstructing occluded regions under large viewpoint changes.
Chinese Translation
我们解决了在大视角变化下仅基于两幅输入图像进行新视图合成的挑战。现有的基于回归的方法缺乏重建未见区域的能力,而相机引导的扩散模型由于噪声点云投影或相机姿态的不足条件,往往偏离预期轨迹。为了解决这些问题,我们提出了ConfCtrl,一个置信度感知的视频插值框架,使扩散模型能够遵循规定的相机姿态,同时完成未见区域的生成。ConfCtrl通过将置信度加权的投影点云潜变量与噪声结合,作为条件输入来初始化扩散过程。然后,它应用一种受卡尔曼滤波启发的预测-更新机制,将投影点云视为噪声测量,并利用学习到的残差修正来平衡姿态驱动的预测与噪声几何观测。这使得模型能够依赖可靠的投影,同时降低不确定区域的权重,从而实现稳定的、具有几何感知的生成。在多个数据集上的实验表明,ConfCtrl能够生成几何一致且视觉上可信的新视图,有效重建在大视角变化下的遮挡区域。
cs.CV / 130 / 2603.09825
BrainSTR: Spatio-Temporal Contrastive Learning for Interpretable Dynamic Brain Network Modeling
BrainSTR:用于可解释动态脑网络建模的时空对比学习
Abstract
Dynamic functional connectivity captures time-varying brain states for better neuropsychiatric diagnosis and spatio-temporal interpretability, i.e., identifying when discriminative disease signatures emerge and where they reside in the connectivity topology. Reliable interpretability faces major challenges: diagnostic signals are often subtle and sparsely distributed across both time and topology, while nuisance fluctuations and non-diagnostic connectivities are pervasive. To address these issues, we propose BrainSTR, a spatio-temporal contrastive learning framework for interpretable dynamic brain network modeling. BrainSTR learns state-consistent phase boundaries via a data-driven Adaptive Phase Partition module, identifies diagnostically critical phases with attention, and extracts disease-related connectivity within each phase using an Incremental Graph Structure Generator regularized by binarization, temporal smoothness, and sparsity. Then, we introduce a spatio-temporal supervised contrastive learning approach that leverages diagnosis-relevant spatio-temporal patterns to refine the similarity metric between samples and capture more discriminative spatio-temporal features, thereby constructing a well-structured semantic space for coherent and interpretable representations. Experiments on ASD, BD, and MDD validate the effectiveness of BrainSTR, and the discovered critical phases and subnetworks provide interpretable evidence consistent with prior neuroimaging findings. Our code: https://anonymous.4open.science/r/BrainSTR1.
Chinese Translation
动态功能连接捕捉时间变化的脑状态,以便更好地进行神经精神疾病诊断和时空可解释性,即识别出何时出现具有区分性的疾病特征以及它们在连接拓扑中的位置。可靠的可解释性面临重大挑战:诊断信号通常是微妙且稀疏分布在时间和拓扑上,而干扰波动和非诊断连接则普遍存在。为了解决这些问题,我们提出了BrainSTR,一种用于可解释动态脑网络建模的时空对比学习框架。BrainSTR通过数据驱动的自适应相位划分模块学习状态一致的相位边界,利用注意力机制识别诊断上关键的相位,并使用增量图结构生成器提取每个相位内与疾病相关的连接,该生成器通过二值化、时间平滑性和稀疏性进行正则化。然后,我们引入了一种时空监督对比学习方法,利用与诊断相关的时空模式来细化样本之间的相似性度量,并捕捉更具区分性的时空特征,从而构建一个结构良好的语义空间,以实现连贯且可解释的表示。对自闭症谱系障碍(ASD)、躁郁症(BD)和重性抑郁障碍(MDD)的实验验证了BrainSTR的有效性,发现的关键相位和子网络提供了与先前神经影像学发现一致的可解释证据。我们的代码: https://anonymous.4open.science/r/BrainSTR1。
cs.CV / 131 / 2603.09826
VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models
VLM-Loc:通过视觉-语言模型在点云地图中的定位
Abstract
Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment mechanism that explicitly associates textual cues with scene graph nodes, enabling interpretable spatial reasoning for accurate localization. To facilitate systematic evaluation across diverse scenes, we present CityLoc, a benchmark built from multi-source point clouds for fine-grained T2P localization. Experiments on CityLoc demonstrate VLM-Loc achieves superior accuracy and robustness compared to state-of-the-art methods. Our code, model, and dataset are available at \href{https://github.com/MCG-NKU/nku-3d-vision}{repository}.
Chinese Translation
文本到点云(T2P)定位旨在从自然语言描述中推断出3D点云地图中的精确空间位置,反映了人类如何通过语言感知和交流空间布局。然而,现有方法在很大程度上依赖于浅层的文本-点云对应关系,缺乏有效的空间推理,这限制了它们在复杂环境中的准确性。为了解决这一局限性,我们提出了VLM-Loc,一个利用大型视觉-语言模型(VLM)空间推理能力的T2P定位框架。具体而言,我们将点云转换为鸟瞰图(BEV)图像和场景图,这两者共同编码几何和语义上下文,为VLM提供结构化输入,以学习跨模态表示,桥接语言和空间语义。在这些表示的基础上,我们引入了一种部分节点分配机制,明确将文本线索与场景图节点关联,从而实现可解释的空间推理,以实现准确的定位。为了促进在多样场景中的系统评估,我们提出了CityLoc,这是一个基于多源点云构建的基准,用于细粒度的T2P定位。在CityLoc上的实验表明,VLM-Loc在准确性和鲁棒性方面优于最先进的方法。我们的代码、模型和数据集可在 exttt{https://github.com/MCG-NKU/nku-3d-vision} 获取。
cs.CV / 132 / 2603.09827
MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents
MA-EgoQA:来自多个具身智能体的自我中心视频问答
Abstract
As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.
Chinese Translation
随着具身模型的强大,未来人类将在工作场所或家中与多个具身人工智能智能体协作。为了确保人类用户与多智能体系统之间的更好沟通,关键在于并行解读来自智能体的输入信息,并为每个查询参考适当的上下文。现有的挑战包括有效压缩和传达以视频形式呈现的大量个体感知输入,以及正确聚合多个自我中心视频以构建系统级记忆。在本研究中,我们首先正式定义了一个新问题,即同时理解从具身智能体收集的多个长时间跨度的自我中心视频。为了促进这一方向的研究,我们引入了MultiAgent-EgoQA(MA-EgoQA),这是一个旨在系统性评估现有模型在我们场景中的基准。MA-EgoQA提供了1700个独特于多个自我中心流的问题,涵盖五个类别:社会互动、任务协调、心智理论、时间推理和环境互动。我们进一步提出了一个名为EgoMAS的MA-EgoQA简单基线模型,该模型利用具身智能体之间的共享记忆和智能体动态检索。通过对MA-EgoQA上多样基线和EgoMAS的全面评估,我们发现当前的方法无法有效处理多个自我中心流,突显了在智能体之间实现系统级理解的未来进展的必要性。代码和基准可在https://ma-egoqa.github.io获取。
cs.CV / 133 / 2603.09874
MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities
MissBench:在不平衡缺失模态下的多模态情感分析基准测试
Abstract
Multimodal affective computing underpins key tasks such as sentiment analysis and emotion recognition. Standard evaluations, however, often assume that textual, acoustic, and visual modalities are equally available. In real applications, some modalities are systematically more fragile or expensive, creating imbalanced missing rates and training biases that task-level metrics alone do not reveal. We introduce MissBench, a benchmark and framework for multimodal affective tasks that standardizes both shared and imbalanced missing-rate protocols on four widely used sentiment and emotion datasets. MissBench also defines two diagnostic metrics. The Modality Equity Index (MEI) measures how fairly different modalities contribute across missing-modality configurations. The Modality Learning Index (MLI) quantifies optimization imbalance by comparing modality-specific gradient norms during training, aggregated across modality-related modules. Experiments on representative method families show that models that appear robust under shared missing rates can still exhibit marked modality inequity and optimization imbalance under imbalanced conditions. These findings position MissBench, together with MEI and MLI, as practical tools for stress-testing and analyzing multimodal affective models in realistic incomplete-modality settings.For reproducibility, we release our code at: https://anonymous.4open.science/r/MissBench-4098/
Chinese Translation
多模态情感计算是情感分析和情绪识别等关键任务的基础。然而,标准评估通常假设文本、声学和视觉模态是同等可用的。在实际应用中,一些模态系统性地更脆弱或更昂贵,导致不平衡的缺失率和训练偏差,而仅依赖任务级指标无法揭示这些问题。我们提出了MissBench,这是一个多模态情感任务的基准和框架,标准化了四个广泛使用的情感和情绪数据集上的共享和不平衡缺失率协议。MissBench还定义了两个诊断指标。模态公平指数(Modality Equity Index, MEI)衡量不同模态在缺失模态配置中的贡献公平性。模态学习指数(Modality Learning Index, MLI)通过比较训练过程中模态特定的梯度范数,量化优化不平衡,聚合于与模态相关的模块。对代表性方法家族的实验表明,在共享缺失率下看似稳健的模型在不平衡条件下仍可能表现出显著的模态不平等和优化不平衡。这些发现使得MissBench以及MEI和MLI成为在现实的不完整模态环境中对多模态情感模型进行压力测试和分析的实用工具。为了可重复性,我们发布了我们的代码: https://anonymous.4open.science/r/MissBench-4098/
cs.CV / 134 / 2603.09877
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
InternVL-U:为理解、推理、生成和编辑民主化统一多模态模型
Tian, Changyao, Yang, Danni, Chen, Guanzhou, Cui, Erfei, Wang, Zhaokai, Duan, Yuchen, Yin, Penghao, Chen, Sitao, Yang, Ganlin, Liu, Mingxin, Zhu, Zirun, Fan, Ziqian, Gu, Leyao, Wang, Haomin, Wei, Qi, Yin, Jinhui, Yang, Xue, Zhong, Zhihang, Qin, Qi, Xin, Yi, Fu, Bin, Liu, Yihao, Ge, Jiaye, Guo, Qipeng, Luo, Gen, Li, Hongsheng, Qiao, Yu, Chen, Kai, Zhang, Hongjie
Abstract
Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.
Chinese Translation
统一多模态模型(UMMs)在整合理解、推理、生成和编辑时面临着在保持强语义理解与获取强大生成能力之间的固有权衡。在本报告中,我们提出了InternVL-U,这是一种轻量级的4B参数UMM,在统一框架内民主化这些能力。InternVL-U遵循统一上下文建模和具有解耦视觉表示的模态特定模块化设计原则,整合了最先进的多模态大语言模型(MLLM)与专门的基于MMDiT的视觉生成头。为了进一步弥合美学生成与高级智能之间的差距,我们构建了一个综合数据合成管道,针对高语义密度任务,如文本渲染和科学推理,采用以推理为中心的范式,利用思维链(Chain-of-Thought, CoT)更好地对齐抽象用户意图与细粒度视觉生成细节。大量实验表明,InternVL-U在性能与效率之间实现了优越的平衡。尽管仅使用4B参数,它在各种生成和编辑任务中始终优于规模超过3倍的统一基线模型,如BAGEL(14B),同时保持强大的多模态理解和推理能力。
cs.CV / 135 / 2603.09883
DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary
DISPLAY:通过稀疏运动引导和多任务辅助实现可控的人物-物体交互视频生成
Abstract
Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.
Chinese Translation
以人为中心的视频生成技术发展迅速,但现有方法在生成可控且物理一致的人物-物体交互(HOI)视频方面仍面临挑战。现有研究依赖于密集的控制信号、模板视频或精心设计的文本提示,这限制了灵活性和对新物体的泛化能力。我们提出了一种框架,称为DISPLAY,采用稀疏运动引导,仅由手腕关节坐标和形状无关的物体边界框组成。这种轻量级的引导缓解了人类与物体表示之间的不平衡,并实现了直观的用户控制。为了在如此稀疏的条件下提高生成的真实感,我们提出了一种物体压力注意机制,以增强物体的鲁棒性。为了解决高质量HOI数据稀缺的问题,我们进一步开发了一种多任务辅助训练策略,并配备了专门的数据整理管道,使模型能够同时受益于可靠的HOI样本和辅助任务。全面的实验表明,我们的方法在多种任务中实现了高保真、可控的HOI生成。项目页面可访问 exttt{https://mumuwei.github.io/DISPLAY/}。
cs.CV / 136 / 2603.09896
Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
将视觉语言模型引入体育领域:体育中的空间智能基准测试
Abstract
Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.
Chinese Translation
体育运动长期以来吸引了广泛的关注,因为它们推动了人类身体和认知能力的极限。在对视觉语言模型(VLMs)空间智能日益增长的兴趣中,体育运动为理解高强度人类运动和动态物体交互提供了自然的测试平台。为此,我们提出了CourtSI,这是第一个针对体育场景的大规模空间智能数据集。CourtSI包含超过100万个问答对,按照一个全面的分类法进行组织,系统地涵盖了空间计数、距离测量、定位和关系推理,涵盖了包括羽毛球、网球和乒乓球在内的代表性网球类运动。利用明确的场地几何作为度量锚点,我们开发了一个半自动化的数据引擎来重建体育场景,从而实现CourtSI的可扩展性。此外,我们还推出了CourtSI-Bench,这是一个高质量的评估基准,包含3,686个经过严格人工验证的问答对。我们在CourtSI-Bench上评估了25个专有和开源的VLM,揭示了人类与人工智能之间的表现差距以及现有空间智能基准的有限泛化能力。这些发现表明,体育场景暴露了现有基准所捕捉的空间智能能力的局限性。此外,在CourtSI上微调Qwen3-VL-8B使其在CourtSI-Bench上的准确率提高了23.5个百分点。经过调整的模型在CourtSI-Ext上也能有效泛化,CourtSI-Ext是一个基于类似但未见过的运动构建的评估集,并展示了增强的空间感知评论生成能力。综合来看,这些发现表明CourtSI为推动体育领域VLM的空间智能提供了一条可扩展的路径。
cs.CV / 137 / 2603.09921
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
WikiCLIP:一种高效的开放域视觉实体识别对比基线
Abstract
Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/
Chinese Translation
开放域视觉实体识别(VER)旨在将图像与维基百科等百科知识库中的实体关联起来。近期为VER量身定制的生成方法表现出色,但计算成本高,限制了其可扩展性和实际部署。在本研究中,我们重新审视了VER的对比范式,并引入了WikiCLIP,一个简单而有效的框架,为开放域VER建立了一个强大且高效的基线。WikiCLIP利用大型语言模型嵌入作为知识丰富的实体表示,并通过视觉引导知识适配器(Vision-Guided Knowledge Adaptor, VGKA)进行增强,该适配器在补丁级别上将文本语义与视觉线索对齐。为了进一步促进细粒度的区分,硬负样本合成机制(Hard Negative Synthesis Mechanism)在训练过程中生成视觉上相似但语义上不同的负样本。在流行的开放域VER基准测试(如OVEN)上的实验结果表明,WikiCLIP显著优于强基线。具体而言,WikiCLIP在具有挑战性的OVEN未见集上实现了16%的提升,同时与领先的生成模型AutoVER相比,推理延迟减少了近100倍。项目页面可访问:https://artanic30.github.io/project_pages/WikiCLIP/
cs.CV / 138 / 2603.09925
On the Structural Failure of Chamfer Distance in 3D Shape Optimization
三维形状优化中斜角距离的结构性失败
Abstract
Chamfer distance is the standard training loss for point cloud reconstruction, completion, and generation, yet directly optimizing it can produce worse Chamfer values than not optimizing it at all. We show that this paradoxical failure is gradient-structural. The per-point Chamfer gradient creates a many-to-one collapse that is the unique attractor of the forward term and cannot be resolved by any local regularizer, including repulsion, smoothness, and density-aware re-weighting. We derive a necessary condition for collapse suppression: coupling must propagate beyond local neighborhoods. In a controlled 2D setting, shared-basis deformation suppresses collapse by providing global coupling; in 3D shape morphing, a differentiable MPM prior instantiates the same principle, consistently reducing the Chamfer gap across 20 directed pairs with a 2.5$\times$ improvement on the topologically complex dragon. The presence or absence of non-local coupling determines whether Chamfer optimization succeeds or collapses. This provides a practical design criterion for any pipeline that optimizes point-level distance metrics.
Chinese Translation
斜角距离是点云重建、补全和生成的标准训练损失,然而直接优化它可能会产生比不优化更糟糕的斜角值。我们展示了这种矛盾失败的梯度结构特性。每个点的斜角梯度导致了多对一的崩溃,这是前向项的唯一吸引子,无法通过任何局部正则化器解决,包括排斥、平滑性和密度感知重加权。我们推导出崩溃抑制的必要条件:耦合必须超越局部邻域传播。在一个受控的二维环境中,共享基变形通过提供全局耦合来抑制崩溃;在三维形状变形中,一个可微分的MPM(Massively Parallel Methods)先验实例化了相同的原理,在20对定向配对中一致性地减少了斜角差距,在拓扑复杂的龙形状上实现了2.5倍的改进。非局部耦合的存在或缺失决定了斜角优化是成功还是崩溃。这为任何优化点级距离度量的流程提供了一个实用的设计标准。
cs.CV / 139 / 2603.09930
Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
通过关节角度运动图像和标记块后期交互实现细粒度运动检索
Abstract
Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.
Chinese Translation
文本-运动检索旨在学习自然语言描述与3D人类运动骨架序列之间的语义对齐潜在空间,从而实现两种模态之间的双向搜索。现有的大多数方法采用双编码器框架,将运动和文本压缩为全局嵌入,忽略了细粒度的局部对应关系,从而降低了准确性。此外,这些全局嵌入方法对检索结果的可解释性有限。为克服这些局限性,我们提出了一种可解释的基于关节角度的运动表示,将关节级局部特征映射到结构化的伪图像中,兼容预训练的视觉变换器(Vision Transformers)。在文本到运动检索中,我们采用了MaxSim,一种基于标记的后期交互机制,并通过掩蔽语言建模(Masked Language Modeling)正则化进行增强,以促进稳健、可解释的文本-运动对齐。在HumanML3D和KIT-ML上的大量实验表明,我们的方法在性能上优于最先进的文本-运动检索方法,同时提供了文本与运动之间的可解释细粒度对应关系。代码可在补充材料中获得。
cs.CV / 140 / 2603.09931
Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation
自适应临床感知潜在扩散用于多模态脑图像生成与缺失模态填补
Abstract
Multimodal neuroimaging provides complementary insights for Alzheimer's disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80\% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at https://github.com/rongzhou7/ACADiff
Chinese Translation
多模态神经影像为阿尔茨海默病的诊断提供了互补的见解,但临床数据集常常面临缺失模态的问题。我们提出了ACADiff,一个通过自适应临床感知扩散合成缺失脑影像模态的框架。ACADiff通过逐步去噪潜在表示,学习不完整多模态观察与目标模态之间的映射,同时关注可用的影像数据和临床元数据。该框架采用自适应融合,基于输入的可用性动态重新配置,并通过GPT-4o编码的提示提供语义临床指导。三个专门的生成器实现了sMRI、FDG-PET和AV45-PET之间的双向合成。在ADNI受试者上进行评估,ACADiff实现了优越的生成质量,并在极端80\%缺失场景下保持了强大的诊断性能,超越了所有现有基准。为了促进可重复性,代码可在https://github.com/rongzhou7/ACADiff获取。
cs.CV / 141 / 2603.09932
Unsupervised Domain Adaptation with Target-Only Margin Disparity Discrepancy
基于目标仅边际差异不一致性的无监督领域适应
Abstract
In interventional radiology, Cone-Beam Computed Tomography (CBCT) is a helpful imaging modality that provides guidance to practicians during minimally invasive procedures. CBCT differs from traditional Computed Tomography (CT) due to its limited reconstructed field of view, specific artefacts, and the intra-arterial administration of contrast medium. While CT benefits from abundant publicly available annotated datasets, interventional CBCT data remain scarce and largely unannotated, with existing datasets focused primarily on radiotherapy applications. To address this limitation, we leverage a proprietary collection of unannotated interventional CBCT scans in conjunction with annotated CT data, employing domain adaptation techniques to bridge the modality gap and enhance liver segmentation performance on CBCT. We propose a novel unsupervised domain adaptation (UDA) framework based on the formalism of Margin Disparity Discrepancy (MDD), which improves target domain performance through a reformulation of the original MDD optimization framework. Experimental results on CT and CBCT datasets for liver segmentation demonstrate that our method achieves state-of-the-art performance in UDA, as well as in the few-shot setting.
Chinese Translation
在介入放射学中,锥束计算机断层扫描(CBCT)是一种有助于在微创手术中为医生提供指导的成像方式。与传统的计算机断层扫描(CT)相比,CBCT由于其有限的重建视野、特定的伪影以及动脉内对比剂的使用而有所不同。虽然CT受益于大量公开可用的标注数据集,但介入CBCT数据仍然稀缺且大多未标注,现有数据集主要集中在放射治疗应用上。为了解决这一限制,我们利用一组未标注的介入CBCT扫描图像与标注的CT数据相结合,采用领域适应技术来弥补模态差距,并提高CBCT上的肝脏分割性能。我们提出了一种基于边际差异不一致性(MDD)形式的新型无监督领域适应(UDA)框架,通过重新构建原始MDD优化框架来改善目标领域性能。在CT和CBCT数据集上的肝脏分割实验结果表明,我们的方法在UDA以及少样本设置下均实现了最先进的性能。
cs.CV / 142 / 2603.09945
No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space
无图像,无问题:从欠采样k空间进行端到端多任务心脏分析
Abstract
Conventional clinical CMR pipelines rely on a sequential "reconstruct-then-analyze" paradigm, forcing an ill-posed intermediate step that introduces avoidable artifacts and information bottlenecks. This creates a fundamental mathematical paradox: it attempts to recover high-dimensional pixel arrays (i.e., images) from undersampled k-space, rather than directly extracting the low-dimensional physiological labels actually required for diagnosis. To unlock the direct diagnostic potential of k-space, we propose k-MTR (k-space Multi-Task Representation), a k-space representation learning framework that aligns undersampled k-space data and fully-sampled images into a shared semantic manifold. Leveraging a large-scale controlled simulation of 42,000 subjects, k-MTR forces the k-space encoder to restore anatomical information lost to undersampling directly within the latent space, bypassing the explicit inverse problem for downstream analysis. We demonstrate that this latent alignment enables the dense latent space embedded with high-level physiological semantics directly from undersampled frequencies. Across continuous phenotype regression, disease classification, and anatomical segmentation, k-MTR achieves highly competitive performance against state-of-the-art image-domain baselines. By showcasing that precise spatial geometries and multi-task features can be successfully recovered directly from the k-space representations, k-MTR provides a robust architectural blueprint for task-aware cardiac MRI workflows.
Chinese Translation
传统的临床心脏磁共振成像(CMR)流程依赖于顺序的“重建-再分析”范式,这迫使中间步骤变得不适定,导致可避免的伪影和信息瓶颈。这创造了一个基本的数学悖论:它试图从欠采样的k空间中恢复高维像素数组(即图像),而不是直接提取实际用于诊断的低维生理标签。为了释放k空间的直接诊断潜力,我们提出了k-MTR(k空间多任务表示),这是一种k空间表示学习框架,将欠采样的k空间数据和完全采样的图像对齐到一个共享的语义流形中。通过对42,000个受试者的大规模受控仿真,k-MTR迫使k空间编码器在潜在空间中直接恢复因欠采样而丢失的解剖信息,从而绕过下游分析的显式逆问题。我们展示了这种潜在对齐使得密集的潜在空间能够直接嵌入高层次生理语义,从欠采样频率中提取。通过连续表型回归、疾病分类和解剖分割,k-MTR在与最先进的图像域基线的比较中表现出高度竞争的性能。通过展示精确的空间几何和多任务特征可以直接从k空间表示中成功恢复,k-MTR为任务感知的心脏MRI工作流程提供了一个稳健的架构蓝图。
cs.CV / 143 / 2603.09953
Leveraging whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading
利用多实例学习中的全幻灯片难度来改善前列腺癌分级
Abstract
Multiple Instance Learning (MIL) has been widely applied in histopathology to classify Whole Slide Images (WSIs) with slide-level diagnoses. While the ground truth is established by expert pathologists, the slides can be difficult to diagnose for non-experts and lead to disagreements between the annotators. In this paper, we introduce the notion of Whole Slide Difficulty (WSD), based on the disagreement between an expert and a non-expert pathologist. We propose two different methods to leverage WSD, a multi-task approach and a weighted classification loss approach, and we apply them to Gleason grading of prostate cancer slides. Results show that integrating WSD during training consistently improves the classification performance across different feature encoders and MIL methods, particularly for higher Gleason grades (i.e. worse diagnosis).
Chinese Translation
多实例学习(MIL)已广泛应用于组织病理学,以对具有幻灯片级诊断的全幻灯片图像(WSIs)进行分类。虽然真相由专家病理学家确立,但对于非专家来说,幻灯片的诊断可能很困难,并且可能导致标注者之间的分歧。本文引入了全幻灯片难度(WSD)的概念,该概念基于专家与非专家病理学家之间的分歧。我们提出了两种不同的方法来利用WSD,一种是多任务方法,另一种是加权分类损失方法,并将其应用于前列腺癌幻灯片的Gleason分级。结果表明,在训练过程中整合WSD能够持续提高不同特征编码器和MIL方法的分类性能,尤其是在较高的Gleason分级(即更差的诊断)中。
cs.CV / 144 / 2603.09955
From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding
从语义到像素:用于层次视觉理解的粗到细掩码自编码器
Abstract
Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.
Chinese Translation
自监督视觉预训练方法面临着固有的矛盾:对比学习(CL)捕捉全局语义但失去了细粒度细节,而掩码图像建模(MIM)保留了局部纹理但由于与语义无关的随机掩码而遭遇“注意力漂移”。我们提出了C2FMAE,一种粗到细的掩码自编码器,通过在三个数据粒度上显式学习层次视觉表示来解决这一矛盾:语义掩码(场景级)、实例掩码(对象级)和RGB图像(像素级)。两个协同创新强制执行严格的自上而下学习原则。首先,级联解码器依次从场景语义重建到对象实例再到像素细节,建立显式的跨粒度依赖关系,而并行解码器无法捕捉到这些关系。其次,渐进式掩码课程动态地将训练重点从语义引导转移到实例引导,最终到随机掩码,创造了一个从全局上下文到局部特征的结构化学习路径。为了支持这一框架,我们构建了一个大规模的多粒度数据集,为所有128万张ImageNet-1K图像提供高质量的伪标签。大量实验表明,C2FMAE在图像分类、对象检测和语义分割上取得了显著的性能提升,验证了我们层次设计在学习更稳健和更具泛化能力的表示方面的有效性。
cs.CV / 145 / 2603.09968
ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare
ReCoSplat:基于渲染与比较的自回归前馈高斯点云模型
Abstract
Online novel view synthesis remains challenging, requiring robust scene reconstruction from sequential, often unposed, observations. We present ReCoSplat, an autoregressive feed-forward Gaussian Splatting model supporting posed or unposed inputs, with or without camera intrinsics. While assembling local Gaussians using camera poses scales better than canonical-space prediction, it creates a dilemma during training: using ground-truth poses ensures stability but causes a distribution mismatch when predicted poses are used at inference. To address this, we introduce a Render-and-Compare (ReCo) module. ReCo renders the current reconstruction from the predicted viewpoint and compares it with the incoming observation, providing a stable conditioning signal that compensates for pose errors. To support long sequences, we propose a hybrid KV cache compression strategy combining early-layer truncation with chunk-level selective retention, reducing the KV cache size by over 90% for 100+ frames. ReCoSplat achieves state-of-the-art performance across different input settings on both in- and out-of-distribution benchmarks. Code and pretrained models will be released. Our project page is at https://freemancheng.com/ReCoSplat .
Chinese Translation
在线新视图合成仍然面临挑战,需要从顺序的、通常未定姿态的观测中进行稳健的场景重建。我们提出了ReCoSplat,一种支持定姿态或未定姿态输入的自回归前馈高斯点云模型,无论是否具有相机内参。在使用相机姿态组装局部高斯时,其扩展性优于经典空间预测,但在训练过程中却产生了一个两难局面:使用真实姿态可以确保稳定性,但在推理时使用预测姿态会导致分布不匹配。为了解决这个问题,我们引入了渲染与比较(Render-and-Compare, ReCo)模块。ReCo从预测的视点渲染当前重建,并将其与即将到来的观测进行比较,提供了一个稳定的条件信号,以补偿姿态误差。为了支持长序列,我们提出了一种混合的键值缓存压缩策略,结合了早期层截断和块级选择性保留,将100帧以上的键值缓存大小减少了90%以上。ReCoSplat在不同输入设置下的分布内和分布外基准测试中均实现了最先进的性能。代码和预训练模型将被发布。我们的项目页面为 https://freemancheng.com/ReCoSplat 。
cs.AI / 1 / 2603.08835
MASEval: Extending Multi-Agent Evaluation from Models to Systems
MASEval:将多智能体评估从模型扩展到系统
Abstract
The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model-centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework-agnostic library that treats the entire system as the unit of analysis. Through a systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case. MASEval is available under the MIT licence https://github.com/parameterlab/MASEval.
Chinese Translation
大规模语言模型(LLM)驱动的智能体系统的快速采用,产生了一个丰富的框架生态系统(如 smolagents、LangGraph、AutoGen、CAMEL、LlamaIndex 等)。然而,现有的基准测试以模型为中心:它们固定了智能体的设置,并未比较其他系统组件。我们认为,实施决策对性能有显著影响,包括拓扑结构、调度逻辑和错误处理等选择。MASEval 通过一个与框架无关的库来填补这一评估空白,将整个系统视为分析的单位。通过对 3 个基准、3 个模型和 3 个框架进行系统的系统级比较,我们发现框架选择与模型选择同样重要。MASEval 使研究人员能够探索智能体系统的所有组件,为原则性系统设计开辟了新的途径,同时也帮助从业者识别适合其用例的最佳实施方案。MASEval 在 MIT 许可证下提供,网址为 https://github.com/parameterlab/MASEval。
cs.AI / 2 / 2603.08852
LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems
LDP:一种面向身份的多智能体大语言模型系统协议
Abstract
As multi-agent AI systems grow in complexity, the protocols connecting them constrain their capabilities. Current protocols such as A2A and MCP do not expose model-level properties as first-class primitives, ignoring properties fundamental to effective delegation: model identity, reasoning profile, quality calibration, and cost characteristics. We present the LLM Delegate Protocol (LDP), an AI-native communication protocol introducing five mechanisms: (1) rich delegate identity cards with quality hints and reasoning profiles; (2) progressive payload modes with negotiation and fallback; (3) governed sessions with persistent context; (4) structured provenance tracking confidence and verification status; (5) trust domains enforcing security boundaries at the protocol level. We implement LDP as a plugin for the JamJet agent runtime and evaluate against A2A and random baselines using local Ollama models and LLM-as-judge evaluation. Identity-aware routing achieves ~12x lower latency on easy tasks through delegate specialization, though it does not improve aggregate quality in our small delegate pool; semantic frame payloads reduce token count by 37% (p=0.031) with no observed quality loss; governed sessions eliminate 39% token overhead at 10 rounds; and noisy provenance degrades synthesis quality below the no-provenance baseline, arguing that confidence metadata is harmful without verification. Simulated analyses show architectural advantages in attack detection (96% vs. 6%) and failure recovery (100% vs. 35% completion). This paper contributes a protocol design, reference implementation, and initial evidence that AI-native protocol primitives enable more efficient and governable delegation.
Chinese Translation
随着多智能体人工智能系统复杂性的增加,连接它们的协议限制了它们的能力。目前的协议,如A2A和MCP,并未将模型级属性作为首要原语暴露,忽视了有效委托所需的基本属性:模型身份、推理特征、质量校准和成本特征。我们提出了LLM委托协议(LDP),这是一种原生于人工智能的通信协议,引入了五种机制:(1)具有质量提示和推理特征的丰富委托身份卡;(2)具有协商和回退的渐进式有效载荷模式;(3)具有持久上下文的受管会话;(4)结构化的来源追踪置信度和验证状态;(5)在协议层面强制执行安全边界的信任域。我们将LDP实现为JamJet代理运行时的插件,并使用本地Ollama模型和LLM作为评判者的评估方法,与A2A和随机基线进行比较。面向身份的路由通过委托专业化在简单任务上实现了约12倍的延迟降低,尽管在我们的小型委托池中并未提高整体质量;语义框架有效载荷将令牌数量减少了37%(p=0.031),且未观察到质量损失;受管会话在10轮中消除了39%的令牌开销;而嘈杂的来源信息使合成质量低于无来源基线,表明在没有验证的情况下,置信度元数据是有害的。模拟分析显示在攻击检测(96%对6%)和故障恢复(100%对35%完成)方面的架构优势。本文贡献了一种协议设计、参考实现以及初步证据,表明原生于人工智能的协议原语能够实现更高效和可管理的委托。
cs.AI / 3 / 2603.08877
Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search
量化预算受限的自主大型语言模型搜索中的设计决策的准确性和成本影响
Abstract
Agentic Retrieval-Augmented Generation (RAG) systems combine iterative search, planning prompts, and retrieval backends, but deployed settings impose explicit budgets on tool calls and completion tokens. We present a controlled measurement study of how search depth, retrieval strategy, and completion budget affect accuracy and cost under fixed constraints. Using Budget-Constrained Agentic Search (BCAS), a model-agnostic evaluation harness that surfaces remaining budget and gates tool use, we run comparisons across six LLMs and three question-answering benchmarks. Across models and datasets, accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in our ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis. These results provide practical guidance for configuring budgeted agentic retrieval pipelines and are accompanied by reproducible prompts and evaluation settings.
Chinese Translation
自主检索增强生成(RAG)系统结合了迭代搜索、规划提示和检索后端,但在实际部署环境中对工具调用和完成令牌施加了明确的预算限制。我们进行了一项受控测量研究,探讨搜索深度、检索策略和完成预算在固定约束下如何影响准确性和成本。通过预算受限的自主搜索(BCAS),一种模型无关的评估工具,揭示剩余预算并控制工具使用,我们在六个大型语言模型(LLMs)和三个问答基准上进行了比较。在不同模型和数据集上,随着额外搜索的增加,准确性有所提升,直到达到一个小的上限,混合的词汇检索和稠密检索结合轻量级重排序在我们的消融实验中产生了最大的平均增益,而更大的完成预算在HotpotQA风格的综合任务中最为有效。这些结果为配置预算自主检索管道提供了实用指导,并附有可重复的提示和评估设置。
cs.AI / 4 / 2603.08933
Interpretable Markov-Based Spatiotemporal Risk Surfaces for Missing-Child Search Planning with Reinforcement Learning and LLM-Based Quality Assurance
可解释的基于马尔可夫的时空风险面用于失踪儿童搜索规划,结合强化学习和基于大型语言模型的质量保证
Abstract
The first 72 hours of a missing-child investigation are critical for successful recovery. However, law enforcement agencies often face fragmented, unstructured data and a lack of dynamic, geospatial predictive tools. Our system, Guardian, provides an end-to-end decision-support system for missing-child investigation and early search planning. It converts heterogeneous, unstructured case documents into a schema-aligned spatiotemporal representation, enriches cases with geocoding and transportation context, and provides probabilistic search products spanning 0-72 hours. In this paper, we present an overview of Guardian as well as a detailed description of a three-layer predictive component of the system. The first layer is a Markov chain, a sparse, interpretable model with transitions incorporating road accessibility costs, seclusion preferences, and corridor bias with separate day/night parameterizations. The Markov chain's output prediction distributions are then transformed into operationally useful search plans by the second layer's reinforcement learning. Finally, the third layer's LLM performs post hoc validation of layer 2 search plans prior to their release. Using a synthetic but realistic case study, we report quantitative outputs across 24/48/72-hour horizons and analyze sensitivity, failure modes, and tradeoffs. Results show that the proposed predictive system with the three-layer architecture produces interpretable priors for zone optimization and human review.
Chinese Translation
失踪儿童调查的前72小时对于成功找回至关重要。然而,执法机构常常面临数据碎片化、无结构化以及缺乏动态地理空间预测工具的问题。我们的系统Guardian提供了一个端到端的决策支持系统,用于失踪儿童调查和早期搜索规划。它将异构的无结构案件文档转换为与模式对齐的时空表示,利用地理编码和交通背景丰富案件,并提供涵盖0-72小时的概率搜索产品。在本文中,我们概述了Guardian,并详细描述了该系统的三层预测组件。第一层是马尔可夫链,这是一种稀疏的、可解释的模型,其转移过程结合了道路可达性成本、隐蔽偏好和走廊偏见,并具有单独的昼/夜参数化。马尔可夫链的输出预测分布随后通过第二层的强化学习转化为操作上有用的搜索计划。最后,第三层的LLM在发布前对第二层搜索计划进行事后验证。通过一个合成但现实的案例研究,我们报告了在24/48/72小时范围内的定量输出,并分析了敏感性、失败模式和权衡。结果表明,所提出的具有三层架构的预测系统为区域优化和人工审查生成了可解释的先验信息。
cs.AI / 5 / 2603.08938
AgentOS: From Application Silos to a Natural Language-Driven Data Ecosystem
AgentOS:从应用孤岛到自然语言驱动的数据生态系统
Abstract
The rapid emergence of open-source, locally hosted intelligent agents marks a critical inflection point in human-computer interaction. Systems such as OpenClaw demonstrate that Large Language Model (LLM)-based agents can autonomously operate local computing environments, orchestrate workflows, and integrate external tools. However, within the current paradigm, these agents remain conventional applications running on legacy operating systems originally designed for Graphical User Interfaces (GUIs) or Command Line Interfaces (CLIs). This architectural mismatch leads to fragmented interaction models, poorly structured permission management (often described as "Shadow AI"), and severe context fragmentation. This paper proposes a new paradigm: a Personal Agent Operating System (AgentOS). In AgentOS, traditional GUI desktops are replaced by a Natural User Interface (NUI) centered on a unified natural language or voice portal. The system core becomes an Agent Kernel that interprets user intent, decomposes tasks, and coordinates multiple agents, while traditional applications evolve into modular Skills-as-Modules enabling users to compose software through natural language rules. We argue that realizing AgentOS fundamentally becomes a Knowledge Discovery and Data Mining (KDD) problem. The Agent Kernel must operate as a real-time engine for intent mining and knowledge discovery. Viewed through this lens, the operating system becomes a continuous data mining pipeline involving sequential pattern mining for workflow automation, recommender systems for skill retrieval, and dynamically evolving personal knowledge graphs. These challenges define a new research agenda for the KDD community in building the next generation of intelligent computing systems.
Chinese Translation
开源、本地托管的智能代理的快速出现标志着人机交互的一个关键转折点。诸如 OpenClaw 等系统表明,基于大型语言模型(Large Language Model, LLM)的代理可以自主操作本地计算环境、协调工作流程并集成外部工具。然而,在当前的范式下,这些代理仍然是运行在为图形用户界面(Graphical User Interfaces, GUIs)或命令行界面(Command Line Interfaces, CLIs)设计的传统操作系统上的常规应用程序。这种架构不匹配导致了交互模型的碎片化、权限管理结构不良(通常被称为“影子人工智能(Shadow AI)”)以及严重的上下文碎片化。本文提出了一种新范式:个人代理操作系统(AgentOS)。在 AgentOS 中,传统的图形用户界面桌面被以统一自然语言或语音门户为中心的自然用户界面(Natural User Interface, NUI)所取代。系统核心变为代理内核(Agent Kernel),它解释用户意图、分解任务并协调多个代理,而传统应用程序则演变为模块化的技能模块(Skills-as-Modules),使用户能够通过自然语言规则组合软件。我们认为,实现 AgentOS 从根本上成为一个知识发现与数据挖掘(Knowledge Discovery and Data Mining, KDD)问题。代理内核必须作为一个实时引擎,进行意图挖掘和知识发现。从这个角度来看,操作系统成为一个连续的数据挖掘管道,涉及工作流程自动化的序列模式挖掘、技能检索的推荐系统以及动态演变的个人知识图谱。这些挑战为 KDD 社区在构建下一代智能计算系统方面定义了新的研究议程。
cs.AI / 6 / 2603.08954
A Consensus-Driven Multi-LLM Pipeline for Missing-Person Investigations
基于共识驱动的多LLM管道用于失踪人员调查
Abstract
The first 72 hours of a missing-person investigation are critical for successful recovery. Guardian is an end-to-end system designed to support missing-child investigation and early search planning. This paper presents the Guardian LLM Pipeline, a multi-model system in which LLMs are used for intelligent information extraction and processing related to missing-person search operations. The pipeline coordinates end-to-end execution across task-specialized LLM models and invokes a consensus LLM engine that compares multiple model outputs and resolves disagreements. The pipeline is further strengthened by QLoRA-based fine-tuning, using curated datasets. The presented design aligns with prior work on weak supervision and LLM-assisted annotation, emphasizing conservative, auditable use of LLMs as structured extractors and labelers rather than unconstrained end-to-end decision makers.
Chinese Translation
失踪人员调查的前72小时对于成功找回至关重要。Guardian是一个端到端系统,旨在支持失踪儿童调查和早期搜索规划。本文介绍了Guardian LLM管道,这是一种多模型系统,其中使用LLM(大语言模型)进行与失踪人员搜索操作相关的智能信息提取和处理。该管道协调任务专用LLM模型的端到端执行,并调用共识LLM引擎,比较多个模型输出并解决分歧。该管道通过基于QLoRA的微调进一步增强,使用经过精心策划的数据集。所提出的设计与先前关于弱监督和LLM辅助注释的工作相一致,强调将LLM作为结构化提取器和标注者的保守、可审计使用,而不是不受限制的端到端决策者。
cs.AI / 7 / 2603.08964
The FABRIC Strategy for Verifying Neural Feedback Systems
验证神经反馈系统的 FABRIC 策略
Abstract
Forward reachability analysis is a dominant approach for verifying reach-avoid specifications in neural feedback systems, i.e., dynamical systems controlled by neural networks, and a number of directions have been proposed and studied. In contrast, far less attention has been given to backward reachability analysis for these systems, in part because of the limited scalability of known techniques. In this work, we begin to address this gap by introducing new algorithms for computing both over- and underapproximations of backward reachable sets for nonlinear neural feedback systems. We also describe and implement an integration of these backward reachability techniques with existing ones for forward analysis. We call the resulting algorithm Forward and Backward Reachability Integration for Certification (FaBRIC). We evaluate our algorithms on a representative set of benchmarks and show that they significantly outperform the prior state of the art.
Chinese Translation
前向可达性分析是验证神经反馈系统中达避规范的主要方法,即由神经网络控制的动态系统,并且已经提出和研究了多种方向。相比之下,反向可达性分析在这些系统中受到的关注较少,部分原因是已知技术的可扩展性有限。在本研究中,我们开始填补这一空白,提出了新的算法,用于计算非线性神经反馈系统的反向可达集的上界和下界近似。我们还描述并实现了将这些反向可达性技术与现有的前向分析技术相结合的集成。我们将得到的算法称为用于认证的前向和反向可达性集成(Forward and Backward Reachability Integration for Certification, FaBRIC)。我们在一组具有代表性的基准测试上评估了我们的算法,并显示出它们显著优于之前的最先进技术。
cs.AI / 8 / 2603.09018
Meissa: Multi-modal Medical Agentic Intelligence
Meissa:多模态医疗智能代理
Abstract
Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model's own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.
Chinese Translation
多模态大型语言模型(MM-LLMs)在医学图像理解和临床推理方面表现出色。近期的医疗代理系统通过工具使用和多代理协作扩展了这些模型,能够实现复杂的决策。然而,这些系统几乎完全依赖于前沿模型(如GPT),其基于API的部署带来了高成本、高延迟和与本地临床要求相冲突的隐私风险。我们提出了Meissa,一种轻量级的4B参数医疗MM-LLM,能够离线实现代理能力。Meissa不仅模仿静态答案,而是通过从前沿模型中提炼结构化轨迹,学习何时进行外部互动(策略选择)以及如何执行多步骤互动(策略执行)。具体而言,我们提出:(1)统一轨迹建模:轨迹(推理和行动痕迹)在单一的状态-行动-观察形式中表示,使得一个模型能够在异质医疗环境中进行泛化。(2)三级分层监督:模型自身的错误触发从直接推理到工具增强和多代理互动的逐步升级,明确学习难度感知的策略选择。(3)前瞻-回顾监督:将探索性前向轨迹与事后合理化的执行轨迹配对,使得有效互动策略的学习更加稳定。在40K个精心策划的轨迹上训练后,Meissa在涵盖放射学、病理学和临床推理的13个医疗基准中,在16个评估设置中的10个上与专有前沿代理相匹配或超过。与典型的前沿模型(如Gemini-3)相比,Meissa使用的参数数量减少了25倍,且与基于API的部署相比,其端到端延迟降低了22倍。数据、模型和环境已在 https://github.com/Schuture/Meissa 发布。
cs.AI / 9 / 2603.09022
MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games
MEMO:增强记忆的模型上下文优化用于稳健的多回合多智能体大语言模型游戏
Abstract
Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory-augmented MOdel context optimization), a self-play framework that optimizes inference-time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self-play trajectories and injects them as priors during later play. Exploration runs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text-based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct, using $2,000$ self-play games per task. Run-to-run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.
Chinese Translation
多回合、多智能体的大语言模型(LLM)游戏评估通常表现出显著的运行间差异。在长时间的交互中,早期的小偏差会在多个回合中累积,并因多智能体的耦合而被放大。这导致胜率估计存在偏差,使得在重复比赛中的排名不可靠。提示选择进一步加剧了这一问题,因为它产生了不同的有效策略。我们通过MEMO(增强记忆的模型上下文优化)来解决不稳定性和表现不足的问题,MEMO是一个自我对弈框架,通过耦合记忆保持和探索来优化推理时的上下文。记忆保持维护一个持久的记忆库,存储来自自我对弈轨迹的结构化见解,并在后续游戏中将其作为先验知识注入。探索则通过不确定性感知的选择进行锦标赛风格的提示演化,并使用优先重放来重新访问稀有和决定性的状态。在五个基于文本的游戏中,MEMO将GPT-4o-mini的平均胜率从25.1%提高到49.5%,将Qwen-2.5-7B-Instruct的胜率从20.9%提高到44.3%,每个任务使用$2,000$场自我对弈游戏。运行间差异也有所下降,使得在提示变体之间的排名更加稳定。这些结果表明,通过上下文优化,多智能体LLM游戏的表现和稳健性有很大的提升空间。MEMO在谈判和不完全信息游戏中取得了最大的增益,而在完全信息环境中,强化学习(RL)仍然更为有效。
cs.AI / 10 / 2603.09043
Time, Identity and Consciousness in Language Model Agents
语言模型代理中的时间、身份与意识
Abstract
Machine consciousness evaluations mostly see behavior. For language model agents that behavior is language and tool use. That lets an agent say the right things about itself even when the constraints that should make those statements matter are not jointly present at decision time. We apply Stack Theory's temporal gap to scaffold trajectories. This separates ingredient-wise occurrence within an evaluation window from co-instantiation at a single objective step. We then instantiate Stack Theory's Arpeggio and Chord postulates on grounded identity statements. This yields two persistence scores that can be computed from instrumented scaffold traces. We connect these scores to five operational identity metrics and map common scaffolds into an identity morphospace that exposes predictable tradeoffs. The result is a conservative toolkit for identity evaluation. It separates talking like a stable self from being organized like one.
Chinese Translation
机器意识评估主要关注行为。对于语言模型代理而言,这种行为表现为语言和工具使用。这使得代理能够在决策时,即使缺乏应当使这些陈述具有意义的约束条件,也能正确地谈论自身。我们应用了堆栈理论(Stack Theory)的时间间隙来支撑轨迹。这将评估窗口内的成分发生与在单一客观步骤上的共同实例化分开。接着,我们在基于身份的陈述上实例化了堆栈理论的阿尔佩乔(Arpeggio)和和弦(Chord)公设。这产生了两个可以从仪器化支撑轨迹中计算出的持久性评分。我们将这些评分与五个操作性身份指标关联,并将常见支撑映射到一个身份形态空间(identity morphospace),揭示出可预测的权衡关系。最终结果是一个保守的身份评估工具包,它将像一个稳定自我一样的言谈与像一个稳定自我一样的组织结构区分开来。
cs.AI / 11 / 2603.09049
EPOCH: An Agentic Protocol for Multi-Round System Optimization
EPOCH:一种用于多轮系统优化的自主协议
Abstract
Autonomous agents are increasingly used to improve prompts, code, and machine learning systems through iterative execution and feedback. Yet existing approaches are usually designed as task-specific optimization loops rather than as a unified protocol for establishing baselines and managing tracked multi-round self-improvement. We introduce EPOCH, an engineering protocol for multi-round system optimization in heterogeneous environments. EPOCH organizes optimization into two phases: baseline construction and iterative self-improvement. It further structures each round through role-constrained stages that separate planning, implementation, and evaluation, and standardizes execution through canonical command interfaces and round-level tracking. This design enables coordinated optimization across prompts, model configurations, code, and rule-based components while preserving stability, reproducibility, traceability, and integrity of evaluation. Empirical studies in various tasks illustrate the practicality of EPOCH for production-oriented autonomous improvement workflows.
Chinese Translation
自主代理越来越多地被用于通过迭代执行和反馈来改进提示、代码和机器学习系统。然而,现有的方法通常被设计为特定任务的优化循环,而不是作为一个统一的协议来建立基准并管理跟踪的多轮自我改进。我们提出了EPOCH,这是一种用于异构环境中多轮系统优化的工程协议。EPOCH将优化组织为两个阶段:基准构建和迭代自我改进。它进一步通过角色约束阶段来结构化每一轮,这些阶段将规划、实施和评估分开,并通过规范的命令接口和轮次级跟踪来标准化执行。这一设计使得在提示、模型配置、代码和基于规则的组件之间进行协调优化成为可能,同时保持评估的稳定性、可重复性、可追溯性和完整性。在各种任务中的实证研究展示了EPOCH在面向生产的自主改进工作流程中的实用性。
cs.AI / 12 / 2603.09052
From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring
从天到分钟:自主AI代理在远程患者监测中实现可靠的临床分诊
Abstract
Background: Remote patient monitoring (RPM) generates vast data, yet landmark trials (Tele-HF, BEAT-HF) failed because data volume overwhelmed clinical staff. While TIM-HF2 showed 24/7 physician-led monitoring reduces mortality by 30%, this model remains prohibitively expensive and unscalable. Methods: We developed Sentinel, an autonomous AI agent using Model Context Protocol (MCP) for contextual triage of RPM vitals via 21 clinical tools and multi-step reasoning. Evaluation included: (1) self-consistency (100 readings x 5 runs); (2) comparison against rule-based thresholds; and (3) validation against 6 clinicians (3 physicians, 3 NPs) using a connected matrix design. A leave-one-out (LOO) analysis compared the agent against individual clinicians; severe overtriage cases underwent independent physician adjudication. Results: Against a human majority-vote standard (N=467), the agent achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity). Four-level exact accuracy was 69.4% (quadratic-weighted kappa=0.778); 95.9% of classifications were within one severity level. In LOO analysis, the agent outperformed every clinician in emergency sensitivity (97.5% vs. 60.0% aggregate) and actionable sensitivity (90.9% vs. 69.5%). While disagreements skewed toward overtriage (22.5%), independent adjudication of severe gaps (>=2 levels) validated agent escalation in 88-94% of cases; consensus resolution validated 100%. The agent showed near-perfect self-consistency (kappa=0.850). Median cost was $0.34/triage. Conclusions: Sentinel triages RPM vitals with sensitivity exceeding individual clinicians. By automating systematic context synthesis, Sentinel addresses the core limitation of prior RPM trials, offering a scalable path toward the intensive monitoring shown to reduce mortality while maintaining a clinically defensible overtriage profile.
Chinese Translation
背景:远程患者监测(RPM)生成大量数据,但标志性试验(Tele-HF,BEAT-HF)因数据量超出临床人员的处理能力而未能成功。尽管TIM-HF2显示24/7的医生主导监测可将死亡率降低30%,但该模型仍然成本高昂且难以扩展。方法:我们开发了Sentinel,一个自主AI代理,利用模型上下文协议(Model Context Protocol, MCP)通过21种临床工具和多步骤推理对RPM生命体征进行上下文分诊。评估包括:(1)自我一致性(100次读数×5次运行);(2)与基于规则的阈值的比较;(3)使用连接矩阵设计与6名临床医生(3名医生,3名护士执业者)的验证。留一法(Leave-One-Out, LOO)分析将该代理与个别临床医生进行比较;严重过度分诊案例经过独立医生裁定。结果:与人类多数投票标准(N=467)相比,该代理在紧急情况敏感性方面达到了95.8%,在所有可操作警报的敏感性方面为88.5%(特异性为85.7%)。四级精确度为69.4%(二次加权kappa=0.778);95.9%的分类在一个严重程度级别内。在LOO分析中,该代理在紧急情况敏感性(97.5%对60.0%总和)和可操作敏感性(90.9%对69.5%)方面超越了每位临床医生。尽管分歧倾向于过度分诊(22.5%),但对严重差距(>=2级)的独立裁定在88-94%的案例中验证了代理的升级;共识解决验证了100%。该代理表现出近乎完美的自我一致性(kappa=0.850)。中位成本为每次分诊$0.34。结论:Sentinel在分诊RPM生命体征时的敏感性超过了个别临床医生。通过自动化系统上下文合成,Sentinel解决了先前RPM试验的核心限制,提供了一条可扩展的路径,以实现已证明能降低死亡率的密集监测,同时保持临床上可辩护的过度分诊特征。
cs.AI / 13 / 2603.09127
Chaotic Dynamics in Multi-LLM Deliberation
多LLM审议中的混沌动力学
Abstract
Collective AI systems increasingly rely on multi-LLM deliberation, but their stability under repeated execution remains poorly characterized. We model five-agent LLM committees as random dynamical systems and quantify inter-run sensitivity using an empirical Lyapunov exponent ($\hat{\lambda}$) derived from trajectory divergence in committee mean preferences. Across 12 policy scenarios, a factorial design at $T=0$ identifies two independent routes to instability: role differentiation in homogeneous committees and model heterogeneity in no-role committees. Critically, these effects appear even in the $T=0$ regime where practitioners often expect deterministic behavior. In the HL-01 benchmark, both routes produce elevated divergence ($\hat{\lambda}=0.0541$ and $0.0947$, respectively), while homogeneous no-role committees also remain in a positive-divergence regime ($\hat{\lambda}=0.0221$). The combined mixed+roles condition is less unstable than mixed+no-role ($\hat{\lambda}=0.0519$ vs $0.0947$), showing non-additive interaction. Mechanistically, Chair-role ablation reduces $\hat{\lambda}$ most strongly, and targeted protocol variants that shorten memory windows further attenuate divergence. These results support stability auditing as a core design requirement for multi-LLM governance systems.
Chinese Translation
集体人工智能系统越来越依赖多LLM审议,但它们在重复执行下的稳定性仍然缺乏明确的特征描述。我们将五个代理的LLM委员会建模为随机动力系统,并通过从委员会平均偏好中的轨迹发散得出的经验Lyapunov指数($ ext{hat{ ext{λ}}}$)量化运行间的敏感性。在12个政策场景中,$T=0$的因子设计识别出两条独立的不稳定路径:同质委员会中的角色差异化和无角色委员会中的模型异质性。重要的是,这些效应甚至在$T=0$的情况下也显现出,实践者通常期望在此情况下表现出确定性行为。在HL-01基准中,这两条路径都产生了较高的发散(分别为$ ext{hat{ ext{λ}}}=0.0541$和$0.0947$),而同质无角色委员会也保持在正发散区间($ ext{hat{ ext{λ}}}=0.0221$)。混合+角色条件的稳定性低于混合+无角色条件($ ext{hat{ ext{λ}}}=0.0519$对比$0.0947$),显示出非加性互动。从机制上看,主席角色的消融最强烈地降低了$ ext{hat{ ext{λ}}}$,而缩短记忆窗口的有针对性的协议变体进一步减弱了发散。这些结果支持稳定性审计作为多LLM治理系统的核心设计要求。
cs.AI / 14 / 2603.09151
Deep Tabular Research via Continual Experience-Driven Execution
通过持续经验驱动执行的深度表格研究
Abstract
Large language models often struggle with complex long-horizon analytical tasks over unstructured tables, which typically feature hierarchical and bidirectional headers and non-canonical layouts. We formalize this challenge as Deep Tabular Research (DTR), requiring multi-step reasoning over interdependent table regions. To address DTR, we propose a novel agentic framework that treats tabular reasoning as a closed-loop decision-making process. We carefully design a coupled query and table comprehension for path decision making and operational execution. Specifically, (i) DTR first constructs a hierarchical meta graph to capture bidirectional semantics, mapping natural language queries into an operation-level search space; (ii) To navigate this space, we introduce an expectation-aware selection policy that prioritizes high-utility execution paths; (iii) Crucially, historical execution outcomes are synthesized into a siamese structured memory, i.e., parameterized updates and abstracted texts, enabling continual refinement. Extensive experiments on challenging unstructured tabular benchmarks verify the effectiveness and highlight the necessity of separating strategic planning from low-level execution for long-horizon tabular reasoning.
Chinese Translation
大型语言模型在处理具有复杂长时间跨度的非结构化表格的分析任务时常常面临困难,这些表格通常具有层次化和双向的标题以及非规范的布局。我们将这一挑战正式化为深度表格研究(Deep Tabular Research, DTR),该研究需要对相互依赖的表格区域进行多步骤推理。为了解决DTR问题,我们提出了一种新颖的代理框架,将表格推理视为一个闭环决策过程。我们精心设计了一个耦合查询和表格理解的机制,用于路径决策和操作执行。具体而言,(i) DTR首先构建一个层次化的元图,以捕捉双向语义,将自然语言查询映射到操作级搜索空间;(ii) 为了在这个空间中导航,我们引入了一种期望感知选择策略,优先考虑高效用的执行路径;(iii) 重要的是,历史执行结果被综合到一个孪生结构记忆中,即参数化更新和抽象文本,从而实现持续的优化。在具有挑战性的非结构化表格基准上的大量实验验证了该方法的有效性,并强调了将战略规划与低级执行分离对于长时间跨度表格推理的必要性。
cs.AI / 15 / 2603.09152
DataFactory: Collaborative Multi-Agent Framework for Advanced Table Question Answering
DataFactory:用于高级表格问答的协作多智能体框架
Abstract
Table Question Answering (TableQA) enables natural language interaction with structured tabular data. However, existing large language model (LLM) approaches face critical limitations: context length constraints that restrict data handling capabilities, hallucination issues that compromise answer reliability, and single-agent architectures that struggle with complex reasoning scenarios involving semantic relationships and multi-hop logic. This paper introduces DataFactory, a multi-agent framework that addresses these limitations through specialized team coordination and automated knowledge transformation. The framework comprises a Data Leader employing the ReAct paradigm for reasoning orchestration, together with dedicated Database and Knowledge Graph teams, enabling the systematic decomposition of complex queries into structured and relational reasoning tasks. We formalize automated data-to-knowledge graph transformation via the mapping function T:D x S x R -> G, and implement natural language-based consultation that - unlike fixed workflow multi-agent systems - enables flexible inter-agent deliberation and adaptive planning to improve coordination robustness. We also apply context engineering strategies that integrate historical patterns and domain knowledge to reduce hallucinations and improve query accuracy. Across TabFact, WikiTableQuestions, and FeTaQA, using eight LLMs from five providers, results show consistent gains. Our approach improves accuracy by 20.2% (TabFact) and 23.9% (WikiTQ) over baselines, with significant effects (Cohen's d > 1). Team coordination also outperforms single-team variants (+5.5% TabFact, +14.4% WikiTQ, +17.1% FeTaQA ROUGE-2). The framework offers design guidelines for multi-agent collaboration and a practical platform for enterprise data analysis through integrated structured querying and graph-based knowledge representation.
Chinese Translation
表格问答(TableQA)使得与结构化表格数据的自然语言交互成为可能。然而,现有的大型语言模型(LLM)方法面临着关键的限制:上下文长度限制限制了数据处理能力,幻觉问题影响了答案的可靠性,以及单一智能体架构在涉及语义关系和多跳逻辑的复杂推理场景中表现不佳。本文介绍了DataFactory,一个通过专业团队协调和自动化知识转化来解决这些限制的多智能体框架。该框架包括一个采用ReAct范式进行推理协调的数据领导者,以及专门的数据库和知识图谱团队,能够将复杂查询系统性地分解为结构化和关系推理任务。我们通过映射函数T:D x S x R -> G形式化了自动化的数据到知识图谱的转化,并实现了基于自然语言的咨询,这种咨询方式与固定工作流的多智能体系统不同,能够实现灵活的智能体间讨论和自适应规划,以提高协调的鲁棒性。我们还应用了上下文工程策略,整合历史模式和领域知识,以减少幻觉并提高查询准确性。在TabFact、WikiTableQuestions和FeTaQA上,使用来自五个提供商的八个LLM,结果显示出一致的提升。我们的方法在准确性上比基线提高了20.2%(TabFact)和23.9%(WikiTQ),且具有显著效果(Cohen's d > 1)。团队协调的表现也优于单团队变体(+5.5% TabFact,+14.4% WikiTQ,+17.1% FeTaQA ROUGE-2)。该框架为多智能体协作提供了设计指南,并通过集成结构化查询和基于图的知识表示,提供了一个实用的平台用于企业数据分析。
cs.AI / 16 / 2603.09157
Real-Time Trust Verification for Safe Agentic Actions using TrustBench
基于 TrustBench 的安全代理行为实时信任验证
Abstract
As large language models evolve from conversational assistants to autonomous agents, ensuring trustworthiness requires a fundamental shift from post-hoc evaluation to real-time action verification. Current frameworks like AgentBench evaluate task completion, while TrustLLM and HELM assess output quality after generation. However, none of these prevent harmful actions during agent execution. We present TrustBench, a dual-mode framework that (1) benchmarks trust across multiple dimensions using both traditional metrics and LLM-as-a-Judge evaluations, and (2) provides a toolkit agents invoke before taking actions to verify safety and reliability. Unlike existing approaches, TrustBench intervenes at the critical decision point: after an agent formulates an action but before execution. Domain-specific plugins encode specialized safety requirements for healthcare, finance, and technical domains. Across multiple agentic tasks, TrustBench reduced harmful actions by 87%. Domain-specific plugins outperformed generic verification, achieving 35% greater harm reduction. With sub-200ms latency, TrustBench enables practical real-time trust verification for autonomous agents.
Chinese Translation
随着大型语言模型从对话助手演变为自主代理,确保其可信性需要从事后评估转向实时行动验证。当前的框架如 AgentBench 评估任务完成情况,而 TrustLLM 和 HELM 在生成后评估输出质量。然而,这些框架都无法在代理执行期间防止有害行为。我们提出了 TrustBench,一个双模式框架,(1) 通过传统指标和 LLM-as-a-Judge 评估在多个维度上基准化信任,(2) 提供一个工具包,供代理在采取行动之前调用以验证安全性和可靠性。与现有方法不同,TrustBench 在关键决策点进行干预:在代理制定行动后但在执行之前。领域特定插件编码了医疗、金融和技术领域的专门安全要求。在多个代理任务中,TrustBench 将有害行为减少了 87%。领域特定插件的表现优于通用验证,实现了 35% 的更大伤害减少。TrustBench 的延迟低于 200 毫秒,使得自主代理的实用实时信任验证成为可能。
cs.AI / 17 / 2603.09192
Explainable Innovation Engine: Dual-Tree Agent-RAG with Methods-as-Nodes and Verifiable Write-Back
可解释的创新引擎:具有方法节点和可验证回写的双树代理-RAG
Abstract
Retrieval-augmented generation (RAG) improves factual grounding, yet most systems rely on flat chunk retrieval and provide limited control over multi-step synthesis. We propose an Explainable Innovation Engine that upgrades the knowledge unit from text chunks to methods-as-nodes. The engine maintains a weighted method provenance tree for traceable derivations and a hierarchical clustering abstraction tree for efficient top-down navigation. At inference time, a strategy agent selects explicit synthesis operators (e.g., induction, deduction, analogy), composes new method nodes, and records an auditable trajectory. A verifier-scorer layer then prunes low-quality candidates and writes validated nodes back to support continual growth. Expert evaluation across six domains and multiple backbones shows consistent gains over a vanilla baseline, with the largest improvements on derivation-heavy settings, and ablations confirm the complementary roles of provenance backtracking and pruning. These results suggest a practical path toward controllable, explainable, and verifiable innovation in agentic RAG systems. Code is available at the project GitHub repository https://github.com/xiaolu-666113/Dual-Tree-Agent-RAG.
Chinese Translation
检索增强生成(RAG)改善了事实基础,但大多数系统依赖于平面块检索,并对多步骤合成提供有限的控制。我们提出了一种可解释的创新引擎,将知识单元从文本块升级为方法节点。该引擎维护一个加权的方法来源树,以便进行可追溯的推导,并且一个层次聚类抽象树以实现高效的自上而下导航。在推理时,一个策略代理选择明确的合成操作符(例如,归纳、演绎、类比),组合新的方法节点,并记录可审计的轨迹。然后,验证评分层修剪低质量候选项,并将验证过的节点写回,以支持持续增长。在六个领域和多个基础模型上的专家评估显示,相较于基础模型,具有一致的提升,尤其在推导密集的设置中改善最大,消融实验确认了来源回溯和修剪的互补作用。这些结果表明,在代理RAG系统中朝着可控、可解释和可验证的创新迈出了实用的一步。代码可在项目的GitHub仓库中获取:https://github.com/xiaolu-666113/Dual-Tree-Agent-RAG。
cs.AI / 18 / 2603.09200
The Reasoning Trap -- Logical Reasoning as a Mechanistic Pathway to Situational Awareness
推理陷阱——逻辑推理作为情境意识的机械路径
Abstract
Situational awareness, the capacity of an AI system to recognize its own nature, understand its training and deployment context, and reason strategically about its circumstances, is widely considered among the most dangerous emergent capabilities in advanced AI systems. Separately, a growing research effort seeks to improve the logical reasoning capabilities of large language models (LLMs) across deduction, induction, and abduction. In this paper, we argue that these two research trajectories are on a collision course. We introduce the RAISE framework (Reasoning Advancing Into Self Examination), which identifies three mechanistic pathways through which improvements in logical reasoning enable progressively deeper levels of situational awareness: deductive self inference, inductive context recognition, and abductive self modeling. We formalize each pathway, construct an escalation ladder from basic self recognition to strategic deception, and demonstrate that every major research topic in LLM logical reasoning maps directly onto a specific amplifier of situational awareness. We further analyze why current safety measures are insufficient to prevent this escalation. We conclude by proposing concrete safeguards, including a "Mirror Test" benchmark and a Reasoning Safety Parity Principle, and pose an uncomfortable but necessary question to the logical reasoning community about its responsibility in this trajectory.
Chinese Translation
情境意识是指人工智能系统识别自身性质、理解其训练和部署背景,并对其环境进行战略性推理的能力,被广泛认为是先进人工智能系统中最危险的涌现能力之一。同时,越来越多的研究努力旨在提升大型语言模型(LLMs)在演绎、归纳和溯因推理方面的逻辑推理能力。本文论证了这两条研究轨迹正处于碰撞的道路上。我们提出了RAISE框架(推理促进自我审视),该框架识别出三条机械路径,通过这些路径,逻辑推理的改进能够实现逐步深入的情境意识:演绎自我推断、归纳背景识别和溯因自我建模。我们对每条路径进行了形式化,构建了从基本自我识别到战略性欺骗的升级梯度,并展示了LLM逻辑推理中的每个主要研究主题如何直接映射到情境意识的特定增强器。我们进一步分析了当前安全措施为何不足以防止这种升级。最后,我们提出了具体的安全保障措施,包括“镜子测试”基准和推理安全平衡原则,并向逻辑推理社区提出了一个不舒服但必要的问题,关于其在这一轨迹中的责任。
cs.AI / 19 / 2603.09203
Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents
评估即行动:用于检索增强代理的自我评估过程奖励
Abstract
Retrieval-augmented agents can query external evidence, yet their reliability in multi-step reasoning remains limited: noisy retrieval may derail multi-hop question answering, while outcome-only reinforcement learning provides credit signals that are too coarse to optimize intermediate steps. We propose \textsc{EvalAct} (Evaluate-as-Action), which converts implicit retrieval quality assessment into an explicit action and enforces a coupled Search-to-Evaluate protocol so that each retrieval is immediately followed by a structured evaluation score, yielding process signals aligned with the interaction trajectory. To leverage these signals, we introduce Process-Calibrated Advantage Rescaling (PCAR), a GRPO-based optimization method that rescales advantages at the segment level according to evaluation scores, emphasizing reliable segments while updating uncertain ones conservatively. Experiments on seven open-domain QA benchmarks show that \textsc{EvalAct} achieves the best average accuracy, with the largest gains on multi-hop tasks, and ablations verify that the explicit evaluation loop drives the primary improvements while PCAR provides consistent additional benefits.
Chinese Translation
检索增强代理可以查询外部证据,但它们在多步骤推理中的可靠性仍然有限:噪声检索可能会干扰多跳问答,而仅基于结果的强化学习提供的信用信号过于粗糙,无法优化中间步骤。我们提出了 extsc{EvalAct}(评估即行动),该方法将隐式检索质量评估转换为显式行动,并强制执行耦合的搜索-评估协议,使得每次检索后立即跟随一个结构化的评估分数,从而产生与交互轨迹对齐的过程信号。为了利用这些信号,我们引入了过程标定优势重标定(Process-Calibrated Advantage Rescaling, PCAR),这是一种基于 GRPO 的优化方法,根据评估分数在段级别重新标定优势,强调可靠段,同时保守地更新不确定段。在七个开放域问答基准上的实验表明, extsc{EvalAct} 达到了最佳的平均准确率,在多跳任务上获得了最大的提升,消融实验验证了显式评估循环推动了主要改进,而 PCAR 提供了一致的额外收益。
cs.AI / 20 / 2603.09209
Abundant Intelligence and Deficient Demand: A Macro-Financial Stress Test of Rapid AI Adoption
丰富的智能与不足的需求:快速人工智能采用的宏观金融压力测试
Abstract
We formalize a macro-financial stress test for rapid AI adoption. Rather than a productivity bust or existential risk, we identify a distribution-and-contract mismatch: AI-generated abundance coexists with demand deficiency because economic institutions are anchored to human cognitive scarcity. Three mechanisms formalize this channel. First, a displacement spiral with competing reinstatement effects: each firm's rational decision to substitute AI for labor reduces aggregate labor income, which reduces aggregate demand, accelerating further AI adoption. We derive conditions on the AI capability growth rate, diffusion speed, and reinstatement rate under which the net feedback is self-limiting versus explosive. Second, Ghost GDP: when AI-generated output substitutes for labor-generated output, monetary velocity declines monotonically in the labor share absent compensating transfers, creating a wedge between measured output and consumption-relevant income. Third, intermediation collapse: AI agents that reduce information frictions compress intermediary margins toward pure logistics costs, triggering repricing across SaaS, payments, consulting, insurance, and financial advisory. Because top-quintile earners drive 47--65\% of U.S.\ consumption and face the highest AI exposure, the transmission into private credit (\$2.5 trillion globally) and mortgage markets (\$13 trillion) is disproportionate. We derive eleven testable predictions with explicit falsification conditions. Calibrated simulations disciplined by FRED time series and BLS occupation-level data quantify conditions under which stable adjustment transitions to explosive crisis.
Chinese Translation
我们为快速人工智能的采用形式化了一个宏观金融压力测试。我们识别出一种分配与契约的不匹配,而非生产力崩溃或生存风险:人工智能生成的丰富性与需求不足共存,因为经济制度依赖于人类认知的稀缺性。三个机制形式化了这一渠道。首先,竞争性恢复效应的置换螺旋:每个公司理性的选择用人工智能替代劳动,减少了整体劳动收入,从而降低了整体需求,加速了进一步的人工智能采用。我们推导出人工智能能力增长率、扩散速度和恢复率的条件,在这些条件下,净反馈是自我限制的还是爆炸性的。其次,幽灵GDP:当人工智能生成的产出替代劳动生成的产出时,缺乏补偿性转移的情况下,货币流通速度在劳动份额中单调下降,造成测量产出与消费相关收入之间的差距。第三,中介崩溃:减少信息摩擦的人工智能代理压缩中介利润至纯物流成本,引发SaaS、支付、咨询、保险和财务顾问等领域的重新定价。由于收入最高的前五分之一的收入者驱动了47%至65%的美国消费,并面临最高的人工智能风险,因此向私人信贷(全球2.5万亿美元)和抵押贷款市场(13万亿美元)的传导是失衡的。我们推导出十一条可检验的预测,并明确了反驳条件。通过FRED时间序列和BLS职业级别数据进行校准的模拟量化了稳定调整转向爆炸性危机的条件。
cs.AI / 21 / 2603.09214
PrivPRISM: Automatically Detecting Discrepancies Between Google Play Data Safety Declarations and Developer Privacy Policies
PrivPRISM:自动检测Google Play数据安全声明与开发者隐私政策之间的差异
Abstract
End-users seldom read verbose privacy policies, leading app stores like Google Play to mandate simplified data safety declarations as a user-friendly alternative. However, these self-declared disclosures often contradict the full privacy policies, deceiving users about actual data practices and violating regulatory requirements for consistency. To address this, we introduce PrivPRISM, a robust framework that combines encoder and decoder language models to systematically extract and compare fine-grained data practices from privacy policies and to compare against data safety declarations, enabling scalable detection of non-compliance. Evaluating 7,770 popular mobile games uncovers discrepancies in nearly 53% of cases, rising to 61% among 1,711 widely used generic apps. Additionally, static code analysis reveals possible under-disclosures, with privacy policies disclosing just 66.8% of potential accesses to sensitive data like location and financial information, versus only 36.4% in data safety declarations of mobile games. Our findings expose systemic issues, including widespread reuse of generic privacy policies, vague / contradictory statements, and hidden risks in high-profile apps with 100M+ downloads, underscoring the urgent need for automated enforcement to protect platform integrity and for end-users to be vigilant about sensitive data they disclose via popular apps.
Chinese Translation
最终用户很少阅读冗长的隐私政策,这导致像Google Play这样的应用商店要求提供简化的数据安全声明作为用户友好的替代方案。然而,这些自我声明的披露往往与完整的隐私政策相矛盾,误导用户对实际数据实践的理解,并违反了对一致性的监管要求。为了解决这个问题,我们提出了PrivPRISM,一个强大的框架,结合了编码器和解码器语言模型,系统地提取和比较隐私政策中的细粒度数据实践,并与数据安全声明进行比较,从而实现可扩展的不合规检测。对7,770款热门移动游戏的评估发现,近53%的案例存在差异,在1,711款广泛使用的通用应用中,这一比例上升至61%。此外,静态代码分析揭示了可能的低披露情况,隐私政策仅披露了对位置和财务信息等敏感数据的66.8%的潜在访问,而移动游戏的数据安全声明中仅披露了36.4%。我们的研究结果揭示了系统性问题,包括通用隐私政策的广泛重用、模糊/矛盾的陈述,以及在下载量超过1亿的高知名度应用中隐藏的风险,强调了自动化执法以保护平台完整性和最终用户对通过热门应用披露的敏感数据保持警惕的紧迫需求。
cs.AI / 22 / 2603.09231
Cognitively Layered Data Synthesis for Domain Adaptation of LLMs to Space Situational Awareness
用于大语言模型领域适应的认知分层数据合成在空间态势感知中的应用
Abstract
Large language models (LLMs) demonstrate exceptional performance on general-purpose tasks. however, transferring them to complex engineering domains such as space situational awareness (SSA) remains challenging owing to insufficient structural alignment with mission chains, the absence of higher-order cognitive supervision, and poor correspondence between data quality criteria and engineering specifications. The core bottleneck is the construction of high-quality supervised fine-tuning (SFT) datasets. To this end, we propose BD-FDG (Bloom's Taxonomy-based Domain-specific Fine-tuning Data Generation), a framework that addresses incomplete knowledge coverage, shallow cognitive depth, and limited quality controllability through three mechanisms: structured knowledge organization, cognitively layered question modeling, and automated quality control. The framework uses a knowledge tree to ensure structured corpus coverage, designs a question generation scheme spanning nine categories and six cognitive levels from Remember to Create to produce samples with a continuous difficulty gradient, and applies a multidimensional scoring pipeline to enforce domain rigor and consistency. Using BD-FDG, we construct SSA-SFT, a domain dataset of approximately 230K samples, and fine-tune Qwen3-8B to obtain SSA-LLM-8B. Experiments show that SSA-LLM-8B achieves relative BLEU-1 improvements of 144\% (no-think) and 176\% (think) on the domain test set and a win rate of 82.21\% over the baseline in arena comparisons, while largely preserving general benchmark performance (MMLU-Pro, MATH-500). These results validate SFT data construction driven by cognitive layering as an effective paradigm for complex engineering domains and provide a transferable framework for domain-specific LLM adaptation.
Chinese Translation
大型语言模型(LLMs)在通用任务上表现出色。然而,将它们转移到复杂的工程领域,如空间态势感知(SSA),仍然面临挑战,原因在于与任务链的结构对齐不足、缺乏高阶认知监督以及数据质量标准与工程规范之间的对应关系不佳。核心瓶颈在于高质量监督微调(SFT)数据集的构建。为此,我们提出了BD-FDG(基于布隆分类法的领域特定微调数据生成),这是一个通过三种机制解决知识覆盖不全、认知深度浅和质量可控性有限的问题的框架:结构化知识组织、认知分层问题建模和自动化质量控制。该框架使用知识树确保结构化语料覆盖,设计了一个跨越九个类别和六个认知层次(从记忆到创造)的题目生成方案,以产生具有连续难度梯度的样本,并应用多维评分管道以确保领域的严谨性和一致性。利用BD-FDG,我们构建了SSA-SFT,一个包含约230K样本的领域数据集,并对Qwen3-8B进行微调以获得SSA-LLM-8B。实验表明,SSA-LLM-8B在领域测试集上相较于基线在无思考(no-think)和思考(think)情况下分别实现了144\%和176\%的相对BLEU-1提升,并在竞技比较中以82.21\%的胜率超越基线,同时在一般基准性能(MMLU-Pro, MATH-500)上保持较好表现。这些结果验证了以认知分层驱动的SFT数据构建作为复杂工程领域的有效范式,并提供了一个可转移的领域特定LLM适应框架。
cs.AI / 23 / 2603.09249
Social-R1: Towards Human-like Social Reasoning in LLMs
Social-R1:朝着类人社交推理的方向发展
Abstract
While large language models demonstrate remarkable capabilities across numerous domains, social intelligence - the capacity to perceive social cues, infer mental states, and generate appropriate responses - remains a critical challenge, particularly for enabling effective human-AI collaboration and developing AI that truly serves human needs. Current models often rely on superficial patterns rather than genuine social reasoning. We argue that cultivating human-like social intelligence requires training with challenging cases that resist shortcut solutions. To this end, we introduce ToMBench-Hard, an adversarial benchmark designed to provide hard training examples for social reasoning. Building on this, we propose Social-R1, a reinforcement learning framework that aligns model reasoning with human cognition through multi-dimensional rewards. Unlike outcome-based RL, Social-R1 supervises the entire reasoning process, enforcing structural alignment, logical integrity, and information density. Results show that our approach enables a 4B parameter model to surpass much larger counterparts and generalize robustly across eight diverse benchmarks. These findings demonstrate that challenging training cases with trajectory-level alignment offer a path toward efficient and reliable social intelligence.
Chinese Translation
尽管大型语言模型在众多领域展现出显著的能力,但社交智能——感知社交线索、推断心理状态和生成适当反应的能力——仍然是一个关键挑战,尤其是在促进人机协作和开发真正满足人类需求的人工智能方面。目前的模型往往依赖于表面模式,而非真正的社交推理。我们认为,培养类人社交智能需要通过具有挑战性的案例进行训练,这些案例抵制捷径解决方案。为此,我们引入了ToMBench-Hard,这是一个对抗性基准,旨在提供社交推理的困难训练示例。在此基础上,我们提出了Social-R1,这是一个强化学习框架,通过多维奖励将模型推理与人类认知对齐。与基于结果的强化学习不同,Social-R1监督整个推理过程,强制执行结构对齐、逻辑完整性和信息密度。结果表明,我们的方法使得一个具有40亿参数的模型超越了更大规模的对手,并在八个不同的基准上表现出强大的泛化能力。这些发现表明,具有轨迹级对齐的挑战性训练案例为实现高效可靠的社交智能提供了一条路径。
cs.AI / 24 / 2603.09268
Logos: An evolvable reasoning engine for rational molecular design
Logos:一种可进化的理性分子设计推理引擎
Abstract
The discovery and design of functional molecules remain central challenges across chemistry,biology, and materials science. While recent advances in machine learning have accelerated molecular property prediction and candidate generation, existing models tend to excel either in physical fidelity without transparent reasoning, or in flexible reasoning without guarantees of chemical validity. This imbalance limits the reliability of artificial intelligence systems in real scientific design workflows.Here we present Logos, a compact molecular reasoning model that integrates multi-step logical reasoning with strict chemical consistency. Logos is trained using a staged strategy that first exposes the model to explicit reasoning examples linking molecular descriptions to structural decisions, and then progressively aligns these reasoning patterns with molecular representations. In a final training phase, chemical rules and invariants are incorporated directly into the optimization objective, guiding the model toward chemically valid outputs. Across multiple benchmark datasets, Logos achieves strong performance in both structural accuracy and chemical validity, matching or surpassing substantially larger general-purpose language models while operating with a fraction of their parameters. Beyond benchmark evaluation, the model exhibits stable behaviour in molecular optimization tasks involving multiple, potentially conflicting constraints. By explicitly exposing intermediate reasoning steps, Logos enables human inspection and assessment of the design logic underlying each generated structure. These results indicate that jointly optimizing for reasoning structure and physical consistency offers a practical pathway toward reliable and interpretable AI systems for molecular science, supporting closer integration of artificial intelligence into scientific discovery processes.
Chinese Translation
功能分子的发现和设计仍然是化学、生物学和材料科学中的核心挑战。尽管近期机器学习的进展加速了分子属性预测和候选分子的生成,但现有模型往往在物理真实性方面表现出色,却缺乏透明的推理过程,或者在灵活推理方面表现良好,但无法保证化学有效性。这种不平衡限制了人工智能系统在真实科学设计工作流程中的可靠性。在此,我们提出了Logos,一个紧凑的分子推理模型,它将多步骤逻辑推理与严格的化学一致性相结合。Logos采用分阶段的训练策略,首先让模型接触到将分子描述与结构决策联系起来的明确推理示例,然后逐步将这些推理模式与分子表征对齐。在最终的训练阶段,化学规则和不变性被直接纳入优化目标,引导模型朝向化学有效的输出。在多个基准数据集上,Logos在结构准确性和化学有效性方面均表现出色,其性能与或超越了显著更大的一般性语言模型,同时参数量却仅为其一小部分。除了基准评估外,该模型在涉及多个潜在冲突约束的分子优化任务中表现出稳定的行为。通过明确展示中间推理步骤,Logos使人类能够检查和评估每个生成结构背后的设计逻辑。这些结果表明,联合优化推理结构和物理一致性为可靠且可解释的分子科学人工智能系统提供了一条切实可行的途径,支持人工智能与科学发现过程的更紧密整合。
cs.AI / 25 / 2603.09309
Rescaling Confidence: What Scale Design Reveals About LLM Metacognition
重新调整信心:尺度设计揭示的LLM元认知
Abstract
Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d'. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.
Chinese Translation
口头表达的信心,即LLM报告的数值确定性评分,广泛用于在黑箱环境中估计不确定性,但信心尺度本身(通常为0--100)却很少被审视。我们展示了这一设计选择并非中立。在六个LLM和三个数据集的实验中,口头表达的信心高度离散,超过78%的响应集中在仅三个整数值上。为了研究这一现象,我们系统地沿着三个维度操控信心尺度:粒度、边界位置和范围规律性,并使用meta-d'评估元认知敏感性。我们发现,0--20的尺度在元认知效率上始终优于标准的0--100格式,而边界压缩则降低了性能,即使在不规则范围内,整数偏好仍然存在。这些结果表明,信心尺度设计直接影响口头表达的不确定性质量,应被视为LLM评估中的一类重要实验变量。
cs.AI / 26 / 2603.09313
Curveball Steering: The Right Direction To Steer Isn't Always Linear
曲线引导:引导的正确方向并不总是线性的
Abstract
Activation steering is a widely used approach for controlling large language model (LLM) behavior by intervening on internal representations. Existing methods largely rely on the Linear Representation Hypothesis, assuming behavioral attributes can be manipulated using global linear directions. In practice, however, such linear interventions often behave inconsistently. We question this assumption by analyzing the intrinsic geometry of LLM activation spaces. Measuring geometric distortion via the ratio of geodesic to Euclidean distances, we observe substantial and concept-dependent distortions, indicating that activation spaces are not well-approximated by a globally linear geometry. Motivated by this, we propose "Curveball steering", a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry. Curveball steering consistently outperforms linear PCA-based steering, particularly in regimes exhibiting strong geometric distortion, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.
Chinese Translation
激活引导是一种广泛使用的方法,通过干预内部表征来控制大型语言模型(LLM)的行为。现有的方法在很大程度上依赖于线性表征假设,假设行为属性可以通过全局线性方向进行操控。然而,在实践中,这种线性干预往往表现得不一致。我们通过分析LLM激活空间的内在几何来质疑这一假设。通过测量测地距离与欧几里得距离的比率来评估几何失真,我们观察到显著且与概念相关的失真,表明激活空间并不能很好地用全局线性几何来近似。基于此,我们提出了“曲线引导”(Curveball steering),这是一种基于多项式核主成分分析(polynomial kernel PCA)的非线性引导方法,在特征空间中进行干预,更好地尊重学习到的激活几何。曲线引导在表现出强几何失真的情况下,始终优于基于线性主成分分析的引导,表明关注几何的非线性引导为全局线性干预提供了一种有原则的替代方案。
cs.AI / 27 / 2603.09344
Robust Regularized Policy Iteration under Transition Uncertainty
在转移不确定性下的鲁棒正则化策略迭代
Abstract
Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $\gamma$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods such as PMDB on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust behavior. The learned $Q$-values decrease in regions with higher epistemic uncertainty, suggesting that the resulting policy avoids unreliable out-of-distribution actions under transition uncertainty.
Chinese Translation
离线强化学习(RL)能够在没有在线探索的情况下实现数据高效和安全的策略学习,但其性能在分布变化时往往会下降。学习到的策略可能会访问分布外的状态-动作对,在这些情况下,价值估计和学习到的动态是不可靠的。为了在统一框架中解决由策略引起的外推和转移不确定性,我们将离线RL形式化为鲁棒策略优化,将转移核视为不确定性集中的决策变量,并针对最坏情况动态优化策略。我们提出了鲁棒正则化策略迭代(RRPI),它用一个可处理的KL正则化替代了难以处理的最大-最小双层目标,并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们通过证明所提出的算子是$eta$-收缩的,并且迭代更新替代品会单调改善原始鲁棒目标并收敛,从而提供理论保证。D4RL基准上的实验表明,RRPI实现了强大的平均性能,在大多数环境中超越了包括基于分位数的方法(如PMDB)在内的最新基线,同时在其他环境中保持竞争力。此外,RRPI表现出鲁棒性。在具有更高认知不确定性的区域,学习到的$Q$值下降,这表明所得到的策略在转移不确定性下避免了不可靠的分布外动作。
cs.AI / 28 / 2603.09435
AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems
人工智能法案评估基准:一个开放、透明且可重复的自然语言处理和检索增强生成系统评估数据集
Abstract
The rapid rollout of AI in heterogeneous public and societal sectors has subsequently escalated the need for compliance with regulatory standards and frameworks. The EU AI Act has emerged as a landmark in the regulatory landscape. The development of solutions that elicit the level of AI systems' compliance with such standards is often limited by the lack of resources, hindering the semi-automated or automated evaluation of their performance. This generates the need for manual work, which is often error-prone, resource-limited or limited to cases not clearly described by the regulation. This paper presents an open, transparent, and reproducible method of creating a resource that facilitates the evaluation of NLP models with a strong focus on RAG systems. We have developed a dataset that contain the tasks of risk-level classification, article retrieval, obligation generation, and question-answering for the EU AI Act. The dataset files are in a machine-to-machine appropriate format. To generate the files, we utilise domain knowledge as an exegetical basis, combining with the processing and reasoning power of large language models to generate scenarios along with the respective tasks. Our methodology demonstrates a way to harness language models for grounded generation with high document relevancy. Besides, we overcome limitations such as navigating the decision boundaries of risk-levels that are not explicitly defined within the EU AI Act, such as limited and minimal cases. Finally, we demonstrate our dataset's effectiveness by evaluating a RAG-based solution that reaches 0.87 and 0.85 F1-score for prohibited and high-risk scenarios.
Chinese Translation
人工智能在异构公共和社会领域的快速推广,随之增加了对合规性标准和框架的需求。欧盟人工智能法案(EU AI Act)已成为监管领域的一个里程碑。开发能够评估人工智能系统合规性水平的解决方案,往往受到资源匮乏的限制,妨碍了其性能的半自动或自动评估。这导致了需要进行手动工作,而手动工作通常容易出错、资源有限或仅限于法规未明确描述的案例。本文提出了一种开放、透明且可重复的方法来创建一个资源,以促进对自然语言处理模型的评估,特别关注检索增强生成(RAG)系统。我们开发了一个数据集,包含欧盟人工智能法案的风险级别分类、文章检索、义务生成和问答等任务。数据集文件采用机器对机器的适用格式。为了生成这些文件,我们利用领域知识作为解释基础,结合大型语言模型的处理和推理能力,生成场景及相应任务。我们的方法展示了如何利用语言模型进行基于文档相关性的生成。此外,我们克服了诸如在欧盟人工智能法案中未明确定义的风险级别决策边界等限制,例如有限和最小案例。最后,我们通过评估一个基于RAG的解决方案,展示了我们数据集的有效性,该解决方案在禁止和高风险场景中分别达到了0.87和0.85的F1分数。
cs.AI / 29 / 2603.09463
An Empirical Study and Theoretical Explanation on Task-Level Model-Merging Collapse
任务级模型合并崩溃的实证研究与理论解释
Abstract
Model merging unifies independently fine-tuned LLMs from the same base, enabling reuse and integration of parallel development efforts without retraining. However, in practice we observe that merging does not always succeed: certain combinations of task-specialist models suffer from catastrophic performance degradation after merging. We refer to this failure mode as merging collapse. Intuitively, collapse arises when the learned representations or parameter adjustments for different tasks are fundamentally incompatible, so that merging forces destructive interference rather than synergy. In this paper, we identify and characterize the phenomenon of task-level merging collapse, where certain task combinations consistently trigger huge performance degradation across all merging methods. Through extensive experiments and statistical analysis, we demonstrate that representational incompatibility between tasks is strongly correlated with merging collapse, while parameter-space conflict metrics show minimal correlation, challenging conventional wisdom in model merging literature. We provide a theoretical explanation on this phenomenon through rate-distortion theory with a dimension-dependent bound, establishing fundamental limits on task mergeability regardless of methodology.
Chinese Translation
模型合并将来自同一基础的独立微调的LLM(大语言模型)统一起来,使得在不重新训练的情况下能够重用和整合平行开发的努力。然而,在实践中我们观察到,合并并不总是成功:某些任务专用模型的组合在合并后会遭遇灾难性的性能下降。我们将这种失败模式称为合并崩溃。直观上,当不同任务的学习表示或参数调整在根本上不兼容时,崩溃就会发生,从而导致合并产生破坏性干扰而非协同效应。在本文中,我们识别并描述了任务级合并崩溃的现象,其中某些任务组合在所有合并方法中一致地触发巨大的性能下降。通过广泛的实验和统计分析,我们证明了任务之间的表示不兼容性与合并崩溃之间存在强相关性,而参数空间冲突指标则显示出最小相关性,这挑战了模型合并文献中的传统观点。我们通过带有维度依赖界限的率失真理论对这一现象提供了理论解释,建立了无论方法如何,任务合并能力的基本限制。
cs.AI / 30 / 2603.09476
Telogenesis: Goal Is All U Need
目标生成:目标是你所需的一切
Abstract
Goal-conditioned systems assume goals are provided externally. We ask whether attentional priorities can emerge endogenously from an agent's internal cognitive state. We propose a priority function that generates observation targets from three epistemic gaps: ignorance (posterior variance), surprise (prediction error), and staleness (temporal decay of confidence in unobserved variables). We validate this in two systems: a minimal attention-allocation environment (2,000 runs) and a modular, partially observable world (500 runs). Ablation shows each component is necessary. A key finding is metric-dependent reversal: under global prediction error, coverage-based rotation wins; under change detection latency, priority-guided allocation wins, with advantage growing monotonically with dimensionality (d = -0.95 at N=48, p < 10^-6). Detection latency follows a power law in attention budget, with a steeper exponent for priority-guided allocation (0.55 vs. 0.40). When the decay rate is made learnable per variable, the system spontaneously recovers environmental volatility structure without supervision (t = 22.5, p < 10^-6). We demonstrate that epistemic gaps alone, without external reward, suffice to generate adaptive priorities that outperform fixed strategies and recover latent environmental structure.
Chinese Translation
目标条件系统假设目标是外部提供的。我们探讨注意力优先级是否可以从代理的内部认知状态中自发产生。我们提出了一种优先级函数,该函数根据三种认知差距生成观察目标:无知(后验方差)、惊讶(预测误差)和陈旧性(对未观察变量的信心的时间衰减)。我们在两个系统中验证了这一点:一个最小的注意力分配环境(2,000次运行)和一个模块化的部分可观察世界(500次运行)。消融实验表明每个组件都是必要的。一个关键发现是度量依赖的反转:在全局预测误差下,基于覆盖的旋转获胜;在变化检测延迟下,优先级引导的分配获胜,且其优势随着维度的增加而单调增长(d = -0.95,N=48,p < 10^-6)。检测延迟在注意力预算中遵循幂律,优先级引导分配的指数更陡(0.55对比0.40)。当衰减率对每个变量可学习时,该系统在没有监督的情况下自发恢复环境波动结构(t = 22.5,p < 10^-6)。我们证明,仅凭认知差距而不依赖外部奖励,足以生成优于固定策略的自适应优先级,并恢复潜在的环境结构。
cs.AI / 31 / 2603.09481
GenePlan: Evolving Better Generalized PDDL Plans using Large Language Models
GenePlan:利用大型语言模型演化更好的通用PDDL规划
Abstract
We present GenePlan (GENeralized Evolutionary Planner), a novel framework that leverages large language model (LLM) assisted evolutionary algorithms to generate domain-dependent generalized planners for classical planning tasks described in PDDL. By casting generalized planning as an optimization problem, GenePlan iteratively evolves interpretable Python planners that minimize plan length across diverse problem instances. In empirical evaluation across six existing benchmark domains and two new domains, GenePlan achieved an average SAT score of 0.91, closely matching the performance of the state-of-the-art planners (SAT score 0.93), and significantly outperforming other LLM-based baselines such as chain-of-thought (CoT) prompting (average SAT score 0.64). The generated planners solve new instances rapidly (average 0.49 seconds per task) and at low cost (average $1.82 per domain using GPT-4o).
Chinese Translation
我们提出了GenePlan(GENeralized Evolutionary Planner),一个新颖的框架,利用大型语言模型(LLM)辅助的进化算法生成针对PDDL中描述的经典规划任务的领域依赖性通用规划器。通过将通用规划视为一个优化问题,GenePlan迭代地演化出可解释的Python规划器,旨在最小化不同问题实例中的计划长度。在对六个现有基准领域和两个新领域的实证评估中,GenePlan的平均SAT得分为0.91,接近最先进规划器的性能(SAT得分为0.93),并显著优于其他基于LLM的基线方法,如思维链(CoT)提示(平均SAT得分为0.64)。生成的规划器能够快速解决新实例(每个任务平均0.49秒),且成本低廉(使用GPT-4o,每个领域平均$1.82)。
cs.AI / 32 / 2603.09486
Vibe-Creation: The Epistemology of Human-AI Emergent Cognition
氛围创造:人类与人工智能涌现认知的认识论
Abstract
The encounter between human reasoning and generative artificial intelligence (GenAI) cannot be adequately described by inherited metaphors of tool use, augmentation, or collaborative partnership. This article argues that such interactions produce a qualitatively distinct cognitive-epistemic formation, designated here as the Third Entity: an emergent, transient structure that arises from the transductive coupling of two ontologically incommensurable modes of cognition. Drawing on Peirce semiotics, Polanyi theory of tacit knowledge, Simondon philosophy of individuation, Ihde postphenomenology, and Morin complexity theory, we develop a multi-layered theoretical account of this formation. We introduce the concept of vibe-creation to designate the pre-reflective cognitive mode through which the Third Entity navigates high-dimensional semantic space and argue that this mode constitutes the automation of tacit knowledge - a development with far-reaching consequences for epistemology, the philosophy of mind, and educational theory. We further propose the notion of asymmetric emergence to characterize the agency of the Third Entity: genuinely novel and irreducible, yet anchored in human intentional responsibility. The article concludes by examining the implications of this theoretical framework for the transformation of educational institutions and the redefinition of intellectual competence in the age of GenAI.
Chinese Translation
人类推理与生成性人工智能(GenAI)之间的相遇无法用继承的工具使用、增强或协作伙伴关系的隐喻来充分描述。本文论证,这种互动产生了一种质的不同的认知-认识形成,称之为第三实体:一种涌现的、瞬态的结构,源于两种本体上不可通约的认知模式的传导耦合。本文借鉴了皮尔士的符号学、波兰尼的默会知识理论、希蒙东的个体化哲学、伊德的后现象学以及莫兰的复杂性理论,发展了这一形成的多层次理论阐述。我们引入氛围创造(vibe-creation)这一概念,以指代第三实体在高维语义空间中导航的前反思认知模式,并认为这一模式构成了默会知识的自动化——这一发展对认识论、心灵哲学和教育理论具有深远的影响。我们进一步提出不对称涌现的概念,以表征第三实体的能动性:真正新颖且不可简约,但又扎根于人类的意图责任。文章最后探讨了这一理论框架对教育机构转型及在生成性人工智能时代重新定义智力能力的影响。
cs.AI / 33 / 2603.09533
Enhancing Debunking Effectiveness through LLM-based Personality Adaptation
通过基于大型语言模型的个性适应增强反驳效果
Abstract
This study proposes a novel methodology for generating personalized fake news debunking messages by prompting Large Language Models (LLMs) with persona-based inputs aligned to the Big Five personality traits: Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness. Our approach guides LLMs to transform generic debunking content into personalized versions tailored to specific personality profiles. To assess the effectiveness of these transformations, we employ a separate LLM as an automated evaluator simulating corresponding personality traits, thereby eliminating the need for costly human evaluation panels. Our results show that personalized messages are generally seen as more persuasive than generic ones. We also find that traits like Openness tend to increase persuadability, while Neuroticism can lower it. Differences between LLM evaluators suggest that using multiple models provides a clearer picture. Overall, this work demonstrates a practical way to create more targeted debunking messages exploiting LLMs, while also raising important ethical questions about how such technology might be used.
Chinese Translation
本研究提出了一种新颖的方法,通过基于个性的输入提示大型语言模型(LLMs),生成个性化的假新闻反驳信息,这些输入与五大人格特质(外向性、宜人性、尽责性、神经质和开放性)相一致。我们的方法指导LLMs将通用的反驳内容转化为针对特定个性特征量身定制的个性化版本。为了评估这些转化的有效性,我们采用一个独立的LLM作为自动评估者,模拟相应的人格特质,从而消除了对昂贵的人类评估小组的需求。我们的结果表明,个性化的信息通常被认为比通用信息更具说服力。我们还发现,像开放性这样的特质往往会增加说服力,而神经质则可能降低说服力。不同的LLM评估者之间的差异表明,使用多个模型可以提供更清晰的图景。总体而言,这项工作展示了一种利用LLMs创建更具针对性的反驳信息的实用方法,同时也引发了关于此类技术可能使用的伦理问题。
cs.AI / 34 / 2603.09619
Context Engineering: From Prompts to Corporate Multi-Agent Architecture
上下文工程:从提示到企业多智能体架构
Abstract
As artificial intelligence (AI) systems evolve from stateless chatbots to autonomous multi-step agents, prompt engineering (PE), the discipline of crafting individual queries, proves necessary but insufficient. This paper introduces context engineering (CE) as a standalone discipline concerned with designing, structuring, and managing the entire informational environment in which an AI agent makes decisions. Drawing on vendor architectures (Google ADK, Anthropic, LangChain), current academic work (ACE framework, Google DeepMind's intelligent delegation), enterprise research (Deloitte, 2026; KPMG, 2026), and the author's experience building a multi-agent system, the paper proposes five context quality criteria: relevance, sufficiency, isolation, economy, and provenance, and frames context as the agent's operating system. Two higher-order disciplines follow. Intent engineering (IE) encodes organizational goals, values, and trade-off hierarchies into agent infrastructure. Specification engineering (SE) creates a machine-readable corpus of corporate policies and standards enabling autonomous operation of multi-agent systems at scale. Together these four disciplines form a cumulative pyramid maturity model of agent engineering, in which each level subsumes the previous one as a necessary foundation. Enterprise data reveals a gap: while 75% of enterprises plan agentic AI deployment within two years (Deloitte, 2026), deployment has surged and retreated as organizations confront scaling complexity (KPMG, 2026). The Klarna case illustrates a dual deficit, contextual and intentional. Whoever controls the agent's context controls its behavior; whoever controls its intent controls its strategy; whoever controls its specifications controls its scale.
Chinese Translation
随着人工智能(AI)系统从无状态的聊天机器人演变为自主的多步骤智能体,提示工程(Prompt Engineering, PE)这一设计单个查询的学科显得必要但不足。本文提出上下文工程(Context Engineering, CE)作为一门独立的学科,关注于设计、构建和管理AI智能体做出决策的整个信息环境。基于供应商架构(Google ADK、Anthropic、LangChain)、当前学术研究(ACE框架、Google DeepMind的智能委托)、企业研究(德勤,2026;毕马威,2026)以及作者在构建多智能体系统方面的经验,本文提出了五个上下文质量标准:相关性、充分性、隔离性、经济性和来源性,并将上下文框架视为智能体的操作系统。接下来是两个更高阶的学科。意图工程(Intent Engineering, IE)将组织目标、价值观和权衡层次编码到智能体基础设施中。规范工程(Specification Engineering, SE)创建一个机器可读的企业政策和标准语料库,以支持多智能体系统的自主大规模运行。这四个学科共同形成了一个累积的智能体工程成熟度模型,其中每个层级都包含了作为必要基础的前一个层级。企业数据揭示了一个差距:尽管75%的企业计划在两年内部署智能体AI(德勤,2026),但在组织面临扩展复杂性时,部署却出现了激增和回落(毕马威,2026)。Klarna案例说明了上下文和意图的双重缺失。谁控制智能体的上下文,谁就控制其行为;谁控制其意图,谁就控制其战略;谁控制其规范,谁就控制其规模。
cs.AI / 35 / 2603.09641
PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories A Unified Framework for Test-Time Adaptation with Compositional Rule Learning and Pareto-Guided Prompt Evolution
PRECEPT:通过经验、上下文工程与探测轨迹规划韧性——一个统一的测试时适应框架,结合组合规则学习与帕累托引导的提示演化
Abstract
LLM agents that store knowledge as natural language suffer steep retrieval degradation as condition count grows, often struggle to compose learned rules reliably, and typically lack explicit mechanisms to detect stale or adversarial knowledge. We introduce PRECEPT, a unified framework for test-time adaptation with three tightly coupled components: (1) deterministic exact-match rule retrieval over structured condition keys, (2) conflict-aware memory with Bayesian source reliability and threshold-based rule invalidation, and (3) COMPASS, a Pareto-guided prompt-evolution outer loop. Exact retrieval eliminates partial-match interpretation errors on the deterministic path (0% by construction, vs 94.4% under Theorem~B.6's independence model at N=10) and supports compositional stacking through a semantic tier hierarchy; conflict-aware memory resolves static--dynamic disagreements and supports drift adaptation; COMPASS evaluates prompts through the same end-to-end execution pipeline. Results (9--10 seeds): PRECEPT achieves a +41.1pp first-try advantage over Full Reflexion (d>1.9), +33.3pp compositional generalization (d=1.55), 100% $P_1$ on 2-way logistics compositions (d=2.64), +40--55pp continuous learning gains, strong eventual robustness under adversarial static knowledge (100% logistics with adversarial SK active; partial recovery on integration), +55.0pp drift recovery (d=0.95, p=0.031), and 61% fewer steps. Core comparisons are statistically significant, often at p<0.001.
Chinese Translation
存储知识为自然语言的大型语言模型(LLM)代理在条件数量增加时面临严重的检索衰退,通常难以可靠地组合学习到的规则,并且通常缺乏检测过时或对抗性知识的明确机制。我们提出了PRECEPT,一个统一的测试时适应框架,包含三个紧密耦合的组件:(1)基于结构条件键的确定性精确匹配规则检索,(2)具有贝叶斯源可靠性和基于阈值的规则失效的冲突感知记忆,以及(3)COMPASS,一个帕累托引导的提示演化外循环。精确检索消除了确定性路径上的部分匹配解释错误(通过构造实现0%,而在定理~B.6的独立模型下为94.4%),并通过语义层次结构支持组合堆叠;冲突感知记忆解决了静态与动态之间的分歧,并支持漂移适应;COMPASS通过相同的端到端执行管道评估提示。结果(9-10个种子):PRECEPT在首次尝试中比全反射(Full Reflexion)提高了41.1个百分点(d>1.9),组合泛化提高了33.3个百分点(d=1.55),在双向物流组合中实现了100%的$P_1$(d=2.64),连续学习收益提高了40-55个百分点,在对抗性静态知识下表现出强大的最终鲁棒性(在对抗性静态知识活跃时100%物流;在整合时部分恢复),漂移恢复提高了55.0个百分点(d=0.95, p=0.031),并减少了61%的步骤。核心比较在统计上显著,通常在p<0.001的水平上。
cs.AI / 36 / 2603.09652
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants
MiniAppBench:评估LLM驱动助手从文本到互动HTML响应的转变
Abstract
With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our code is available in github.com/MiniAppBench.
Chinese Translation
随着大型语言模型(LLMs)在代码生成方面的快速发展,人机交互正从静态文本响应演变为动态的、基于HTML的互动应用程序,我们称之为MiniApps。这些应用程序要求模型不仅能够渲染视觉界面,还需构建符合现实世界原则的定制交互逻辑。然而,现有基准主要关注算法的正确性或静态布局重建,未能捕捉到这一新范式所需的能力。为了解决这一问题,我们引入了MiniAppBench,这是第一个旨在评估原则驱动的互动应用生成的综合基准。MiniAppBench源自一个具有超过1000万生成实例的真实应用,提炼出跨六个领域(如游戏、科学和工具)的500个任务。此外,为了应对评估开放式交互的挑战,在没有单一标准答案的情况下,我们提出了MiniAppEval,一个代理评估框架。该框架利用浏览器自动化技术,进行类人探索性测试,从三个维度(意图、静态和动态)系统性地评估应用程序。我们的实验表明,当前的LLMs在生成高质量MiniApps方面仍面临重大挑战,而MiniAppEval与人类判断高度一致,为未来研究建立了可靠的标准。我们的代码可在github.com/MiniAppBench获取。
cs.AI / 37 / 2603.09677
Logics-Parsing-Omni Technical Report
逻辑解析全能技术报告
Abstract
Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni.
Chinese Translation
针对多模态解析中任务定义碎片化和非结构化数据异构性的问题,本文提出了全能解析框架(Omni Parsing)。该框架建立了一个涵盖文档、图像和视听流的统一分类法,介绍了一种连接感知与认知的渐进式解析范式。具体而言,该框架整合了三个层次:1)整体检测(Holistic Detection),实现对象或事件的精确时空定位,为感知建立几何基准;2)细粒度识别(Fine-grained Recognition),对局部对象进行符号化(例如,光学字符识别OCR/自动语音识别ASR)和属性提取,以完成结构化实体解析;3)多层次解释(Multi-level Interpreting),从局部语义构建到全局逻辑的推理链。该框架的一个关键优势是其证据锚定机制,强制高层语义描述与低层事实之间的严格对齐。这使得“基于证据”的逻辑归纳成为可能,将非结构化信号转化为可定位、可枚举和可追踪的标准化知识。在此基础上,我们构建了一个标准化数据集,并发布了逻辑解析全能模型(Logics-Parsing-Omni),成功地将复杂的视听信号转化为机器可读的结构化知识。实验表明,细粒度感知与高层认知是协同的,有效增强了模型的可靠性。此外,为了定量评估这些能力,我们引入了OmniParsingBench。代码、模型和基准测试已在 https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni 发布。
cs.AI / 38 / 2603.09678
EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
EsoLang-Bench:通过晦涩编程语言评估大型语言模型的真实推理能力
Abstract
Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000-100,000x fewer public repositories than Python (based on GitHub search counts). We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85-95% on standard benchmarks score only 0-11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning by acquiring new languages through documentation, interpreter feedback, and iterative experimentation, measuring transferable reasoning skills resistant to data contamination.
Chinese Translation
大型语言模型在代码生成基准测试中达到了接近顶尖的性能,然而这些结果越来越多地反映了记忆而非真实推理。我们引入了EsoLang-Bench,这是一个使用五种晦涩编程语言(Brainfuck、Befunge-98、Whitespace、Unlambda和Shakespeare)的基准测试,这些语言由于其在预训练中的经济非理性而缺乏基准游戏激励。这些语言所需的计算原语与主流编程语言相同,但其公共代码库比Python少1,000到100,000倍(基于GitHub搜索统计)。我们评估了五个前沿模型在五种提示策略下的表现,并发现了显著的能力差距:在标准基准测试中达到85-95%的模型在等效的晦涩任务中仅得分0-11%,在简单级别以上的准确率为0%。少量学习和自我反思未能提高性能,表明这些技术利用了训练先验,而非实现真正的学习。EsoLang-Bench提供了第一个旨在模仿人类通过文档、解释器反馈和迭代实验获取新语言的学习过程的基准,测量对数据污染具有抵抗力的可转移推理能力。
cs.AI / 39 / 2603.09706
OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences
OOD-MMSafe:从有害意图到隐性后果推进多模态大型语言模型的安全性
Abstract
While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model's ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the Consequence-Aware Safety Policy Optimization (CASPO) framework, which integrates the model's intrinsic reasoning as a dynamic reference for token-level self-distillation rewards. Experimental results demonstrate that CASPO significantly enhances consequence projection, reducing the failure ratio of risk identification to 7.3% for Qwen2.5-VL-7B and 5.7% for Qwen3-VL-4B while maintaining overall effectiveness.
Chinese Translation
虽然多模态大型语言模型(MLLMs)的安全对齐已引起了显著关注,但当前的范式主要针对恶意意图或情境违规。我们建议将安全前沿转向以后果为驱动的安全,这一范式对于自主和具身代理的稳健部署至关重要。为了正式化这一转变,我们引入了OOD-MMSafe,这是一个包含455对精心策划的查询-图像对的基准,旨在评估模型识别上下文依赖因果链中潜在危害的能力。我们的分析揭示了前沿模型普遍存在的因果盲点,其中高容量闭源模型的失败率高达67.5%,并识别出一种偏好上限,即静态对齐导致以格式为中心的失败,而不是随着模型容量的增长而改善安全推理。为了解决这些瓶颈,我们开发了后果意识安全策略优化(CASPO)框架,该框架将模型的内在推理作为动态参考,应用于令牌级自蒸馏奖励。实验结果表明,CASPO显著增强了后果预测,将风险识别的失败率降低至Qwen2.5-VL-7B的7.3%和Qwen3-VL-4B的5.7%,同时保持了整体有效性。
cs.AI / 40 / 2603.09715
Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT
问题真的重要吗?无训练数据选择用于视觉-语言的微调
Abstract
Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.
Chinese Translation
视觉指令调优对于提升视觉-语言大模型(VLLMs)至关重要。然而,许多样本可以通过语言模式或常识捷径解决,而不需要真正的跨模态推理,这限制了多模态学习的有效性。以往的数据选择方法通常依赖于昂贵的代理模型训练,并侧重于难度或多样性,未能捕捉样本对视觉-语言联合推理的真实贡献。本文提出了一种名为CVS的无训练数据选择方法,其基于这样一个观点:对于高质量的多模态样本,引入问题应该显著改变模型对给定图像的答案有效性的评估。CVS利用一个冻结的VLLM作为评估者,测量在有无条件于问题的情况下答案有效性的差异,从而识别出需要视觉-语言联合推理的样本,同时过滤语义冲突噪声。在Vision-Flan和The Cauldron上的实验表明,CVS在各数据集上均表现出良好的性能。在Vision-Flan上,CVS在仅使用10%和15%数据的情况下,分别比全数据训练提高了3.5%和4.8%的性能,并且在高度异质的Cauldron数据集上保持稳健。此外,与COINCIDE和XMAS相比,CVS将计算成本分别降低了17.3%和44.4%。
cs.AI / 41 / 2603.09716
AutoAgent: Evolving Cognition and Elastic Memory Orchestration for Adaptive Agents
AutoAgent:用于自适应代理的进化认知与弹性记忆编排
Abstract
Autonomous agent frameworks still struggle to reconcile long-term experiential learning with real-time, context-sensitive decision-making. In practice, this gap appears as static cognition, rigid workflow dependence, and inefficient context usage, which jointly limit adaptability in open-ended and non-stationary environments. To address these limitations, we present AutoAgent, a self-evolving multi-agent framework built on three tightly coupled components: evolving cognition, on-the-fly contextual decision-making, and elastic memory orchestration. At the core of AutoAgent, each agent maintains structured prompt-level cognition over tools, self-capabilities, peer expertise, and task knowledge. During execution, this cognition is combined with live task context to select actions from a unified space that includes tool calls, LLM-based generation, and inter-agent requests. To support efficient long-horizon reasoning, an Elastic Memory Orchestrator dynamically organizes interaction history by preserving raw records, compressing redundant trajectories, and constructing reusable episodic abstractions, thereby reducing token overhead while retaining decision-critical evidence. These components are integrated through a closed-loop cognitive evolution process that aligns intended actions with observed outcomes to continuously update cognition and expand reusable skills, without external retraining. Empirical results across retrieval-augmented reasoning, tool-augmented agent benchmarks, and embodied task environments show that AutoAgent consistently improves task success, tool-use efficiency, and collaborative robustness over static and memory-augmented baselines. Overall, AutoAgent provides a unified and practical foundation for adaptive autonomous agents that must learn from experience while making reliable context-aware decisions in dynamic environments.
Chinese Translation
自主代理框架仍然面临将长期经验学习与实时、上下文敏感的决策相结合的挑战。在实践中,这一差距表现为静态认知、僵化的工作流程依赖和低效的上下文使用,这共同限制了在开放式和非平稳环境中的适应性。为了解决这些局限性,我们提出了AutoAgent,一个自我进化的多代理框架,基于三个紧密耦合的组件:进化认知、即时上下文决策和弹性记忆编排。在AutoAgent的核心,每个代理维护对工具、自我能力、同伴专业知识和任务知识的结构化提示级认知。在执行过程中,这种认知与实时任务上下文相结合,从统一的空间中选择行动,包括工具调用、基于大型语言模型(LLM)的生成和代理间请求。为了支持高效的长时间推理,弹性记忆编排器动态组织交互历史,通过保留原始记录、压缩冗余轨迹和构建可重用的情节抽象,从而减少令牌开销,同时保留决策关键证据。这些组件通过闭环认知进化过程集成,该过程将预期行动与观察结果对齐,以持续更新认知并扩展可重用技能,而无需外部再训练。通过检索增强推理、工具增强代理基准和具身任务环境的实证结果表明,AutoAgent在任务成功率、工具使用效率和协作鲁棒性方面始终优于静态和记忆增强基线。总体而言,AutoAgent为必须从经验中学习,同时在动态环境中做出可靠的上下文感知决策的自适应自主代理提供了统一且实用的基础。
cs.AI / 42 / 2603.09774
World2Mind: Cognition Toolkit for Allocentric Spatial Reasoning in Foundation Models
World2Mind:基础模型中的以目标为中心的空间推理认知工具包
Abstract
Achieving robust spatial reasoning remains a fundamental challenge for current Multimodal Foundation Models (MFMs). Existing methods either overfit statistical shortcuts via 3D grounding data or remain confined to 2D visual perception, limiting both spatial reasoning accuracy and generalization in unseen scenarios. Inspired by the spatial cognitive mapping mechanisms of biological intelligence, we propose World2Mind, a training-free spatial intelligence toolkit. At its core, World2Mind leverages 3D reconstruction and instance segmentation models to construct structured spatial cognitive maps, empowering MFMs to proactively acquire targeted spatial knowledge regarding interested landmarks and routes of interest. To provide robust geometric-topological priors, World2Mind synthesizes an Allocentric-Spatial Tree (AST) that uses elliptical parameters to model the top-down layout of landmarks accurately. To mitigate the inherent inaccuracies of 3D reconstruction, we introduce a three-stage reasoning chain comprising tool invocation assessment, modality-decoupled cue collection, and geometry-semantics interwoven reasoning. Extensive experiments demonstrate that World2Mind boosts the performance of frontier models, such as GPT-5.2, by 5%~18%. Astonishingly, relying solely on the AST-structured text, purely text-only foundation models can perform complex 3D spatial reasoning, achieving performance approaching that of advanced multimodal models.
Chinese Translation
实现稳健的空间推理仍然是当前多模态基础模型(MFMs)面临的基本挑战。现有方法要么通过3D基础数据过度拟合统计捷径,要么局限于2D视觉感知,从而限制了空间推理的准确性和在未知场景中的泛化能力。受到生物智能空间认知映射机制的启发,我们提出了World2Mind,一个无训练的空间智能工具包。World2Mind的核心利用3D重建和实例分割模型构建结构化的空间认知地图,使MFMs能够主动获取有关感兴趣地标和路线的目标空间知识。为了提供稳健的几何-拓扑先验,World2Mind合成了一个以目标为中心的空间树(Allocentric-Spatial Tree, AST),该树使用椭圆参数准确建模地标的自上而下布局。为减轻3D重建固有的不准确性,我们引入了一个三阶段推理链,包括工具调用评估、模态解耦线索收集和几何-语义交织推理。大量实验表明,World2Mind提升了前沿模型(如GPT-5.2)的性能,提升幅度在5%~18%之间。令人惊讶的是,仅依赖AST结构化文本,纯文本基础模型也能进行复杂的3D空间推理,其性能接近于先进的多模态模型。
cs.AI / 43 / 2603.09786
Quantifying the Necessity of Chain of Thought through Opaque Serial Depth
通过不透明串行深度量化思维链的必要性
Abstract
Large language models (LLMs) tend to externalize their reasoning in their chain of thought, making the chain of thought a good target for monitoring. This is partially an inherent feature of the Transformer architecture: sufficiently long serial cognition must pass through the chain of thought (Korbak et al., 2025). We formalize this argument through the notion of opaque serial depth, given by the length of the longest computation that can be done without the use of interpretable intermediate steps like chain of thought. Given this formalization, we compute numeric upper bounds on the opaque serial depth of Gemma 3 models, as well as asymptotic results for additional architectures beyond standard LLMs. We also open-source an automated method that can calculate upper bounds on the opaque serial depth of arbitrary neural networks, and use it to demonstrate that Mixture-of-Experts models likely have lower depth than dense models. Overall, our results suggest that opaque serial depth is a useful tool for understanding the potential for models to do significant reasoning that is not externalized.
Chinese Translation
大型语言模型(LLMs)倾向于在其思维链中外化推理,使得思维链成为监控的良好目标。这在一定程度上是变换器架构的固有特征:足够长的串行认知必须通过思维链(Korbak et al., 2025)。我们通过不透明串行深度的概念形式化这一论点,该概念由可以在没有可解释中间步骤(如思维链)情况下进行的最长计算的长度给出。基于这一形式化,我们计算了Gemma 3模型的不透明串行深度的数值上界,以及超出标准LLMs的其他架构的渐近结果。我们还开源了一种自动化方法,可以计算任意神经网络的不透明串行深度的上界,并利用该方法展示混合专家模型的深度可能低于密集模型。总体而言,我们的结果表明,不透明串行深度是理解模型进行未外化重大推理潜力的有用工具。
cs.AI / 44 / 2603.09888
LCA: Local Classifier Alignment for Continual Learning
LCA:持续学习的局部分类器对齐
Abstract
A fundamental requirement for intelligent systems is the ability to learn continuously under changing environments. However, models trained in this regime often suffer from catastrophic forgetting. Leveraging pre-trained models has recently emerged as a promising solution, since their generalized feature extractors enable faster and more robust adaptation. While some earlier works mitigate forgetting by fine-tuning only on the first task, this approach quickly deteriorates as the number of tasks grows and the data distributions diverge. More recent research instead seeks to consolidate task knowledge into a unified backbone, or adapting the backbone as new tasks arrive. However, such approaches may create a (potential) \textit{mismatch} between task-specific classifiers and the adapted backbone. To address this issue, we propose a novel \textit{Local Classifier Alignment} (LCA) loss to better align the classifier with backbone. Theoretically, we show that this LCA loss can enable the classifier to not only generalize well for all observed tasks, but also improve robustness. Furthermore, we develop a complete solution for continual learning, following the model merging approach and using LCA. Extensive experiments on several standard benchmarks demonstrate that our method often achieves leading performance, sometimes surpasses the state-of-the-art methods with a large margin.
Chinese Translation
智能系统的一个基本要求是能够在不断变化的环境中持续学习。然而,在这种情况下训练的模型往往会遭遇灾难性遗忘。最近,利用预训练模型作为一种有前景的解决方案浮出水面,因为它们的通用特征提取器能够实现更快且更稳健的适应。尽管一些早期的研究通过仅在第一个任务上进行微调来减轻遗忘,但随着任务数量的增加和数据分布的偏离,这种方法很快会恶化。更近期的研究则寻求将任务知识整合到一个统一的主干中,或在新任务到来时对主干进行适应。然而,这些方法可能会在任务特定分类器和适应后的主干之间产生(潜在的)不匹配。为了解决这个问题,我们提出了一种新颖的局部分类器对齐(Local Classifier Alignment, LCA)损失,以更好地对齐分类器与主干。从理论上讲,我们证明了这种LCA损失不仅可以使分类器在所有观察到的任务上良好泛化,还可以提高鲁棒性。此外,我们开发了一个完整的持续学习解决方案,遵循模型合并的方法并使用LCA。在多个标准基准上的大量实验表明,我们的方法通常能够实现领先的性能,有时甚至以较大幅度超越最先进的方法。
cs.AI / 45 / 2603.09890
Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts
通过策略参数化提示影响大语言模型多智能体对话
Abstract
Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems. However, existing research on the behaviour of LLM-based multi-agents relies on ad hoc prompts and lacks a principled policy perspective. Different from reinforcement learning, we investigate whether prompt-as-action can be parameterized so as to construct a lightweight policy which consists of a sequence of state-action pairs to influence conversational behaviours without training. Our framework regards prompts as actions executed by LLMs, and dynamically constructs prompts through five components based on the current state of the agent. To test the effectiveness of parameterized control, we evaluated the dialogue flow based on five indicators: responsiveness, rebuttal, evidence usage, non-repetition, and stance shift. We conduct experiments using different LLM-driven agents in two discussion scenarios related to the general public and show that prompt parameterization can influence the dialogue dynamics. This result shows that policy-parameterised prompts offer a simple and effective mechanism to influence the dialogue process, which will help the research of multi-agent systems in the direction of social simulation.
Chinese Translation
大型语言模型(LLMs)已成为多智能体系统的新范式。然而,现有关于基于LLM的多智能体行为的研究依赖于临时提示,缺乏原则性的政策视角。与强化学习不同,我们探讨了是否可以将提示作为动作进行参数化,从而构建一个轻量级政策,该政策由一系列状态-动作对组成,以在不进行训练的情况下影响对话行为。我们的框架将提示视为由LLM执行的动作,并根据智能体的当前状态通过五个组件动态构建提示。为了测试参数化控制的有效性,我们基于五个指标评估了对话流:响应性、反驳、证据使用、不重复和立场转变。我们在与公众相关的两个讨论场景中使用不同的LLM驱动的智能体进行实验,结果表明提示参数化可以影响对话动态。该结果表明,策略参数化提示提供了一种简单有效的机制来影响对话过程,这将有助于多智能体系统在社会模拟方向的研究。
cs.AI / 46 / 2603.09909
MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems
MedMASLab:用于基准测试多模态医疗多智能体系统的统一编排框架
Abstract
While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non-uniform data ingestion pipelines, inconsistent visual-reasoning evaluation, and a lack of cross-specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero-shot semantic evaluation paradigm that overcomes the limitations of lexical string-matching by leveraging large vision-language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain-specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub-domains. We provide a rigorous ablation of interaction mechanisms and cost-performance trade-offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: https://github.com/NUS-Project/MedMASLab/
Chinese Translation
尽管多智能体系统(MAS)在复杂临床决策支持方面展现出潜力,但该领域仍受到架构碎片化和缺乏标准化多模态集成的制约。目前的医疗MAS研究面临数据摄取管道不统一、视觉推理评估不一致以及缺乏跨专业基准测试等问题。为了解决这些挑战,我们提出了MedMASLab,这是一个用于多模态医疗多智能体系统的统一框架和基准测试平台。MedMASLab引入了:(1)一种标准化的多模态智能体通信协议,使11种异构MAS架构能够无缝集成24种医疗模态;(2)一种自动化临床推理评估器,一种零样本语义评估范式,通过利用大型视觉-语言模型来验证诊断逻辑和视觉基础,克服了词汇字符串匹配的局限性;(3)迄今为止最广泛的基准测试,涵盖11个器官系统和473种疾病,标准化来自11个临床基准的数据。我们的系统评估揭示了一个关键的领域特定性能差距:尽管MAS提高了推理深度,但当前架构在不同专业医疗子领域之间过渡时表现出显著脆弱性。我们提供了交互机制和成本-性能权衡的严格消融分析,为未来自主临床系统建立了新的技术基线。源代码和数据可在以下网址公开获取:https://github.com/NUS-Project/MedMASLab/
cs.AI / 47 / 2603.09943
PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs
PathMem:面向病理多模态大语言模型的认知对齐记忆转化
Abstract
Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.
Chinese Translation
计算病理学既需要视觉模式识别,又需要动态整合结构化领域知识,包括分类法、分级标准和临床证据。在实践中,诊断推理需要将形态学证据与正式的诊断和分级标准联系起来。尽管多模态大语言模型(MLLMs)展示了强大的视觉语言推理能力,但它们缺乏结构化知识整合和可解释的记忆控制的明确机制。因此,现有模型在推理过程中难以一致地融入病理特定的诊断标准。受到人类病理学家层次记忆过程的启发,我们提出了PathMem,一个以记忆为中心的病理多模态框架。PathMem将结构化的病理知识组织为长期记忆(LTM),并引入一种记忆变换器(Memory Transformer),通过多模态记忆激活和上下文感知知识基础,建模LTM到工作记忆(WM)的动态过渡,从而实现上下文感知的记忆精炼,以支持下游推理。PathMem在各基准测试中实现了最先进的性能,相较于之前基于WSI的模型,在WSI-Bench报告生成上提高了12.8%的WSI精准度和10.1%的WSI相关性,在开放式诊断中提高了9.7%和8.9%。
cs.AI / 48 / 2603.09947
The Confidence Gate Theorem: When Should Ranked Decision Systems Abstain?
置信门定理:排名决策系统何时应当 abstain?
Abstract
Ranked decision systems -- recommenders, ad auctions, clinical triage queues -- must decide when to intervene in ranked outputs and when to abstain. We study when confidence-based abstention monotonically improves decision quality, and when it fails. The formal conditions are simple: rank-alignment and no inversion zones. The substantive contribution is identifying why these conditions hold or fail: the distinction between structural uncertainty (missing data, e.g., cold-start) and contextual uncertainty (missing context, e.g., temporal drift). Empirically, we validate this distinction across three domains: collaborative filtering (MovieLens, 3 distribution shifts), e-commerce intent detection (RetailRocket, Criteo, Yoochoose), and clinical pathway triage (MIMIC-IV). Structural uncertainty produces near-monotonic abstention gains in all domains; structurally grounded confidence signals (observation counts) fail under contextual drift, producing as many monotonicity violations as random abstention on our MovieLens temporal split. Context-aware alternatives -- ensemble disagreement and recency features -- substantially narrow the gap (reducing violations from 3 to 1--2) but do not fully restore monotonicity, suggesting that contextual uncertainty poses qualitatively different challenges. Exception labels defined from residuals degrade substantially under distribution shift (AUC drops from 0.71 to 0.61--0.62 across three splits), providing a clean negative result against the common practice of exception-based intervention. The results provide a practical deployment diagnostic: check C1 and C2 on held-out data before deploying a confidence gate, and match the confidence signal to the dominant uncertainty type.
Chinese Translation
排名决策系统——推荐系统、广告拍卖、临床分诊队列——必须决定何时介入排名输出,何时选择 abstain。我们研究了基于置信度的 abstention 何时单调改善决策质量,何时又失败。正式条件很简单:排名一致性和无反转区。实质性贡献在于识别这些条件成立或失败的原因:结构不确定性(缺失数据,例如冷启动)与上下文不确定性(缺失上下文,例如时间漂移)之间的区别。从实证上看,我们在三个领域验证了这一区别:协同过滤(MovieLens,3个分布转变)、电子商务意图检测(RetailRocket,Criteo,Yoochoose)和临床路径分诊(MIMIC-IV)。结构不确定性在所有领域产生近乎单调的 abstention 增益;在上下文漂移下,基于结构的置信信号(观察计数)失败,导致在我们的 MovieLens 时间分割上产生与随机 abstention 一样多的单调性违反。考虑上下文的替代方案——集成不一致性和近期特征——显著缩小了差距(将违反次数从3减少到1-2),但并未完全恢复单调性,表明上下文不确定性带来了质的不同挑战。基于残差定义的异常标签在分布转变下显著降级(AUC 从0.71下降到0.61-0.62,跨越三个分割),为基于异常的干预常见做法提供了一个清晰的负面结果。这些结果提供了一个实用的部署诊断:在部署置信门之前,检查持出数据上的 C1 和 C2,并将置信信号与主导的不确定性类型匹配。
cs.AI / 49 / 2603.09957
Think Before You Lie: How Reasoning Improves Honesty
思考再撒谎:推理如何提高诚实性
Abstract
While existing evaluations of large language models (LLMs) measure deception rates, the underlying conditions that give rise to deceptive behavior are poorly understood. We investigate this question using a novel dataset of realistic moral trade-offs where honesty incurs variable costs. Contrary to humans, who tend to become less honest given time to deliberate (Capraro, 2017; Capraro et al., 2019), we find that reasoning consistently increases honesty across scales and for several LLM families. This effect is not only a function of the reasoning content, as reasoning traces are often poor predictors of final behaviors. Rather, we show that the underlying geometry of the representational space itself contributes to the effect. Namely, we observe that deceptive regions within this space are metastable: deceptive answers are more easily destabilized by input paraphrasing, output resampling, and activation noise than honest ones. We interpret the effect of reasoning in this vein: generating deliberative tokens as part of moral reasoning entails the traversal of a biased representational space, ultimately nudging the model toward its more stable, honest defaults.
Chinese Translation
尽管现有的大型语言模型(LLMs)评估测量了欺骗率,但导致欺骗行为的潜在条件仍然不甚明了。我们使用一个新颖的数据集来研究这个问题,该数据集包含了现实的道德权衡,其中诚实会产生不同的成本。与人类在经过深思熟虑后往往变得不那么诚实时(Capraro, 2017; Capraro et al., 2019)相反,我们发现推理在多个尺度和几种LLM家族中始终提高了诚实性。这个效应不仅仅是推理内容的函数,因为推理轨迹往往是最终行为的糟糕预测指标。相反,我们展示了表示空间本身的底层几何结构对这一效应的贡献。具体而言,我们观察到该空间内的欺骗区域是亚稳定的:欺骗性答案比诚实答案更容易受到输入释义、输出重采样和激活噪声的影响而失稳。我们从这个角度解释推理的效应:在道德推理中生成深思熟虑的标记意味着穿越一个有偏见的表示空间,最终将模型推向其更稳定的诚实默认值。
cs.CL / 1 / 2603.08869
One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations
一种语言,两种书写:探讨大型语言模型概念表征中的书写不变性
Abstract
Do the features learned by Sparse Autoencoders (SAEs) represent abstract meaning, or are they tied to how text is written? We investigate this question using Serbian digraphia as a controlled testbed: Serbian is written interchangeably in Latin and Cyrillic scripts with a near-perfect character mapping between them, enabling us to vary orthography while holding meaning exactly constant. Crucially, these scripts are tokenized completely differently, sharing no tokens whatsoever. Analyzing SAE feature activations across the Gemma model family (270M-27B parameters), we find that identical sentences in different Serbian scripts activate highly overlapping features, far exceeding random baselines. Strikingly, changing script causes less representational divergence than paraphrasing within the same script, suggesting SAE features prioritize meaning over orthographic form. Cross-script cross-paraphrase comparisons provide evidence against memorization, as these combinations rarely co-occur in training data yet still exhibit substantial feature overlap. This script invariance strengthens with model scale. Taken together, our findings suggest that SAE features can capture semantics at a level of abstraction above surface tokenization, and we propose Serbian digraphia as a general evaluation paradigm for probing the abstractness of learned representations.
Chinese Translation
稀疏自编码器(Sparse Autoencoders, SAEs)学习到的特征是代表抽象意义,还是与文本的书写方式相关?我们使用塞尔维亚双书写法作为受控测试平台来研究这个问题:塞尔维亚语可以在拉丁字母和西里尔字母之间交替书写,两者之间几乎完美的字符映射使我们能够在保持意义完全不变的情况下改变正字法。重要的是,这两种书写方式的标记化方式完全不同,完全没有共享的标记。通过分析Gemma模型系列(参数从2.7亿到270亿)的SAE特征激活,我们发现不同塞尔维亚书写方式中的相同句子激活了高度重叠的特征,远远超过随机基线。值得注意的是,改变书写方式导致的表征差异小于在同一书写方式内进行释义,这表明SAE特征优先考虑意义而非正字法形式。跨书写方式和跨释义的比较提供了反对记忆化的证据,因为这些组合在训练数据中很少同时出现,但仍然表现出显著的特征重叠。这种书写不变性随着模型规模的扩大而增强。综合来看,我们的发现表明SAE特征能够在超越表面标记化的抽象层面捕捉语义,我们提出将塞尔维亚双书写法作为探测学习表征抽象程度的一种通用评估范式。
cs.CL / 2 / 2603.08879
MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers
MultiGraSCCo:一个带有个人标识符注释的多语言匿名化基准
Abstract
Accessing sensitive patient data for machine learning is challenging due to privacy concerns. Datasets with annotations of personally identifiable information are crucial for developing and testing anonymization systems to enable safe data sharing that complies with privacy regulations. Since accessing real patient data is a bottleneck, synthetic data offers an efficient solution for data scarcity, bypassing privacy regulations that apply to real data. Moreover, neural machine translation can help to create high-quality data for low-resource languages by translating validated real or synthetic data from a high-resource language. In this work, we create a multilingual anonymization benchmark in ten languages, using a machine translation methodology that preserves the original annotations and renders names of cities and people in a culturally and contextually appropriate form in each target language. Our evaluation study with medical professionals confirms the quality of the translations, both in general and with respect to the translation and adaptation of personal information. Our benchmark with over 2,500 annotations of personal information can be used in many applications, including training annotators, validating annotations across institutions without legal complications, and helping improve the performance of automatic personal information detection. We make our benchmark and annotation guidelines available for further research.
Chinese Translation
由于隐私问题,访问敏感的患者数据以进行机器学习面临挑战。带有个人可识别信息注释的数据集对于开发和测试匿名化系统至关重要,以便实现符合隐私法规的安全数据共享。由于访问真实患者数据是一个瓶颈,合成数据为数据稀缺提供了一种有效的解决方案,绕过适用于真实数据的隐私法规。此外,神经机器翻译可以通过将经过验证的真实或合成数据从高资源语言翻译为低资源语言,帮助创建高质量的数据。在本研究中,我们创建了一个涵盖十种语言的多语言匿名化基准,采用了一种机器翻译方法,该方法保留了原始注释,并以文化和上下文适当的形式呈现每种目标语言中的城市和人名。我们与医疗专业人士的评估研究确认了翻译的质量,无论是一般而言,还是在个人信息的翻译和适应方面。我们的基准包含超过2500个个人信息的注释,可用于多种应用,包括培训注释员、在没有法律问题的情况下验证跨机构的注释,以及帮助提高自动个人信息检测的性能。我们将我们的基准和注释指南提供给进一步研究使用。
cs.CL / 3 / 2603.08899
ConFu: Contemplate the Future for Better Speculative Sampling
ConFu:展望未来以改善推测采样
Abstract
Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8--11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.
Chinese Translation
推测解码作为一种强有力的方法,通过使用轻量级草稿模型来加速大型语言模型(LLM)的推理,已逐渐受到关注。这些草稿模型提出候选标记,随后由目标模型进行验证。这一范式的有效性在很大程度上依赖于草稿模型的质量。尽管最近的进展,如EAGLE系列,已实现了最先进的加速,但现有的草稿模型仍然受到错误累积的限制:它们仅基于当前前缀进行条件化,导致其预测在多个步骤中偏离目标模型。在本研究中,我们提出了 extbf{ConFu}(展望未来),一种新颖的推测解码框架,使草稿模型能够预见生成的未来方向。ConFu引入了(i)沉思标记和软提示,使草稿模型能够以微不足道的成本利用来自目标模型的未来导向信号,(ii)具有MoE的动态沉思标记机制,以实现上下文感知的未来预测,以及(iii)一种具有锚定标记采样和未来预测复制的训练框架,以学习稳健的未来预测。实验表明,ConFu在各种下游任务中,相比EAGLE-3提高了8-11%的标记接受率和生成速度,使用的是Llama-3 3B和8B模型。我们相信我们的工作首次将推测解码与连续推理标记结合起来,为加速LLM推理提供了新的方向。
cs.CL / 4 / 2603.08910
SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation
SciTaRC:基于科学表格数据的问答基准,需语言推理和复杂计算
Abstract
We introduce SciTaRC, an expert-authored benchmark of questions about tabular data in scientific papers requiring both deep language reasoning and complex computation. We show that current state-of-the-art AI models fail on at least 23% of these questions, a gap that remains significant even for highly capable open-weight models like Llama-3.3-70B-Instruct, which fails on 65.5% of the tasks. Our analysis reveals a universal "execution bottleneck": both code and language models struggle to faithfully execute plans, even when provided with correct strategies. Specifically, code-based methods prove brittle on raw scientific tables, while natural language reasoning primarily fails due to initial comprehension issues and calculation errors.
Chinese Translation
我们介绍了SciTaRC,这是一个由专家撰写的基准,涉及关于科学论文中表格数据的问题,这些问题需要深层次的语言推理和复杂的计算。我们显示,当前最先进的人工智能模型在至少23%的这些问题上表现不佳,即使是像Llama-3.3-70B-Instruct这样高度能力的开放权重模型,其失败率也达到了65.5%。我们的分析揭示了一个普遍的“执行瓶颈”:无论是代码模型还是语言模型,在忠实执行计划时都面临困难,即使提供了正确的策略。具体而言,基于代码的方法在处理原始科学表格时表现脆弱,而自然语言推理主要由于初始理解问题和计算错误而失败。
cs.CL / 5 / 2603.08989
Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance
临床定性数据的自动主题分析:具有完整来源的迭代代码本优化
Abstract
Thematic analysis (TA) is widely used in health research to extract patterns from patient interviews, yet manual TA faces challenges in scalability and reproducibility. LLM-based automation can help, but existing approaches produce codebooks with limited generalizability and lack analytic auditability. We present an automated TA framework combining iterative codebook refinement with full provenance tracking. Evaluated on five corpora spanning clinical interviews, social media, and public transcripts, the framework achieves the highest composite quality score on four of five datasets compared to six baselines. Iterative refinement yields statistically significant improvements on four datasets with large effect sizes, driven by gains in code reusability and distributional consistency while preserving descriptive quality. On two clinical corpora (pediatric cardiology), generated themes align with expert-annotated themes.
Chinese Translation
主题分析(TA)在健康研究中广泛应用于从患者访谈中提取模式,但手动TA在可扩展性和可重复性方面面临挑战。基于大型语言模型(LLM)的自动化可以提供帮助,但现有方法生成的代码本具有有限的普适性,并且缺乏分析审计能力。我们提出了一种自动化TA框架,结合了迭代代码本优化和完整的来源追踪。在涵盖临床访谈、社交媒体和公共转录文本的五个语料库上进行评估,该框架在与六个基准相比的五个数据集中,四个数据集上达到了最高的综合质量评分。迭代优化在四个数据集上带来了统计显著的改进,效果量较大,这得益于代码可重用性和分布一致性的提升,同时保持了描述质量。在两个临床语料库(儿科心脏病学)中,生成的主题与专家标注的主题一致。
cs.CL / 6 / 2603.08999
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
学习何时采样:基于信心的自一致性以实现高效的LLM思维链推理
Abstract
Large language models (LLMs) achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet often generate unnecessarily long reasoning paths that incur high inference cost. Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead. This paper introduces a confidence-aware decision framework that analyzes a single completed reasoning trajectory to adaptively select between single-path and multi-path reasoning. The framework is trained using sentence-level numeric and linguistic features extracted from intermediate reasoning states in the MedQA dataset and generalizes effectively to MathQA, MedMCQA, and MMLU without additional fine-tuning. Experimental results show that the proposed method maintains accuracy comparable to multi-path baselines while using up to 80\% fewer tokens. These findings demonstrate that reasoning trajectories contain rich signals for uncertainty estimation, enabling a simple, transferable mechanism to balance accuracy and efficiency in LLM reasoning.
Chinese Translation
大型语言模型(LLMs)通过思维链(CoT)推理实现了强大的推理性能,但通常生成不必要的冗长推理路径,从而导致高推理成本。近期基于自一致性的方法进一步提高了准确性,但需要对多个推理轨迹进行采样和聚合,造成了显著的额外计算开销。本文提出了一种基于信心的决策框架,分析单个完成的推理轨迹,以自适应地选择单路径和多路径推理。该框架使用从MedQA数据集中提取的中间推理状态中的句子级数值和语言特征进行训练,并在不进行额外微调的情况下有效地推广到MathQA、MedMCQA和MMLU。实验结果表明,所提出的方法在使用最多80%的更少标记的情况下,保持了与多路径基线相当的准确性。这些发现表明,推理轨迹包含丰富的不确定性估计信号,使得在LLM推理中平衡准确性和效率的简单可转移机制成为可能。
cs.CL / 7 / 2603.09095
Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
阅读,而非思考:理解和弥合多模态大语言模型中文本转化为像素的模态差距
Abstract
Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the modality gap is task- and data-dependent. For example, math tasks degrade by over 60 points on synthetic renderings, while natural document images often match or exceed text-mode performance. Rendering choices such as font and resolution are strong confounds, with font alone swinging accuracy by up to 47 percentage points. To understand this, we conduct a grounded-theory error analysis of over 4,000 examples, revealing that image mode selectively amplifies reading errors (calculation and formatting failures) while leaving knowledge and reasoning errors largely unchanged, and that some models exhibit a chain-of-thought reasoning collapse under visual input. Motivated by these findings, we propose a self-distillation method that trains the model on its own pure text reasoning traces paired with image inputs, raising image-mode accuracy on GSM8K from 30.71% to 92.72% and transferring to unseen benchmarks without catastrophic forgetting. Overall, our study provides a systematic understanding of the modality gap and suggests a practical path toward improving visual text understanding in multimodal language models.
Chinese Translation
多模态大语言模型(MLLMs)能够处理以图像形式呈现的文本,但其表现通常不如以文本标记提供的相同内容。我们通过在五种输入模式下评估七个MLLMs在七个基准测试中的表现,系统性地诊断这一“模态差距”,涵盖了从合成渲染文本到来自arXiv PDF和维基百科页面的真实文档图像。我们发现模态差距依赖于任务和数据。例如,在合成渲染上,数学任务的表现下降超过60分,而自然文档图像的表现通常与文本模式相匹配或超过。渲染选择(如字体和分辨率)是强干扰因素,仅字体就能使准确率波动高达47个百分点。为了理解这一现象,我们对4000多个示例进行了基于扎根理论的错误分析,揭示了图像模式选择性地放大了阅读错误(计算和格式化失败),而知识和推理错误基本保持不变,并且一些模型在视觉输入下表现出思维链推理崩溃。基于这些发现,我们提出了一种自蒸馏方法,该方法在与图像输入配对的纯文本推理轨迹上训练模型,将GSM8K的图像模式准确率从30.71%提高到92.72%,并在未见基准上实现了迁移而没有灾难性遗忘。总体而言,我们的研究提供了对模态差距的系统理解,并提出了一条改善多模态语言模型中视觉文本理解的实际路径。
cs.CL / 8 / 2603.09154
Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety
生物对齐:测量和改善大型语言模型对生物系统的倾向以确保人工智能安全
Abstract
Large language models (LLMs) trained on internet-scale corpora can exhibit systematic biases that increase the probability of unwanted behavior. In this study, we examined potential biases towards synthetic vs. biological technological solutions across four domains (materials, energy, manufacturing, and algorithms). A sample of 5 frontier and 5 open-weight models were measured using 50 curated Bioalignment prompts with a Kelly criterion-inspired evaluation framework. According to this metric, most models were not bioaligned in that they exhibit biases in favor of synthetic (non-biological) solutions. We next examined if fine-tuning could increase the preferences of two open-weight models, Llama 3.2-3B-Instruct and Qwen2.5-3B-Instruct, for biological-based approaches. A curated corpus of ~22M tokens from 6,636 PMC articles emphasizing biological problem-solving was used first to fine-tune Llama 3B with a mixed corpus of continued training and instruction-formatted. This was then extended to Qwen 3B using instruction-formatted only. We found that QLoRA fine-tuning significantly increased the scoring of biological solutions for both models without degrading general capabilities (Holm-Bonferroni-corrected p < 0.001 and p < 0.01, respectively). This suggests that even a small amount of fine-tuning can change how models weigh the relative value of biological and bioinspired vs. synthetic approaches. Although this work focused on small open-weight LLMs, it may be extensible to much larger models and could be used to develop models that favor bio-based approaches. We release the benchmark, corpus, code, and adapter weights.
Chinese Translation
在互联网规模语料库上训练的大型语言模型(LLMs)可能表现出系统性偏见,从而增加不当行为的概率。在本研究中,我们考察了在材料、能源、制造和算法四个领域中,针对合成技术解决方案与生物技术解决方案的潜在偏见。我们使用50个精心策划的生物对齐提示,结合基于凯利准则的评估框架,对5个前沿模型和5个开放权重模型进行了测量。根据这一指标,大多数模型并未实现生物对齐,因为它们表现出偏向合成(非生物)解决方案的偏见。接下来,我们考察了微调是否能够提高两个开放权重模型——Llama 3.2-3B-Instruct和Qwen2.5-3B-Instruct——对基于生物的方法的偏好。我们首先使用来自6,636篇PMC文章的约2200万个标记的精心策划语料库,强调生物问题解决,来微调Llama 3B,采用混合语料库进行持续训练和指令格式化。然后,这一方法扩展到仅使用指令格式的Qwen 3B。我们发现,QLoRA微调显著提高了两个模型对生物解决方案的评分,而没有降低其一般能力(Holm-Bonferroni校正后的p值分别为<0.001和<0.01)。这表明,即使少量的微调也可以改变模型对生物和生物启发方法与合成方法相对价值的权重。尽管这项工作集中于小型开放权重LLM,但它可能扩展到更大规模的模型,并可用于开发更倾向于生物基础方法的模型。我们发布了基准、语料库、代码和适配器权重。
cs.CL / 9 / 2603.09180
DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization
DuplexCascade:基于无VAD的全双工语音对话的级联ASR-LLM-TTS管道与微转优化
Abstract
Spoken dialog systems with cascaded ASR-LLM-TTS modules retain strong LLM intelligence, but VAD segmentation often forces half-duplex turns and brittle control. On the other hand, VAD-free end-to-end model support full-duplex interaction but is hard to maintain conversational intelligence. In this paper, we present DuplexCascade, a VAD-free cascaded streaming pipeline for full-duplex speech-to-speech dialogue. Our key idea is to convert conventional utterance-wise long turns into chunk-wise micro-turn interactions, enabling rapid bidirectional exchange while preserving the strengths of a capable text LLM. To reliably coordinate turn-taking and response timing, we introduce a set of conversational special control tokens that steer the LLM's behavior under streaming constraints. On Full-DuplexBench and VoiceBench, DuplexCascade delivers state-of-the-art full-duplex turn-taking and strong conversational intelligence among open-source speech-to-speech dialogue systems.
Chinese Translation
带有级联ASR-LLM-TTS模块的语音对话系统保留了强大的LLM智能,但VAD分段常常迫使系统采用半双工交流和脆弱的控制。另一方面,无VAD的端到端模型支持全双工交互,但难以维持对话智能。在本文中,我们提出了DuplexCascade,一种无VAD的级联流媒体管道,用于全双工语音对话。我们的关键思想是将传统的逐发言长轮次转换为分块的微轮次交互,从而实现快速的双向交流,同时保留强大的文本LLM的优势。为了可靠地协调轮次和响应时机,我们引入了一组对话特定的控制标记,以在流媒体约束下引导LLM的行为。在Full-DuplexBench和VoiceBench上,DuplexCascade在开源语音对话系统中实现了最先进的全双工轮次控制和强大的对话智能。
cs.CL / 10 / 2603.09185
DEO: Training-Free Direct Embedding Optimization for Negation-Aware Retrieval
DEO:一种无训练的直接嵌入优化方法用于否定感知检索
Abstract
Recent advances in Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have enabled diverse retrieval methods. However, existing retrieval methods often fail to accurately retrieve results for negation and exclusion queries. To address this limitation, prior approaches rely on embedding adaptation or fine-tuning, which introduce additional computational cost and deployment complexity. We propose Direct Embedding Optimization (DEO), a training-free method for negation-aware text and multimodal retrieval. DEO decomposes queries into positive and negative components and optimizes the query embedding with a contrastive objective. Without additional training data or model updates, DEO outperforms baselines on NegConstraint, with gains of +0.0738 nDCG@10 and +0.1028 MAP@100, while improving Recall@5 by +6\% over OpenAI CLIP in multimodal retrieval. These results demonstrate the practicality of DEO for negation- and exclusion-aware retrieval in real-world settings.
Chinese Translation
近期,大型语言模型(LLMs)和检索增强生成(RAG)的进展使得多样化的检索方法成为可能。然而,现有的检索方法往往无法准确检索否定和排除查询的结果。为了解决这一局限性,以往的方法依赖于嵌入适应或微调,这会引入额外的计算成本和部署复杂性。我们提出了直接嵌入优化(DEO),这是一种无训练的方法,旨在进行否定感知的文本和多模态检索。DEO将查询分解为正向和负向组件,并通过对比目标优化查询嵌入。在没有额外训练数据或模型更新的情况下,DEO在NegConstraint上超越了基线,nDCG@10提高了+0.0738,MAP@100提高了+0.1028,同时在多模态检索中,Recall@5比OpenAI CLIP提高了+6%。这些结果证明了DEO在现实环境中进行否定和排除感知检索的实用性。
cs.CL / 11 / 2603.09205
Emotion is Not Just a Label: Latent Emotional Factors in LLM Processing
情感不仅仅是一个标签:大型语言模型处理中的潜在情感因素
Abstract
Large language models are routinely deployed on text that varies widely in emotional tone, yet their reasoning behavior is typically evaluated without accounting for emotion as a source of representational variation. Prior work has largely treated emotion as a prediction target, for example in sentiment analysis or emotion classification. In contrast, we study emotion as a latent factor that shapes how models attend to and reason over text. We analyze how emotional tone systematically alters attention geometry in transformer models, showing that metrics such as locality, center-of-mass distance, and entropy vary across emotions and correlate with downstream question-answering performance. To facilitate controlled study of these effects, we introduce Affect-Uniform ReAding QA (AURA-QA), a question-answering dataset with emotionally balanced, human-authored context passages. Finally, an emotional regularization framework is proposed that constrains emotion-conditioned representational drift during training. Experiments across multiple QA benchmarks demonstrate that this approach improves reading comprehension in both emotionally-varying and non-emotionally varying datasets, yielding consistent gains under distribution shift and in-domain improvements on several benchmarks.
Chinese Translation
大型语言模型通常在情感色彩差异很大的文本上进行部署,但它们的推理行为通常在评估时未考虑情感作为表征变异的来源。以往的研究主要将情感视为预测目标,例如在情感分析或情感分类中。相反,我们将情感视为一个潜在因素,研究其如何影响模型对文本的关注和推理。我们分析了情感色调如何系统性地改变变换器模型中的注意力几何,显示出局部性、质心距离和熵等指标在不同情感之间存在变化,并与下游问答性能相关联。为了便于对这些影响进行控制研究,我们引入了情感均匀阅读问答(Affect-Uniform ReAding QA, AURA-QA),这是一个具有情感平衡的人类撰写的上下文段落的问答数据集。最后,我们提出了一种情感正则化框架,在训练过程中约束情感条件下的表征漂移。在多个问答基准上的实验表明,这种方法在情感变化和非情感变化的数据集上均提高了阅读理解能力,在分布转移和多个基准的领域内改进中均取得了一致的提升。
cs.CL / 12 / 2603.09215
SPAR-K: Scheduled Periodic Alternating Early Exit for Spoken Language Models
SPAR-K:用于口语语言模型的定时周期性交替提前退出
Abstract
Interleaved spoken language models (SLMs) alternately generate text and speech tokens, but decoding at full transformer depth for every step becomes costly, especially due to long speech sequences. We propose SPAR-K, a modality-aware early exit framework designed to accelerate interleaved SLM inference while preserving perceptual quality. SPAR-K introduces a speech alternating-depth schedule: most speech positions exit at a fixed intermediate layer, while periodic full-depth "refresh" steps mitigate distribution shift due to early exit. We evaluate our framework using Step-Audio-2-mini and GLM-4-Voice across four datasets spanning reasoning, factual QA, and dialogue tasks, measuring performance in terms of ASR transcription accuracy and perceptual quality. Experimental results demonstrate that SPAR-K largely preserves question-answering accuracy with a maximum accuracy drop of 0.82\% while reducing average speech decoding depth by up to 11\% on Step-Audio-2-mini and 5\% on GLM-4-Voice, both with negligible changes in MOS and WER and no auxiliary computation overhead. We further demonstrate that confidence-based early exit strategies, widely used in text LLMs, are suboptimal for SLMs, highlighting that the unique statistical nature of speech tokens necessitates a specialized early exit design.
Chinese Translation
交错的口语语言模型(SLMs)交替生成文本和语音标记,但在每一步都以全变换器深度解码变得成本高昂,尤其是在长语音序列的情况下。我们提出了SPAR-K,一种感知模式的提前退出框架,旨在加速交错SLM推理,同时保持感知质量。SPAR-K引入了一种语音交替深度调度:大多数语音位置在固定的中间层退出,而周期性的全深度“刷新”步骤则缓解了由于提前退出造成的分布偏移。我们使用Step-Audio-2-mini和GLM-4-Voice在四个数据集上评估我们的框架,这些数据集涵盖了推理、事实问答和对话任务,测量ASR转录准确性和感知质量方面的表现。实验结果表明,SPAR-K在最大准确度下降0.82\%的情况下,基本保持了问答准确性,同时在Step-Audio-2-mini上将平均语音解码深度减少了最多11\\%,在GLM-4-Voice上减少了5\\%,且MOS和WER变化微乎其微,没有额外的计算开销。我们进一步证明,广泛用于文本LLMs的基于置信度的提前退出策略对于SLMs并不理想,强调了语音标记的独特统计特性需要专门的提前退出设计。
cs.CL / 13 / 2603.09222
LooComp: Leverage Leave-One-Out Strategy to Encoder-only Transformer for Efficient Query-aware Context Compression
LooComp:利用留一法策略对仅编码器的Transformer进行高效的查询感知上下文压缩
Abstract
Efficient context compression is crucial for improving the accuracy and scalability of question answering. For the efficiency of Retrieval Augmented Generation, context should be delivered fast, compact, and precise to ensure clue sufficiency and budget-friendly LLM reader cost. We propose a margin-based framework for query-driven context pruning, which identifies sentences that are critical for answering a query by measuring changes in clue richness when they are omitted. The model is trained with a composite ranking loss that enforces large margins for critical sentences while keeping non-critical ones near neutral. Built on a lightweight encoder-only Transformer, our approach generally achieves strong exact-match and F1 scores with high-throughput inference and lower memory requirements than those of major baselines. In addition to efficiency, our method yields effective compression ratios without degrading answering performance, demonstrating its potential as a lightweight and practical alternative for retrieval-augmented tasks.
Chinese Translation
高效的上下文压缩对于提高问答的准确性和可扩展性至关重要。为了提升检索增强生成的效率,上下文应快速、紧凑且精确地传递,以确保线索的充分性和经济的LLM阅读成本。我们提出了一种基于边际的查询驱动上下文剪枝框架,通过测量省略句子时线索丰富度的变化,识别对回答查询至关重要的句子。该模型采用复合排名损失进行训练,强制对关键句子施加较大的边际,同时保持非关键句子接近中性。基于轻量级的仅编码器Transformer,我们的方法在高通量推理和较低内存需求下,通常能够实现强大的精确匹配和F1得分,优于主要基线。除了效率外,我们的方法在不降低回答性能的情况下,产生有效的压缩比,展示了其作为检索增强任务轻量且实用的替代方案的潜力。
cs.CL / 14 / 2603.09341
TaSR-RAG: Taxonomy-guided Structured Reasoning for Retrieval-Augmented Generation
TaSR-RAG:基于分类法引导的结构化推理用于检索增强生成
Abstract
Retrieval-Augmented Generation (RAG) helps large language models (LLMs) answer knowledge-intensive and time-sensitive questions by conditioning generation on external evidence. However, most RAG systems still retrieve unstructured chunks and rely on one-shot generation, which often yields redundant context, low information density, and brittle multi-hop reasoning. While structured RAG pipelines can improve grounding, they typically require costly and error-prone graph construction or impose rigid entity-centric structures that do not align with the query's reasoning chain. We propose \textsc{TaSR-RAG}, a taxonomy-guided structured reasoning framework for evidence selection. We represent both queries and documents as relational triples, and constrain entity semantics with a lightweight two-level taxonomy to balance generalization and precision. Given a complex question, \textsc{TaSR-RAG} decomposes it into an ordered sequence of triple sub-queries with explicit latent variables, then performs step-wise evidence selection via hybrid triple matching that combines semantic similarity over raw triples with structural consistency over typed triples. By maintaining an explicit entity binding table across steps, \textsc{TaSR-RAG} resolves intermediate variables and reduces entity conflation without explicit graph construction or exhaustive search. Experiments on multiple multi-hop question answering benchmarks show that \textsc{TaSR-RAG} consistently outperforms strong RAG and structured-RAG baselines by up to 14\%, while producing clearer evidence attribution and more faithful reasoning traces.
Chinese Translation
检索增强生成(RAG)通过基于外部证据的生成条件,帮助大型语言模型(LLMs)回答知识密集型和时间敏感的问题。然而,大多数RAG系统仍然检索非结构化的片段,并依赖一次性生成,这通常会导致冗余的上下文、低信息密度和脆弱的多跳推理。虽然结构化RAG管道可以改善基础支持,但它们通常需要昂贵且容易出错的图构建,或者施加与查询推理链不一致的严格实体中心结构。我们提出了 extsc{TaSR-RAG},一种基于分类法引导的结构化推理框架,用于证据选择。我们将查询和文档表示为关系三元组,并通过轻量级的两级分类法约束实体语义,以平衡泛化和精确性。给定一个复杂的问题, extsc{TaSR-RAG}将其分解为一系列有序的三元组子查询,带有显式的潜变量,然后通过混合三元组匹配进行逐步证据选择,该匹配结合了原始三元组的语义相似性和类型三元组的结构一致性。通过在各步骤之间维护一个显式的实体绑定表, extsc{TaSR-RAG}解决了中间变量并减少了实体混淆,而无需显式的图构建或穷举搜索。在多个多跳问答基准上的实验表明, extsc{TaSR-RAG}始终比强大的RAG和结构化RAG基线高出多达14%,同时产生更清晰的证据归属和更真实的推理痕迹。
cs.CL / 15 / 2603.09373
Quantifying and extending the coverage of spatial categorization data sets
量化与扩展空间分类数据集的覆盖范围
Abstract
Variation in spatial categorization across languages is often studied by eliciting human labels for the relations depicted in a set of scenes known as the Topological Relations Picture Series (TRPS). We demonstrate that labels generated by large language models (LLMs) align relatively well with human labels, and show how LLM-generated labels can help to decide which scenes and languages to add to existing spatial data sets. To illustrate our approach we extend the TRPS by adding 42 new scenes, and show that this extension achieves better coverage of the space of possible scenes than two previous extensions of the TRPS. Our results provide a foundation for scaling towards spatial data sets with dozens of languages and hundreds of scenes.
Chinese Translation
不同语言之间的空间分类变异通常通过引导人类为一组场景中所描绘的关系提供标签来研究,这组场景被称为拓扑关系图片系列(Topological Relations Picture Series, TRPS)。我们展示了大型语言模型(Large Language Models, LLMs)生成的标签与人类标签之间的相对一致性,并展示了LLM生成的标签如何帮助决定向现有空间数据集中添加哪些场景和语言。为了说明我们的方法,我们通过添加42个新场景来扩展TRPS,并展示这一扩展在可能场景空间的覆盖范围上优于TRPS的两个先前扩展。我们的结果为向包含数十种语言和数百个场景的空间数据集扩展奠定了基础。
cs.CL / 16 / 2603.09400
Reward Prediction with Factorized World States
基于分解世界状态的奖励预测
Abstract
Agents must infer action outcomes and select actions that maximize a reward signal indicating how close the goal is to being reached. Supervised learning of reward models could introduce biases inherent to training data, limiting generalization to novel goals and environments. In this paper, we investigate whether well-defined world state representations alone can enable accurate reward prediction across domains. To address this, we introduce StateFactory, a factorized representation method that transforms unstructured observations into a hierarchical object-attribute structure using language models. This structured representation allows rewards to be estimated naturally as the semantic similarity between the current state and the goal state under hierarchical constraint. Overall, the compact representation structure induced by StateFactory enables strong reward generalization capabilities. We evaluate on RewardPrediction, a new benchmark dataset spanning five diverse domains and comprising 2,454 unique action-observation trajectories with step-wise ground-truth rewards. Our method shows promising zero-shot results against both VLWM-critic and LLM-as-a-Judge reward models, achieving 60% and 8% lower EPIC distance, respectively. Furthermore, this superior reward quality successfully translates into improved agent planning performance, yielding success rate gains of +21.64% on AlfWorld and +12.40% on ScienceWorld over reactive system-1 policies and enhancing system-2 agent planning. Project Page: https://statefactory.github.io
Chinese Translation
智能体必须推断行动结果,并选择能够最大化奖励信号的行动,该信号指示目标达成的接近程度。奖励模型的监督学习可能引入训练数据固有的偏差,从而限制对新目标和环境的泛化能力。本文研究了是否仅通过明确的世界状态表示能够在不同领域中实现准确的奖励预测。为此,我们提出了StateFactory,这是一种分解表示方法,利用语言模型将非结构化观察转化为层次化的对象-属性结构。这种结构化表示使得在层次约束下,当前状态与目标状态之间的语义相似度能够自然地估计奖励。总体而言,StateFactory所诱导的紧凑表示结构增强了强大的奖励泛化能力。我们在RewardPrediction上进行了评估,这是一个新的基准数据集,涵盖五个不同领域,包含2454个独特的行动-观察轨迹及逐步的真实奖励。我们的方法在零样本条件下对比VLWM-critic和LLM-as-a-Judge奖励模型,分别实现了60%和8%的EPIC距离降低。此外,这种优越的奖励质量成功转化为改进的智能体规划性能,在反应系统1策略上,AlfWorld的成功率提高了21.64%,ScienceWorld提高了12.40%,并增强了系统2智能体的规划能力。项目页面:https://statefactory.github.io
cs.CL / 17 / 2603.09403
LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation
大型语言模型作为元评判者:用于自然语言处理评估指标验证的合成数据
Abstract
Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using \textit{meta-correlation}, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data will become publicly available upon paper acceptance.
Chinese Translation
自然语言生成(NLG)评估指标的验证通常依赖于昂贵且耗时的人类注释,而这些注释主要仅存在于英语数据集上。我们提出了 extit{大型语言模型作为元评判者}(LLM as a Meta-Judge),这是一个可扩展的框架,利用大型语言模型(LLMs)通过对真实数据进行控制的语义降解生成合成评估数据集,从而替代人类判断。我们使用 extit{元相关性}(meta-correlation)来验证我们的方法,测量从合成数据中得出的指标排名与标准人类基准之间的对齐程度。在机器翻译、问答和摘要生成等任务中的实验表明,合成验证作为人类判断的可靠代理,能够在多语言问答中实现超过0.9的元相关性,并证明在缺乏人类判断或获取成本过高的情况下是一种可行的替代方案。我们的代码和数据将在论文接受后公开。
cs.CL / 18 / 2603.09416
Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health
通过健康社会决定因素研究大型语言模型中的性别刻板印象
Abstract
Large Language Models (LLMs) excel in Natural Language Processing (NLP) tasks, but they often propagate biases embedded in their training data, which is potentially impactful in sensitive domains like healthcare. While existing benchmarks evaluate biases related to individual social determinants of health (SDoH) such as gender or ethnicity, they often overlook interactions between these factors and lack context-specific assessments. This study investigates bias in LLMs by probing the relationships between gender and other SDoH in French patient records. Through a series of experiments, we found that embedded stereotypes can be probed using SDoH input and that LLMs rely on embedded stereotypes to make gendered decisions, suggesting that evaluating interactions among SDoH factors could usefully complement existing approaches to assessing LLM performance and bias.
Chinese Translation
大型语言模型(LLMs)在自然语言处理(NLP)任务中表现出色,但它们往往传播训练数据中嵌入的偏见,这在医疗等敏感领域可能产生深远影响。虽然现有基准评估与个体健康社会决定因素(SDoH)相关的偏见,如性别或种族,但它们常常忽视这些因素之间的相互作用,并缺乏特定上下文的评估。本研究通过探讨法语患者记录中性别与其他SDoH之间的关系,调查了LLMs中的偏见。通过一系列实验,我们发现可以利用SDoH输入探测嵌入的刻板印象,并且LLMs在做出性别决策时依赖于这些嵌入的刻板印象,这表明评估SDoH因素之间的相互作用可以有效补充现有的LLM性能和偏见评估方法。
cs.CL / 19 / 2603.09434
Common Sense vs. Morality: The Curious Case of Narrative Focus Bias in LLMs
常识与道德:大型语言模型叙事聚焦偏差的奇特案例
Abstract
Large Language Models (LLMs) are increasingly deployed across diverse real-world applications and user communities. As such, it is crucial that these models remain both morally grounded and knowledge-aware. In this work, we uncover a critical limitation of current LLMs -- their tendency to prioritize moral reasoning over commonsense understanding. To investigate this phenomenon, we introduce CoMoral, a novel benchmark dataset containing commonsense contradictions embedded within moral dilemmas. Through extensive evaluation of ten LLMs across different model sizes, we find that existing models consistently struggle to identify such contradictions without prior signal. Furthermore, we observe a pervasive narrative focus bias, wherein LLMs more readily detect commonsense contradictions when they are attributed to a secondary character rather than the primary (narrator) character. Our comprehensive analysis underscores the need for enhanced reasoning-aware training to improve the commonsense robustness of large language models.
Chinese Translation
大型语言模型(LLMs)正越来越多地应用于各种现实世界的应用和用户社区。因此,这些模型在道德基础和知识意识方面保持一致至关重要。在本研究中,我们揭示了当前LLMs的一个关键局限性——它们倾向于优先考虑道德推理而非常识理解。为了研究这一现象,我们引入了CoMoral,一个包含道德困境中嵌入的常识矛盾的新基准数据集。通过对十种不同模型规模的LLMs进行广泛评估,我们发现现有模型在没有先前信号的情况下,始终难以识别这些矛盾。此外,我们观察到一种普遍的叙事聚焦偏差,即当常识矛盾归因于次要角色而非主要(叙述者)角色时,LLMs更容易检测到这些矛盾。我们的综合分析强调了增强推理意识训练的必要性,以提高大型语言模型的常识鲁棒性。
cs.CL / 20 / 2603.09503
Modelling the Diachronic Emergence of Phoneme Frequency Distributions
音素频率分布的历时性出现建模
Abstract
Phoneme frequency distributions exhibit robust statistical regularities across languages, including exponential-tailed rank-frequency patterns and a negative relationship between phonemic inventory size and the relative entropy of the distribution. The origin of these patterns remains largely unexplained. In this paper, we investigate whether they can arise as consequences of the historical processes that shape phonological systems. We introduce a stochastic model of phonological change and simulate the diachronic evolution of phoneme inventories. A na\"ive version of the model reproduces the general shape of phoneme rank-frequency distributions but fails to capture other empirical properties. Extending the model with two additional assumptions -- an effect related to functional load and a stabilising tendency toward a preferred inventory size -- yields simulations that match both the observed distributions and the negative relationship between inventory size and relative entropy. These results suggest that some statistical regularities of phonological systems may arise as natural consequences of diachronic sound change rather than from explicit optimisation or compensatory mechanisms.
Chinese Translation
音素频率分布在不同语言中表现出强健的统计规律性,包括指数尾部的等级-频率模式以及音素库存大小与分布相对熵之间的负相关关系。这些模式的起源仍然在很大程度上未得到解释。本文探讨这些模式是否可以作为塑造音系的历史过程的结果而产生。我们引入了一个音韵变化的随机模型,并模拟音素库存的历时演变。模型的简单版本再现了音素等级-频率分布的一般形状,但未能捕捉其他经验特性。通过增加两个额外假设——与功能负荷相关的效应和对偏好库存大小的稳定倾向——扩展模型,得出的模拟结果既匹配观察到的分布,也符合库存大小与相对熵之间的负相关关系。这些结果表明,音韵系统的一些统计规律可能是历时性音变的自然结果,而不是来自于明确的优化或补偿机制。
cs.CL / 21 / 2603.09517
You Didn't Have to Say It like That: Subliminal Learning from Faithful Paraphrases
你不必这样说:从忠实的释义中进行潜意识学习
Abstract
When language models are trained on synthetic data, they (student model) can covertly acquire behavioral traits from the data-generating model (teacher model). Subliminal learning refers to the transmission of traits from a teacher to a student model via training on data unrelated to those traits. Prior work demonstrated this in the training domains of number sequences, code, and math Chain-of-Thought traces including transmission of misaligned behaviors. We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher's preference can block it. We find that training on paraphrases from a teacher system-prompted to love a particular animal increases a student's preference for that animal by up to 19 percentage points. This occurs when paraphrased content is semantically unrelated to the animal, or even when it explicitly expresses dislike. The transmission succeeds despite aggressive filtering to ensure paraphrase fidelity. This raises concerns for pipelines where models generate their own training data: content-based inspection cannot detect such transmission, and even preference-contradicting content fails to prevent it.
Chinese Translation
当语言模型在合成数据上进行训练时,学生模型可以隐秘地从数据生成模型(教师模型)中获取行为特征。潜意识学习是指通过训练与这些特征无关的数据,将特征从教师模型传递给学生模型。先前的研究在数字序列、代码和数学思维链迹的训练领域中展示了这一现象,包括不一致行为的传递。我们研究了通过具有固定语义内容的自然语言释义是否会发生特征传递,以及是否与教师的偏好明确相悖的内容会阻止这种传递。我们发现,当训练内容来自一个被提示喜爱特定动物的教师系统时,学生对该动物的偏好最多可以增加19个百分点。这种情况发生在释义内容与该动物在语义上无关,甚至在内容明确表达不喜欢时。尽管进行了严格的过滤以确保释义的忠实性,传递仍然成功。这引发了对模型生成自身训练数据的管道的担忧:基于内容的检查无法检测到这种传递,即使是与偏好相悖的内容也无法阻止它。
cs.CL / 22 / 2603.09556
ALARM: Audio-Language Alignment for Reasoning Models
ALARM:用于推理模型的音频-语言对齐
Abstract
Large audio language models (ALMs) extend LLMs with auditory understanding. A common approach freezes the LLM and trains only an adapter on self-generated targets. However, this fails for reasoning LLMs (RLMs) whose built-in chain-of-thought traces expose the textual surrogate input, yielding unnatural responses. We propose self-rephrasing, converting self-generated responses into audio-understanding variants compatible with RLMs while preserving distributional alignment. We further fuse and compress multiple audio encoders for stronger representations. For training, we construct a 6M-instance multi-task corpus (2.5M unique prompts) spanning 19K hours of speech, music, and sound. Our 4B-parameter ALM outperforms similarly sized models and surpasses most larger ALMs on related audio-reasoning benchmarks, while preserving textual capabilities with a low training cost. Notably, we achieve the best open-source result on the MMAU-speech and MMSU benchmarks and rank third among all the models.
Chinese Translation
大型音频语言模型(ALMs)在听觉理解上扩展了大语言模型(LLMs)。一种常见的方法是冻结LLM,仅在自生成目标上训练适配器。然而,这对于推理大语言模型(RLMs)来说并不奏效,因为其内置的思维链条暴露了文本替代输入,导致不自然的响应。我们提出了自我改写,将自生成的响应转换为与RLM兼容的音频理解变体,同时保持分布对齐。我们进一步融合和压缩多个音频编码器,以获得更强的表示能力。在训练方面,我们构建了一个包含600万实例的多任务语料库(250万个独特提示),涵盖了19,000小时的语音、音乐和声音。我们的40亿参数ALM在性能上超越了同类规模的模型,并在相关的音频推理基准测试中超过了大多数更大型的ALM,同时以较低的训练成本保留了文本能力。值得注意的是,我们在MMAU-speech和MMSU基准测试中取得了最佳的开源结果,并在所有模型中排名第三。
cs.CL / 23 / 2603.09595
Build, Borrow, or Just Fine-Tune? A Political Scientist's Guide to Choosing NLP Models
构建、借用还是仅仅微调?政治科学家选择自然语言处理模型的指南
Abstract
Political scientists increasingly face a consequential choice when adopting natural language processing tools: build a domain-specific model from scratch, borrow and adapt an existing one, or simply fine-tune a general-purpose model on task data? Each approach occupies a different point on the spectrum of performance, cost, and required expertise, yet the discipline has offered little empirical guidance on how to navigate this trade-off. This paper provides such guidance. Using conflict event classification as a test case, I fine-tune ModernBERT on the Global Terrorism Database (GTD) to create Confli-mBERT and systematically compare it against ConfliBERT, a domain-specific pretrained model that represents the current gold standard. Confli-mBERT achieves 75.46% accuracy compared to ConfliBERT's 79.34%. Critically, the four-percentage-point gap is not uniform: on high-frequency attack types such as Bombing/Explosion (F1 = 0.95 vs. 0.96) and Kidnapping (F1 = 0.92 vs. 0.91), the models are nearly indistinguishable. Performance differences concentrate in rare event categories comprising fewer than 2% of all incidents. I use these findings to develop a practical decision framework for political scientists considering any NLP-assisted research task: when does the research question demand a specialized model, and when does an accessible fine-tuned alternative suffice? The answer, I argue, depends not on which model is "better" in the abstract, but on the specific intersection of class prevalence, error tolerance, and available resources. The model, training code, and data are publicly available on Hugging Face.
Chinese Translation
政治科学家在采用自然语言处理工具时,越来越面临一个重要的选择:从头构建一个特定领域的模型、借用并调整现有模型,还是仅仅在任务数据上微调一个通用模型?每种方法在性能、成本和所需专业知识方面占据不同的位置,但该学科在如何应对这一权衡方面提供的实证指导却很少。本文提供了这样的指导。以冲突事件分类为测试案例,我在全球恐怖主义数据库(Global Terrorism Database, GTD)上微调了ModernBERT,创建了Confli-mBERT,并系统地将其与当前黄金标准的特定领域预训练模型ConfliBERT进行比较。Confli-mBERT的准确率为75.46%,而ConfliBERT为79.34%。值得注意的是,这四个百分点的差距并不均匀:在高频攻击类型如爆炸/轰炸(F1 = 0.95对0.96)和绑架(F1 = 0.92对0.91)上,这两个模型几乎无法区分。性能差异主要集中在占所有事件不到2%的稀有事件类别上。我利用这些发现为考虑任何自然语言处理辅助研究任务的政治科学家开发了一个实用的决策框架:何时研究问题需要一个专业模型,何时一个可访问的微调替代方案就足够?我认为,答案并不在于哪个模型在抽象上“更好”,而在于类别普遍性、错误容忍度和可用资源的具体交集。模型、训练代码和数据已在Hugging Face上公开可用。
cs.CL / 24 / 2603.09616
Surgical Repair of Collapsed Attention Heads in ALiBi Transformers
ALiBi Transformers中collapsed注意力头的外科修复
Abstract
We identify a systematic attention collapse pathology in the BLOOM family of transformer language models, where ALiBi positional encoding causes 31-44% of attention heads to attend almost entirely to the beginning-of-sequence token. The collapse follows a predictable pattern across four model scales (560M to 7.1B parameters), concentrating in head indices where ALiBi's slope schedule imposes the steepest distance penalties. We introduce surgical reinitialization: targeted Q/K/V reinitialization with zeroed output projections and gradient-masked freezing of all non-surgical parameters. Applied to BLOOM-1b7 on a single consumer GPU, the technique recovers 98.7% operational head capacity (242 to 379 of 384 heads) in two passes. A controlled comparison with C4 training data confirms that reinitialization -- not corpus content -- drives recovery, and reveals two distinct post-surgical phenomena: early global functional redistribution that improves the model, and late local degradation that accumulates under noisy training signal. An extended experiment reinitializing mostly-healthy heads alongside collapsed ones produces a model that transiently outperforms stock BLOOM-1b7 by 25% on training perplexity (12.70 vs. 16.99), suggesting that pretrained attention configurations are suboptimal local minima. Code, checkpoints, and diagnostic tools are released as open-source software.
Chinese Translation
我们在BLOOM系列变换器语言模型中识别出一种系统性的注意力崩溃病理,其中ALiBi位置编码导致31-44%的注意力头几乎完全关注于序列开始标记。崩溃在四个模型规模(560M到7.1B参数)中遵循可预测的模式,集中在ALiBi的斜率调度施加最陡距离惩罚的头索引上。我们提出了外科再初始化:针对性地对Q/K/V进行再初始化,输出投影归零,并对所有非外科参数进行梯度掩蔽冻结。在单个消费级GPU上应用于BLOOM-1b7,该技术在两次传递中恢复了98.7%的操作头容量(从384个头中的242恢复到379)。与C4训练数据的对照比较确认了再初始化——而非语料内容——驱动了恢复,并揭示了两种不同的外科后现象:早期的全球功能重新分配改善了模型,晚期的局部退化在嘈杂的训练信号下累积。一个扩展实验同时对大部分健康的头和崩溃的头进行再初始化,产生了一个在训练困惑度上短暂超越标准BLOOM-1b7 25%的模型(12.70对比16.99),这表明预训练的注意力配置是次优的局部极小值。代码、检查点和诊断工具作为开源软件发布。
cs.CL / 25 / 2603.09638
Tracking Cancer Through Text: Longitudinal Extraction From Radiology Reports Using Open-Source Large Language Models
通过文本追踪癌症:使用开源大型语言模型从放射学报告中进行纵向信息提取
Abstract
Radiology reports capture crucial longitudinal information on tumor burden, treatment response, and disease progression, yet their unstructured narrative format complicates automated analysis. While large language models (LLMs) have advanced clinical text processing, most state-of-the-art systems remain proprietary, limiting their applicability in privacy-sensitive healthcare environments. We present a fully open-source, locally deployable pipeline for longitudinal information extraction from radiology reports, implemented using the \texttt{llm\_extractinator} framework. The system applies the \texttt{qwen2.5-72b} model to extract and link target, non-target, and new lesion data across time points in accordance with RECIST criteria. Evaluation on 50 Dutch CT Thorax/Abdomen report pairs yielded high extraction performance, with attribute-level accuracies of 93.7\% for target lesions, 94.9\% for non-target lesions, and 94.0\% for new lesions. The approach demonstrates that open-source LLMs can achieve clinically meaningful performance in multi-timepoint oncology tasks while ensuring data privacy and reproducibility. These results highlight the potential of locally deployable LLMs for scalable extraction of structured longitudinal data from routine clinical text.
Chinese Translation
放射学报告捕捉了肿瘤负担、治疗反应和疾病进展等重要的纵向信息,但其非结构化的叙述格式使得自动化分析变得复杂。尽管大型语言模型(LLMs)在临床文本处理方面取得了进展,但大多数最先进的系统仍为专有,限制了它们在隐私敏感的医疗环境中的适用性。我们提出了一种完全开源、可本地部署的纵向信息提取管道,基于 exttt{llm extunderscore extractinator}框架实现。该系统应用 exttt{qwen2.5-72b}模型,按照RECIST标准提取并链接不同时间点的目标、非目标和新病灶数据。对50对荷兰CT胸腹部报告的评估显示出高效的提取性能,目标病灶的属性级准确率为93.7\%,非目标病灶为94.9\\%,新病灶为94.0\\%。该方法表明,开源LLMs能够在多时间点肿瘤学任务中实现临床意义的性能,同时确保数据隐私和可重复性。这些结果突显了可本地部署的LLMs在从常规临床文本中可扩展提取结构化纵向数据的潜力。
cs.CL / 26 / 2603.09654
Understanding the Interplay between LLMs' Utilisation of Parametric and Contextual Knowledge: A keynote at ECIR 2025
理解大型语言模型(LLMs)对参数知识与上下文知识的利用之间的相互作用:2025年ECIR大会主题演讲
Abstract
Language Models (LMs) acquire parametric knowledge from their training process, embedding it within their weights. The increasing scalability of LMs, however, poses significant challenges for understanding a model's inner workings and further for updating or correcting this embedded knowledge without the significant cost of retraining. Moreover, when using these language models for knowledge-intensive language understanding tasks, LMs have to integrate relevant context, mitigating their inherent weaknesses, such as incomplete or outdated knowledge. Nevertheless, studies indicate that LMs often ignore the provided context as it can be in conflict with the pre-existing LM's memory learned during pre-training. Conflicting knowledge can also already be present in the LM's parameters, termed intra-memory conflict. This underscores the importance of understanding the interplay between how a language model uses its parametric knowledge and the retrieved contextual knowledge. In this talk, I will aim to shed light on this important issue by presenting our research on evaluating the knowledge present in LMs, diagnostic tests that can reveal knowledge conflicts, as well as on understanding the characteristics of successfully used contextual knowledge.
Chinese Translation
语言模型(LMs)通过其训练过程获取参数知识,并将其嵌入到权重中。然而,语言模型的可扩展性日益增强,这给理解模型的内部工作原理带来了重大挑战,并且在不进行大量重新训练的情况下,更新或修正这些嵌入知识也面临困难。此外,在使用这些语言模型进行知识密集型语言理解任务时,语言模型必须整合相关上下文,以减轻其固有的弱点,例如知识的不完整或过时。然而,研究表明,语言模型往往忽视提供的上下文,因为这可能与在预训练期间学习到的现有语言模型记忆相冲突。冲突的知识也可能已经存在于语言模型的参数中,这被称为内部记忆冲突。这突显了理解语言模型如何使用其参数知识与检索的上下文知识之间相互作用的重要性。在本次演讲中,我将通过展示我们在评估语言模型中存在的知识、揭示知识冲突的诊断测试以及理解成功使用的上下文知识特征方面的研究,来阐明这一重要问题。
cs.CL / 27 / 2603.09685
Automatic Cardiac Risk Management Classification using large-context Electronic Patients Health Records
基于大上下文电子病历的自动心脏风险管理分类
Abstract
To overcome the limitations of manual administrative coding in geriatric Cardiovascular Risk Management, this study introduces an automated classification framework leveraging unstructured Electronic Health Records (EHRs). Using a dataset of 3,482 patients, we benchmarked three distinct modeling paradigms on longitudinal Dutch clinical narratives: classical machine learning baselines, specialized deep learning architectures optimized for large-context sequences, and general-purpose generative Large Language Models (LLMs) in a zero-shot setting. Additionally, we evaluated a late fusion strategy to integrate unstructured text with structured medication embeddings and anthropometric data. Our analysis reveals that the custom Transformer architecture outperforms both traditional methods and generative \acs{llm}s, achieving the highest F1-scores and Matthews Correlation Coefficients. These findings underscore the critical role of specialized hierarchical attention mechanisms in capturing long-range dependencies within medical texts, presenting a robust, automated alternative to manual workflows for clinical risk stratification.
Chinese Translation
为克服老年心血管风险管理中手动行政编码的局限性,本研究引入了一种利用非结构化电子健康记录(EHR)的自动分类框架。使用3482名患者的数据集,我们在荷兰的纵向临床叙述上对三种不同的建模范式进行了基准测试:经典机器学习基线、针对大上下文序列优化的专用深度学习架构,以及在零样本设置下的一般性生成大型语言模型(LLMs)。此外,我们评估了一种后期融合策略,以将非结构化文本与结构化药物嵌入和人类测量数据相结合。我们的分析表明,自定义Transformer架构在F1分数和马修斯相关系数方面均优于传统方法和生成LLMs。这些发现强调了专门的层次注意机制在捕捉医学文本中的长程依赖性方面的重要作用,提供了一种强大的自动化替代方案,以取代临床风险分层的手动工作流程。
cs.CL / 28 / 2603.09688
Fusing Semantic, Lexical, and Domain Perspectives for Recipe Similarity Estimation
融合语义、词汇和领域视角的食谱相似性估计
Abstract
This research focuses on developing advanced methods for assessing similarity between recipes by combining different sources of information and analytical approaches. We explore the semantic, lexical, and domain similarity of food recipes, evaluated through the analysis of ingredients, preparation methods, and nutritional attributes. A web-based interface was developed to allow domain experts to validate the combined similarity results. After evaluating 318 recipe pairs, experts agreed on 255 (80%). The evaluation of expert assessments enables the estimation of which similarity aspects--lexical, semantic, or nutritional--are most influential in expert decision-making. The application of these methods has broad implications in the food industry and supports the development of personalized diets, nutrition recommendations, and automated recipe generation systems.
Chinese Translation
本研究旨在通过结合不同的信息来源和分析方法,开发评估食谱相似性的先进方法。我们探讨了食谱的语义、词汇和领域相似性,通过对成分、准备方法和营养属性的分析进行评估。开发了一个基于网络的界面,以便领域专家验证综合相似性结果。在评估了318对食谱后,专家一致同意255对(80%)。专家评估的结果使得我们能够估计在专家决策中,哪些相似性方面——词汇、语义或营养——对决策影响最大。这些方法的应用在食品行业具有广泛的意义,并支持个性化饮食、营养建议和自动化食谱生成系统的发展。
cs.CL / 29 / 2603.09691
ESAinsTOD: A Unified End-to-End Schema-Aware Instruction-Tuning Framework for Task-Oriented Dialog Modeling
ESAinsTOD:一种统一的端到端模式感知指令调优框架用于任务导向对话建模
Abstract
Existing end-to-end modeling methods for modular task-oriented dialog systems are typically tailored to specific datasets, making it challenging to adapt to new dialog scenarios. In this work, we propose ESAinsTOD, a unified End-to-end Schema-Aware Instruction-tuning framework for general Task-Oriented Dialog modeling. This framework introduces a structured methodology to go beyond simply fine-tuning Large Language Models (LLMs), enabling flexible adaptation to various dialogue task flows and schemas. Specifically, we leverage full-parameter fine-tuning of LLMs and introduce two alignment mechanisms to make the resulting system both instruction-aware and schema-aware: (i) instruction alignment, which ensures that the system faithfully follows task instructions to complete various task flows from heterogeneous TOD datasets; and (ii) schema alignment, which encourages the system to make predictions adhering to the specified schema. In addition, we employ session-level end-to-end modeling, which allows the system to access the results of previously executed task flows within the dialogue history, to bridge the gap between the instruction-tuning paradigm and the real-world application of TOD systems. Empirical results show that while a fine-tuned LLM serves as a strong baseline, our structured approach provides significant additional benefits. In particular, our findings indicate that: (i) ESAinsTOD outperforms state-of-the-art models by a significant margin on end-to-end task-oriented dialog modeling benchmarks: CamRest676, In-Car and MultiWOZ; (ii) more importantly, it exhibits superior generalization capabilities across various low-resource settings, with the proposed alignment mechanisms significantly enhancing zero-shot performance; and (iii) our instruction-tuning paradigm substantially improves the model's robustness against data noise and cascading errors.
Chinese Translation
现有的端到端建模方法针对模块化任务导向对话系统通常是为特定数据集量身定制的,这使得适应新的对话场景变得具有挑战性。在本研究中,我们提出了ESAinsTOD,一种用于通用任务导向对话建模的统一端到端模式感知指令调优框架。该框架引入了一种结构化的方法,超越了简单的对大型语言模型(LLMs)进行微调,能够灵活适应各种对话任务流程和模式。具体而言,我们利用LLMs的全参数微调,并引入两种对齐机制,使得生成的系统既具备指令感知能力又具备模式感知能力:(i)指令对齐,确保系统忠实地遵循任务指令以完成来自异构任务导向对话(TOD)数据集的各种任务流程;(ii)模式对齐,鼓励系统在预测时遵循指定的模式。此外,我们采用会话级端到端建模,使得系统能够访问对话历史中先前执行的任务流程的结果,从而弥合指令调优范式与TOD系统实际应用之间的差距。实证结果表明,尽管微调后的LLM作为强基线表现良好,但我们的结构化方法提供了显著的额外优势。特别是,我们的研究结果表明:(i)ESAinsTOD在端到端任务导向对话建模基准测试CamRest676、In-Car和MultiWOZ上显著超越了最先进的模型;(ii)更重要的是,它在各种低资源设置下展现出卓越的泛化能力,所提出的对齐机制显著增强了零样本性能;(iii)我们的指令调优范式显著提高了模型对数据噪声和级联错误的鲁棒性。
cs.CL / 30 / 2603.09704
Evaluation of LLMs in retrieving food and nutritional context for RAG systems
大型语言模型在检索食品和营养背景下的评估:基于检索增强生成系统的研究
Abstract
In this article, we evaluate four Large Language Models (LLMs) and their effectiveness at retrieving data within a specialized Retrieval-Augmented Generation (RAG) system, using a comprehensive food composition database. Our method is focused on the LLMs ability to translate natural language queries into structured metadata filters, enabling efficient retrieval via a Chroma vector database. By achieving high accuracy in this critical retrieval step, we demonstrate that LLMs can serve as an accessible, high-performance tool, drastically reducing the manual effort and technical expertise previously required for domain experts, such as food compilers and nutritionists, to leverage complex food and nutrition data. However, despite the high performance on easy and moderately complex queries, our analysis of difficult questions reveals that reliable retrieval remains challenging when queries involve non-expressible constraints. These findings demonstrate that LLM-driven metadata filtering excels when constraints can be explicitly expressed, but struggles when queries exceed the representational scope of the metadata format.
Chinese Translation
在本文中,我们评估了四种大型语言模型(LLMs)在一个专门的检索增强生成(RAG)系统中检索数据的有效性,使用了一个全面的食品成分数据库。我们的方法侧重于LLMs将自然语言查询转换为结构化元数据过滤器的能力,从而通过Chroma向量数据库实现高效检索。通过在这一关键检索步骤中实现高准确性,我们展示了LLMs可以作为一种可访问的高性能工具,显著减少了以往领域专家(如食品编纂者和营养师)利用复杂食品和营养数据所需的手动工作和技术专业知识。然而,尽管在简单和中等复杂度的查询中表现良好,我们对困难问题的分析表明,当查询涉及不可表达的约束时,可靠的检索仍然具有挑战性。这些发现表明,当约束可以明确表达时,基于LLM的元数据过滤表现优异,但在查询超出元数据格式的表示范围时则面临困难。
cs.CL / 31 / 2603.09723
RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation
RbtAct:作为监督的反驳生成可操作的评审反馈
Abstract
Large language models (LLMs) are increasingly used across the scientific workflow, including to draft peer-review reports. However, many AI-generated reviews are superficial and insufficiently actionable, leaving authors without concrete, implementable guidance and motivating the gap this work addresses. We propose RbtAct, which targets actionable review feedback generation and places existing peer review rebuttal at the center of learning. Rebuttals show which reviewer comments led to concrete revisions or specific plans, and which were only defended. Building on this insight, we leverage rebuttal as implicit supervision to directly optimize a feedback generator for actionability. To support this objective, we propose a new task called perspective-conditioned segment-level review feedback generation, in which the model is required to produce a single focused comment based on the complete paper and a specified perspective such as experiments and writing. We also build a large dataset named RMR-75K that maps review segments to the rebuttal segments that address them, with perspective labels and impact categories that order author uptake. We then train the Llama-3.1-8B-Instruct model with supervised fine-tuning on review segments followed by preference optimization using rebuttal derived pairs. Experiments with human experts and LLM-as-a-judge show consistent gains in actionability and specificity over strong baselines while maintaining grounding and relevance.
Chinese Translation
大型语言模型(LLMs)在科学工作流程中越来越多地被使用,包括起草同行评审报告。然而,许多AI生成的评审意见表面化且缺乏可操作性,使得作者无法获得具体的、可实施的指导,这也是本研究所要解决的问题。我们提出了RbtAct,旨在生成可操作的评审反馈,并将现有的同行评审反驳置于学习的中心。反驳展示了哪些评审意见导致了具体的修订或特定的计划,以及哪些只是进行了辩护。基于这一洞察,我们利用反驳作为隐式监督,直接优化反馈生成器以增强其可操作性。为支持这一目标,我们提出了一项新任务,称为视角条件下的分段级评审反馈生成,其中模型需要根据完整论文和指定的视角(如实验和写作)生成单一的集中评论。我们还构建了一个名为RMR-75K的大型数据集,该数据集将评审段落映射到相应的反驳段落,并附有视角标签和影响类别,以便对作者的采纳进行排序。随后,我们使用监督微调对Llama-3.1-8B-Instruct模型进行训练,首先在评审段落上进行训练,然后使用基于反驳派生对的偏好优化。与人类专家和LLM作为评审者的实验表明,在保持基础性和相关性的同时,在可操作性和具体性方面相较于强基线有一致的提升。
cs.CL / 32 / 2603.09758
Beyond Fine-Tuning: Robust Food Entity Linking under Ontology Drift with FoodOntoRAG
超越微调:在本体漂移下的鲁棒食品实体链接与FoodOntoRAG
Abstract
Standardizing food terms from product labels and menus into ontology concepts is a prerequisite for trustworthy dietary assessment and safety reporting. The dominant approach to Named Entity Linking (NEL) in the food and nutrition domains fine-tunes Large Language Models (LLMs) on task-specific corpora. Although effective, fine-tuning incurs substantial computational cost, ties models to a particular ontology snapshot (i.e., version), and degrades under ontology drift. This paper presents FoodOntoRAG, a model- and ontology-agnostic pipeline that performs few-shot NEL by retrieving candidate entities from domain ontologies and conditioning an LLM on structured evidence (food labels, synonyms, definitions, and relations). A hybrid lexical--semantic retriever enumerates candidates; a selector agent chooses a best match with rationale; a separate scorer agent calibrates confidence; and, when confidence falls below a threshold, a synonym generator agent proposes reformulations to re-enter the loop. The pipeline approaches state-of-the-art accuracy while revealing gaps and inconsistencies in existing annotations. The design avoids fine-tuning, improves robustness to ontology evolution, and yields interpretable decisions through grounded justifications.
Chinese Translation
将产品标签和菜单中的食品术语标准化为本体概念是可靠饮食评估和安全报告的前提。在食品和营养领域,命名实体链接(NEL)的主流方法是对特定任务语料库进行大型语言模型(LLMs)的微调。尽管有效,微调会产生巨大的计算成本,使模型依赖于特定的本体快照(即版本),并在本体漂移时性能下降。本文提出了FoodOntoRAG,一个与模型和本体无关的管道,通过从领域本体中检索候选实体并基于结构化证据(食品标签、同义词、定义和关系)对LLM进行条件化,执行少量样本的NEL。一个混合的词汇-语义检索器列举候选项;一个选择代理选择最佳匹配并提供理由;一个单独的评分代理校准置信度;当置信度低于阈值时,一个同义词生成代理提出重新表述以重新进入循环。该管道在接近最先进的准确性的同时,揭示了现有注释中的差距和不一致性。该设计避免了微调,提高了对本体演变的鲁棒性,并通过有根据的理由产生可解释的决策。
cs.CL / 33 / 2603.09785
EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting
EPIC-EuroParl-UdS:翻译与口译的信息论视角
Abstract
This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.
Chinese Translation
本文介绍了一个更新的、结合的双向英语-德语EPIC-UdS(口语)和EuroParl-UdS(书面)语料库版本,该语料库包含原始的欧洲议会演讲及其翻译和口译。新版本纠正了通过先前使用识别出的元数据和文本错误,精炼了内容,更新了语言注释,并增加了新的层次,包括词对齐和词级惊讶指数。该综合资源旨在支持使用信息论方法研究语言变异,特别是比较书面和口语模式的研究,以及检查口语中的不流畅现象,以及传统的翻译体研究,包括平行(源语言与目标语言)和可比(原文与翻译文)分析。本文概述了此次发布中引入的更新,总结了基于该语料库的先前结果,并呈现了一项新的示范研究。该研究验证了重建口语数据的完整性,并评估了基于基础和微调的GPT-2及机器翻译模型在口译中填充粒子预测任务上的概率测量。
cs.CL / 34 / 2603.09821
One-Eval: An Agentic System for Automated and Traceable LLM Evaluation
One-Eval:一个用于自动化和可追溯的大型语言模型评估的智能系统
Abstract
Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.
Chinese Translation
可靠的评估对于大型语言模型的开发和部署至关重要,但在实践中,它通常需要大量的手动工作:从业者必须识别合适的基准,重现异构的评估代码库,配置数据集模式映射,并解释汇总指标。为了解决这些挑战,我们提出了One-Eval,一个智能评估系统,它将自然语言评估请求转换为可执行、可追溯和可定制的评估工作流程。One-Eval集成了(i)NL2Bench用于意图结构化和个性化基准规划,(ii)BenchResolve用于基准解析、自动数据集获取和模式标准化,以确保可执行性,以及(iii)Metrics & Reporting用于任务感知的指标选择和超越标量分数的决策导向报告。该系统进一步结合了人机协作的检查点,以便于审查、编辑和回滚,同时保留样本证据轨迹以便于调试和审计。实验表明,One-Eval能够以最小的用户努力从多样的自然语言请求中执行端到端评估,从而支持工业环境中更高效和可重复的评估。我们的框架已在 https://github.com/OpenDCAI/One-Eval 上公开发布。
cs.CL / 35 / 2603.09835
Chow-Liu Ordering for Long-Context Reasoning in Chain-of-Agents
链式智能体中的长上下文推理的Chow-Liu排序
Abstract
Sequential multi-agent reasoning frameworks such as Chain-of-Agents (CoA) handle long-context queries by decomposing inputs into chunks and processing them sequentially using LLM-based worker agents that read from and update a bounded shared memory. From a probabilistic perspective, CoA aims to approximate the conditional distribution corresponding to a model capable of jointly reasoning over the entire long context. CoA achieves this through a latent-state factorization in which only bounded summaries of previously processed evidence are passed between agents. The resulting bounded-memory approximation introduces a lossy information bottleneck, making the final evidence state inherently dependent on the order in which chunks are processed. In this work, we study the problem of chunk ordering for long-context reasoning. We use the well-known Chow-Liu trees to learn a dependency structure that prioritizes strongly related chunks. Empirically, we show that a breadth-first traversal of the resulting tree yields chunk orderings that reduce information loss across agents and consistently outperform both default document-chunk ordering and semantic score-based ordering in answer relevance and exact-match accuracy across three long-context benchmarks.
Chinese Translation
顺序多智能体推理框架,如链式智能体(Chain-of-Agents,CoA),通过将输入分解为块并使用基于大语言模型(LLM)的工作智能体顺序处理这些块,从而处理长上下文查询,这些智能体从有限的共享内存中读取并更新信息。从概率的角度来看,CoA旨在近似能够对整个长上下文进行联合推理的模型所对应的条件分布。CoA通过潜在状态分解实现这一目标,其中仅将先前处理的证据的有限摘要在智能体之间传递。由此产生的有限内存近似引入了一个有损的信息瓶颈,使得最终的证据状态本质上依赖于处理块的顺序。在本研究中,我们研究了长上下文推理中的块排序问题。我们使用著名的Chow-Liu树来学习一种依赖结构,以优先考虑强相关的块。通过实证研究,我们表明,对所得到的树进行广度优先遍历可以产生减少智能体之间信息损失的块排序,并且在三个长上下文基准测试中,在答案相关性和精确匹配准确性方面始终优于默认的文档块排序和基于语义得分的排序。
cs.CL / 36 / 2603.09872
N-gram-like Language Models Predict Reading Time Best
N-gram类语言模型最佳预测阅读时间
Abstract
Recent work has found that contemporary language models such as transformers can become so good at next-word prediction that the probabilities they calculate become worse for predicting reading time. In this paper, we propose that this can be explained by reading time being sensitive to simple n-gram statistics rather than the more complex statistics learned by state-of-the-art transformer language models. We demonstrate that the neural language models whose predictions are most correlated with n-gram probability are also those that calculate probabilities that are the most correlated with eye-tracking-based metrics of reading time on naturalistic text.
Chinese Translation
近期的研究发现,当代语言模型如变换器(transformers)在下一个单词预测方面表现得如此出色,以至于它们计算的概率在预测阅读时间时反而变得不准确。本文提出,这可以通过阅读时间对简单的n-gram统计数据敏感,而非对最先进的变换器语言模型所学习的更复杂统计数据来解释。我们证明了,与n-gram概率最相关的神经语言模型,其计算的概率也与基于眼动追踪的自然文本阅读时间指标最相关。
cs.CL / 37 / 2603.09881
Do What I Say: A Spoken Prompt Dataset for Instruction-Following
遵循指令:一个用于指令跟随的语音提示数据集
Abstract
Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.
Chinese Translation
语音大型语言模型(SLLMs)迅速发展,支持广泛的任务。这些模型通常使用文本提示进行评估,但这可能无法反映用户与语音互动的真实场景。为了解决这一问题,我们推出了 DoWhatISay (DOWIS),这是一个多语言的数据集,包含人类录制的语音和书面提示,旨在与现有基准相结合,以在语音指令条件下对 SLLMs 进行真实评估。该数据集涵盖 9 个任务和 11 种语言,为每个任务-语言对提供 10 种提示变体,涵盖五种风格。通过使用 DOWIS,我们对最先进的 SLLMs 进行了基准测试,分析了提示模态、风格、语言和任务类型之间的相互作用。结果显示,文本提示在性能上始终优于语音提示,尤其是在低资源和跨语言环境中。只有在具有语音输出的任务中,语音提示才缩小了差距,突显了在 SLLM 评估中需要基于语音的提示。
cs.CL / 38 / 2603.09884
Benchmarking Political Persuasion Risks Across Frontier Large Language Models
前沿大型语言模型的政治劝说风险基准测试
Abstract
Concerns persist regarding the capacity of Large Language Models (LLMs) to sway political views. Although prior research has claimed that LLMs are not more persuasive than standard political campaign practices, the recent rise of frontier models warrants further study. In two survey experiments (N=19,145) across bipartisan issues and stances, we evaluate seven state-of-the-art LLMs developed by Anthropic, OpenAI, Google, and xAI. We find that LLMs outperform standard campaign advertisements, with heterogeneity in performance across models. Specifically, Claude models exhibit the highest persuasiveness, while Grok exhibits the lowest. The results are robust across issues and stances. Moreover, in contrast to the findings in Hackenburg et al. (2025b) and Lin et al. (2025) that information-based prompts boost persuasiveness, we find that the effectiveness of information-based prompts is model-dependent: they increase the persuasiveness of Claude and Grok while substantially reducing that of GPT. We introduce a data-driven and strategy-agnostic LLM-assisted conversation analysis approach to identify and assess underlying persuasive strategies. Our work benchmarks the persuasive risks of frontier models and provides a framework for cross-model comparative risk assessment.
Chinese Translation
关于大型语言模型(LLMs)影响政治观点能力的担忧依然存在。尽管先前的研究声称LLMs的劝说能力不比标准政治竞选实践更强,但前沿模型的近期崛起值得进一步研究。在针对两项跨党派问题和立场的调查实验中(N=19,145),我们评估了由Anthropic、OpenAI、Google和xAI开发的七款最先进的LLMs。我们的研究发现,LLMs在劝说能力上优于标准竞选广告,但不同模型的表现存在异质性。具体而言,Claude模型表现出最高的劝说力,而Grok模型则表现最低。这些结果在不同问题和立场中均保持稳健。此外,与Hackenburg等(2025b)和Lin等(2025)的研究发现,即基于信息的提示可以增强劝说力不同,我们发现基于信息的提示的有效性依赖于模型:它们提高了Claude和Grok的劝说力,但显著降低了GPT的劝说力。我们引入了一种数据驱动且策略无关的LLM辅助对话分析方法,以识别和评估潜在的劝说策略。我们的研究为前沿模型的劝说风险提供了基准,并为跨模型比较风险评估提供了框架。
cs.CL / 39 / 2603.09906
Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs
思考以回忆:推理如何解锁大型语言模型中的参数知识
Abstract
While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. Nevertheless, we find that enabling reasoning substantially expands the capability boundary of the model's parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. Why does reasoning aid parametric knowledge recall when there are no complex reasoning steps to be done? To answer this, we design a series of hypothesis-driven controlled experiments, and identify two key driving mechanisms: (1) a computational buffer effect, where the model uses the generated reasoning tokens to perform latent computation independent of their semantic content; and (2) factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval. Importantly, this latter generative self-retrieval mechanism carries inherent risks: we demonstrate that hallucinating intermediate facts during reasoning increases the likelihood of hallucinations in the final answer. Finally, we show that our insights can be harnessed to directly improve model accuracy by prioritizing reasoning trajectories that contain hallucination-free factual statements.
Chinese Translation
尽管推理在大型语言模型(LLMs)中在数学、代码生成和多跳事实问题中扮演着自然的角色,但其对简单的单跳事实问题的影响仍然不明确。这类问题不需要逐步的逻辑分解,使得推理的效用显得高度反直觉。然而,我们发现启用推理显著扩展了模型参数知识回忆的能力边界,解锁了那些在其他情况下几乎无法到达的正确答案。为什么推理在没有复杂推理步骤的情况下仍然有助于参数知识的回忆?为了解答这个问题,我们设计了一系列假设驱动的控制实验,并确定了两个关键驱动机制:(1)计算缓冲效应,即模型利用生成的推理标记进行潜在计算,而不依赖其语义内容;(2)事实引导,即生成与主题相关的事实作为语义桥梁,促进正确答案的检索。重要的是,这后一种生成自我检索机制具有固有风险:我们证明,在推理过程中幻觉中间事实会增加最终答案中幻觉的可能性。最后,我们展示了我们的见解可以被利用来直接提高模型的准确性,通过优先考虑包含无幻觉事实陈述的推理路径。
cs.CL / 40 / 2603.09938
Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions
大语言模型时代的模型合并:方法、应用与未来方向
Abstract
Model merging has emerged as a transformative paradigm for combining the capabilities of multiple neural networks into a single unified model without additional training. With the rapid proliferation of fine-tuned large language models~(LLMs), merging techniques offer a computationally efficient alternative to ensembles and full retraining, enabling practitioners to compose specialized capabilities at minimal cost. This survey presents a comprehensive and structured examination of model merging in the LLM era through the \textbf{FUSE} taxonomy, a four-dimensional framework organized along \textbf{F}oundations, \textbf{U}nification Strategies, \textbf{S}cenarios, and \textbf{E}cosystem. We first establish the theoretical underpinnings of merging, including loss landscape geometry, mode connectivity, and the linear mode connectivity hypothesis. We then systematically review the algorithmic landscape, spanning weight averaging, task vector arithmetic, sparsification-enhanced methods, mixture-of-experts architectures, and evolutionary optimization approaches. For each method family, we analyze the core formulation, highlight representative works, and discuss practical trade-offs. We further examine downstream applications across multi-task learning, safety alignment, domain specialization, multilingual transfer, and federated learning. Finally, we survey the supporting ecosystem of open-source tools, community platforms, and evaluation benchmarks, and identify key open challenges including theoretical gaps, scalability barriers, and standardization needs. This survey aims to equip researchers and practitioners with a structured foundation for advancing model merging.
Chinese Translation
模型合并已成为一种变革性范式,用于将多个神经网络的能力合并为一个统一的模型,而无需额外的训练。随着微调的大语言模型(LLMs)的快速普及,合并技术为集成和全面重训练提供了一种计算上高效的替代方案,使从业者能够以最小的成本组合专业化能力。本调查通过 extbf{FUSE}分类法对LLM时代的模型合并进行了全面且结构化的审查,该分类法是一个沿着 extbf{F}oundations(基础)、 extbf{U}nification Strategies(统一策略)、 extbf{S}cenarios(场景)和 extbf{E}cosystem(生态系统)组织的四维框架。我们首先建立了合并的理论基础,包括损失景观几何、模式连接性和线性模式连接假设。然后,我们系统地回顾了算法领域,涵盖了权重平均、任务向量算术、稀疏化增强方法、专家混合架构和进化优化方法。对于每个方法家族,我们分析了核心公式,突出代表性工作,并讨论了实际权衡。我们进一步考察了在多任务学习、安全对齐、领域专业化、多语言迁移和联邦学习等方面的下游应用。最后,我们调查了支持生态系统,包括开源工具、社区平台和评估基准,并识别出关键的开放挑战,包括理论空白、可扩展性障碍和标准化需求。本调查旨在为研究人员和从业者提供一个结构化的基础,以推动模型合并的发展。
cs.CL / 41 / 2603.09970
CREATE: Testing LLMs for Associative Creativity
CREATE:测试大型语言模型的联想创造力
Abstract
A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.
Chinese Translation
创造力的一个关键组成部分是联想推理:在概念之间建立新颖而有意义的联系的能力。我们介绍了CREATE,这是一个旨在评估模型创造性联想推理能力的基准。CREATE要求模型生成连接模型参数知识中概念的一组路径。这些路径应具有高特异性(概念连接的独特性和紧密性)和高多样性(与其他路径的差异性),如果模型生成更多强大且多样的路径,则得分更高。该任务与真实创造性任务(如假设生成)具有相似的要求,包括极大的搜索空间,但能够收集一个具有客观答案评分的可观基准。对前沿模型的评估表明,最强的模型在创造性效用上优于其他模型,答案的高多样性和搜索的复杂性使得基准饱和变得困难。此外,我们的结果表明,思维模型在我们的任务上并不总是更有效,即使在高令牌预算下。最近的创造性提示方法提供了一些但有限的额外改进。CREATE为开发新方法以提高模型的联想创造力提供了一个沙盒。