cs.RO / 1 / 2602.20200
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation
全球先验与局部一致性相结合:用于高效机器人操作的双记忆增强视觉-语言-动作模型
Abstract
Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.
Chinese Translation
分层视觉-语言-动作(VLA)模型迅速成为机器人操作的主流范式。它通常包括用于感知和理解的视觉-语言主干,以及用于动作生成的生成策略。然而,其性能正日益受到动作生成过程的瓶颈限制。(i) 低推理效率。各向同性噪声先验与目标动作分布之间显著的分布差距,增加了去噪步骤和不可行样本的发生率。(ii) 较差的鲁棒性。现有策略仅基于当前观察进行条件化,忽视历史序列的约束,因此缺乏对任务进展和时间一致性的意识。为了解决这些问题,我们提出了OptimusVLA,一个具有全球先验记忆(GPM)和局部一致性记忆(LCM)的双记忆VLA框架。GPM用从语义相似轨迹中检索的任务级先验替代高斯噪声,从而缩短生成路径并减少函数评估次数(NFE)。LCM动态建模已执行的动作序列,以推断任务进展,并注入一个学习到的一致性约束,以强制执行轨迹的时间一致性和平滑性。在三个仿真基准测试中,OptimusVLA始终优于强基线:在LIBERO上实现了98.6%的平均成功率,在CALVIN上比pi_0提高了13.5%,在RoboTwin 2.0 Hard上达到了38%的平均成功率。在现实世界评估中,OptimusVLA在泛化和长时间跨度套件中排名最佳,分别超越pi_0 42.9%和52.4%,同时实现了2.9倍的推理加速。
cs.RO / 2 / 2602.20215
Vision-Based Reasoning with Topology-Encoded Graphs for Anatomical Path Disambiguation in Robot-Assisted Endovascular Navigation
基于视觉推理的拓扑编码图在机器人辅助血管内导航中的解剖路径消歧义
Abstract
Robotic-assisted percutaneous coronary intervention (PCI) is constrained by the inherent limitations of 2D Digital Subtraction Angiography (DSA). Unlike physicians, who can directly manipulate guidewires and integrate tactile feedback with their prior anatomical knowledge, teleoperated robotic systems must rely solely on 2D projections. This mode of operation, simultaneously lacking spatial context and tactile sensation, may give rise to projection-induced ambiguities at vascular bifurcations. To address this challenge, we propose a two-stage framework (SCAR-UNet-GAT) for real-time robotic path planning. In the first stage, SCAR-UNet, a spatial-coordinate-attention-regularized U-Net, is employed for accurate coronary vessel segmentation. The integration of multi-level attention mechanisms enhances the delineation of thin, tortuous vessels and improves robustness against imaging noise. From the resulting binary masks, vessel centerlines and bifurcation points are extracted, and geometric descriptors (e.g., branch diameter, intersection angles) are fused with local DSA patches to construct node features. In the second stage, a Graph Attention Network (GAT) reasons over the vessel graph to identify anatomically consistent and clinically feasible trajectories, effectively distinguishing true bifurcations from projection-induced false crossings. On a clinical DSA dataset, SCAR-UNet achieved a Dice coefficient of 93.1%. For path disambiguation, the proposed GAT-based method attained a success rate of 95.0% and a target-arrival success rate of 90.0%, substantially outperforming conventional shortest-path planning (60.0% and 55.0%) and heuristic-based planning (75.0% and 70.0%). Validation on a robotic platform further confirmed the practical feasibility and robustness of the proposed framework.
Chinese Translation
机器人辅助的经皮冠状动脉介入(PCI)受到二维数字减影血管造影(DSA)固有局限性的制约。与能够直接操控导丝并将触觉反馈与先前解剖知识相结合的医生不同,遥控机器人系统只能依赖二维投影。这种操作模式缺乏空间上下文和触觉感知,可能导致血管分叉处的投影引发歧义。为了解决这一挑战,我们提出了一种两阶段框架(SCAR-UNet-GAT)用于实时机器人路径规划。在第一阶段,采用SCAR-UNet(一种空间坐标注意力正则化的U-Net)进行准确的冠状动脉血管分割。多层次注意力机制的集成增强了对细长、曲折血管的描绘,并提高了对成像噪声的鲁棒性。从生成的二值掩膜中提取血管中心线和分叉点,并将几何描述符(如分支直径、交叉角度)与局部DSA图像结合,以构建节点特征。在第二阶段,图注意力网络(GAT)在血管图上进行推理,以识别解剖一致且临床可行的轨迹,有效区分真实分叉与投影引发的虚假交叉。在临床DSA数据集上,SCAR-UNet达到了93.1%的Dice系数。对于路径消歧义,所提出的基于GAT的方法达到了95.0%的成功率和90.0%的目标到达成功率,显著优于传统的最短路径规划(60.0%和55.0%)和启发式规划(75.0%和70.0%)。在机器人平台上的验证进一步确认了所提框架的实际可行性和鲁棒性。
cs.RO / 3 / 2602.20216
Sample-Efficient Learning with Online Expert Correction for Autonomous Catheter Steering in Endovascular Bifurcation Navigation
基于在线专家修正的样本高效学习用于血管内分叉导航中的自主导管引导
Abstract
Robot-assisted endovascular intervention offers a safe and effective solution for remote catheter manipulation, reducing radiation exposure while enabling precise navigation. Reinforcement learning (RL) has recently emerged as a promising approach for autonomous catheter steering; however, conventional methods suffer from sparse reward design and reliance on static vascular models, limiting their sample efficiency and generalization to intraoperative variations. To overcome these challenges, this paper introduces a sample-efficient RL framework with online expert correction for autonomous catheter steering in endovascular bifurcation navigation. The proposed framework integrates three key components: (1) A segmentation-based pose estimation module for accurate real-time state feedback, (2) A fuzzy controller for bifurcation-aware orientation adjustment, and (3) A structured reward generator incorporating expert priors to guide policy learning. By leveraging online expert correction, the framework reduces exploration inefficiency and enhances policy robustness in complex vascular structures. Experimental validation on a robotic platform using a transparent vascular phantom demonstrates that the proposed approach achieves convergence in 123 training episodes -- a 25.9% reduction compared to the baseline Soft Actor-Critic (SAC) algorithm -- while reducing average positional error to 83.8% of the baseline. These results indicate that combining sample-efficient RL with online expert correction enables reliable and accurate catheter steering, particularly in anatomically challenging bifurcation scenarios critical for endovascular navigation.
Chinese Translation
机器人辅助的血管内干预为远程导管操作提供了一种安全有效的解决方案,减少了辐射暴露,同时实现了精确导航。强化学习(Reinforcement Learning, RL)最近作为一种有前景的方法出现在自主导管引导领域;然而,传统方法存在稀疏奖励设计和依赖静态血管模型的问题,限制了其样本效率和对手术中变化的泛化能力。为了解决这些挑战,本文提出了一种样本高效的RL框架,结合在线专家修正,用于血管内分叉导航中的自主导管引导。所提出的框架集成了三个关键组件:(1)基于分割的姿态估计模块,用于准确的实时状态反馈;(2)模糊控制器,用于考虑分叉的方向调整;(3)结构化奖励生成器,结合专家先验以指导策略学习。通过利用在线专家修正,该框架减少了探索低效性,并增强了在复杂血管结构中的策略鲁棒性。在使用透明血管模型的机器人平台上的实验验证表明,所提出的方法在123个训练回合内实现了收敛,比基线的Soft Actor-Critic(SAC)算法减少了25.9%的训练时间,同时将平均位置误差降低到基线的83.8%。这些结果表明,将样本高效的RL与在线专家修正相结合,可以实现可靠且准确的导管引导,特别是在对血管内导航至关重要的解剖挑战性分叉场景中。
cs.RO / 4 / 2602.20219
An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction
一种结合视频与语音的大型语言模型在人机交互中的方法
Abstract
Interpreting human intent accurately is a central challenge in human-robot interaction (HRI) and a key requirement for achieving more natural and intuitive collaboration between humans and machines. This work presents a novel multimodal HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic to enable precise and adaptive control of a Dobot Magician robotic arm. The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition, providing users with a seamless and intuitive interface for object manipulation through spoken commands. By jointly addressing scene perception and action planning, the approach enhances the reliability of command interpretation and execution. Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75\%, highlighting both the robustness and adaptability of the system. Beyond its current performance, the proposed architecture serves as a flexible and extensible foundation for future HRI research, offering a practical pathway toward more sophisticated and natural human-robot collaboration through tightly coupled speech and vision-language processing.
Chinese Translation
准确解读人类意图是人机交互(HRI)中的一个核心挑战,也是实现人类与机器之间更自然、直观协作的关键要求。本研究提出了一种新颖的多模态HRI框架,结合了先进的视觉-语言模型、语音处理和模糊逻辑,以实现对Dobot Magician机器人手臂的精确和自适应控制。所提系统集成了Florence-2进行物体检测,Llama 3.1用于自然语言理解,以及Whisper进行语音识别,为用户提供了一个无缝且直观的物体操作界面,通过语音命令进行控制。通过共同解决场景感知和行动规划,该方法增强了命令解读和执行的可靠性。在消费级硬件上进行的实验评估显示,命令执行准确率达到75%,突显了系统的稳健性和适应性。除了当前的性能外,所提架构还为未来的人机交互研究提供了一个灵活且可扩展的基础,提供了一条通过紧密结合的语音和视觉-语言处理实现更复杂和自然的人机协作的实用路径。
cs.RO / 5 / 2602.20220
What Matters for Simulation to Online Reinforcement Learning on Real Robots
在线强化学习在真实机器人上的仿真关键因素
Abstract
We investigate what specific design choices enable successful online reinforcement learning (RL) on physical robots. Across 100 real-world training runs on three distinct robotic platforms, we systematically ablate algorithmic, systems, and experimental decisions that are typically left implicit in prior work. We find that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware. These results provide the first large-sample empirical study of such design choices, enabling practitioners to deploy online RL with lower engineering effort.
Chinese Translation
我们研究了哪些具体设计选择能够使物理机器人上的在线强化学习(RL)成功。在三个不同的机器人平台上进行的100次真实世界训练中,我们系统性地剔除了以往研究中通常隐含的算法、系统和实验决策。我们发现,一些广泛使用的默认设置可能是有害的,而一组在标准RL实践中易于采用的稳健设计选择则能够在不同任务和硬件上实现稳定学习。这些结果提供了首个大样本的实证研究,帮助从业者以更低的工程成本部署在线RL。
cs.RO / 6 / 2602.20225
FACTO: Function-space Adaptive Constrained Trajectory Optimization for Robotic Manipulators
FACTO:用于机器人操纵器的函数空间自适应约束轨迹优化
Abstract
This paper introduces Function-space Adaptive Constrained Trajectory Optimization (FACTO), a new trajectory optimization algorithm for both single- and multi-arm manipulators. Trajectory representations are parameterized as linear combinations of orthogonal basis functions, and optimization is performed directly in the coefficient space. The constrained problem formulation consists of both an objective functional and a finite-dimensional objective defined over truncated coefficients. To address nonlinearity, FACTO uses a Gauss-Newton approximation with exponential moving averaging, yielding a smoothed quadratic subproblem. Trajectory-wide constraints are addressed using coefficient-space mappings, and an adaptive constrained update using the Levenberg-Marquardt algorithm is performed in the null space of active constraints. Comparisons with optimization-based planners (CHOMP, TrajOpt, GPMP2) and sampling-based planners (RRT-Connect, RRT*, PRM) show the improved solution quality and feasibility, especially in constrained single- and multi-arm scenarios. The experimental evaluation of FACTO on Franka robots verifies the feasibility of deployment.
Chinese Translation
本文介绍了一种新的轨迹优化算法——函数空间自适应约束轨迹优化(FACTO),适用于单臂和多臂操纵器。轨迹表示被参数化为正交基函数的线性组合,优化直接在系数空间中进行。约束问题的形式包括一个目标泛函和一个定义在截断系数上的有限维目标。为了解决非线性问题,FACTO采用了带有指数移动平均的高斯-牛顿近似,从而产生平滑的二次子问题。通过系数空间映射解决轨迹范围内的约束,并在活动约束的零空间中使用列文伯格-马夸特算法进行自适应约束更新。与基于优化的规划器(CHOMP、TrajOpt、GPMP2)和基于采样的规划器(RRT-Connect、RRT*、PRM)的比较显示,特别是在约束的单臂和多臂场景中,解决方案的质量和可行性得到了改善。对Franka机器人上FACTO的实验评估验证了其部署的可行性。
cs.RO / 7 / 2602.20231
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
UniLACT:面向视觉-语言-动作模型的深度感知RGB潜在动作学习
Abstract
Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived solely from RGB observations primarily encode appearance-driven dynamics and lack explicit 3D geometric structure, which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UniLACT, a transformer-based VLA model that incorporates geometric structure through depth-aware latent pretraining, enabling downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UniLARN, a unified latent action learning framework based on inverse and forward dynamics objectives that learns a shared embedding space for RGB and depth while explicitly modeling their cross-modal interactions. This formulation produces modality-specific and unified latent action representations that serve as pseudo-labels for the depth-aware pretraining of UniLACT. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness of depth-aware unified latent action representations. UniLACT consistently outperforms RGB-based latent action baselines under in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks.
Chinese Translation
从未标记视频中学习的潜在动作表示最近成为了一种有前景的范式,用于在没有明确机器人动作监督的情况下预训练视觉-语言-动作(VLA)模型。然而,仅从RGB观测中推导的潜在动作主要编码外观驱动的动态,缺乏明确的3D几何结构,而这对于精确和丰富接触的操作至关重要。为了解决这一局限性,我们提出了UniLACT,这是一种基于变换器的VLA模型,通过深度感知潜在预训练引入几何结构,使下游策略能够继承更强的空间先验。为促进这一过程,我们提出了UniLARN,这是一种基于逆向和前向动态目标的统一潜在动作学习框架,学习RGB和深度的共享嵌入空间,同时明确建模它们的跨模态交互。这一公式产生了特定模态和统一的潜在动作表示,作为UniLACT深度感知预训练的伪标签。在模拟和现实世界环境中的广泛实验表明,深度感知统一潜在动作表示的有效性。UniLACT在领域内和领域外的预训练模式下,以及在已见和未见的操作任务上,始终优于基于RGB的潜在动作基线。
cs.RO / 8 / 2602.20304
Smoothly Differentiable and Efficiently Vectorizable Contact Manifold Generation
平滑可微且高效可向量化的接触流形生成
Abstract
Simulating rigid-body dynamics with contact in a fast, massively vectorizable, and smoothly differentiable manner is highly desirable in robotics. An important bottleneck faced by existing differentiable simulation frameworks is contact manifold generation: representing the volume of intersection between two colliding geometries via a discrete set of properly distributed contact points. A major factor contributing to this bottleneck is that the related routines of commonly used robotics simulators were not designed with vectorization and differentiability as a primary concern, and thus rely on logic and control flow that hinder these goals. We instead propose a framework designed from the ground up with these goals in mind, by trying to strike a middle ground between: i) convex primitive based approaches used by common robotics simulators (efficient but not differentiable), and ii) mollified vertex-face and edge-edge unsigned distance-based approaches used by barrier methods (differentiable but inefficient). Concretely, we propose: i) a representative set of smooth analytical signed distance primitives to implement vertex-face collisions, and ii) a novel differentiable edge-edge collision routine that can provide signed distances and signed contact normals. The proposed framework is evaluated via a set of didactic experiments and benchmarked against the collision detection routine of the well-established Mujoco XLA framework, where we observe a significant speedup. Supplementary videos can be found at https://github.com/bekeronur/contax, where a reference implementation in JAX will also be made available at the conclusion of the review process.
Chinese Translation
以快速、大规模向量化和平滑可微的方式模拟具有接触的刚体动力学在机器人技术中是非常理想的。现有可微模拟框架面临的一个重要瓶颈是接触流形生成:通过一组适当分布的接触点来表示两个相撞几何体之间的交集体积。造成这一瓶颈的一个主要因素是,常用机器人模拟器的相关例程并未将向量化和可微性作为主要考虑,因此依赖于阻碍这些目标的逻辑和控制流程。我们提出了一个从头开始设计的框架,旨在兼顾以下两个方面:i) 常用机器人模拟器所采用的基于凸原语的方法(高效但不可微),以及 ii) 由障碍方法使用的平滑顶点-面和边-边无符号距离方法(可微但效率低下)。具体而言,我们提出:i) 一组平滑的解析有符号距离原语,用于实现顶点-面碰撞,以及 ii) 一种新颖的可微边-边碰撞例程,能够提供有符号距离和有符号接触法线。通过一系列教学实验对所提出的框架进行了评估,并与成熟的Mujoco XLA框架的碰撞检测例程进行了基准测试,观察到显著的加速效果。补充视频可在 https://github.com/bekeronur/contax 找到,参考实现将在审稿过程结束时以JAX形式提供。
cs.RO / 9 / 2602.20323
Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory
从交互中学习物理原理:通过测试时记忆自我演化规划
Abstract
Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly, reducing rigid reliance on prior experience when physical conditions change. We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones. On a controlled brick insertion task, principled abstraction achieves 76% success compared to 23% for direct experience retrieval, and real-world experiments show consistent improvement over 30-minute deployment sessions.
Chinese Translation
可靠的物体操控需要理解在不同物体和环境中变化的物理属性。视觉-语言模型(VLM)规划器可以在一般层面上推理摩擦和稳定性;然而,它们往往无法预测特定的球在特定表面上的滚动方式,或哪块石头能够提供稳定的基础,而不需要直接经验。我们提出了PhysMem,一个记忆框架,使VLM机器人规划器能够在测试时通过交互学习物理原理,而无需更新模型参数。该系统记录经验,生成候选假设,并通过针对性的交互验证这些假设,然后将验证过的知识提升以指导未来的决策。一个核心设计选择是在应用前进行验证:系统对新观察进行假设测试,而不是直接应用检索到的经验,从而减少在物理条件变化时对先前经验的僵化依赖。我们在三个真实世界的操控任务和四个VLM骨干网络的仿真基准上评估了PhysMem。在一个受控的砖块插入任务中,原则性抽象的成功率为76%,而直接经验检索的成功率仅为23%;真实世界实验显示,在30分钟的部署会话中,PhysMem表现出持续的改进。
cs.RO / 10 / 2602.20362
Energy-Based Injury Protection Database: Including Shearing Contact Thresholds for Hand and Finger Using Porcine Surrogates
基于能量的伤害保护数据库:包括使用猪模型的手和手指剪切接触阈值
Abstract
While robotics research continues to propose strategies for collision avoidance in human-robot interaction, the reality of constrained environments and future humanoid systems makes contact inevitable. To mitigate injury risks, energy-constraining control approaches are commonly used, often relying on safety thresholds derived from blunt impact data in EN ISO 10218-2:2025. However, this dataset does not extend to edged or pointed collisions. Without scalable, clinically grounded datasets covering diverse contact scenarios, safety validation remains limited. Previous studies have laid the groundwork by assessing surrogate-based velocity and mass limits across various geometries, focusing on perpendicular impacts. This study expands those datasets by including shearing contact scenarios in unconstrained collisions, revealing that collision angle significantly affects injury outcomes. Notably, unconstrained shearing contacts result in fewer injuries than perpendicular ones. By reevaluating all prior porcine surrogate data, we establish energy thresholds across geometries and contact types, forming the first energy-based Injury Protection Database. This enables the development of meaningful energy-limiting controllers that ensure safety across a wide range of realistic collision events.
Chinese Translation
尽管机器人研究持续提出人机交互中的碰撞避免策略,但受限环境和未来类人系统的现实使得接触不可避免。为了降低伤害风险,通常采用能量约束控制方法,这些方法往往依赖于来自EN ISO 10218-2:2025的钝性冲击数据得出的安全阈值。然而,该数据集并未涵盖边缘或尖锐碰撞。在缺乏可扩展的、临床基础的数据集以涵盖多样化接触场景的情况下,安全验证仍然受到限制。以往的研究通过评估基于替代物的速度和质量限制,为不同几何形状的垂直冲击奠定了基础。本研究通过包括在无约束碰撞中的剪切接触场景,扩展了这些数据集,揭示了碰撞角度对伤害结果的显著影响。值得注意的是,无约束的剪切接触导致的伤害少于垂直接触。通过重新评估所有先前的猪模型数据,我们在不同几何形状和接触类型中建立了能量阈值,形成了第一个基于能量的伤害保护数据库。这使得开发有意义的能量限制控制器成为可能,从而确保在广泛的现实碰撞事件中实现安全。
cs.RO / 11 / 2602.20375
Generalizing from References using a Multi-Task Reference and Goal-Driven RL Framework
基于多任务参考和目标驱动的强化学习框架进行引用的泛化
Abstract
Learning agile humanoid behaviors from human motion offers a powerful route to natural, coordinated control, but existing approaches face a persistent trade-off: reference-tracking policies are often brittle outside the demonstration dataset, while purely task-driven Reinforcement Learning (RL) can achieve adaptability at the cost of motion quality. We introduce a unified multi-task RL framework that bridges this gap by treating reference motion as a prior for behavioral shaping rather than a deployment-time constraint. A single goal-conditioned policy is trained jointly on two tasks that share the same observation and action spaces, but differ in their initialization schemes, command spaces, and reward structures: (i) a reference-guided imitation task in which reference trajectories define dense imitation rewards but are not provided as policy inputs, and (ii) a goal-conditioned generalization task in which goals are sampled independently of any reference and where rewards reflect only task success. By co-optimizing these objectives within a shared formulation, the policy acquires structured, human-like motor skills from dense reference supervision while learning to adapt these skills to novel goals and initial conditions. This is achieved without adversarial objectives, explicit trajectory tracking, phase variables, or reference-dependent inference. We evaluate the method on a challenging box-based parkour playground that demands diverse athletic behaviors (e.g., jumping and climbing), and show that the learned controller transfers beyond the reference distribution while preserving motion naturalness. Finally, we demonstrate long-horizon behavior generation by composing multiple learned skills, illustrating the flexibility of the learned polices in complex scenarios.
Chinese Translation
从人类运动中学习灵活的人形行为为自然、协调的控制提供了一条强有力的途径,但现有方法面临着持续的权衡:参考跟踪策略在演示数据集之外往往表现脆弱,而纯粹以任务为驱动的强化学习(Reinforcement Learning, RL)则可能以运动质量为代价实现适应性。我们提出了一个统一的多任务强化学习框架,通过将参考运动视为行为塑造的先验,而非部署时的约束,来弥合这一差距。一个目标条件的策略在两个任务上共同训练,这两个任务共享相同的观察和动作空间,但在初始化方案、指令空间和奖励结构上有所不同:(i)一个参考引导的模仿任务,其中参考轨迹定义了密集的模仿奖励,但并未作为策略输入提供;(ii)一个目标条件的泛化任务,其中目标独立于任何参考进行采样,奖励仅反映任务成功。通过在共享的公式中共同优化这些目标,策略从密集的参考监督中获得结构化的人类运动技能,同时学习将这些技能适应于新的目标和初始条件。这是在没有对抗目标、显式轨迹跟踪、相位变量或依赖参考的推理的情况下实现的。我们在一个具有挑战性的基于箱子的跑酷游乐场上评估该方法,该场景要求多样化的运动行为(例如,跳跃和攀爬),并展示了学习到的控制器在保持运动自然性的同时超越了参考分布。最后,我们通过组合多个学习到的技能展示了长时间行为生成,说明了学习策略在复杂场景中的灵活性。
cs.RO / 12 / 2602.20466
Grasp to Act: Dexterous Grasping for Tool Use in Dynamic Settings
抓取以行动:动态环境中工具使用的灵巧抓取
Abstract
Achieving robust grasping with dexterous hands remains challenging, especially when manipulation involves dynamic forces such as impacts, torques, and continuous resistance--situations common in real-world tool use. Existing methods largely optimize grasps for static geometric stability and often fail once external forces arise during manipulation. We present Grasp-to-Act, a hybrid system that combines physics-based grasp optimization with reinforcement-learning-based grasp adaptation to maintain stable grasps throughout functional manipulation tasks. Our method synthesizes robust grasp configurations informed by human demonstrations and employs an adaptive controller that residually issues joint corrections to prevent in-hand slip while tracking the object trajectory. Grasp-to-Act enables robust zero-shot sim-to-real transfer across five dynamic tool-use tasks--hammering, sawing, cutting, stirring, and scooping--consistently outperforming baselines. Across simulation and real-world hardware trials with a 16-DoF dexterous hand, our method reduces translational and rotational in-hand slip and achieves the highest task completion rates, demonstrating stable functional grasps under dynamic, contact-rich conditions.
Chinese Translation
实现灵巧手的稳健抓取仍然是一个挑战,尤其是在操作涉及动态力量(如冲击、扭矩和持续阻力)时,这些情况在现实世界的工具使用中很常见。现有方法主要优化静态几何稳定性的抓取,往往在操作过程中外部力量出现时失效。我们提出了Grasp-to-Act,这是一个混合系统,结合了基于物理的抓取优化与基于强化学习的抓取适应,以在功能性操作任务中维持稳定的抓取。我们的方法合成了基于人类示范的稳健抓取配置,并采用自适应控制器,残余地发出关节修正以防止手内滑动,同时跟踪物体轨迹。Grasp-to-Act在五个动态工具使用任务(锤击、锯切、切割、搅拌和舀取)中实现了稳健的零样本仿真到现实转移,始终优于基线。在16自由度灵巧手的仿真和真实硬件试验中,我们的方法减少了平移和旋转的手内滑动,并实现了最高的任务完成率,证明了在动态、接触丰富的条件下的稳定功能性抓取。
cs.RO / 13 / 2602.20500
Strategy-Supervised Autonomous Laparoscopic Camera Control via Event-Driven Graph Mining
基于策略监督的自主腹腔镜摄像头控制通过事件驱动图挖掘
Abstract
Autonomous laparoscopic camera control must maintain a stable and safe surgical view under rapid tool-tissue interactions while remaining interpretable to surgeons. We present a strategy-grounded framework that couples high-level vision-language inference with low-level closed-loop control. Offline, raw surgical videos are parsed into camera-relevant temporal events (e.g., interaction, working-distance deviation, and view-quality degradation) and structured as attributed event graphs. Mining these graphs yields a compact set of reusable camera-handling strategy primitives, which provide structured supervision for learning. Online, a fine-tuned Vision-Language Model (VLM) processes the live laparoscopic view to predict the dominant strategy and discrete image-based motion commands, executed by an IBVS-RCM controller under strict safety constraints; optional speech input enables intuitive human-in-the-loop conditioning. On a surgeon-annotated dataset, event parsing achieves reliable temporal localization (F1-score 0.86), and the mined strategies show strong semantic alignment with expert interpretation (cluster purity 0.81). Extensive ex vivo experiments on silicone phantoms and porcine tissues demonstrate that the proposed system outperforms junior surgeons in standardized camera-handling evaluations, reducing field-of-view centering error by 35.26% and image shaking by 62.33%, while preserving smooth motion and stable working-distance regulation.
Chinese Translation
自主腹腔镜摄像头控制必须在快速的工具与组织交互下保持稳定和安全的手术视野,同时对外科医生可解释。我们提出了一种基于策略的框架,将高层次的视觉-语言推理与低层次的闭环控制相结合。在离线阶段,原始手术视频被解析为与摄像头相关的时间事件(例如,交互、工作距离偏差和视图质量下降),并结构化为带属性的事件图。挖掘这些图形生成了一组紧凑的可重用摄像头处理策略原语,为学习提供了结构化监督。在在线阶段,经过微调的视觉-语言模型(VLM)处理实时腹腔镜视图,以预测主导策略和离散的基于图像的运动命令,由IBVS-RCM控制器在严格的安全约束下执行;可选的语音输入使得人机交互条件直观。在外科医生标注的数据集中,事件解析实现了可靠的时间定位(F1-score 0.86),而挖掘的策略与专家解释表现出强语义一致性(聚类纯度0.81)。在硅胶模型和猪组织上的广泛体外实验表明,所提出的系统在标准化摄像头处理评估中优于初级外科医生,减少了视野中心误差35.26%和图像抖动62.33%,同时保持了平滑运动和稳定的工作距离调节。
cs.RO / 14 / 2602.20512
Conflict-Based Search for Multi-Agent Path Finding with Elevators
基于冲突的多智能体路径规划问题研究:考虑电梯的情况
Abstract
This paper investigates a problem called Multi-Agent Path Finding with Elevators (MAPF-E), which seeks conflict-free paths for multiple agents from their start to goal locations that may locate on different floors, and the agents can use elevators to travel between floors. The existence of elevators complicates the interaction among the agents and introduces new challenges to the planning. On the one hand, elevators can cause many conflicts among the agents due to its relatively long traversal time across floors, especially when many agents need to reach a different floor. On the other hand, the planner has to reason in a larger state space including the states of the elevators, besides the locations of the agents.
Chinese Translation
本文研究了一种称为电梯多智能体路径规划(Multi-Agent Path Finding with Elevators, MAPF-E)的问题,该问题旨在为多个智能体寻找无冲突的路径,使其从起始位置到达目标位置,这些位置可能位于不同楼层,并且智能体可以使用电梯在楼层之间移动。电梯的存在使得智能体之间的互动变得复杂,并给规划带来了新的挑战。一方面,由于电梯在楼层之间的相对较长的移动时间,尤其是在多个智能体需要到达不同楼层时,电梯可能会导致许多冲突。另一方面,规划者除了需要考虑智能体的位置外,还必须在包含电梯状态的更大状态空间中进行推理。
cs.RO / 15 / 2602.20566
BFA++: Hierarchical Best-Feature-Aware Token Prune for Multi-View Vision Language Action Model
BFA++:用于多视角视觉语言动作模型的分层最佳特征感知令牌剪枝
Abstract
Vision-Language-Action (VLA) models have achieved significant breakthroughs by leveraging Large Vision Language Models (VLMs) to jointly interpret instructions and visual inputs. However, the substantial increase in visual tokens, particularly from multi-view inputs, poses serious challenges to real-time robotic manipulation. Existing acceleration techniques for VLMs, such as token pruning, often result in degraded performance when directly applied to VLA models, as they overlook the relationships between different views and fail to account for the dynamic and task-specific characteristics of robotic operation. To address this, we propose BFA++, a dynamic token pruning framework designed specifically for VLA models. BFA++ introduces a hierarchical pruning strategy guided by two-level importance predictors: an intra-view predictor highlights task-relevant regions within each image to suppress spatial noise, while an inter-view predictor identifies critical camera views throughout different manipulation phases to reduce cross-view redundancy. This design enables efficient token selection while preserving essential visual cues, resulting in improved computational efficiency and higher manipulation success rates. Evaluations on the RoboTwin benchmark and real-world robotic tasks demonstrate that BFA++ consistently outperforms existing methods. BFA++ improves the success rate by about 10% on both the {\pi}0 and RDT models, achieving speedup of 1.8X and 1.5X, respectively. Our results highlight that context-sensitive and task-aware token pruning serves as a more effective strategy than full visual processing, enabling faster inference and improved manipulation accuracy in real-world robotic systems.
Chinese Translation
视觉-语言-动作(VLA)模型通过利用大型视觉语言模型(VLMs)实现了显著突破,能够共同解释指令和视觉输入。然而,来自多视角输入的视觉令牌数量大幅增加,对实时机器人操作提出了严峻挑战。现有的VLM加速技术,如令牌剪枝,直接应用于VLA模型时往往导致性能下降,因为它们忽视了不同视角之间的关系,并未考虑机器人操作的动态和任务特定特征。为此,我们提出了BFA++,一个专门为VLA模型设计的动态令牌剪枝框架。BFA++引入了一种分层剪枝策略,由两个层次的重要性预测器指导:内部视角预测器突出每幅图像中与任务相关的区域,以抑制空间噪声,而外部视角预测器则识别在不同操作阶段中关键的摄像机视角,以减少视角间的冗余。该设计实现了高效的令牌选择,同时保留了重要的视觉线索,从而提高了计算效率和更高的操作成功率。在RoboTwin基准和真实世界机器人任务上的评估表明,BFA++始终优于现有方法。在{C0}和RDT模型上,BFA++的成功率分别提高了约10%,实现了1.8倍和1.5倍的加速。我们的结果强调了上下文敏感和任务感知的令牌剪枝作为比全面视觉处理更有效的策略,能够在真实世界的机器人系统中实现更快的推理和更高的操作准确性。
cs.RO / 16 / 2602.20596
Acoustic Feedback for Closed-Loop Force Control in Robotic Grinding
用于机器人磨削的闭环力控制的声学反馈
Abstract
Acoustic feedback is a critical indicator for assessing the contact condition between the tool and the workpiece when humans perform grinding tasks with rotary tools. In contrast, robotic grinding systems typically rely on force sensing, with acoustic information largely ignored. This reliance on force sensors is costly and difficult to adapt to different grinding tools, whereas audio sensors (microphones) are low-cost and can be mounted on any medium that conducts grinding sound. This paper introduces a low-cost Acoustic Feedback Robotic Grinding System (AFRG) that captures audio signals with a contact microphone, estimates grinding force from the audio in real time, and enables closed-loop force control of the grinding process. Compared with conventional force-sensing approaches, AFRG achieves a 4-fold improvement in consistency across different grinding disc conditions. AFRG relies solely on a low-cost microphone, which is approximately 200-fold cheaper than conventional force sensors, as the sensing modality, providing an easily deployable, cost-effective robotic grinding solution.
Chinese Translation
声学反馈是评估工具与工件之间接触状态的重要指标,尤其是在人工使用旋转工具进行磨削任务时。相比之下,机器人磨削系统通常依赖于力传感器,而对声学信息的关注则相对较少。这种对力传感器的依赖成本高昂且难以适应不同的磨削工具,而音频传感器(麦克风)则成本低廉,可以安装在任何能够传导磨削声音的介质上。本文介绍了一种低成本的声学反馈机器人磨削系统(Acoustic Feedback Robotic Grinding System, AFRG),该系统通过接触麦克风捕获音频信号,实时估计磨削力,并实现磨削过程的闭环力控制。与传统的力传感方法相比,AFRG在不同磨削盘条件下的一致性提高了4倍。AFRG仅依赖于一种低成本的麦克风,其成本约为传统力传感器的200分之一,作为传感方式,提供了一种易于部署且具有成本效益的机器人磨削解决方案。
cs.RO / 17 / 2602.20645
Robot Local Planner: A Periodic Sampling-Based Motion Planner with Minimal Waypoints for Home Environments
机器人局部规划器:一种基于周期采样的家庭环境最小途径点运动规划器
Abstract
The objective of this study is to enable fast and safe manipulation tasks in home environments. Specifically, we aim to develop a system that can recognize its surroundings and identify target objects while in motion, enabling it to plan and execute actions accordingly. We propose a periodic sampling-based whole-body trajectory planning method, called the "Robot Local Planner (RLP)." This method leverages unique features of home environments to enhance computational efficiency, motion optimality, and robustness against recognition and control errors, all while ensuring safety. The RLP minimizes computation time by planning with minimal waypoints and generating safe trajectories. Furthermore, overall motion optimality is improved by periodically executing trajectory planning to select more optimal motions. This approach incorporates inverse kinematics that are robust to base position errors, further enhancing robustness. Evaluation experiments demonstrated that the RLP outperformed existing methods in terms of motion planning time, motion duration, and robustness, confirming its effectiveness in home environments. Moreover, application experiments using a tidy-up task achieved high success rates and short operation times, thereby underscoring its practical feasibility.
Chinese Translation
本研究的目标是实现家庭环境中的快速和安全操作任务。具体而言,我们旨在开发一个能够在运动中识别周围环境和目标物体的系统,从而能够相应地规划和执行动作。我们提出了一种基于周期采样的全身轨迹规划方法,称为“机器人局部规划器(Robot Local Planner, RLP)”。该方法利用家庭环境的独特特征来提高计算效率、运动最优性以及对识别和控制错误的鲁棒性,同时确保安全性。RLP通过使用最小途径点进行规划,从而最小化计算时间并生成安全轨迹。此外,通过定期执行轨迹规划以选择更优的运动,整体运动最优性得到了改善。该方法结合了对基座位置误差具有鲁棒性的逆向运动学,进一步增强了鲁棒性。评估实验表明,RLP在运动规划时间、运动持续时间和鲁棒性方面优于现有方法,确认了其在家庭环境中的有效性。此外,使用整理任务的应用实验实现了高成功率和短操作时间,从而强调了其实际可行性。
cs.RO / 18 / 2602.20715
IG-RFT: An Interaction-Guided RL Framework for VLA Models in Long-Horizon Robotic Manipulation
IG-RFT:一种用于长时间跨度机器人操作的交互引导强化学习框架
Abstract
Vision-Language-Action (VLA) models have demonstrated significant potential for generalist robotic policies; however, they struggle to generalize to long-horizon complex tasks in novel real-world domains due to distribution shifts and the scarcity of high-quality demonstrations. Although reinforcement learning (RL) offers a promising avenue for policy improvement, applying it to real-world VLA fine-tuning faces challenges regarding exploration efficiency, training stability, and sample cost. To address these issues, we propose IG-RFT, a novel Interaction-Guided Reinforced Fine-Tuning system designed for flow-based VLA models. Firstly, to facilitate effective policy optimization, we introduce Interaction-Guided Advantage Weighted Regression (IG-AWR), an RL algorithm that dynamically modulates exploration intensity based on the robot's interaction status. Furthermore, to address the limitations of sparse or task-specific rewards, we design a novel hybrid dense reward function that integrates the trajectory-level reward and the subtask-level reward. Finally, we construct a three-stage RL system comprising SFT, Offline RL, and Human-in-the-Loop RL for fine-tuning VLA models. Extensive real-world experiments on four challenging long-horizon tasks demonstrate that IG-RFT achieves an average success rate of 85.0%, significantly outperforming SFT (18.8%) and standard Offline RL baselines (40.0%). Ablation studies confirm the critical contributions of IG-AWR and hybrid reward shaping. In summary, our work establishes and validates a novel reinforced fine-tuning system for VLA models in real-world robotic manipulation.
Chinese Translation
视觉-语言-动作(VLA)模型在通用机器人策略中展现了显著的潜力;然而,由于分布变化和高质量示范的稀缺,它们在新颖的现实世界领域中难以推广到长时间跨度的复杂任务。尽管强化学习(RL)为策略改进提供了一个有前景的途径,但将其应用于现实世界的VLA微调面临探索效率、训练稳定性和样本成本等挑战。为了解决这些问题,我们提出了IG-RFT,一种为基于流的VLA模型设计的新型交互引导强化微调系统。首先,为了促进有效的策略优化,我们引入了交互引导优势加权回归(IG-AWR),这是一种根据机器人交互状态动态调节探索强度的RL算法。此外,为了解决稀疏或任务特定奖励的局限性,我们设计了一种新型混合密集奖励函数,结合了轨迹级奖励和子任务级奖励。最后,我们构建了一个包含SFT、离线RL和人机协作RL的三阶段RL系统,用于微调VLA模型。在四个具有挑战性的长时间跨度任务上的大量现实世界实验表明,IG-RFT的平均成功率达到85.0%,显著优于SFT(18.8%)和标准离线RL基线(40.0%)。消融研究确认了IG-AWR和混合奖励塑造的关键贡献。总之,我们的工作建立并验证了一种用于现实世界机器人操作的VLA模型的新型强化微调系统。
cs.RO / 19 / 2602.20768
Visual Cooperative Drone Tracking for Open-Path Gas Measurements
开放路径气体测量的视觉协作无人机追踪
Abstract
Open-path Tunable Diode Laser Absorption Spectroscopy offers an effective method for measuring, mapping, and monitoring gas concentrations, such as leaking CO2 or methane. Compared to spatial sampling of gas distributions using in-situ sensors, open-path sensors in combination with gas tomography algorithms can cover large outdoor environments faster in a non-invasive way. However, the requirement of a dedicated reflection surface for the open-path laser makes automating the spatial sampling process challenging. This publication presents a robotic system for collecting open-path measurements, making use of a sensor mounted on a ground-based pan-tilt unit and a small drone carrying a reflector. By means of a zoom camera, the ground unit visually tracks red LED markers mounted on the drone and aligns the sensor's laser beam with the reflector. Incorporating GNSS position information provided by the drone's flight controller further improves the tracking approach. Outdoor experiments validated the system's performance, demonstrating successful autonomous tracking and valid CO2 measurements at distances up to 60 meters. Furthermore, the system successfully measured a CO2 plume without interference from the drone's propulsion system, demonstrating its superiority compared to flying in-situ sensors.
Chinese Translation
开放路径可调二极管激光吸收光谱法提供了一种有效的方法,用于测量、绘制和监测气体浓度,例如泄漏的二氧化碳或甲烷。与使用原位传感器进行气体分布的空间采样相比,开放路径传感器结合气体断层成像算法能够以非侵入的方式更快地覆盖大面积户外环境。然而,开放路径激光需要专用的反射面,这使得空间采样过程的自动化变得具有挑战性。本文提出了一种用于收集开放路径测量的机器人系统,该系统利用安装在地面平移-俯仰单元上的传感器和携带反射器的小型无人机。通过变焦摄像头,地面单元能够视觉追踪安装在无人机上的红色LED标记,并将传感器的激光束与反射器对准。结合无人机飞行控制器提供的全球导航卫星系统(GNSS)位置信息,进一步改善了追踪方法。户外实验验证了系统的性能,成功实现了自主追踪,并在距离达60米的情况下进行了有效的二氧化碳测量。此外,该系统成功测量了二氧化碳羽流,且无人机的推进系统没有干扰,显示出其优于飞行原位传感器的优势。
cs.RO / 20 / 2602.20850
KCFRC: Kinematic Collision-Aware Foothold Reachability Criteria for Legged Locomotion
KCFRC:考虑运动学碰撞的腿部运动踏点可达性标准
Abstract
Legged robots face significant challenges in navigating complex environments, as they require precise real-time decisions for foothold selection and contact planning. While existing research has explored methods to select footholds based on terrain geometry or kinematics, a critical gap remains: few existing methods efficiently validate the existence of a non-collision swing trajectory. This paper addresses this gap by introducing KCFRC, a novel approach for efficient foothold reachability analysis. We first formally define the foothold reachability problem and establish a sufficient condition for foothold reachability. Based on this condition, we develop the KCFRC algorithm, which enables robots to validate foothold reachability in real time. Our experimental results demonstrate that KCFRC achieves remarkable time efficiency, completing foothold reachability checks for a single leg across 900 potential footholds in an average of 2 ms. Furthermore, we show that KCFRC can accelerate trajectory optimization and is particularly beneficial for contact planning in confined spaces, enhancing the adaptability and robustness of legged robots in challenging environments.
Chinese Translation
腿部机器人在复杂环境中导航面临重大挑战,因为它们需要对踏点选择和接触规划做出精确的实时决策。尽管现有研究已探讨了基于地形几何或运动学选择踏点的方法,但仍存在一个关键的空白:现有方法中很少有效验证非碰撞摆动轨迹的存在。本文通过引入KCFRC,提出了一种高效的踏点可达性分析新方法,以填补这一空白。我们首先正式定义了踏点可达性问题,并建立了踏点可达性的充分条件。基于这一条件,我们开发了KCFRC算法,使机器人能够实时验证踏点的可达性。我们的实验结果表明,KCFRC在时间效率上表现出色,平均在2毫秒内完成对900个潜在踏点的单腿踏点可达性检查。此外,我们还展示了KCFRC能够加速轨迹优化,特别有利于在狭小空间内的接触规划,从而增强腿部机器人在挑战性环境中的适应性和鲁棒性。
cs.RO / 21 / 2602.20871
GeCo-SRT: Geometry-aware Continual Adaptation for Robotic Cross-Task Sim-to-Real Transfer
GeCo-SRT:面向几何的持续适应方法用于机器人跨任务仿真到现实转移
Abstract
Bridging the sim-to-real gap is important for applying low-cost simulation data to real-world robotic systems. However, previous methods are severely limited by treating each transfer as an isolated endeavor, demanding repeated, costly tuning and wasting prior transfer experience.To move beyond isolated sim-to-real, we build a continual cross-task sim-to-real transfer paradigm centered on knowledge accumulation across iterative transfers, thereby enabling effective and efficient adaptation to novel tasks. Thus, we propose GeCo-SRT, a geometry-aware continual adaptation method. It utilizes domain-invariant and task-invariant knowledge from local geometric features as a transferable foundation to accelerate adaptation during subsequent sim-to-real transfers. This method starts with a geometry-aware mixture-of-experts module, which dynamically activates experts to specialize in distinct geometric knowledge to bridge observation sim-to-real gap. Further, the geometry-expert-guided prioritized experience replay module preferentially samples from underutilized experts, refreshing specialized knowledge to combat forgetting and maintain robust cross-task performance. Leveraging knowledge accumulated during iterative transfer, GeCo-SRT method not only achieves 52% average performance improvement over the baseline, but also demonstrates significant data efficiency for new task adaptation with only 1/6 data.We hope this work inspires approaches for efficient, low-cost cross-task sim-to-real transfer.
Chinese Translation
弥合仿真到现实的差距对于将低成本仿真数据应用于现实世界的机器人系统至关重要。然而,以往的方法由于将每次转移视为孤立的努力,严重受限,导致需要反复进行昂贵的调优,并浪费先前的转移经验。为了超越孤立的仿真到现实转移,我们构建了一个以知识积累为中心的持续跨任务仿真到现实转移范式,从而实现对新任务的有效和高效适应。因此,我们提出了GeCo-SRT,一种面向几何的持续适应方法。该方法利用来自局部几何特征的领域不变和任务不变知识,作为可转移的基础,以加速后续仿真到现实转移过程中的适应。该方法首先采用一个面向几何的专家混合模块,动态激活专家以专注于不同的几何知识,从而弥合观察的仿真到现实差距。此外,几何专家引导的优先经验重放模块优先从未充分利用的专家中采样,刷新专业知识以应对遗忘,并保持强大的跨任务性能。利用在迭代转移过程中积累的知识,GeCo-SRT方法不仅在基准测试中实现了52%的平均性能提升,而且在新任务适应方面展现出显著的数据效率,仅需1/6的数据。我们希望这项工作能激励高效、低成本的跨任务仿真到现实转移的方法。
cs.RO / 22 / 2602.20915
Task-oriented grasping for dexterous robots using postural synergies and reinforcement learning
基于姿态协同和强化学习的灵巧机器人任务导向抓取
Abstract
In this paper, we address the problem of task-oriented grasping for humanoid robots, emphasizing the need to align with human social norms and task-specific objectives. Existing methods, employ a variety of open-loop and closed-loop approaches but lack an end-to-end solution that can grasp several objects while taking into account the downstream task's constraints. Our proposed approach employs reinforcement learning to enhance task-oriented grasping, prioritizing the post-grasp intention of the agent. We extract human grasp preferences from the ContactPose dataset, and train a hand synergy model based on the Variational Autoencoder (VAE) to imitate the participant's grasping actions. Based on this data, we train an agent able to grasp multiple objects while taking into account distinct post-grasp intentions that are task-specific. By combining data-driven insights from human grasping behavior with learning by exploration provided by reinforcement learning, we can develop humanoid robots capable of context-aware manipulation actions, facilitating collaboration in human-centered environments.
Chinese Translation
本文探讨了类人机器人任务导向抓取的问题,强调了与人类社会规范和任务特定目标对齐的必要性。现有方法采用多种开环和闭环方法,但缺乏能够在考虑下游任务约束的情况下抓取多个物体的端到端解决方案。我们提出的方法采用强化学习来增强任务导向抓取,优先考虑代理的抓取后意图。我们从ContactPose数据集中提取人类抓取偏好,并基于变分自编码器(Variational Autoencoder, VAE)训练手部协同模型,以模仿参与者的抓取动作。基于这些数据,我们训练一个能够在考虑特定任务的不同抓取后意图的情况下抓取多个物体的代理。通过结合来自人类抓取行为的数据驱动洞察与强化学习提供的探索学习,我们可以开发出能够进行上下文感知操作的类人机器人,从而促进人本环境中的协作。
cs.RO / 23 / 2602.20920
Computer-Aided Design of Rational Motions for 4R and 6R Spatial Mechanism Synthesis
四杆和六杆空间机构综合的计算机辅助设计合理运动
Abstract
This paper focuses on geometric methods for generating rational motions used in the design of single-loop rational linkages, 1-degree-of-freedom mechanisms that can execute prescribed spatial tasks. Building on established rational motion synthesis methods, we introduce a new interpolation scheme for seven 3D points based on cubic quaternionic Bezier curves. The resulting motion admits factorization, i.e. the synthesis of a spatial six-bar mechanism whose tool frame passes the specified seven points. To support engineering practice, we provide open-source CAD tools that implement also the other methods and provide fast visual evaluation of motion generation and mechanism synthesis.
Chinese Translation
本文聚焦于用于设计单环合理连杆的几何方法,这些连杆是能够执行指定空间任务的1自由度机制。在已有的合理运动综合方法的基础上,我们引入了一种基于立方四元数贝塞尔曲线的七个三维点的新插值方案。所得到的运动允许因式分解,即合成一个空间六杆机构,其工具框架经过指定的七个点。为了支持工程实践,我们提供了开源CAD工具,这些工具实现了其他方法,并提供快速的运动生成和机构综合的可视化评估。
cs.RO / 24 / 2602.20923
ParkDiffusion++: Ego Intention Conditioned Joint Multi-Agent Trajectory Prediction for Automated Parking using Diffusion Models
ParkDiffusion++:基于扩散模型的自动停车中自我意图条件的联合多智能体轨迹预测
Abstract
Automated parking is a challenging operational domain for advanced driver assistance systems, requiring robust scene understanding and interaction reasoning. The key challenge is twofold: (i) predict multiple plausible ego intentions according to context and (ii) for each intention, predict the joint responses of surrounding agents, enabling effective what-if decision-making. However, existing methods often fall short, typically treating these interdependent problems in isolation. We propose ParkDiffusion++, which jointly learns a multi-modal ego intention predictor and an ego-conditioned multi-agent joint trajectory predictor for automated parking. Our approach makes several key contributions. First, we introduce an ego intention tokenizer that predicts a small set of discrete endpoint intentions from agent histories and vectorized map polylines. Second, we perform ego-intention-conditioned joint prediction, yielding socially consistent predictions of the surrounding agents for each possible ego intention. Third, we employ a lightweight safety-guided denoiser with different constraints to refine joint scenes during training, thus improving accuracy and safety. Fourth, we propose counterfactual knowledge distillation, where an EMA teacher refined by a frozen safety-guided denoiser provides pseudo-targets that capture how agents react to alternative ego intentions. Extensive evaluations demonstrate that ParkDiffusion++ achieves state-of-the-art performance on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Importantly, qualitative what-if visualizations show that other agents react appropriately to different ego intentions.
Chinese Translation
自动停车是高级驾驶辅助系统面临的一个具有挑战性的操作领域,要求具备强大的场景理解和交互推理能力。其关键挑战主要有两个方面:(i)根据上下文预测多个合理的自我意图,以及(ii)针对每个意图,预测周围智能体的联合反应,从而实现有效的假设决策。然而,现有方法通常存在不足,通常将这些相互依赖的问题孤立处理。我们提出了ParkDiffusion++,该方法联合学习多模态自我意图预测器和自我条件的多智能体联合轨迹预测器,以实现自动停车。我们的方法有几个关键贡献。首先,我们引入了一种自我意图标记器,从智能体历史和矢量化地图折线中预测一小组离散的端点意图。其次,我们进行自我意图条件的联合预测,为每个可能的自我意图生成周围智能体的社会一致性预测。第三,我们采用了一种轻量级的安全引导去噪器,结合不同的约束条件,在训练过程中精炼联合场景,从而提高准确性和安全性。第四,我们提出了反事实知识蒸馏,其中经过冻结的安全引导去噪器优化的EMA教师提供伪目标,捕捉智能体如何对替代自我意图做出反应。广泛的评估表明,ParkDiffusion++在Dragon Lake Parking (DLP) 数据集和Intersections Drone (inD) 数据集上达到了最先进的性能。重要的是,定性假设可视化显示其他智能体对不同的自我意图做出了适当反应。
cs.RO / 25 / 2602.20925
LST-SLAM: A Stereo Thermal SLAM System for Kilometer-Scale Dynamic Environments
LST-SLAM:一种用于公里级动态环境的立体热SLAM系统
Abstract
Thermal cameras offer strong potential for robot perception under challenging illumination and weather conditions. However, thermal Simultaneous Localization and Mapping (SLAM) remains difficult due to unreliable feature extraction, unstable motion tracking, and inconsistent global pose and map construction, particularly in dynamic large-scale outdoor environments. To address these challenges, we propose LST-SLAM, a novel large-scale stereo thermal SLAM system that achieves robust performance in complex, dynamic scenes. Our approach combines self-supervised thermal feature learning, stereo dual-level motion tracking, and geometric pose optimization. We also introduce a semantic-geometric hybrid constraint that suppresses potentially dynamic features lacking strong inter-frame geometric consistency. Furthermore, we develop an online incremental bag-of-words model for loop closure detection, coupled with global pose optimization to mitigate accumulated drift. Extensive experiments on kilometer-scale dynamic thermal datasets show that LST-SLAM significantly outperforms recent representative SLAM systems, including AirSLAM and DROID-SLAM, in both robustness and accuracy.
Chinese Translation
热成像相机在复杂的光照和天气条件下为机器人感知提供了强大的潜力。然而,由于特征提取不可靠、运动跟踪不稳定以及全局位姿和地图构建不一致,热成像同步定位与地图构建(SLAM)仍然面临困难,尤其是在动态的大规模户外环境中。为了解决这些挑战,我们提出了LST-SLAM,这是一种新颖的大规模立体热SLAM系统,能够在复杂的动态场景中实现稳健的性能。我们的方法结合了自监督热特征学习、立体双层运动跟踪和几何位姿优化。我们还引入了一种语义-几何混合约束,抑制缺乏强烈帧间几何一致性的潜在动态特征。此外,我们开发了一种在线增量词袋模型用于回环闭合检测,并结合全局位姿优化以减轻累积漂移。在公里级动态热成像数据集上的大量实验表明,LST-SLAM在稳健性和准确性方面显著优于包括AirSLAM和DROID-SLAM在内的近期代表性SLAM系统。
cs.RO / 26 / 2602.20958
EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations
基于扩展卡尔曼滤波的深度相机与深度学习融合用于无人机-人距离估计与跟踪的搜索与救援操作
Abstract
Search and rescue (SAR) operations require rapid responses to save lives or property. Unmanned Aerial Vehicles (UAVs) equipped with vision-based systems support these missions through prior terrain investigation or real-time assistance during the mission itself. Vision-based UAV frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. UAVs with deep learning-based vision systems offer a new approach to the planning and execution of SAR operations. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning-based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, root mean square error (RMSE) and standard deviations of distance estimation up to 15,3\% in three tested scenarios.
Chinese Translation
搜索与救援(SAR)操作需要快速响应以挽救生命或财产。配备视觉系统的无人机(UAV)通过事先的地形调查或在任务执行过程中的实时辅助来支持这些任务。基于视觉的无人机框架通过检测和识别特定个体,随后在保持安全距离的同时跟踪和跟随他们,从而帮助人类搜索任务。无人机跟随的一个关键安全要求是在真实环境条件下准确估计相机与目标物体之间的距离,这通过融合多种图像模态来实现。基于深度学习的视觉系统的无人机为SAR操作的规划和执行提供了一种新方法。作为使用深度学习进行自动人群检测和面部识别的系统的一部分,本文提出了深度相机测量与单目相机到人体距离估计的融合,以实现稳健的跟踪和跟随。通过YOLO-pose实现深度相机数据的基于深度学习的滤波和单目相机到人体距离的估计,利用扩展卡尔曼滤波(EKF)算法实现深度信息的实时融合。该子系统设计用于无人机,估计并测量深度相机与人体关键点之间的距离,以保持无人机与人类目标之间的安全距离。我们的系统提供了准确的距离估计,并已与运动捕捉的真实数据进行了验证。该系统在室内实时测试中表现良好,在三种测试场景中将平均误差、均方根误差(RMSE)和距离估计的标准差减少了多达15.3%。
cs.RO / 27 / 2602.20963
A Robotic Testing Platform for Pipelined Discovery of Resilient Soft Actuators
一种用于管道式发现韧性软驱动器的机器人测试平台
Abstract
Short lifetime under high electrical fields hinders the widespread robotic application of linear dielectric elastomer actuators (DEAs). Systematic scanning is difficult due to time-consuming per-sample testing and the high-dimensional parameter space affecting performance. To address this, we propose an optimization pipeline enabled by a novel testing robot capable of scanning DEA lifetime. The robot integrates electro-mechanical property measurement, programmable voltage input, and multi-channel testing capacity. Using it, we scanned the lifetime of Elastosil-based linear actuators across parameters including input voltage magnitude, frequency, electrode material concentration, and electrical connection filler. The optimal parameter combinations improved operational lifetime under boundary operating conditions by up to 100% and were subsequently scaled up to achieve higher force and displacement output. The final product demonstrated resilience on a modular, scalable quadruped walking robot with payload carrying capacity (>100% of its untethered body weight, and >700% of combined actuator weight). This work is the first to introduce a self-driving lab approach into robotic actuator design.
Chinese Translation
在高电场下,线性介电弹性体驱动器(DEAs)的短寿命限制了其在机器人应用中的广泛使用。由于每个样本测试耗时且高维参数空间影响性能,系统扫描变得困难。为了解决这一问题,我们提出了一种优化管道,利用一种新型测试机器人来扫描DEAs的寿命。该机器人集成了电机机械性能测量、可编程电压输入和多通道测试能力。通过使用该机器人,我们扫描了基于Elastosil的线性驱动器在输入电压幅度、频率、电极材料浓度和电连接填料等参数下的寿命。最佳参数组合在边界操作条件下将操作寿命提高了多达100%,并随后进行了放大以实现更高的力和位移输出。最终产品在一个模块化、可扩展的四足行走机器人上展示了韧性,具备超过其无绳体重的载荷能力(>100%)和超过组合驱动器重量的载荷能力(>700%)。本研究首次将自驱动实验室的方法引入机器人驱动器设计中。
cs.RO / 28 / 2602.21013
Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks
自我笔记:增强型语言-视觉-动作(VLA)的记忆依赖操作任务
Abstract
Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.
Chinese Translation
许多灵巧的操作任务本质上是非马尔可夫的,但在最近兴起的视觉-语言-动作(VLA)范式中对此事实关注甚少。尽管现有的VLA在将互联网规模的语义理解引入机器人领域方面取得了成功,但它们主要是“无状态”的,难以处理依赖记忆的长时间任务。在本研究中,我们探索了一种通过引入语言草稿本来赋予VLA空间和时间记忆的方法。草稿本使得记忆特定任务信息(如物体位置)成为可能,并允许模型跟踪计划及其在该计划中朝子目标的进展。我们在ClevrSkills环境中的一组记忆依赖任务、MemoryBench以及一个具有挑战性的现实世界拾取和放置任务上评估了这种方法。我们展示了引入语言草稿本显著提高了这些任务中无递归和递归模型的泛化能力。
cs.RO / 29 / 2602.21028
Surface-based Manipulation Using Tunable Compliant Porous-Elastic Soft Sensing
基于表面的可调柔性多孔弹性软传感操作
Abstract
There is a growing need for soft robotic platforms that perform gentle, precise handling of a wide variety of objects. Existing surface-based manipulation systems, however, lack the compliance and tactile feedback needed for delicate handling. This work introduces the COmpliant Porous-Elastic Soft Sensing (COPESS) integrated with inductive sensors for adaptive object manipulation and localised sensing. The design features a tunable lattice layer that simultaneously modulates mechanical compliance and sensing performance. By adjusting lattice geometry, both stiffness and sensor response can be tailored to handle objects with varying mechanical properties. Experiments demonstrate that by easily adjusting one parameter, the lattice density, from 7 % to 20 %, it is possible to significantly alter the sensitivity and operational force range (about -23x and 9x, respectively). This approach establishes a blueprint for creating adaptive, sensorized surfaces where mechanical and sensory properties are co-optimized, enabling passive, yet programmable, delicate manipulation.
Chinese Translation
对能够温和、精确地处理各种物体的软机器人平台的需求日益增长。然而,现有的基于表面的操作系统缺乏进行精细处理所需的柔性和触觉反馈。本研究介绍了一种集成感应传感器的可调柔性多孔弹性软传感器(COmpliant Porous-Elastic Soft Sensing, COPESS),用于自适应物体操作和局部感知。该设计具有一个可调的格子层,能够同时调节机械柔性和感知性能。通过调整格子几何形状,可以定制刚度和传感器响应,以处理具有不同机械特性的物体。实验表明,通过轻松调整一个参数,即格子密度,从7%到20%,可以显著改变灵敏度和操作力范围(分别约为-23倍和9倍)。这种方法为创建自适应的、传感化的表面奠定了基础,在这些表面上,机械和感知特性得以共同优化,从而实现被动但可编程的精细操作。
cs.RO / 30 / 2602.21119
Cooperative-Competitive Team Play of Real-World Craft Robots
现实世界工艺机器人之合作-竞争团队游戏
Abstract
Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years. However, the efficient training of collective robots using multi-agent RL and the transfer of learned policies to real-world applications remain open research questions. In this work, we first develop a comprehensive robotic system, including simulation, distributed learning framework, and physical robot components. We then propose and evaluate reinforcement learning techniques designed for efficient training of cooperative and competitive policies on this platform. To address the challenges of multi-agent sim-to-real transfer, we introduce Out of Distribution State Initialization (OODSI) to mitigate the impact of the sim-to-real gap. In the experiments, OODSI improves the Sim2Real performance by 20%. We demonstrate the effectiveness of our approach through experiments with a multi-robot car competitive game and a cooperative task in real-world settings.
Chinese Translation
近年来,多智能体深度强化学习(RL)在开发智能游戏代理方面取得了显著进展。然而,利用多智能体RL高效训练集体机器人以及将学习到的策略转移到现实世界应用中仍然是未解决的研究问题。在本研究中,我们首先开发了一个综合的机器人系统,包括仿真、分布式学习框架和物理机器人组件。然后,我们提出并评估了旨在高效训练合作和竞争策略的强化学习技术,以适应该平台。为了解决多智能体从仿真到现实转移的挑战,我们引入了分布外状态初始化(Out of Distribution State Initialization, OODSI),以减轻仿真与现实之间差距的影响。在实验中,OODSI使Sim2Real性能提高了20%。我们通过在现实世界环境中进行多机器人汽车竞争游戏和合作任务的实验,展示了我们方法的有效性。
cs.RO / 31 / 2602.21148
A Micro-Macro Model of Encounter-Driven Information Diffusion in Robot Swarms
基于遭遇驱动的信息扩散微观-宏观模型在机器人群体中的应用
Abstract
In this paper, we propose the problem of Encounter-Driven Information Diffusion (EDID). In EDID, robots are allowed to exchange information only upon meeting. Crucially, EDID assumes that the robots are not allowed to schedule their meetings. As such, the robots have no means to anticipate when, where, and who they will meet. As a step towards the design of storage and routing algorithms for EDID, in this paper we propose a model of information diffusion that captures the essential dynamics of EDID. The model is derived from first principles and is composed of two levels: a micro model, based on a generalization of the concept of `mean free path'; and a macro model, which captures the global dynamics of information diffusion. We validate the model through extensive robot simulations, in which we consider swarm size, communication range, environment size, and different random motion regimes. We conclude the paper with a discussion of the implications of this model on the algorithms that best support information diffusion according to the parameters of interest.
Chinese Translation
在本文中,我们提出了遭遇驱动信息扩散(Encounter-Driven Information Diffusion, EDID)的问题。在EDID中,机器人仅在相遇时交换信息。关键在于,EDID假设机器人不能安排它们的会面。因此,机器人无法预见何时、何地以及与谁会面。作为设计EDID存储和路由算法的一步,本文提出了一个捕捉EDID基本动态的信息扩散模型。该模型源于基本原理,由两个层次组成:微观模型,基于“平均自由程”(mean free path)概念的推广;以及宏观模型,捕捉信息扩散的全局动态。我们通过广泛的机器人仿真验证了该模型,考虑了群体规模、通信范围、环境大小以及不同的随机运动模式。最后,我们讨论了该模型对根据感兴趣参数最佳支持信息扩散的算法的影响。
cs.RO / 32 / 2602.21157
HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
HALO:一种统一的视觉-语言-行动模型用于具身多模态链式思维推理
Abstract
Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in long-horizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, and action prediction. To this end, we propose HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. We instantiate HALO with a Mixture-of-Transformers (MoT) architecture that decouples semantic reasoning, visual foresight, and action prediction into specialized experts while allowing seamless cross-expert collaboration. To enable HALO learning at scale, we introduce an automated pipeline to synthesize EM-CoT training data along with a carefully crafted training recipe. Extensive experiments demonstrate that: (1) HALO achieves superior performance in both simulated and real-world environments, surpassing baseline policy pi_0 by 34.1% on RoboTwin benchmark; (2) all proposed components of the training recipe and EM-CoT design help improve task success rate; and (3) HALO exhibits strong generalization capabilities under aggressive unseen environmental randomization with our proposed EM-CoT reasoning.
Chinese Translation
视觉-语言-行动(VLA)模型在机器人操作中表现出色,但在长时间跨度或分布外场景中常常面临挑战,因为缺乏明确的多模态推理机制以及预判在行动下世界如何演变的能力。近期的研究在VLA模型中引入了文本链式思维或视觉子目标预测来进行推理,但仍未能提供一个统一的人类式推理框架,以实现联合文本推理、视觉前瞻和行动预测。为此,我们提出了HALO,一种统一的VLA模型,通过文本任务推理、细粒度指导的视觉子目标预测和增强的具身多模态链式思维(EM-CoT)行动预测的顺序过程,来实现具身多模态链式思维推理。我们采用混合变换器(Mixture-of-Transformers, MoT)架构来实例化HALO,该架构将语义推理、视觉前瞻和行动预测解耦为专门的专家,同时允许无缝的跨专家协作。为了实现HALO的大规模学习,我们引入了一个自动化管道来合成EM-CoT训练数据,并制定了精心设计的训练方案。大量实验表明:(1)HALO在模拟和真实环境中均表现优异,在RoboTwin基准测试中超越基线策略pi_0 34.1%;(2)训练方案和EM-CoT设计中提出的所有组件均有助于提高任务成功率;(3)HALO在我们提出的EM-CoT推理下,展现出强大的泛化能力,能够应对激进的未见环境随机化。
cs.RO / 33 / 2602.21161
ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking
ActionReasoning:基于大型语言模型的机器人在三维空间中的动作推理用于机器人砖块堆叠
Abstract
Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabilities, limiting the scalability of embodied AI and general-purpose robots. Recent data-driven Vision-Language-Action (VLA) approaches aim to learn policies from large-scale simulation and real-world data. However, the continuous action space of the physical world significantly exceeds the representational capacity of linguistic tokens, making it unclear if scaling data alone can yield general robotic intelligence. To address this gap, we propose ActionReasoning, an LLM-driven framework that performs explicit action reasoning to produce physics-consistent, prior-guided decisions for robotic manipulation. ActionReasoning leverages the physical priors and real-world knowledge already encoded in Large Language Models (LLMs) and structures them within a multi-agent architecture. We instantiate this framework on a tractable case study of brick stacking, where the environment states are assumed to be already accurately measured. The environmental states are then serialized and passed to a multi-agent LLM framework that generates physics-aware action plans. The experiments demonstrate that the proposed multi-agent LLM framework enables stable brick placement while shifting effort from low-level domain-specific coding to high-level tool invocation and prompting, highlighting its potential for broader generalization. This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with LLMs.
Chinese Translation
传统的机器人系统通常依赖于为受限环境设计的定制规划器。尽管在限制性环境中有效,这些系统缺乏泛化能力,限制了具身人工智能和通用机器人的可扩展性。最近的数据驱动的视觉-语言-动作(Vision-Language-Action, VLA)方法旨在从大规模模拟和现实世界数据中学习策略。然而,物理世界的连续动作空间显著超出了语言符号的表征能力,使得仅依靠数据扩展是否能够产生通用机器人智能变得不明确。为了解决这一问题,我们提出了ActionReasoning,一个基于大型语言模型(LLM)的框架,通过显式的动作推理生成物理一致的、基于先验知识的机器人操作决策。ActionReasoning利用已经编码在大型语言模型中的物理先验和现实世界知识,并将其结构化为多智能体架构。我们在一个可处理的砖块堆叠案例研究中实例化该框架,其中环境状态被假设为已准确测量。然后将环境状态序列化并传递给一个多智能体LLM框架,该框架生成物理感知的动作计划。实验表明,所提出的多智能体LLM框架能够实现稳定的砖块放置,同时将努力从低级领域特定编码转移到高级工具调用和提示,突显其更广泛泛化的潜力。本研究提出了一种有前景的方法,通过将物理推理与大型语言模型相结合,弥合机器人操作中的感知与执行之间的鸿沟。
cs.RO / 34 / 2602.21174
Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
高效的多分辨率三维网格层次任意角路径规划
Abstract
Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information. Yet widely used path planning methods such as sampling and trajectory optimization do not exploit this explicit connectivity information, and search-based methods such as A* suffer from scalability issues in large-scale high-resolution maps. In many applications, Euclidean shortest paths form the underpinning of the navigation system. For such applications, any-angle planning methods, which find optimal paths by connecting corners of obstacles with straight-line segments, provide a simple and efficient solution. In this paper, we present a method that has the optimality and completeness properties of any-angle planners while overcoming computational tractability issues common to search-based methods by exploiting multi-resolution representations. Extensive experiments on real and synthetic environments demonstrate the proposed approach's solution quality and speed, outperforming even sampling-based methods. The framework is open-sourced to allow the robotics and planning community to build on our research.
Chinese Translation
层次化的多分辨率体积映射方法广泛用于表示大型复杂环境,因为它们能够有效捕捉环境的占用和连通性信息。然而,广泛使用的路径规划方法,如采样和轨迹优化,并未利用这些显式的连通性信息,而基于搜索的方法(如 A* 算法)在大规模高分辨率地图中面临可扩展性问题。在许多应用中,欧几里得最短路径构成导航系统的基础。对于此类应用,任意角规划方法通过用直线段连接障碍物的角落来寻找最优路径,提供了一种简单而高效的解决方案。本文提出了一种方法,该方法具备任意角规划器的最优性和完整性特性,同时通过利用多分辨率表示克服了基于搜索的方法常见的计算可行性问题。在真实和合成环境中的大量实验表明,所提出的方法在解决质量和速度上均优于甚至采样基方法。该框架已开源,以便机器人和规划社区在我们的研究基础上进行进一步探索。
cs.RO / 35 / 2602.21203
Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics
Squint:用于仿真到现实机器人技术的快速视觉强化学习
Abstract
Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.
Chinese Translation
视觉强化学习在机器人领域具有吸引力,但代价高昂——离线策略方法样本效率高但速度较慢;在线策略方法并行化效果良好但浪费样本。近期研究表明,对于基于状态的控制,离线策略方法在实际时间上训练速度快于在线策略方法。将这一点扩展到视觉领域仍然具有挑战性,因为高维输入图像使训练动态复杂化,并引入了大量存储和编码开销。为了解决这些挑战,我们提出了Squint,一种视觉软演员评论家(Soft Actor Critic)方法,能够比以往的视觉离线和在线方法实现更快的实际训练时间。Squint通过并行仿真、分布式评论家、分辨率缩放、层归一化、调优的数据更新比率以及优化的实现来实现这一目标。我们在SO-101任务集上进行了评估,这是一个包含八个操作任务的新套件,具有大量领域随机化,并展示了仿真到现实的转移到真实的SO-101机器人。我们在单个RTX 3090 GPU上训练策略15分钟,大多数任务在6分钟内收敛。
cs.CV / 1 / 2602.20165
VISION-ICE: Video-based Interpretation and Spatial Identification of Arrhythmia Origins via Neural Networks in Intracardiac Echocardiography
VISION-ICE:基于视频的心内超声中心律失常起源的解读与空间识别的神经网络方法
Abstract
Contemporary high-density mapping techniques and preoperative CT/MRI remain time and resource intensive in localizing arrhythmias. AI has been validated as a clinical decision aid in providing accurate, rapid real-time analysis of echocardiographic images. Building on this, we propose an AI-enabled framework that leverages intracardiac echocardiography (ICE), a routine part of electrophysiology procedures, to guide clinicians toward areas of arrhythmogenesis and potentially reduce procedural time. Arrhythmia source localization is formulated as a three-class classification task, distinguishing normal sinus rhythm, left-sided, and right-sided arrhythmias, based on ICE video data. We developed a 3D Convolutional Neural Network trained to discriminate among the three aforementioned classes. In ten-fold cross-validation, the model achieved a mean accuracy of 66.2% when evaluated on four previously unseen patients (substantially outperforming the 33.3% random baseline). These results demonstrate the feasibility and clinical promise of using ICE videos combined with deep learning for automated arrhythmia localization. Leveraging ICE imaging could enable faster, more targeted electrophysiological interventions and reduce the procedural burden of cardiac ablation. Future work will focus on expanding the dataset to improve model robustness and generalizability across diverse patient populations.
Chinese Translation
现代高密度映射技术和术前CT/MRI在定位心律失常方面仍然耗时且资源密集。人工智能(AI)已被验证为临床决策辅助工具,能够提供准确、快速的超声图像实时分析。在此基础上,我们提出了一种AI驱动的框架,利用心内超声(ICE),作为电生理程序的常规部分,引导临床医生识别心律失常发生区域,并可能减少手术时间。心律失常源定位被构建为一个三类分类任务,基于ICE视频数据区分正常窦性心律、左侧和右侧心律失常。我们开发了一种三维卷积神经网络,旨在区分上述三类。在十折交叉验证中,该模型在四名未见患者的评估中达到了66.2%的平均准确率(显著优于33.3%的随机基线)。这些结果展示了结合ICE视频和深度学习进行自动化心律失常定位的可行性和临床前景。利用ICE成像可以实现更快速、更有针对性的电生理干预,并减少心脏消融的程序负担。未来的工作将集中于扩展数据集,以提高模型在不同患者群体中的鲁棒性和普适性。
cs.CV / 2 / 2602.20205
OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport
OTPrune:通过最优传输实现分布对齐的视觉标记剪枝
Abstract
Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.
Chinese Translation
多模态大型语言模型(MLLMs)在视觉语言推理方面表现出色,但由于冗余的视觉标记,推理成本较高。近期的研究探索了视觉标记剪枝以加速推理,而现有的剪枝方法忽视了视觉表示的潜在分布结构。我们提出了OTPrune,一个无训练的框架,将剪枝形式化为通过最优传输(OT)实现的分布对齐。通过最小化完整和剪枝标记分布之间的2-瓦瑟斯坦距离,OTPrune在降低推理成本的同时,保持了局部多样性和全局代表性。此外,我们推导出一个可处理的子模目标,使得优化高效,并理论上证明了其单调性和子模性,为稳定和高效的剪枝提供了原则基础。我们进一步提供了全面的分析,解释了分布对齐如何有助于稳定和语义忠实的剪枝。在更广泛的基准测试中进行的全面实验表明,OTPrune在性能与效率的权衡上优于最先进的方法。代码可在 https://github.com/xiwenc1/OTPrune 获取。
cs.CV / 3 / 2602.20291
De-rendering, Reasoning, and Repairing Charts with Vision-Language Models
使用视觉-语言模型进行图表的去渲染、推理和修复
Abstract
Data visualizations are central to scientific communication, journalism, and everyday decision-making, yet they are frequently prone to errors that can distort interpretation or mislead audiences. Rule-based visualization linters can flag violations, but they miss context and do not suggest meaningful design changes. Directly querying general-purpose LLMs about visualization quality is unreliable: lacking training to follow visualization design principles, they often produce inconsistent or incorrect feedback. In this work, we introduce a framework that combines chart de-rendering, automated analysis, and iterative improvement to deliver actionable, interpretable feedback on visualization design. Our system reconstructs the structure of a chart from an image, identifies design flaws using vision-language reasoning, and proposes concrete modifications supported by established principles in visualization research. Users can selectively apply these improvements and re-render updated figures, creating a feedback loop that promotes both higher-quality visualizations and the development of visualization literacy. In our evaluation on 1,000 charts from the Chart2Code benchmark, the system generated 10,452 design recommendations, which clustered into 10 coherent categories (e.g., axis formatting, color accessibility, legend consistency). These results highlight the promise of LLM-driven recommendation systems for delivering structured, principle-based feedback on visualization design, opening the door to more intelligent and accessible authoring tools.
Chinese Translation
数据可视化在科学传播、新闻报道和日常决策中至关重要,但它们常常容易出现错误,从而扭曲解读或误导受众。基于规则的可视化检查工具可以标记违规情况,但它们缺乏上下文,且未能提出有意义的设计改进建议。直接询问通用大型语言模型(LLMs)关于可视化质量的反馈是不可靠的:由于缺乏遵循可视化设计原则的训练,它们常常产生不一致或不正确的反馈。在本研究中,我们提出了一个框架,结合图表去渲染、自动化分析和迭代改进,以提供可操作的、可解释的可视化设计反馈。我们的系统从图像中重建图表的结构,利用视觉-语言推理识别设计缺陷,并提出基于可视化研究中已建立原则的具体修改建议。用户可以选择性地应用这些改进并重新渲染更新后的图形,形成一个反馈循环,促进更高质量的可视化和可视化素养的发展。在我们对来自Chart2Code基准的1,000个图表的评估中,该系统生成了10,452条设计建议,这些建议聚集成10个一致的类别(例如,轴格式、颜色可访问性、图例一致性)。这些结果突显了基于LLM的推荐系统在提供结构化、基于原则的可视化设计反馈方面的潜力,为更智能和更易于访问的创作工具铺平了道路。
cs.CV / 4 / 2602.20312
N4MC: Neural 4D Mesh Compression
N4MC:神经4D网格压缩
Abstract
We present N4MC, the first 4D neural compression framework to efficiently compress time-varying mesh sequences by exploiting their temporal redundancy. Unlike prior neural mesh compression methods that treat each mesh frame independently, N4MC takes inspiration from inter-frame compression in 2D video codecs, and learns motion compensation in long mesh sequences. Specifically, N4MC converts consecutive irregular mesh frames into regular 4D tensors to provide a uniform and compact representation. These tensors are then condensed using an auto-decoder, which captures both spatial and temporal correlations for redundancy removal. To enhance temporal coherence, we introduce a transformer-based interpolation model that predicts intermediate mesh frames conditioned on latent embeddings derived from tracked volume centers, eliminating motion ambiguities. Extensive evaluations show that N4MC outperforms state-of-the-art in rate-distortion performance, while enabling real-time decoding of 4D mesh sequences. The implementation of our method is available at: https://github.com/frozzzen3/N4MC.
Chinese Translation
我们提出了N4MC,这是第一个4D神经压缩框架,通过利用时间变化网格序列的时间冗余来高效压缩。与之前将每个网格帧独立处理的神经网格压缩方法不同,N4MC受到2D视频编码器中帧间压缩的启发,学习长网格序列中的运动补偿。具体而言,N4MC将连续的不规则网格帧转换为规则的4D张量,以提供统一且紧凑的表示。这些张量随后通过自动解码器进行压缩,该解码器捕捉空间和时间相关性以去除冗余。为了增强时间一致性,我们引入了一种基于变换器的插值模型,该模型根据从跟踪体积中心提取的潜在嵌入预测中间网格帧,从而消除运动模糊。大量评估表明,N4MC在速率失真性能上优于最先进的方法,同时实现了4D网格序列的实时解码。我们方法的实现可在以下链接获取:https://github.com/frozzzen3/N4MC。
cs.CV / 5 / 2602.20328
GSNR: Graph Smooth Null-Space Representation for Inverse Problems
GSNR:用于逆问题的图平滑零空间表示
Abstract
Inverse problems in imaging are ill-posed, leading to infinitely many solutions consistent with the measurements due to the non-trivial null-space of the sensing matrix. Common image priors promote solutions on the general image manifold, such as sparsity, smoothness, or score function. However, as these priors do not constrain the null-space component, they can bias the reconstruction. Thus, we aim to incorporate meaningful null-space information in the reconstruction framework. Inspired by smooth image representation on graphs, we propose Graph-Smooth Null-Space Representation (GSNR), a mechanism that imposes structure only into the invisible component. Particularly, given a graph Laplacian, we construct a null-restricted Laplacian that encodes similarity between neighboring pixels in the null-space signal, and we design a low-dimensional projection matrix from the $p$-smoothest spectral graph modes (lowest graph frequencies). This approach has strong theoretical and practical implications: i) improved convergence via a null-only graph regularizer, ii) better coverage, how much null-space variance is captured by $p$ modes, and iii) high predictability, how well these modes can be inferred from the measurements. GSNR is incorporated into well-known inverse problem solvers, e.g., PnP, DIP, and diffusion solvers, in four scenarios: image deblurring, compressed sensing, demosaicing, and image super-resolution, providing consistent improvement of up to 4.3 dB over baseline formulations and up to 1 dB compared with end-to-end learned models in terms of PSNR.
Chinese Translation
成像中的逆问题是病态的,由于感知矩阵的非平凡零空间,导致与测量一致的解有无穷多个。常见的图像先验促进在一般图像流形上的解,例如稀疏性、平滑性或得分函数。然而,由于这些先验并未约束零空间成分,它们可能会偏倚重建。因此,我们的目标是在重建框架中融入有意义的零空间信息。受到图上平滑图像表示的启发,我们提出了图平滑零空间表示(Graph-Smooth Null-Space Representation,GSNR),这是一种仅对不可见成分施加结构的机制。特别地,给定一个图拉普拉斯算子,我们构造了一个限制零空间的拉普拉斯算子,该算子编码了零空间信号中相邻像素之间的相似性,并设计了一个低维投影矩阵,该矩阵来自$p$-平滑的谱图模式(最低图频率)。这种方法具有强大的理论和实践意义:i) 通过仅针对零空间的图正则化器改善收敛性,ii) 更好的覆盖,$p$个模式捕获的零空间方差的多少,以及iii) 高可预测性,这些模式能从测量中推断得多好。GSNR被纳入了众所周知的逆问题求解器,例如PnP、DIP和扩散求解器,在四种场景中:图像去模糊、压缩感知、去马赛克和图像超分辨率,在PSNR方面提供了与基线模型相比高达4.3 dB的一致改善,并与端到端学习模型相比提高了高达1 dB的性能。
cs.CV / 6 / 2602.20330
Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking
视觉-语言模型中的电路追踪:理解多模态思维的内部机制
Abstract
Vision-language models (VLMs) are powerful but remain opaque black boxes. We introduce the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning. By utilizing transcoders, attribution graphs, and attention-based methods, we uncover how VLMs hierarchically integrate visual and semantic concepts. We reveal that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations. Validated through feature steering and circuit patching, our framework proves these circuits are causal and controllable, laying the groundwork for more explainable and reliable VLMs.
Chinese Translation
视觉-语言模型(VLMs)功能强大,但仍然是一个不透明的黑箱。我们提出了第一个透明电路追踪框架,以系统地分析多模态推理。通过利用转码器(transcoders)、归因图(attribution graphs)和基于注意力的方法,我们揭示了VLMs如何分层整合视觉和语义概念。我们发现,独特的视觉特征电路能够处理数学推理并支持跨模态关联。通过特征引导和电路修补进行验证,我们的框架证明了这些电路是因果的且可控的,为更具解释性和可靠性的VLMs奠定了基础。
cs.CV / 7 / 2602.20342
Large-scale Photorealistic Outdoor 3D Scene Reconstruction from UAV Imagery Using Gaussian Splatting Techniques
基于高斯喷溅技术的大规模真实感户外3D场景重建:来自无人机影像的研究
Abstract
In this study, we present an end-to-end pipeline capable of converting drone-captured video streams into high-fidelity 3D reconstructions with minimal latency. Unmanned aerial vehicles (UAVs) are extensively used in aerial real-time perception applications. Moreover, recent advances in 3D Gaussian Splatting (3DGS) have demonstrated significant potential for real-time neural rendering. However, their integration into end-to-end UAV-based reconstruction and visualization systems remains underexplored. Our goal is to propose an efficient architecture that combines live video acquisition via RTMP streaming, synchronized sensor fusion, camera pose estimation, and 3DGS optimization, achieving continuous model updates and low-latency deployment within interactive visualization environments that supports immersive augmented and virtual reality (AR/VR) applications. Experimental results demonstrate that the proposed method achieves competitive visual fidelity, while delivering significantly higher rendering performance and substantially reduced end-to-end latency, compared to NeRF-based approaches. Reconstruction quality remains within 4-7\% of high-fidelity offline references, confirming the suitability of the proposed system for real-time, scalable augmented perception from aerial platforms.
Chinese Translation
在本研究中,我们提出了一种端到端的管道,能够将无人机捕获的视频流转换为高保真度的3D重建,且延迟极小。无人驾驶飞行器(UAV)在空中实时感知应用中被广泛使用。此外,最近在3D高斯喷溅(3DGS)方面的进展显示出实时神经渲染的显著潜力。然而,它们在端到端的基于UAV的重建和可视化系统中的集成仍然未得到充分探索。我们的目标是提出一种高效的架构,结合通过RTMP流进行的实时视频采集、同步传感器融合、相机位姿估计和3DGS优化,实现连续的模型更新和低延迟部署,支持沉浸式增强现实和虚拟现实(AR/VR)应用的交互式可视化环境。实验结果表明,所提出的方法在视觉保真度方面具有竞争力,同时在渲染性能上显著提高,并且与基于NeRF的方法相比,端到端延迟大幅降低。重建质量保持在高保真离线参考的4-7\%范围内,确认了所提出系统在空中平台上进行实时、可扩展增强感知的适用性。
cs.CV / 8 / 2602.20351
BiRQA: Bidirectional Robust Quality Assessment for Images
BiRQA:双向鲁棒图像质量评估
Abstract
Full-Reference image quality assessment (FR IQA) is important for image compression, restoration and generative modeling, yet current neural metrics remain slow and vulnerable to adversarial perturbations. We present BiRQA, a compact FR IQA metric model that processes four fast complementary features within a bidirectional multiscale pyramid. A bottom-up attention module injects fine-scale cues into coarse levels through an uncertainty-aware gate, while a top-down cross-gating block routes semantic context back to high resolution. To enhance robustness, we introduce Anchored Adversarial Training, a theoretically grounded strategy that uses clean "anchor" samples and a ranking loss to bound pointwise prediction error under attacks. On five public FR IQA benchmarks BiRQA outperforms or matches the previous state of the art (SOTA) while running ~3x faster than previous SOTA models. Under unseen white-box attacks it lifts SROCC from 0.30-0.57 to 0.60-0.84 on KADID-10k, demonstrating substantial robustness gains. To our knowledge, BiRQA is the only FR IQA model combining competitive accuracy with real-time throughput and strong adversarial resilience.
Chinese Translation
全参考图像质量评估(FR IQA)对于图像压缩、恢复和生成建模至关重要,但当前的神经度量方法仍然较慢且易受对抗扰动的影响。我们提出了BiRQA,一种紧凑的FR IQA度量模型,能够在双向多尺度金字塔中处理四种快速互补特征。自下而上的注意模块通过一个不确定性感知门将细尺度线索注入粗尺度层,而自上而下的交叉门控块则将语义上下文引导回高分辨率。为了增强鲁棒性,我们引入了锚定对抗训练(Anchored Adversarial Training),这是一种理论基础的策略,利用干净的“锚”样本和排名损失在攻击下限制逐点预测误差。在五个公共FR IQA基准测试中,BiRQA的性能超越或匹配了之前的最新技术(SOTA),同时运行速度比之前的SOTA模型快约3倍。在未见的白盒攻击下,它在KADID-10k上的SROCC从0.30-0.57提升至0.60-0.84,显示出显著的鲁棒性提升。据我们所知,BiRQA是唯一一个将竞争性准确性与实时吞吐量和强对抗韧性相结合的FR IQA模型。
cs.CV / 9 / 2602.20354
3DSPA: A 3D Semantic Point Autoencoder for Evaluating Video Realism
3DSPA:一种用于评估视频真实感的3D语义点自编码器
Abstract
AI video generation is evolving rapidly. For video generators to be useful for applications ranging from robotics to film-making, they must consistently produce realistic videos. However, evaluating the realism of generated videos remains a largely manual process -- requiring human annotation or bespoke evaluation datasets which have restricted scope. Here we develop an automated evaluation framework for video realism which captures both semantics and coherent 3D structure and which does not require access to a reference video. Our method, 3DSPA, is a 3D spatiotemporal point autoencoder which integrates 3D point trajectories, depth cues, and DINO semantic features into a unified representation for video evaluation. 3DSPA models how objects move and what is happening in the scene, enabling robust assessments of realism, temporal consistency, and physical plausibility. Experiments show that 3DSPA reliably identifies videos which violate physical laws, is more sensitive to motion artifacts, and aligns more closely with human judgments of video quality and realism across multiple datasets. Our results demonstrate that enriching trajectory-based representations with 3D semantics offers a stronger foundation for benchmarking generative video models, and implicitly captures physical rule violations. The code and pretrained model weights will be available at https://github.com/TheProParadox/3dspa_code.
Chinese Translation
人工智能视频生成正在迅速发展。为了使视频生成器在从机器人技术到电影制作等应用中发挥作用,它们必须持续生成真实的视频。然而,评估生成视频的真实感仍然主要依赖人工过程——需要人工标注或特定的评估数据集,这些数据集的范围有限。在此,我们开发了一种自动化的视频真实感评估框架,该框架捕捉语义和一致的3D结构,并且不需要访问参考视频。我们的方法3DSPA是一种3D时空点自编码器,它将3D点轨迹、深度线索和DINO语义特征整合为视频评估的统一表示。3DSPA模拟物体的运动和场景中发生的事件,从而能够对真实感、时间一致性和物理合理性进行稳健评估。实验表明,3DSPA能够可靠地识别违反物理定律的视频,对运动伪影更为敏感,并且在多个数据集上与人类对视频质量和真实感的判断更为一致。我们的结果表明,丰富基于轨迹的表示与3D语义结合,为基准生成视频模型提供了更强的基础,并隐含地捕捉物理规则的违反。代码和预训练模型权重将可在 https://github.com/TheProParadox/3dspa_code 获取。
cs.CV / 10 / 2602.20363
Aesthetic Camera Viewpoint Suggestion with 3D Aesthetic Field
基于3D美学场的美学相机视角建议
Abstract
The aesthetic quality of a scene depends strongly on camera viewpoint. Existing approaches for aesthetic viewpoint suggestion are either single-view adjustments, predicting limited camera adjustments from a single image without understanding scene geometry, or 3D exploration approaches, which rely on dense captures or prebuilt 3D environments coupled with costly reinforcement learning (RL) searches. In this work, we introduce the notion of 3D aesthetic field that enables geometry-grounded aesthetic reasoning in 3D with sparse captures, allowing efficient viewpoint suggestions in contrast to costly RL searches. We opt to learn this 3D aesthetic field using a feedforward 3D Gaussian Splatting network that distills high-level aesthetic knowledge from a pretrained 2D aesthetic model into 3D space, enabling aesthetic prediction for novel viewpoints from only sparse input views. Building on this field, we propose a two-stage search pipeline that combines coarse viewpoint sampling with gradient-based refinement, efficiently identifying aesthetically appealing viewpoints without dense captures or RL exploration. Extensive experiments show that our method consistently suggests viewpoints with superior framing and composition compared to existing approaches, establishing a new direction toward 3D-aware aesthetic modeling.
Chinese Translation
场景的美学质量在很大程度上依赖于相机视角。现有的美学视角建议方法要么是单视图调整,从单一图像预测有限的相机调整而不理解场景几何,要么是依赖于密集捕捉或预构建3D环境的3D探索方法,这些方法通常伴随着高昂的强化学习(RL)搜索成本。在本研究中,我们引入了3D美学场的概念,使得在稀疏捕捉的情况下能够进行基于几何的3D美学推理,从而与高成本的RL搜索相比,提供高效的视角建议。我们选择使用前馈3D高斯点云网络来学习这一3D美学场,该网络将预训练的2D美学模型中的高级美学知识提炼到3D空间中,使得仅通过稀疏输入视图即可对新视角进行美学预测。在此基础上,我们提出了一个两阶段搜索管道,结合粗略视角采样与基于梯度的细化,能够高效地识别美学吸引力的视角,而无需密集捕捉或RL探索。大量实验表明,我们的方法在视角构图和框架方面始终优于现有方法,为3D感知美学建模开辟了新的方向。
cs.CV / 11 / 2602.20409
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
CLIPoint3D:基于语言的少样本无监督3D点云领域适应
Abstract
Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Codes are available at https://github.com/SarthakM320/CLIPoint3D.
Chinese Translation
近期的视觉-语言模型(VLMs),如CLIP,展示了令人印象深刻的跨模态推理能力,超越了图像,扩展到3D感知。然而,这些模型在领域转移时仍然脆弱,特别是在从合成到真实世界点云的适应中。传统的3D领域适应方法依赖于重型可训练编码器,虽然能够获得较强的准确性,但代价是效率低下。我们提出了CLIPoint3D,这是第一个基于CLIP的少样本无监督3D点云领域适应框架。我们的方法将3D样本投影到多个深度图中,并利用冻结的CLIP主干,通过知识驱动的提示调优方案进行精细化,该方案将高级语言先验与来自轻量级3D编码器的几何线索相结合。为了有效地适应任务特定特征,我们对CLIP的编码器应用了参数高效的微调,并设计了一种基于熵引导的视图采样策略,以选择可信的投影。此外,基于最优传输的对齐损失和不确定性感知的原型对齐损失共同弥合源-目标分布差距,同时保持类别可分性。在PointDA-10和GraspNetPC-10基准上的广泛实验表明,CLIPoint3D在CLIP基础和传统编码器基础的基线之上实现了一致的3-16%的准确性提升。代码可在https://github.com/SarthakM320/CLIPoint3D获取。
cs.CV / 12 / 2602.20412
SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images
SimLBR:通过学习检测真实图像来检测伪造图像
Abstract
The rapid advancement of generative models has made the detection of AI-generated images a critical challenge for both research and society. Recent works have shown that most state-of-the-art fake image detection methods overfit to their training data and catastrophically fail when evaluated on curated hard test sets with strong distribution shifts. In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. To this end, we propose SimLBR, a simple and efficient framework for fake image detection using Latent Blending Regularization (LBR). Our method significantly improves cross-generator generalization, achieving up to +24.85\% accuracy and +69.62\% recall on the challenging Chameleon benchmark. SimLBR is also highly efficient, training orders of magnitude faster than existing approaches. Furthermore, we emphasize the need for reliability-oriented evaluation in fake image detection, introducing risk-adjusted metrics and worst-case estimates to better assess model robustness. All code and models will be released on HuggingFace and GitHub.
Chinese Translation
生成模型的快速发展使得检测人工智能生成图像成为研究和社会面临的一个重大挑战。近期的研究表明,大多数最先进的伪造图像检测方法在训练数据上过拟合,并在经过强分布转变的精心策划的困难测试集上表现不佳。在本研究中,我们认为围绕真实图像分布学习一个紧密的决策边界,并将伪造类别视为一个汇聚类,更为合理。为此,我们提出了SimLBR,一个简单高效的伪造图像检测框架,采用潜在混合正则化(Latent Blending Regularization, LBR)。我们的方法显著提高了跨生成器的泛化能力,在具有挑战性的Chameleon基准测试中,准确率提高了高达24.85%,召回率提高了69.62%。SimLBR的训练效率也非常高,比现有方法快几个数量级。此外,我们强调在伪造图像检测中需要进行可靠性导向的评估,提出了风险调整指标和最坏情况估计,以更好地评估模型的鲁棒性。所有代码和模型将发布在HuggingFace和GitHub上。
cs.CV / 13 / 2602.20417
gQIR: Generative Quanta Image Reconstruction
gQIR:生成量子图像重建
Abstract
Capturing high-quality images from only a few detected photons is a fundamental challenge in computational imaging. Single-photon avalanche diode (SPAD) sensors promise high-quality imaging in regimes where conventional cameras fail, but raw \emph{quanta frames} contain only sparse, noisy, binary photon detections. Recovering a coherent image from a burst of such frames requires handling alignment, denoising, and demosaicing (for color) under noise statistics far outside those assumed by standard restoration pipelines or modern generative models. We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. Our method leverages the structural and semantic priors of internet-scale diffusion models while introducing mechanisms to handle Bernoulli photon statistics. By integrating latent-space restoration with burst-level spatio-temporal reasoning, our approach produces reconstructions that are both photometrically faithful and perceptually pleasing, even under high-speed motion. We evaluate the method on synthetic benchmarks and new real-world datasets, including the first color SPAD burst dataset and a challenging \textit{Deforming (XD)} video benchmark. Across all settings, the approach substantially improves perceptual quality over classical and modern learning-based baselines, demonstrating the promise of adapting large generative priors to extreme photon-limited sensing. Code at \href{https://github.com/Aryan-Garg/gQIR}{https://github.com/Aryan-Garg/gQIR}.
Chinese Translation
从仅有的少量探测到的光子中捕获高质量图像是计算成像中的一个基本挑战。单光子雪崩二极管(SPAD)传感器在传统相机失效的情况下承诺提供高质量成像,但原始的量子帧仅包含稀疏、嘈杂的二进制光子检测。从一组这样的帧中恢复出一致的图像需要处理对齐、去噪和去马赛克(对于彩色)等问题,而这些问题的噪声统计特性远超标准恢复流程或现代生成模型所假设的范围。我们提出了一种方法,将大型文本到图像的潜在扩散模型适应于量子突发成像的光子限制领域。我们的方法利用了互联网规模扩散模型的结构和语义先验,同时引入了处理伯努利光子统计的机制。通过将潜在空间恢复与突发级时空推理相结合,我们的方法在高速运动下也能生成在光度上真实且在感知上令人愉悦的重建结果。我们在合成基准和新的真实世界数据集上评估了该方法,包括首个彩色SPAD突发数据集和一个具有挑战性的变形(XD)视频基准。在所有设置中,该方法在感知质量上显著优于经典和现代基于学习的基线,展示了将大型生成先验适应于极端光子限制传感的潜力。代码可在 [https://github.com/Aryan-Garg/gQIR](https://github.com/Aryan-Garg/gQIR) 获取。
cs.CV / 14 / 2602.20423
MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation
MedCLIPSeg:用于数据高效和可推广医学图像分割的概率视觉-语言适应
Abstract
Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision-language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages more nuanced semantic learning across diverse textual prompts, MedCLIPSeg effectively improves data efficiency and domain generalizability. Extensive experiments across 16 datasets spanning five imaging modalities and six organs demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight local reliability of segmentation results. This work demonstrates the potential of probabilistic vision-language modeling for text-driven medical image segmentation.
Chinese Translation
医学图像分割仍然面临挑战,原因包括训练数据注释有限、解剖特征模糊以及领域转移。尽管像 CLIP 这样的视觉-语言模型提供了强大的跨模态表示,但它们在密集的、文本引导的医学图像分割中的潜力仍未得到充分探索。我们提出了 MedCLIPSeg,这是一种新颖的框架,旨在将 CLIP 适应于稳健、数据高效且考虑不确定性的医学图像分割。我们的方法通过概率跨模态注意机制利用补丁级 CLIP 嵌入,促进图像和文本标记之间的双向交互,并明确建模预测不确定性。结合一种软补丁级对比损失,鼓励在多样化文本提示中进行更细致的语义学习,MedCLIPSeg 有效提高了数据效率和领域可推广性。在涵盖五种成像模态和六个器官的 16 个数据集上的广泛实验表明,MedCLIPSeg 在准确性、效率和稳健性方面优于先前的方法,同时提供可解释的不确定性图,突出分割结果的局部可靠性。本研究展示了概率视觉-语言建模在文本驱动的医学图像分割中的潜力。
cs.CV / 15 / 2602.20476
SceMoS: Scene-Aware 3D Human Motion Synthesis by Planning with Geometry-Grounded Tokens
SceMoS:通过几何基础标记进行场景感知的3D人类动作合成
Abstract
Synthesizing text-driven 3D human motion within realistic scenes requires learning both semantic intent ("walk to the couch") and physical feasibility (e.g., avoiding collisions). Current methods use generative frameworks that simultaneously learn high-level planning and low-level contact reasoning, and rely on computationally expensive 3D scene data such as point clouds or voxel occupancy grids. We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. SceMoS disentangles global planning from local execution using lightweight 2D cues and relying on (1) a text-conditioned autoregressive global motion planner that operates on a bird's-eye-view (BEV) image rendered from an elevated corner of the scene, encoded with DINOv2 features, as the scene representation, and (2) a geometry-grounded motion tokenizer trained via a conditional VQ-VAE, that uses 2D local scene heightmap, thus embedding surface physics directly into a discrete vocabulary. This 2D factorization reaches an efficiency-fidelity trade-off: BEV semantics capture spatial layout and affordance for global reasoning, while local heightmaps enforce fine-grained physical adherence without full 3D volumetric reasoning. SceMoS achieves state-of-the-art motion realism and contact accuracy on the TRUMANS benchmark, reducing the number of trainable parameters for scene encoding by over 50%, showing that 2D scene cues can effectively ground 3D human-scene interaction.
Chinese Translation
在现实场景中合成文本驱动的3D人类动作需要学习语义意图(如“走到沙发上”)和物理可行性(例如,避免碰撞)。目前的方法使用生成框架,同时学习高层次的规划和低层次的接触推理,并依赖于计算开销较大的3D场景数据,如点云或体素占用网格。我们提出了SceMoS,一个场景感知的动作合成框架,表明结构化的2D场景表示可以作为物理基础动作合成中全3D监督的强大替代方案。SceMoS通过使用轻量级的2D线索,将全局规划与局部执行解耦,依赖于(1)一个基于文本条件的自回归全局动作规划器,该规划器在从场景的高处角落渲染的鸟瞰图(BEV)图像上操作,并使用DINOv2特征进行编码,作为场景表示,以及(2)一个通过条件VQ-VAE训练的几何基础动作标记器,该标记器使用2D局部场景高度图,从而将表面物理直接嵌入离散词汇中。这种2D分解实现了效率与保真度的权衡:BEV语义捕捉空间布局和全球推理的可用性,而局部高度图则在不进行全3D体积推理的情况下强制执行细粒度的物理遵循。SceMoS在TRUMANS基准测试中实现了最先进的动作真实感和接触准确性,将场景编码的可训练参数数量减少了超过50%,表明2D场景线索可以有效地为3D人类-场景交互提供基础。
cs.CV / 16 / 2602.20479
Path-Decoupled Hyperbolic Flow Matching for Few-Shot Adaptation
路径解耦的超曲面流匹配用于少样本适应
Abstract
Recent advances in cross-modal few-shot adaptation treat visual-semantic alignment as a continuous feature transport problem via Flow Matching (FM). However, we argue that Euclidean-based FM overlooks fundamental limitations of flat geometry, where polynomial volume growth fails to accommodate diverse feature distributions, leading to severe path entanglement. To this end, we propose path-decoupled Hyperbolic Flow Matching (HFM), leveraging the Lorentz manifold's exponential expansion for trajectory decoupling. HFM structures the transport via two key designs: 1) Centripetal hyperbolic alignment: It constructs a centripetal hierarchy by anchoring textual roots, which pushes visual leaves to the boundary to initialize orderly flows. 2) Path-decoupled objective: It acts as a ``semantic guardrail'' rigidly confining trajectories within isolated class-specific geodesic corridors via step-wise supervision. Furthermore, we devise an adaptive diameter-based stopping to prevent over-transportation into the crowded origin based on the intrinsic semantic scale. Extensive ablations on 11 benchmarks have shown that HFM establishes a new state-of-the-art, consistently outperforming its Euclidean counterparts. Our codes and models will be released.
Chinese Translation
近年来,跨模态少样本适应的进展将视觉-语义对齐视为通过流匹配(Flow Matching, FM)进行的连续特征传输问题。然而,我们认为基于欧几里得的FM忽视了平面几何的基本局限性,其中多项式体积增长无法适应多样化的特征分布,导致严重的路径纠缠。为此,我们提出了路径解耦的超曲面流匹配(Hyperbolic Flow Matching, HFM),利用洛伦兹流形的指数扩展实现轨迹解耦。HFM通过两个关键设计结构化传输:1)向心超曲面对齐:通过锚定文本根构建向心层次结构,推动视觉叶子到达边界以初始化有序流动。2)路径解耦目标:作为“语义护栏”,通过逐步监督严格限制轨迹在孤立的类特定测地线走廊内。进一步地,我们设计了一种基于自适应直径的停止机制,以防止根据内在语义尺度过度传输到拥挤的起点。在11个基准上的广泛消融实验表明,HFM建立了新的最先进水平,始终优于其欧几里得对应物。我们的代码和模型将会发布。
cs.CV / 17 / 2602.20496
Pip-Stereo: Progressive Iterations Pruner for Iterative Optimization based Stereo Matching
Pip-Stereo:用于迭代优化的渐进迭代剪枝器在立体匹配中的应用
Abstract
While iterative stereo matching achieves high accuracy, its dependence on Recurrent Neural Networks (RNN) hinders edge deployment, a challenge underexplored in existing researches. We analyze iterative refinement and reveal that disparity updates are spatially sparse and temporally redundant. First, we introduce a progressive iteration pruning strategy that suppresses redundant update steps, effectively collapsing the recursive computation into a near-single-pass inference. Second, we propose a collaborative monocular prior transfer framework that implicitly embeds depth priors without requiring a dedicated monocular encoder, thereby eliminating its associated computational burden. Third, we develop FlashGRU, a hardware-aware RNN operator leveraging structured sparsity and I/O-conscious design, achieving a 7.28$\times$ speedup, 76.6\% memory peak reduction and 80.9\% global memory requests reduction over natvie ConvGRUs under 2K resolution. Our PipStereo enables real-time, high-fidelity stereo matching on edge hardware: it processes 320$\times$640 frames in just 75ms on an NVIDIA Jetson Orin NX (FP16) and 19ms on RTX 4090, matching the accuracy of large iterative based models, and our generalization ability and accuracy far exceeds that of existing real-time methods. Our embedded AI projects will be updated at: https://github.com/XPENG-Aridge-AI.
Chinese Translation
尽管迭代立体匹配能够实现高精度,但其对递归神经网络(Recurrent Neural Networks, RNN)的依赖限制了边缘部署,这是现有研究中未充分探讨的挑战。我们分析了迭代细化过程,并揭示了视差更新在空间上是稀疏的,在时间上是冗余的。首先,我们提出了一种渐进迭代剪枝策略,抑制冗余的更新步骤,有效地将递归计算压缩为近乎单次推理。其次,我们提出了一种协同单目先验转移框架,该框架在不需要专用单目编码器的情况下,隐式嵌入深度先验,从而消除了相关的计算负担。第三,我们开发了FlashGRU,这是一种硬件感知的RNN操作符,利用结构稀疏性和I/O意识设计,在2K分辨率下实现了7.28倍的加速、76.6%的内存峰值减少和80.9%的全局内存请求减少,相较于原生ConvGRU。我们的PipStereo使得在边缘硬件上实现实时、高保真度的立体匹配成为可能:在NVIDIA Jetson Orin NX(FP16)上处理320×640帧仅需75毫秒,在RTX 4090上仅需19毫秒,同时匹配大型基于迭代的模型的精度,并且我们的泛化能力和准确性远超现有的实时方法。我们的嵌入式AI项目将更新于:https://github.com/XPENG-Aridge-AI。
cs.CV / 18 / 2602.20497
LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
LESA:可学习的阶段感知预测器用于扩散模型加速
Abstract
Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is included in the supplementary materials and will be released on GitHub.
Chinese Translation
扩散模型在图像和视频生成任务中取得了显著成功。然而,扩散变换器(Diffusion Transformers, DiTs)的高计算需求对其实际部署构成了重大挑战。虽然特征缓存是一种有前景的加速策略,但现有基于简单重用或无训练预测的方法难以适应扩散过程复杂的、阶段依赖的动态,常常导致质量下降,无法与标准去噪过程保持一致。为此,我们提出了一种基于两阶段训练的可学习阶段感知(LEarnable Stage-Aware, LESA)预测器框架。我们的方法利用科尔莫哥洛夫-阿诺德网络(Kolmogorov-Arnold Network, KAN)准确学习数据的时间特征映射。我们进一步引入了一种多阶段、多专家架构,将专门的预测器分配给不同噪声水平的阶段,从而实现更精确和稳健的特征预测。大量实验表明,我们的方法在保持高保真生成的同时实现了显著加速。在FLUX.1-dev上实现了5.00倍的加速,质量下降最小(1.0%),在Qwen-Image上实现了6.25倍的加速,相比之前的最优方法(TaylorSeer)质量提升了20.2%,在HunyuanVideo上实现了5.00倍的加速,PSNR相比TaylorSeer提升了24.7%。在文本到图像和文本到视频合成任务上均表现出最先进的性能,验证了我们基于训练的框架在不同模型上的有效性和泛化能力。我们的代码包含在补充材料中,并将在GitHub上发布。
cs.CV / 19 / 2602.20501
Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models
探究与桥接几何-交互线索以实现视觉基础模型中的可供性推理
Abstract
What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.
Chinese Translation
视觉系统真正理解可供性意味着什么?我们认为这种理解依赖于两种互补能力:几何感知,它识别出使交互成为可能的物体结构部分;以及交互感知,它建模代理的行为如何与这些部分相互作用。为了验证这一假设,我们对视觉基础模型(Visual Foundation Models, VFMs)进行了系统的探测。我们发现,像 DINO 这样的模型固有地编码了部分级几何结构,而像 Flux 这样的生成模型则包含丰富的、以动词为条件的空间注意力图,这些图作为隐式的交互先验。关键是,我们证明这两个维度不仅仅是相关的,而是可供性的可组合元素。通过简单地将 DINO 的几何原型与 Flux 的交互图在无训练和零样本的方式下融合,我们实现了与弱监督方法相媲美的可供性估计。最终的融合实验确认了几何感知和交互感知是 VFMs 中可供性理解的基本构建块,为感知如何支撑行动提供了机制性解释。
cs.CV / 20 / 2602.20511
Leveraging Causal Reasoning Method for Explaining Medical Image Segmentation Models
利用因果推理方法解释医学图像分割模型
Abstract
Medical image segmentation plays a vital role in clinical decision-making, enabling precise localization of lesions and guiding interventions. Despite significant advances in segmentation accuracy, the black-box nature of most deep models has raised growing concerns about their trustworthiness in high-stakes medical scenarios. Current explanation techniques have primarily focused on classification tasks, leaving the segmentation domain relatively underexplored. We introduced an explanation model for segmentation task which employs the causal inference framework and backpropagates the average treatment effect (ATE) into a quantification metric to determine the influence of input regions, as well as network components, on target segmentation areas. Through comparison with recent segmentation explainability techniques on two representative medical imaging datasets, we demonstrated that our approach provides more faithful explanations than existing approaches. Furthermore, we carried out a systematic causal analysis of multiple foundational segmentation models using our method, which reveals significant heterogeneity in perceptual strategies across different models, and even between different inputs for the same model. Suggesting the potential of our method to provide notable insights for optimizing segmentation models. Our code can be found at https://github.com/lcmmai/PdCR.
Chinese Translation
医学图像分割在临床决策中发挥着至关重要的作用,能够精确定位病变并指导干预。尽管分割精度取得了显著进展,但大多数深度模型的黑箱特性引发了人们对其在高风险医疗场景中可信度的日益关注。目前的解释技术主要集中在分类任务上,导致分割领域相对未被充分探索。我们提出了一种用于分割任务的解释模型,该模型采用因果推断框架,并将平均处理效应(ATE)反向传播到量化指标中,以确定输入区域及网络组件对目标分割区域的影响。通过与两组具有代表性的医学影像数据集上的最新分割可解释性技术进行比较,我们证明了我们的方法提供的解释比现有方法更为可信。此外,我们使用我们的方法对多个基础分割模型进行了系统的因果分析,揭示了不同模型之间以及同一模型对不同输入之间感知策略的显著异质性。这表明我们的方法在优化分割模型方面具有提供重要见解的潜力。我们的代码可以在 https://github.com/lcmmai/PdCR 找到。
cs.CV / 21 / 2602.20520
How Do Inpainting Artifacts Propagate to Language?
修复伪影如何传播到语言中?
Abstract
We study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models. We use a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models, enabling controlled comparisons between captions generated from original and reconstructed inputs. Across multiple datasets, we analyze the relationship between reconstruction fidelity and downstream caption quality. We observe consistent associations between pixel-level and perceptual reconstruction metrics and both lexical and semantic captioning performance. Additional analysis of intermediate visual representations and attention patterns shows that inpainting artifacts lead to systematic, layer-dependent changes in model behavior. Together, these results provide a practical diagnostic framework for examining how visual reconstruction quality influences language generation in multimodal systems.
Chinese Translation
我们研究了基于扩散的修复所引入的视觉伪影如何影响视觉-语言模型中的语言生成。我们使用了一个两阶段的诊断设置,其中被遮蔽的图像区域被重建,然后提供给描述模型,从而实现对原始输入和重建输入生成的描述之间的可控比较。在多个数据集中,我们分析了重建保真度与下游描述质量之间的关系。我们观察到像素级和感知重建指标与词汇和语义描述性能之间存在一致的关联。对中间视觉表示和注意力模式的进一步分析表明,修复伪影导致模型行为发生系统性的、层依赖的变化。这些结果共同提供了一个实用的诊断框架,用于检查视觉重建质量如何影响多模态系统中的语言生成。
cs.CV / 22 / 2602.20531
A Lightweight Vision-Language Fusion Framework for Predicting App Ratings from User Interfaces and Metadata
一种轻量级视觉-语言融合框架用于从用户界面和元数据预测应用评分
Abstract
App ratings are among the most significant indicators of the quality, usability, and overall user satisfaction of mobile applications. However, existing app rating prediction models are largely limited to textual data or user interface (UI) features, overlooking the importance of jointly leveraging UI and semantic information. To address these limitations, this study proposes a lightweight vision--language framework that integrates both mobile UI and semantic information for app rating prediction. The framework combines MobileNetV3 to extract visual features from UI layouts and DistilBERT to extract textual features. These multimodal features are fused through a gated fusion module with Swish activations, followed by a multilayer perceptron (MLP) regression head. The proposed model is evaluated using mean absolute error (MAE), root mean square error (RMSE), mean squared error (MSE), coefficient of determination (R2), and Pearson correlation. After training for 20 epochs, the model achieves an MAE of 0.1060, an RMSE of 0.1433, an MSE of 0.0205, an R2 of 0.8529, and a Pearson correlation of 0.9251. Extensive ablation studies further demonstrate the effectiveness of different combinations of visual and textual encoders. Overall, the proposed lightweight framework provides valuable insights for developers and end users, supports sustainable app development, and enables efficient deployment on edge devices.
Chinese Translation
应用评分是移动应用质量、可用性和整体用户满意度的重要指标之一。然而,现有的应用评分预测模型主要局限于文本数据或用户界面(UI)特征,忽视了联合利用UI和语义信息的重要性。为了解决这些局限性,本研究提出了一种轻量级的视觉-语言框架,该框架整合了移动UI和语义信息以进行应用评分预测。该框架结合了MobileNetV3从UI布局中提取视觉特征,并使用DistilBERT提取文本特征。这些多模态特征通过带有Swish激活函数的门控融合模块进行融合,随后通过多层感知器(MLP)回归头进行处理。所提出的模型通过平均绝对误差(MAE)、均方根误差(RMSE)、均方误差(MSE)、决定系数(R2)和皮尔逊相关系数进行评估。在训练20个周期后,模型达到了0.1060的MAE、0.1433的RMSE、0.0205的MSE、0.8529的R2和0.9251的皮尔逊相关系数。广泛的消融研究进一步证明了不同视觉和文本编码器组合的有效性。总体而言,所提出的轻量级框架为开发者和最终用户提供了有价值的见解,支持可持续的应用开发,并能够在边缘设备上高效部署。
cs.CV / 23 / 2602.20537
PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning
PFGNet:一种全卷积频率引导的外围门控网络,用于高效时空预测学习
Abstract
Spatiotemporal predictive learning (STPL) aims to forecast future frames from past observations and is essential across a wide range of applications. Compared with recurrent or hybrid architectures, pure convolutional models offer superior efficiency and full parallelism, yet their fixed receptive fields limit their ability to adaptively capture spatially varying motion patterns. Inspired by biological center-surround organization and frequency-selective signal processing, we propose PFGNet, a fully convolutional framework that dynamically modulates receptive fields through pixel-wise frequency-guided gating. The core Peripheral Frequency Gating (PFG) block extracts localized spectral cues and adaptively fuses multi-scale large-kernel peripheral responses with learnable center suppression, effectively forming spatially adaptive band-pass filters. To maintain efficiency, all large kernels are decomposed into separable 1D convolutions ($1 \times k$ followed by $k \times 1$), reducing per-channel computational cost from $O(k^2)$ to $O(2k)$. PFGNet enables structure-aware spatiotemporal modeling without recurrence or attention. Experiments on Moving MNIST, TaxiBJ, Human3.6M, and KTH show that PFGNet delivers SOTA or near-SOTA forecasting performance with substantially fewer parameters and FLOPs. Our code is available at https://github.com/fhjdqaq/PFGNet.
Chinese Translation
时空预测学习(STPL)旨在从过去的观测中预测未来的帧,并在广泛的应用中至关重要。与递归或混合架构相比,纯卷积模型提供了更高的效率和完全的并行性,但其固定的感受野限制了其自适应捕捉空间变化运动模式的能力。受到生物中心-周围组织和频率选择性信号处理的启发,我们提出了PFGNet,一种全卷积框架,通过像素级频率引导门控动态调节感受野。核心的外围频率门控(Peripheral Frequency Gating, PFG)模块提取局部谱线索,并自适应融合多尺度大核外围响应与可学习的中心抑制,有效形成空间自适应带通滤波器。为了保持效率,所有大核被分解为可分离的1D卷积($1 imes k$后接$k imes 1$),将每通道的计算成本从$O(k^2)$降低到$O(2k)$。PFGNet能够在没有递归或注意力机制的情况下实现结构感知的时空建模。在Moving MNIST、TaxiBJ、Human3.6M和KTH上的实验表明,PFGNet以显著更少的参数和FLOPs实现了SOTA或接近SOTA的预测性能。我们的代码可在https://github.com/fhjdqaq/PFGNet获取。
cs.CV / 24 / 2602.20543
Beyond Human Performance: A Vision-Language Multi-Agent Approach for Quality Control in Pharmaceutical Manufacturing
超越人类表现:一种用于制药制造质量控制的视觉-语言多智能体方法
Abstract
Colony-forming unit (CFU) detection is critical in pharmaceutical manufacturing, serving as a key component of Environmental Monitoring programs and ensuring compliance with stringent quality standards. Manual counting is labor-intensive and error-prone, while deep learning (DL) approaches, though accurate, remain vulnerable to sample quality variations and artifacts. Building on our earlier CNN-based framework (Beznik et al., 2020), we evaluated YOLOv5, YOLOv7, and YOLOv8 for CFU detection; however, these achieved only 97.08 percent accuracy, insufficient for pharmaceutical-grade requirements. A custom Detectron2 model trained on GSK's dataset of over 50,000 Petri dish images achieved 99 percent detection rate with 2 percent false positives and 0.6 percent false negatives. Despite high validation accuracy, Detectron2 performance degrades on outlier cases including contaminated plates, plastic artifacts, or poor optical clarity. To address this, we developed a multi-agent framework combining DL with vision-language models (VLMs). The VLM agent first classifies plates as valid or invalid. For valid samples, both DL and VLM agents independently estimate colony counts. When predictions align within 5 percent, results are automatically recorded in Postgres and SAP; otherwise, samples are routed for expert review. Expert feedback enables continuous retraining and self-improvement. Initial DL-based automation reduced human verification by 50 percent across vaccine manufacturing sites. With VLM integration, this increased to 85 percent, delivering significant operational savings. The proposed system provides a scalable, auditable, and regulation-ready solution for microbiological quality control, advancing automation in biopharmaceutical production.
Chinese Translation
菌落形成单位(CFU)检测在制药制造中至关重要,是环境监测程序的关键组成部分,确保符合严格的质量标准。人工计数劳动强度大且易出错,而深度学习(DL)方法虽然准确,但仍易受到样本质量变化和伪影的影响。在我们早期基于卷积神经网络(CNN)的框架(Beznik et al., 2020)的基础上,我们评估了YOLOv5、YOLOv7和YOLOv8用于CFU检测;然而,这些方法仅达到了97.08%的准确率,无法满足制药级别的要求。我们在GSK超过50,000张培养皿图像的数据集上训练的定制Detectron2模型达到了99%的检测率,假阳性率为2%,假阴性率为0.6%。尽管验证准确率很高,但Detectron2在异常案例(包括污染的培养皿、塑料伪影或光学清晰度差)上的表现有所下降。为了解决这个问题,我们开发了一个将深度学习与视觉-语言模型(VLMs)相结合的多智能体框架。VLM智能体首先将培养皿分类为有效或无效。对于有效样本,DL和VLM智能体独立估计菌落计数。当预测结果在5%以内一致时,结果会自动记录在Postgres和SAP中;否则,样本将被转交专家审查。专家反馈使得持续再训练和自我改进成为可能。初步的基于DL的自动化使疫苗制造现场的人为验证减少了50%。随着VLM的集成,这一比例提高到了85%,带来了显著的运营节省。所提出的系统提供了一种可扩展、可审计且符合监管要求的微生物质量控制解决方案,推动了生物制药生产的自动化进程。
cs.CV / 25 / 2602.20548
Robust Spiking Neural Networks Against Adversarial Attacks
针对对抗攻击的鲁棒脉冲神经网络
Abstract
Spiking Neural Networks (SNNs) represent a promising paradigm for energy-efficient neuromorphic computing due to their bio-plausible and spike-driven characteristics. However, the robustness of SNNs in complex adversarial environments remains significantly constrained. In this study, we theoretically demonstrate that those threshold-neighboring spiking neurons are the key factors limiting the robustness of directly trained SNNs. We find that these neurons set the upper limits for the maximum potential strength of adversarial attacks and are prone to state-flipping under minor disturbances. To address this challenge, we propose a Threshold Guarding Optimization (TGO) method, which comprises two key aspects. First, we incorporate additional constraints into the loss function to move neurons' membrane potentials away from their thresholds. It increases SNNs' gradient sparsity, thereby reducing the theoretical upper bound of adversarial attacks. Second, we introduce noisy spiking neurons to transition the neuronal firing mechanism from deterministic to probabilistic, decreasing their state-flipping probability due to minor disturbances. Extensive experiments conducted in standard adversarial scenarios prove that our method significantly enhances the robustness of directly trained SNNs. These findings pave the way for advancing more reliable and secure neuromorphic computing in real-world applications.
Chinese Translation
脉冲神经网络(SNNs)因其生物可信和脉冲驱动特性,代表了一种有前景的节能神经形态计算范式。然而,SNNs在复杂对抗环境中的鲁棒性仍然受到显著限制。在本研究中,我们理论上证明了那些阈值邻近的脉冲神经元是限制直接训练的SNNs鲁棒性的关键因素。我们发现,这些神经元设定了对抗攻击最大潜在强度的上限,并且在轻微干扰下容易发生状态翻转。为了解决这一挑战,我们提出了一种阈值保护优化(Threshold Guarding Optimization, TGO)方法,该方法包括两个关键方面。首先,我们在损失函数中加入额外约束,以使神经元的膜电位远离其阈值。这增加了SNNs的梯度稀疏性,从而降低了对抗攻击的理论上限。其次,我们引入了噪声脉冲神经元,将神经元的发放机制从确定性转变为概率性,降低了它们因轻微干扰而发生状态翻转的概率。在标准对抗场景中进行的大量实验证明,我们的方法显著增强了直接训练的SNNs的鲁棒性。这些发现为在实际应用中推进更可靠和安全的神经形态计算铺平了道路。
cs.CV / 26 / 2602.20550
The Finite Primitive Basis Theorem for Computational Imaging: Formal Foundations of the OperatorGraph Representation
计算成像的有限原始基定理:算子图表示的形式基础
Abstract
Computational imaging forward models, from coded aperture spectral cameras to MRI scanners, are traditionally implemented as monolithic, modality-specific codes. We prove that every forward model in a broad, precisely defined operator class Cimg (encompassing clinical, scientific, and industrial imaging modalities, both linear and nonlinear) admits an epsilon-approximate representation as a typed directed acyclic graph (DAG) whose nodes are drawn from a library of exactly 11 canonical primitives: Propagate, Modulate, Project, Encode, Convolve, Accumulate, Detect, Sample, Disperse, Scatter, and Transform. We call this the Finite Primitive Basis Theorem. The proof is constructive: we provide an algorithm that, given any H in Cimg, produces a DAG G with relative operator error at most epsilon and graph complexity within prescribed bounds. We further prove that the library is minimal: removing any single primitive causes at least one modality to lose its epsilon-approximate representation. A systematic analysis of nonlinearities in imaging physics shows they fall into two structural categories: pointwise scalar functions (handled by Transform) and self-consistent iterations (unrolled into existing linear primitives). Empirical validation on 31 linear modalities confirms eimg below 0.01 with at most 5 nodes and depth 5, and we provide constructive DAG decompositions for 9 additional nonlinear modalities. These results establish mathematical foundations for the Physics World Model (PWM) framework.
Chinese Translation
计算成像的前向模型,从编码孔径光谱相机到MRI扫描仪,传统上被实现为单一的、特定于模态的代码。我们证明,在一个广泛且精确定义的算子类Cimg中(涵盖临床、科学和工业成像模态,包括线性和非线性),每个前向模型都可以近似表示为一个类型化的有向无环图(DAG),其节点来自于一个恰好包含11个典型原始元素的库:传播(Propagate)、调制(Modulate)、投影(Project)、编码(Encode)、卷积(Convolve)、累加(Accumulate)、检测(Detect)、采样(Sample)、分散(Disperse)、散射(Scatter)和变换(Transform)。我们称之为有限原始基定理。证明是构造性的:我们提供了一种算法,给定Cimg中的任意H,生成一个相对算子误差最多为epsilon且图复杂度在规定范围内的DAG G。我们进一步证明该库是最小的:去除任何单一原始元素都会导致至少一种模态失去其epsilon近似表示。对成像物理中非线性的系统分析显示,它们可分为两种结构类别:逐点标量函数(由变换处理)和自洽迭代(展开为现有的线性原始元素)。在31种线性模态上的实证验证确认eimg低于0.01,且最多包含5个节点和深度为5,我们还为9种额外的非线性模态提供了构造性的DAG分解。这些结果为物理世界模型(PWM)框架奠定了数学基础。
cs.CV / 27 / 2602.20551
CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects
CAD提示的SAM3:面向工业物体的几何条件实例分割
Abstract
Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part's geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance-based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input. The rendered views provide geometry-based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.
Chinese Translation
基于语言提示的分割固有地受到自然语言表达能力的限制,并且在处理不常见、特定实例或难以描述的物体时面临挑战:这些场景在制造和3D打印环境中经常出现。虽然图像示例提供了一种替代方案,但它们主要编码外观线索,如颜色和纹理,这些通常与部件的几何特征无关。在工业环境中,单个组件可能以不同的材料、表面处理或颜色生产,这使得基于外观的提示不可靠。相比之下,这些物体通常由精确的CAD模型定义,捕捉其典型几何形状。我们提出了一种基于CAD提示的分割框架,建立在SAM3之上,该框架使用CAD模型的典型多视图渲染作为提示输入。渲染视图提供了与表面外观无关的几何条件。该模型使用在多种视角和场景上下文中生成的网格渲染的合成数据进行训练。我们的方法实现了单阶段的CAD提示掩膜预测,将可提示分割扩展到无法仅通过语言或外观可靠描述的物体。
cs.CV / 28 / 2602.20556
WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos
WildGHand:从单目野外视频中学习抗干扰的高斯手部虚拟形象
Abstract
Despite recent progress in 3D hand reconstruction from monocular videos, most existing methods rely on data captured in well-controlled environments and therefore degrade in real-world settings with severe perturbations, such as hand-object interactions, extreme poses, illumination changes, and motion blur. To tackle these issues, we introduce WildGHand, an optimization-based framework that enables self-adaptive 3D Gaussian splatting on in-the-wild videos and produces high-fidelity hand avatars. WildGHand incorporates two key components: (i) a dynamic perturbation disentanglement module that explicitly represents perturbations as time-varying biases on 3D Gaussian attributes during optimization, and (ii) a perturbation-aware optimization strategy that generates per-frame anisotropic weighted masks to guide optimization. Together, these components allow the framework to identify and suppress perturbations across both spatial and temporal dimensions. We further curate a dataset of monocular hand videos captured under diverse perturbations to benchmark in-the-wild hand avatar reconstruction. Extensive experiments on this dataset and two public datasets demonstrate that WildGHand achieves state-of-the-art performance and substantially improves over its base model across multiple metrics (e.g., up to a $15.8\%$ relative gain in PSNR and a $23.1\%$ relative reduction in LPIPS). Our implementation and dataset are available at https://github.com/XuanHuang0/WildGHand.
Chinese Translation
尽管在从单目视频中进行三维手部重建方面取得了近期进展,但大多数现有方法依赖于在受控环境中捕获的数据,因此在现实世界中面临严重干扰时表现不佳,例如手与物体的交互、极端姿势、光照变化和运动模糊。为了解决这些问题,我们提出了WildGHand,一个基于优化的框架,能够在野外视频上实现自适应的三维高斯点云渲染,并生成高保真度的手部虚拟形象。WildGHand包含两个关键组件:(i)一个动态扰动解耦模块,明确地将扰动表示为优化过程中三维高斯属性的时变偏差,以及(ii)一个考虑扰动的优化策略,生成每帧的各向异性加权掩模以指导优化。这些组件共同使得框架能够识别并抑制在空间和时间维度上的扰动。我们进一步整理了一个在多种扰动下捕获的单目手部视频数据集,以基准测试野外手部虚拟形象重建。在该数据集和两个公共数据集上的大量实验表明,WildGHand实现了最先进的性能,并在多个指标上显著优于其基础模型(例如,PSNR相对增益高达$15.8\%$,LPIPS相对降低$23.1\\%$)。我们的实现和数据集可在https://github.com/XuanHuang0/WildGHand获取。
cs.CV / 29 / 2602.20569
AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents
AIForge-Doc:检测金融和表单文件中人工智能伪造篡改的基准
Abstract
We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs -- Gemini 2.5 Flash Image and Ideogram v2 Edit -- yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors -- TruFor, DocTamper, and a zero-shot GPT-4o judge -- and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 -- essentially at chance -- confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.
Chinese Translation
我们提出了AIForge-Doc,这是首个专门针对金融和表单文件中基于扩散模型的图像修复进行像素级注释的基准。现有的文件伪造数据集依赖于传统的数字编辑工具(例如,Adobe Photoshop、GIMP),这造成了一个关键的缺口:最先进的检测器对快速增长的人工智能伪造文件欺诈威胁视而不见。AIForge-Doc通过系统性地伪造真实世界收据和表单图像中的数字字段,使用两个人工智能图像修复API——Gemini 2.5 Flash Image和Ideogram v2 Edit,生成了来自四个公共文档数据集(CORD、WildReceipt、SROIE、XFUND)中共4,061张伪造图像,涵盖九种语言,并以DocTamper兼容格式注释了像素精确的篡改区域掩码。我们对三种代表性的检测器进行了基准测试——TruFor、DocTamper和一个零-shot的GPT-4o评估——发现所有现有方法的性能显著下降:TruFor在零-shot和分布外情况下的AUC为0.751,而在NIST16上的AUC为0.96;DocTamper在分布内情况下的AUC为0.563,而在分布外情况下的AUC为0.98,像素级IoU为0.020;GPT-4o的表现仅为0.509——基本上处于偶然水平——确认人工智能伪造的值对自动检测器和视觉语言模型(VLMs)而言是不可区分的。这些结果表明,AIForge-Doc代表了文件取证领域一个质的全新且未解决的挑战。
cs.CV / 30 / 2602.20575
An interactive enhanced driving dataset for autonomous driving
用于自动驾驶的交互增强驾驶数据集
Abstract
The evolution of autonomous driving towards full automation demands robust interactive capabilities; however, the development of Vision-Language-Action (VLA) models is constrained by the sparsity of interactive scenarios and inadequate multimodal alignment in existing data. To this end, this paper proposes the Interactive Enhanced Driving Dataset (IEDD). We develop a scalable pipeline to mine million-level interactive segments from naturalistic driving data based on interactive trajectories, and design metrics to quantify the interaction processes. Furthermore, the IEDD-VQA dataset is constructed by generating synthetic Bird's Eye View (BEV) videos where semantic actions are strictly aligned with structured language. Benchmark results evaluating ten mainstream Vision Language Models (VLMs) are provided to demonstrate the dataset's reuse value in assessing and fine-tuning the reasoning capabilities of autonomous driving models.
Chinese Translation
自动驾驶向完全自动化的发展需要强大的交互能力;然而,视觉-语言-动作(Vision-Language-Action, VLA)模型的发展受到交互场景稀缺和现有数据中多模态对齐不足的限制。为此,本文提出了交互增强驾驶数据集(Interactive Enhanced Driving Dataset, IEDD)。我们开发了一个可扩展的管道,从自然驾驶数据中基于交互轨迹挖掘百万级的交互片段,并设计了量化交互过程的指标。此外,通过生成合成的鸟瞰视图(Bird's Eye View, BEV)视频,构建了IEDD-VQA数据集,其中语义动作与结构化语言严格对齐。提供了对十种主流视觉语言模型(Vision Language Models, VLMs)的基准评估结果,以展示该数据集在评估和微调自动驾驶模型推理能力方面的重用价值。
cs.CV / 31 / 2602.20577
Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion
通过掩码视觉-语言-动作扩散实现高效且可解释的端到端自主驾驶
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference latency, action precision, and explainability. Existing autoregressive approaches struggle with slow token-by-token generation, while prior diffusion-based planners often rely on verbose, general-purpose language tokens that lack explicit geometric structure. In this work, we propose Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD), a novel framework designed to bridge the gap between efficient planning and semantic explainability via a masked vision-language-action diffusion model. Unlike methods that force actions into the language space, we introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions. Moreover, we propose geometry-aware embedding learning to ensure that embeddings in the latent space approximate physical geometric metrics. Finally, an action-priority decoding strategy is introduced to prioritize trajectory generation. Extensive experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.
Chinese Translation
大型语言模型(LLMs)和视觉-语言模型(VLMs)已成为端到端自主驾驶的有希望的候选者。然而,这些模型通常面临推理延迟、动作精度和可解释性方面的挑战。现有的自回归方法在逐个生成令牌时速度较慢,而之前基于扩散的规划方法往往依赖于冗长的通用语言令牌,这些令牌缺乏明确的几何结构。在本研究中,我们提出了用于自主驾驶的掩码视觉-语言-动作扩散(MVLAD-AD),这是一个新颖的框架,旨在通过掩码视觉-语言-动作扩散模型弥合高效规划与语义可解释性之间的差距。与那些将动作强制映射到语言空间的方法不同,我们引入了一种离散动作令牌化策略,从真实世界驾驶分布中构建一个运动学可行的路径点的紧凑代码本。此外,我们提出了几何感知嵌入学习,以确保潜在空间中的嵌入近似物理几何度量。最后,引入了一种动作优先解码策略,以优先生成轨迹。在nuScenes和衍生基准上的大量实验表明,MVLAD-AD在规划精度方面优于最先进的自回归和扩散基线,同时实现了高效性,并提供高保真度和可解释的推理。
cs.CV / 32 / 2602.20583
PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
PropFly:通过预训练视频扩散模型的即时监督学习传播
Abstract
Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.
Chinese Translation
基于传播的视频编辑通过将单个编辑帧传播到后续帧来实现精确的用户控制,同时保持原始上下文,如运动和结构。然而,训练此类模型需要大规模的配对(源和编辑)视频数据集,这些数据集的获取成本高且复杂。因此,我们提出了PropFly,一种基于传播的视频编辑训练管道,依赖于来自预训练视频扩散模型(VDMs)的即时监督,而不需要现成或预计算的配对视频编辑数据集。具体而言,我们的PropFly利用中间噪声潜变量的一步清晰潜变量估计,结合不同的无分类器引导(CFG)尺度,实时合成多样的“源”(低CFG)和“编辑”(高CFG)潜变量对。源潜变量作为视频的结构信息,而编辑潜变量则提供学习传播的目标变换。我们的管道使附加在预训练VDM上的适配器能够通过引导调制流匹配(GMFM)损失学习传播编辑,该损失引导模型复制目标变换。我们的即时监督确保模型学习时间一致和动态的变换。大量实验表明,我们的PropFly在各种视频编辑任务上显著优于现有最先进的方法,生成高质量的编辑结果。
cs.CV / 33 / 2602.20584
Long-Term Multi-Session 3D Reconstruction Under Substantial Appearance Change
长期多会话3D重建在显著外观变化下的研究
Abstract
Long-term environmental monitoring requires the ability to reconstruct and align 3D models across repeated site visits separated by months or years. However, existing Structure-from-Motion (SfM) pipelines implicitly assume near-simultaneous image capture and limited appearance change, and therefore fail when applied to long-term monitoring scenarios such as coral reef surveys, where substantial visual and structural change is common. In this paper, we show that the primary limitation of current approaches lies in their reliance on post-hoc alignment of independently reconstructed sessions, which is insufficient under large temporal appearance change. We address this limitation by enforcing cross-session correspondences directly within a joint SfM reconstruction. Our approach combines complementary handcrafted and learned visual features to robustly establish correspondences across large temporal gaps, enabling the reconstruction of a single coherent 3D model from imagery captured years apart, where standard independent and joint SfM pipelines break down. We evaluate our method on long-term coral reef datasets exhibiting significant real-world change, and demonstrate consistent joint reconstruction across sessions in cases where existing methods fail to produce coherent reconstructions. To ensure scalability to large datasets, we further restrict expensive learned feature matching to a small set of likely cross-session image pairs identified via visual place recognition, which reduces computational cost and improves alignment robustness.
Chinese Translation
长期环境监测需要能够在数月或数年间隔的重复现场访问中重建和对齐3D模型。然而,现有的运动结构(Structure-from-Motion, SfM)管道隐含假设图像捕捉几乎是同时的且外观变化有限,因此在应用于长期监测场景(如珊瑚礁调查)时会失效,因为在这些场景中显著的视觉和结构变化是常见的。本文表明,当前方法的主要限制在于它们依赖于对独立重建会话的事后对齐,而在大规模时间外观变化下,这种方法是不够的。我们通过在联合SfM重建中直接强制跨会话对应关系来解决这一限制。我们的方法结合了互补的手工制作和学习的视觉特征,以稳健地建立跨越大时间间隔的对应关系,从而能够从数年间隔捕获的图像中重建出一个单一一致的3D模型,而标准的独立和联合SfM管道在此情况下会失效。我们在具有显著真实世界变化的长期珊瑚礁数据集上评估了我们的方法,并展示了在现有方法无法产生一致重建的情况下,跨会话的一致联合重建。为了确保对大数据集的可扩展性,我们进一步将昂贵的学习特征匹配限制在通过视觉地点识别识别的小范围可能的跨会话图像对上,从而降低计算成本并提高对齐的鲁棒性。
cs.CV / 34 / 2602.20597
Interaction-aware Representation Modeling with Co-occurrence Consistency for Egocentric Hand-Object Parsing
基于共现一致性的交互感知表示建模用于自我中心手-物体解析
Abstract
A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability to changing active objects across varying input scenes; 2) previous transformer-based methods utilize pixel-level semantic features to iteratively refine queries during mask generation, which may introduce interaction-irrelevant content into the final embeddings; and 3) prevailing models are susceptible to "interaction illusion", producing physically inconsistent predictions. To address these issues, we propose an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss. The DQG explicitly grounds query initialization in the spatial dynamics of hand-object contact, enabling targeted generation of interaction-aware queries for hands and various active objects. The DFS fuses coarse interactive cues with semantic features, thereby suppressing interaction-irrelevant noise and emphasizing the learning of interactive relationships. The CoCo loss incorporates hand-object relationship constraints to enhance physical consistency in prediction. Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets, demonstrating its effectiveness and strong generalization ability. Code and models are publicly available at https://github.com/yuggiehk/InterFormer.
Chinese Translation
对自我中心人类与环境交互的细粒度理解对于开发下一代具身智能体至关重要。在这一领域的一个基本挑战是准确解析手部和活动物体。尽管基于变换器的架构在此类任务中展现了相当大的潜力,但仍存在几个关键限制未得到解决:1)现有的查询初始化机制主要依赖于语义线索或可学习参数,表现出对不同输入场景中活动物体变化的适应性有限;2)之前的基于变换器的方法在掩码生成过程中利用像素级语义特征逐步细化查询,这可能会将与交互无关的内容引入最终的嵌入中;3)现有模型易受“交互幻觉”的影响,产生物理上不一致的预测。为了解决这些问题,我们提出了一种端到端的交互感知变换器(Interaction-aware Transformer,InterFormer),该模型集成了三个关键组件,即动态查询生成器(Dynamic Query Generator,DQG)、双上下文特征选择器(Dual-context Feature Selector,DFS)和条件共现(Conditional Co-occurrence,CoCo)损失。DQG明确将查询初始化与手-物体接触的空间动态相结合,使得能够针对手部和各种活动物体生成交互感知的查询。DFS将粗略的交互线索与语义特征融合,从而抑制与交互无关的噪声,强调交互关系的学习。CoCo损失结合手-物体关系约束,以增强预测的物理一致性。我们的模型在EgoHOS和具有挑战性的分布外mini-HOI4D数据集上实现了最先进的性能,证明了其有效性和强大的泛化能力。代码和模型已公开发布在 https://github.com/yuggiehk/InterFormer。
cs.CV / 35 / 2602.20608
VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos
VAGNet:从视频中的人-物交互中获取3D可用性
Abstract
3D object affordance grounding aims to identify regions on 3D objects that support human-object interaction (HOI), a capability essential to embodied visual reasoning. However, most existing approaches rely on static visual or textual cues, neglecting that affordances are inherently defined by dynamic actions. As a result, they often struggle to localize the true contact regions involved in real interactions. We take a different perspective. Humans learn how to use objects by observing and imitating actions, not just by examining shapes. Motivated by this intuition, we introduce video-guided 3D affordance grounding, which leverages dynamic interaction sequences to provide functional supervision. To achieve this, we propose VAGNet, a framework that aligns video-derived interaction cues with 3D structure to resolve ambiguities that static cues cannot address. To support this new setting, we introduce PVAD, the first HOI video-3D pairing affordance dataset, providing functional supervision unavailable in prior works. Extensive experiments on PVAD show that VAGNet achieves state-of-the-art performance, significantly outperforming static-based baselines. The code and dataset will be open publicly.
Chinese Translation
3D物体可用性定位旨在识别支持人-物交互(HOI)的3D物体上的区域,这一能力对具身视觉推理至关重要。然而,大多数现有方法依赖于静态视觉或文本线索,忽视了可用性本质上是由动态动作定义的。因此,它们通常难以定位真实交互中涉及的接触区域。我们采取了不同的视角。人类通过观察和模仿动作来学习如何使用物体,而不仅仅是通过检查形状。受到这一直觉的启发,我们引入了视频引导的3D可用性定位,利用动态交互序列提供功能性监督。为此,我们提出了VAGNet,一个将视频衍生的交互线索与3D结构对齐的框架,以解决静态线索无法处理的歧义。为了支持这一新设置,我们引入了PVAD,这是第一个HOI视频-3D配对可用性数据集,提供了以往工作中无法获得的功能性监督。在PVAD上的大量实验表明,VAGNet达到了最先进的性能,显著超越了基于静态的基线。代码和数据集将公开发布。
cs.CV / 36 / 2602.20616
Knowing the Unknown: Interpretable Open-World Object Detection via Concept Decomposition Model
认识未知:通过概念分解模型实现可解释的开放世界物体检测
Abstract
Open-world object detection (OWOD) requires incrementally detecting known categories while reliably identifying unknown objects. Existing methods primarily focus on improving unknown recall, yet overlook interpretability, often leading to known-unknown confusion and reduced prediction reliability. This paper aims to make the entire OWOD framework interpretable, enabling the detector to truly "knowing the unknown". To this end, we propose a concept-driven InterPretable OWOD framework(IPOW) by introducing a Concept Decomposition Model (CDM) for OWOD, which explicitly decomposes the coupled RoI features in Faster R-CNN into discriminative, shared, and background concepts. Discriminative concepts identify the most discriminative features to enlarge the distances between known categories, while shared and background concepts, due to their strong generalization ability, can be readily transferred to detect unknown categories. Leveraging the interpretable framework, we identify that known-unknown confusion arises when unknown objects fall into the discriminative space of known classes. To address this, we propose Concept-Guided Rectification (CGR) to further resolve such confusion. Extensive experiments show that IPOW significantly improves unknown recall while mitigating confusion, and provides concept-level interpretability for both known and unknown predictions.
Chinese Translation
开放世界物体检测(OWOD)需要逐步检测已知类别,同时可靠地识别未知物体。现有方法主要集中在提高未知物体的召回率上,但往往忽视了可解释性,这常常导致已知与未知之间的混淆,并降低了预测的可靠性。本文旨在使整个OWOD框架具有可解释性,使检测器能够真正做到“认识未知”。为此,我们提出了一种基于概念的可解释OWOD框架(IPOW),通过引入概念分解模型(CDM)来实现OWOD,该模型明确地将Faster R-CNN中的耦合RoI特征分解为判别性、共享和背景概念。判别性概念识别出最具判别性的特征,以增大已知类别之间的距离,而共享和背景概念由于其强大的泛化能力,可以方便地转移到未知类别的检测中。利用可解释框架,我们发现当未知物体落入已知类别的判别空间时,会产生已知与未知的混淆。为了解决这一问题,我们提出了概念引导的修正(CGR)来进一步消除这种混淆。大量实验表明,IPOW显著提高了未知物体的召回率,同时减轻了混淆,并为已知和未知预测提供了概念层面的可解释性。
cs.CV / 37 / 2602.20618
RecoverMark: Robust Watermarking for Localization and Recovery of Manipulated Faces
RecoverMark:用于定位和恢复被操纵人脸的鲁棒水印技术
Abstract
The proliferation of AI-generated content has facilitated sophisticated face manipulation, severely undermining visual integrity and posing unprecedented challenges to intellectual property. In response, a common proactive defense leverages fragile watermarks to detect, localize, or even recover manipulated regions. However, these methods always assume an adversary unaware of the embedded watermark, overlooking their inherent vulnerability to watermark removal attacks. Furthermore, this fragility is exacerbated in the commonly used dual-watermark strategy that adds a robust watermark for image ownership verification, where mutual interference and limited embedding capacity reduce the fragile watermark's effectiveness. To address the gap, we propose RecoverMark, a watermarking framework that achieves robust manipulation localization, content recovery, and ownership verification simultaneously. Our key insight is twofold. First, we exploit a critical real-world constraint: an adversary must preserve the background's semantic consistency to avoid visual detection, even if they apply global, imperceptible watermark removal attacks. Second, using the image's own content (face, in this paper) as the watermark enhances extraction robustness. Based on these insights, RecoverMark treats the protected face content itself as the watermark and embeds it into the surrounding background. By designing a robust two-stage training paradigm with carefully crafted distortion layers that simulate comprehensive potential attacks and a progressive training strategy, RecoverMark achieves a robust watermark embedding in no fragile manner for image manipulation localization, recovery, and image IP protection simultaneously. Extensive experiments demonstrate the proposed RecoverMark's robustness against both seen and unseen attacks and its generalizability to in-distribution and out-of-distribution data.
Chinese Translation
人工智能生成内容的普及促进了复杂的人脸操纵,严重破坏了视觉完整性,并对知识产权提出了前所未有的挑战。作为应对措施,一种常见的主动防御方法利用脆弱水印来检测、定位甚至恢复被操纵的区域。然而,这些方法总是假设对手对嵌入水印并不知情,忽视了其对水印去除攻击的固有脆弱性。此外,在常用的双水印策略中,添加用于图像所有权验证的鲁棒水印会加剧这种脆弱性,互相干扰和有限的嵌入容量降低了脆弱水印的有效性。为了解决这一问题,我们提出了RecoverMark,一个同时实现鲁棒操纵定位、内容恢复和所有权验证的水印框架。我们的关键见解有两个。首先,我们利用一个重要的现实约束:对手必须保持背景的语义一致性,以避免视觉检测,即使他们施加全局的、不可察觉的水印去除攻击。其次,使用图像自身的内容(本文中的人脸)作为水印可以增强提取的鲁棒性。基于这些见解,RecoverMark将受保护的人脸内容本身视为水印,并将其嵌入周围背景中。通过设计一个鲁棒的两阶段训练范式,结合精心设计的失真层以模拟全面的潜在攻击和渐进的训练策略,RecoverMark实现了在不脆弱的情况下进行图像操纵定位、恢复和图像知识产权保护的鲁棒水印嵌入。大量实验表明,所提出的RecoverMark对已知和未知攻击具有鲁棒性,并且在分布内和分布外数据上具有良好的泛化能力。
cs.CV / 38 / 2602.20627
Object-Scene-Camera Decomposition and Recomposition for Data-Efficient Monocular 3D Object Detection
面向数据高效的单目三维物体检测的物体-场景-相机分解与重组
Abstract
Monocular 3D object detection (M3OD) is intrinsically ill-posed, hence training a high-performance deep learning based M3OD model requires a humongous amount of labeled data with complicated visual variation from diverse scenes, variety of objects and camera poses.However, we observe that, due to strong human bias, the three independent entities, i.e., object, scene, and camera pose, are always tightly entangled when an image is captured to construct training data. More specifically, specific 3D objects are always captured in particular scenes with fixed camera poses, and hence lacks necessary diversity. Such tight entanglement induces the challenging issues of insufficient utilization and overfitting to uniform training data. To mitigate this, we propose an online object-scene-camera decomposition and recomposition data manipulation scheme to more efficiently exploit the training data. We first fully decompose training images into textured 3D object point models and background scenes in an efficient computation and storage manner. We then continuously recompose new training images in each epoch by inserting the 3D objects into the freespace of the background scenes, and rendering them with perturbed camera poses from textured 3D point representation. In this way, the refreshed training data in all epochs can cover the full spectrum of independent object, scene, and camera pose combinations. This scheme can serve as a plug-and-play component to boost M3OD models, working flexibly with both fully and sparsely supervised settings. In the sparsely-supervised setting, objects closest to the ego-camera for all instances are sparsely annotated. We then can flexibly increase the annotated objects to control annotation cost. For validation, our method is widely applied to five representative M3OD models and evaluated on both the KITTI and the more complicated Waymo datasets.
Chinese Translation
单目三维物体检测(M3OD)本质上是一个不适定问题,因此训练一个高性能的基于深度学习的M3OD模型需要大量带标签的数据,这些数据具有来自不同场景、各种物体和相机姿态的复杂视觉变化。然而,我们观察到,由于强烈的人为偏差,当图像被捕获以构建训练数据时,物体、场景和相机姿态这三个独立实体总是紧密纠缠在一起。更具体地说,特定的三维物体总是被捕获在特定的场景中,并且相机姿态是固定的,因此缺乏必要的多样性。这种紧密的纠缠导致了对均匀训练数据的利用不足和过拟合等挑战性问题。为了解决这个问题,我们提出了一种在线物体-场景-相机分解与重组的数据处理方案,以更有效地利用训练数据。我们首先以高效的计算和存储方式将训练图像完全分解为纹理化的三维物体点模型和背景场景。然后,我们在每个训练周期中通过将三维物体插入背景场景的空闲空间并使用来自纹理化三维点表示的扰动相机姿态进行渲染,持续重组新的训练图像。通过这种方式,所有周期中的更新训练数据可以覆盖独立的物体、场景和相机姿态组合的全谱。这一方案可以作为一个即插即用的组件来提升M3OD模型,灵活地适用于完全监督和稀疏监督的设置。在稀疏监督设置中,离自我相机最近的所有实例的物体被稀疏标注。我们可以灵活地增加标注物体以控制标注成本。为了验证,我们的方法广泛应用于五个代表性的M3OD模型,并在KITTI和更复杂的Waymo数据集上进行评估。
cs.CV / 39 / 2602.20630
From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection
从对到序列:针对关键点检测的轨迹感知策略梯度
Abstract
Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.
Chinese Translation
基于关键点的匹配是现代3D视觉系统的一个基本组成部分,如运动重建(Structure-from-Motion, SfM)和同步定位与地图构建(SLAM)。大多数现有的基于学习的方法都是在图像对上进行训练,这种范式未能明确优化关键点在序列中跨越不同视角和光照变化的长期可跟踪性。在本文中,我们将关键点检测重新框定为一个序列决策问题。我们引入了TraqPoint,这是一种新颖的端到端强化学习(Reinforcement Learning, RL)框架,旨在直接在图像序列上优化关键点的轨迹质量(Track-quality, Traq)。我们的核心创新是一个轨迹感知奖励机制,它通过策略梯度方法共同鼓励关键点在多个视角下的一致性和独特性。在稀疏匹配基准测试上的广泛评估,包括相对姿态估计和3D重建,表明TraqPoint显著优于一些最先进(State-of-the-Art, SOTA)的关键点检测和描述方法。
cs.CV / 40 / 2602.20632
Boosting Instance Awareness via Cross-View Correlation with 4D Radar and Camera for 3D Object Detection
通过4D雷达和摄像头的跨视角关联提升实例感知用于3D目标检测
Abstract
4D millimeter-wave radar has emerged as a promising sensing modality for autonomous driving due to its robustness and affordability. However, its sparse and weak geometric cues make reliable instance activation difficult, limiting the effectiveness of existing radar-camera fusion paradigms. BEV-level fusion offers global scene understanding but suffers from weak instance focus, while perspective-level fusion captures instance details but lacks holistic context. To address these limitations, we propose SIFormer, a scene-instance aware transformer for 3D object detection using 4D radar and camera. SIFormer first suppresses background noise during view transformation through segmentation- and depth-guided localization. It then introduces a cross-view activation mechanism that injects 2D instance cues into BEV space, enabling reliable instance awareness under weak radar geometry. Finally, a transformer-based fusion module aggregates complementary image semantics and radar geometry for robust perception. As a result, with the aim of enhancing instance awareness, SIFormer bridges the gap between the two paradigms, combining their complementary strengths to address inherent sparse nature of radar and improve detection accuracy. Experiments demonstrate that SIFormer achieves state-of-the-art performance on View-of-Delft, TJ4DRadSet and NuScenes datasets. Source code is available at github.com/shawnnnkb/SIFormer.
Chinese Translation
4D毫米波雷达因其鲁棒性和经济性而成为自动驾驶领域一种有前景的传感方式。然而,其稀疏且弱的几何线索使得可靠的实例激活变得困难,限制了现有雷达-摄像头融合范式的有效性。鸟瞰图(BEV)级别的融合提供了全球场景理解,但在实例聚焦方面表现较弱;而透视级别的融合则捕捉了实例细节,但缺乏整体上下文。为了解决这些局限性,我们提出了SIFormer,一种基于场景-实例感知的变换器,用于利用4D雷达和摄像头进行3D目标检测。SIFormer首先通过分割和深度引导的定位在视角转换过程中抑制背景噪声。然后,它引入了一种跨视角激活机制,将2D实例线索注入BEV空间,从而在弱雷达几何条件下实现可靠的实例感知。最后,基于变换器的融合模块聚合互补的图像语义和雷达几何信息,以实现稳健的感知。因此,SIFormer旨在增强实例感知,弥合这两种范式之间的差距,结合它们的互补优势,以应对雷达固有的稀疏特性并提高检测精度。实验表明,SIFormer在View-of-Delft、TJ4DRadSet和NuScenes数据集上达到了最先进的性能。源代码可在github.com/shawnnnkb/SIFormer获取。
cs.CV / 41 / 2602.20636
SurgAtt-Tracker: Online Surgical Attention Tracking via Temporal Proposal Reranking and Motion-Aware Refinement
SurgAtt-Tracker:通过时间提议重排序和运动感知精炼实现在线外科注意力追踪
Abstract
Accurate and stable field-of-view (FoV) guidance is critical for safe and efficient minimally invasive surgery, yet existing approaches often conflate visual attention estimation with downstream camera control or rely on direct object-centric assumptions. In this work, we formulate surgical attention tracking as a spatio-temporal learning problem and model surgeon focus as a dense attention heatmap, enabling continuous and interpretable frame-wise FoV guidance. We propose SurgAtt-Tracker, a holistic framework that robustly tracks surgical attention by exploiting temporal coherence through proposal-level reranking and motion-aware refinement, rather than direct regression. To support systematic training and evaluation, we introduce SurgAtt-1.16M, a large-scale benchmark with a clinically grounded annotation protocol that enables comprehensive heatmap-based attention analysis across procedures and institutions. Extensive experiments on multiple surgical datasets demonstrate that SurgAtt-Tracker consistently achieves state-of-the-art performance and strong robustness under occlusion, multi-instrument interference, and cross-domain settings. Beyond attention tracking, our approach provides a frame-wise FoV guidance signal that can directly support downstream robotic FoV planning and automatic camera control.
Chinese Translation
准确且稳定的视野(FoV)引导对于安全高效的微创手术至关重要,然而现有方法通常将视觉注意力估计与后续的相机控制混为一谈,或依赖于直接的以物体为中心的假设。在本研究中,我们将外科注意力追踪形式化为一个时空学习问题,并将外科医生的关注建模为一个密集的注意力热图,从而实现连续且可解释的逐帧视野引导。我们提出了SurgAtt-Tracker,这是一个整体框架,通过提议级重排序和运动感知精炼来稳健地追踪外科注意力,而不是直接回归。为了支持系统的训练和评估,我们引入了SurgAtt-1.16M,这是一个大型基准数据集,具有临床基础的注释协议,能够支持跨程序和机构的全面热图基础注意力分析。在多个外科数据集上的广泛实验表明,SurgAtt-Tracker在遮挡、多工具干扰和跨领域设置下始终实现了最先进的性能和强大的鲁棒性。除了注意力追踪,我们的方法还提供了逐帧的视野引导信号,可以直接支持下游的机器人视野规划和自动相机控制。
cs.CV / 42 / 2602.20650
Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression
数据集颜色量化:一种面向训练的数据集级压缩框架
Abstract
Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image -- particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction. Code is available at \href{https://github.com/he-y/Dataset-Color-Quantization}{https://github.com/he-y/Dataset-Color-Quantization}.
Chinese Translation
大规模图像数据集是深度学习的基础,但其高存储需求在资源受限环境中带来了挑战。现有方法通过丢弃样本来减少数据集大小,但往往忽视了每幅图像内部的显著冗余,特别是在颜色空间中。为了解决这一问题,我们提出了数据集颜色量化(Dataset Color Quantization, DCQ),这是一个统一框架,通过减少颜色空间冗余来压缩视觉数据集,同时保留对模型训练至关重要的信息。DCQ通过在相似图像之间强制一致的调色板表示,选择性地保留由模型感知指导的语义重要颜色,并保持有效特征学习所需的结构细节来实现这一目标。在CIFAR-10、CIFAR-100、Tiny-ImageNet和ImageNet-1K上的广泛实验表明,DCQ在激进压缩下显著提高了训练性能,提供了一种可扩展且稳健的数据集级存储减少解决方案。代码可在 exttt{https://github.com/he-y/Dataset-Color-Quantization} 获取。
cs.CV / 43 / 2602.20653
SD4R: Sparse-to-Dense Learning for 3D Object Detection with 4D Radar
SD4R:基于稀疏到密集学习的4D雷达三维物体检测
Abstract
4D radar measurements offer an affordable and weather-robust solution for 3D perception. However, the inherent sparsity and noise of radar point clouds present significant challenges for accurate 3D object detection, underscoring the need for effective and robust point clouds densification. Despite recent progress, existing densification methods often fail to address the extreme sparsity of 4D radar point clouds and exhibit limited robustness when processing scenes with a small number of points. In this paper, we propose SD4R, a novel framework that transforms sparse radar point clouds into dense representations. SD4R begins by utilizing a foreground point generator (FPG) to mitigate noise propagation and produce densified point clouds. Subsequently, a logit-query encoder (LQE) enhances conventional pillarization, resulting in robust feature representations. Through these innovations, our SD4R demonstrates strong capability in both noise reduction and foreground point densification. Extensive experiments conducted on the publicly available View-of-Delft dataset demonstrate that SD4R achieves state-of-the-art performance. Source code is available at https://github.com/lancelot0805/SD4R.
Chinese Translation
4D雷达测量为三维感知提供了一种经济且抗天气干扰的解决方案。然而,雷达点云固有的稀疏性和噪声对准确的三维物体检测提出了重大挑战,突显了有效且稳健的点云密化的必要性。尽管近期取得了一些进展,现有的密化方法往往未能有效应对4D雷达点云的极端稀疏性,并且在处理点数较少的场景时表现出有限的鲁棒性。本文提出了SD4R,一个将稀疏雷达点云转化为密集表示的新框架。SD4R首先利用前景点生成器(Foreground Point Generator, FPG)来减轻噪声传播并生成密化的点云。随后,逻辑查询编码器(Logit-Query Encoder, LQE)增强了传统的柱状化,产生了稳健的特征表示。通过这些创新,我们的SD4R在噪声减少和前景点密化方面展现了强大的能力。在公开可用的Delft视图数据集上进行的广泛实验表明,SD4R达到了最先进的性能。源代码可在 https://github.com/lancelot0805/SD4R 获取。
cs.CV / 44 / 2602.20658
Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video
用于手动搬运任务的人机工程学评估的视觉-语言模型:从RGB视频中估计水平和垂直手距离
Abstract
Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when estimating H and 5-8 cm when estimating V. Across pipelines and camera view configurations, pixel-level segmentation reduced estimation error by approximately 20-30% for H and 35-40% for V relative to the detection-only pipeline. These findings support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters.
Chinese Translation
手动搬运任务是导致工作相关肌肉骨骼疾病的重要因素,进行有效的人机工程学风险评估对于量化身体暴露和指导人机工程干预至关重要。修订版NIOSH搬运方程(RNLE)是一种广泛使用的搬运任务人机工程学风险评估工具,依赖于六个任务变量,包括水平(H)和垂直(V)手距离;这些距离通常通过手动测量或专用传感系统获得,在实际环境中使用困难。我们评估了使用创新的视觉-语言模型(VLMs)非侵入性地从RGB视频流中估计H和V的可行性。开发了两个多阶段的基于VLM的管道:一个是文本引导的检测-only管道,另一个是检测加分割管道。两个管道均使用文本引导的任务相关区域定位、从这些区域提取视觉特征,以及基于变换器的时间回归来估计搬运开始和结束时的H和V。对于一系列搬运任务,使用留一法验证评估了两个管道和七种摄像机视角条件下的估计性能。结果在管道和摄像机视角条件之间显著不同,基于分割的多视角管道始终产生最小的误差,在估计H时达到约6-8厘米的平均绝对误差,在估计V时达到5-8厘米。在管道和摄像机视角配置中,像素级分割相较于仅检测管道将H的估计误差减少了约20-30%,将V的估计误差减少了约35-40%。这些发现支持基于VLM的管道在视频基础上估计RNLE距离参数的可行性。
cs.CV / 45 / 2602.20664
AnimeAgent: Is the Multi-Agent via Image-to-Video models a Good Disney Storytelling Artist?
AnimeAgent:基于图像到视频模型的多智能体是否是优秀的迪士尼叙事艺术家?
Abstract
Custom Storyboard Generation (CSG) aims to produce high-quality, multi-character consistent storytelling. Current approaches based on static diffusion models, whether used in a one-shot manner or within multi-agent frameworks, face three key limitations: (1) Static models lack dynamic expressiveness and often resort to "copy-paste" pattern. (2) One-shot inference cannot iteratively correct missing attributes or poor prompt adherence. (3) Multi-agents rely on non-robust evaluators, ill-suited for assessing stylized, non-realistic animation. To address these, we propose AnimeAgent, the first Image-to-Video (I2V)-based multi-agent framework for CSG. Inspired by Disney's "Combination of Straight Ahead and Pose to Pose" workflow, AnimeAgent leverages I2V's implicit motion prior to enhance consistency and expressiveness, while a mixed subjective-objective reviewer enables reliable iterative refinement. We also collect a human-annotated CSG benchmark with ground-truth. Experiments show AnimeAgent achieves SOTA performance in consistency, prompt fidelity, and stylization.
Chinese Translation
定制故事板生成(CSG)旨在生成高质量、多角色一致的叙事。目前基于静态扩散模型的方法,无论是一次性使用还是在多智能体框架内,都面临三个主要限制:(1)静态模型缺乏动态表现力,常常 resort to “copy-paste” 模式。(2)一次性推理无法迭代地纠正缺失的属性或不良的提示遵循。(3)多智能体依赖于不够稳健的评估者,不适合评估风格化的非现实动画。为了解决这些问题,我们提出了 AnimeAgent,这是第一个基于图像到视频(I2V)的多智能体框架用于 CSG。受迪士尼“直线推进与姿态转换”工作流程的启发,AnimeAgent 利用 I2V 的隐式运动先验来增强一致性和表现力,同时混合的主观-客观评审者使得可靠的迭代优化成为可能。我们还收集了一个带有真实标签的人类注释 CSG 基准。实验表明,AnimeAgent 在一致性、提示保真度和风格化方面达到了最先进的性能。
cs.CV / 46 / 2602.20666
BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity
BoxSplitGen:一种用于不同粒度3D部件边界框的生成模型
Abstract
Human creativity follows a perceptual process, moving from abstract ideas to finer details during creation. While 3D generative models have advanced dramatically, models specifically designed to assist human imagination in 3D creation -- particularly for detailing abstractions from coarse to fine -- have not been explored. We propose a framework that enables intuitive and interactive 3D shape generation by iteratively splitting bounding boxes to refine the set of bounding boxes. The main technical components of our framework are two generative models: the box-splitting generative model and the box-to-shape generative model. The first model, named BoxSplitGen, generates a collection of 3D part bounding boxes with varying granularity by iteratively splitting coarse bounding boxes. It utilizes part bounding boxes created through agglomerative merging and learns the reverse of the merging process -- the splitting sequences. The model consists of two main components: the first learns the categorical distribution of the box to be split, and the second learns the distribution of the two new boxes, given the set of boxes and the indication of which box to split. The second model, the box-to-shape generative model, is trained by leveraging the 3D shape priors learned by an existing 3D diffusion model while adapting the model to incorporate bounding box conditioning. In our experiments, we demonstrate that the box-splitting generative model outperforms token prediction models and the inpainting approach with an unconditional diffusion model. Also, we show that our box-to-shape model, based on a state-of-the-art 3D diffusion model, provides superior results compared to a previous model.
Chinese Translation
人类创造力遵循一种感知过程,在创作过程中从抽象思维逐渐细化到更精细的细节。尽管3D生成模型已经取得了显著进展,但专门设计用于帮助人类在3D创作中进行想象的模型——特别是在将抽象从粗糙到精细的细节化方面——尚未被探索。我们提出了一种框架,通过迭代分割边界框来细化边界框集合,从而实现直观和交互式的3D形状生成。我们框架的主要技术组件是两个生成模型:边界框分割生成模型和边界框到形状生成模型。第一个模型,名为BoxSplitGen,通过迭代分割粗糙的边界框生成一组具有不同粒度的3D部件边界框。它利用通过聚合合并创建的部件边界框,并学习合并过程的逆过程——分割序列。该模型由两个主要组件组成:第一个学习待分割边界框的类别分布,第二个在给定边界框集合和指示哪个边界框需要分割的情况下,学习两个新边界框的分布。第二个模型,即边界框到形状生成模型,通过利用现有3D扩散模型学习的3D形状先验进行训练,同时调整模型以纳入边界框条件。在我们的实验中,我们证明了边界框分割生成模型在性能上优于标记预测模型和使用无条件扩散模型的修复方法。此外,我们展示了基于最先进的3D扩散模型的我们的边界框到形状模型提供了优于先前模型的结果。
cs.CV / 47 / 2602.20672
BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models
BBQ到图像:大规模文本到图像模型中的数值边界框和颜色控制
Abstract
Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across comprehensive evaluations, BBQ achieves strong box alignment and improves RGB color fidelity over state-of-the-art baselines. More broadly, our results support a new paradigm in which user intent is translated into an intermediate structured language, consumed by a flow-based transformer acting as a renderer and naturally accommodating numeric parameters.
Chinese Translation
文本到图像模型在逼真性和可控性方面迅速进步,最近的方法利用长而详细的描述来支持细粒度生成。然而,仍然存在一个基本的参数差距:现有模型依赖于描述性语言,而专业工作流程需要对物体的位置、大小和颜色进行精确的数值控制。在本研究中,我们介绍了BBQ,一个大规模的文本到图像模型,它在统一的结构化文本框架内直接以数值边界框和RGB三元组为条件。通过在丰富参数注释的描述上进行训练,我们获得了精确的空间和色彩控制,而无需进行架构修改或推理时优化。这还使得直观的用户界面成为可能,例如物体拖动和颜色选择器,用精确且熟悉的控制替代模糊的迭代提示。在全面评估中,BBQ实现了强大的边框对齐,并提高了RGB颜色保真度,相较于最先进的基线。更广泛地说,我们的结果支持了一种新范式,其中用户意图被转化为中间结构化语言,由作为渲染器的流式变换器处理,自然地适应数值参数。
cs.CV / 48 / 2602.20673
GA-Drive: Geometry-Appearance Decoupled Modeling for Free-viewpoint Driving Scene Generatio
GA-Drive:几何-外观解耦建模用于自由视角驾驶场景生成
Abstract
A free-viewpoint, editable, and high-fidelity driving simulator is crucial for training and evaluating end-to-end autonomous driving systems. In this paper, we present GA-Drive, a novel simulation framework capable of generating camera views along user-specified novel trajectories through Geometry-Appearance Decoupling and Diffusion-Based Generation. Given a set of images captured along a recorded trajectory and the corresponding scene geometry, GA-Drive synthesizes novel pseudo-views using geometry information. These pseudo-views are then transformed into photorealistic views using a trained video diffusion model. In this way, we decouple the geometry and appearance of scenes. An advantage of such decoupling is its support for appearance editing via state-of-the-art video-to-video editing techniques, while preserving the underlying geometry, enabling consistent edits across both original and novel trajectories. Extensive experiments demonstrate that GA-Drive substantially outperforms existing methods in terms of NTA-IoU, NTL-IoU, and FID scores.
Chinese Translation
一个自由视角、可编辑且高保真的驾驶模拟器对于训练和评估端到端的自动驾驶系统至关重要。本文提出了GA-Drive,一个新颖的模拟框架,能够通过几何-外观解耦和基于扩散的生成方法,沿用户指定的新轨迹生成相机视图。给定一组沿记录轨迹捕获的图像及其对应的场景几何信息,GA-Drive利用几何信息合成新颖的伪视图。这些伪视图随后通过训练好的视频扩散模型转化为照片级真实感视图。通过这种方式,我们实现了场景几何和外观的解耦。这种解耦的一个优势是支持通过最先进的视频到视频编辑技术进行外观编辑,同时保持基础几何的一致性,使得在原始轨迹和新轨迹之间能够进行一致的编辑。大量实验表明,GA-Drive在NTA-IoU、NTL-IoU和FID评分方面显著优于现有方法。
cs.CV / 49 / 2602.20685
RAYNOVA: 3D-Geometry-Free Auto-Regressive Driving World Modeling with Unified Spatio-Temporal Representation
RAYNOVA:一种无几何的自回归驾驶世界建模方法,采用统一时空表示
Abstract
World foundation models aim to simulate the evolution of the real world with physically plausible behavior. Unlike prior methods that handle spatial and temporal correlations separately, we propose RAYNOVA, a geometry-free world model that employs a dual-causal autoregressive framework. It follows both scale-wise and temporal topological orders in the autoregressive process, and leverages global attention for unified 4D spatio-temporal reasoning. Different from existing works that impose strong 3D geometric priors, RAYNOVA constructs an isotropic spatio-temporal representation across views, frames, and scales based on relative Pl\"ucker-ray positional encoding, enabling robust generalization to diverse camera setups and ego motions. We further introduce a recurrent training paradigm to alleviate distribution drift in long-horizon video generation. RAYNOVA achieves state-of-the-art multi-view video generation results on nuScenes, while offering higher throughput and strong controllability under diverse input conditions, generalizing to novel views and camera configurations without explicit 3D scene representation. Our code will be released at http://yichen928.github.io/raynova.
Chinese Translation
世界基础模型旨在模拟现实世界的演变,具备物理上合理的行为。与以往分别处理空间和时间相关性的方式不同,我们提出了RAYNOVA,这是一种无几何的世界模型,采用双因果自回归框架。它在自回归过程中遵循尺度和时间的拓扑顺序,并利用全局注意力进行统一的4D时空推理。与现有强制施加3D几何先验的工作不同,RAYNOVA基于相对Plücker光线位置编码构建了跨视角、帧和尺度的各向同性时空表示,从而实现对多样化摄像机设置和自我运动的强健泛化。我们进一步引入了一种递归训练范式,以减轻长时间视频生成中的分布漂移。RAYNOVA在nuScenes上实现了最先进的多视角视频生成结果,同时在多种输入条件下提供更高的吞吐量和强大的可控性,能够在没有显式3D场景表示的情况下泛化到新视角和摄像机配置。我们的代码将发布在http://yichen928.github.io/raynova。
cs.CV / 50 / 2602.20689
MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision
MatchED:基于端到端匹配的清晰边缘检测
Abstract
Generating crisp, i.e., one-pixel-wide, edge maps remains one of the fundamental challenges in edge detection, affecting both traditional and learning-based methods. To obtain crisp edges, most existing approaches rely on two hand-crafted post-processing algorithms, Non-Maximum Suppression (NMS) and skeleton-based thinning, which are non-differentiable and hinder end-to-end optimization. Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose \MethodLPP, a lightweight, only $\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. At each training iteration, \MethodLPP performs one-to-one matching between predicted and ground-truth edges based on spatial distance and confidence, ensuring consistency between training and testing protocols. Extensive experiments on four popular datasets demonstrate that integrating \MethodLPP substantially improves the performance of existing edge detection models. In particular, \MethodLPP increases the Average Crispness (AC) metric by up to 2--4$\times$ compared to baseline models. Under the crispness-emphasized evaluation (CEval), \MethodLPP further boosts baseline performance by up to 20--35\% in ODS and achieves similar gains in OIS and AP, achieving SOTA performance that matches or surpasses standard post-processing for the first time. Code is available at https://cvpr26-matched.github.io.
Chinese Translation
生成清晰的边缘图,即宽度为一个像素的边缘图,仍然是边缘检测中的基本挑战之一,影响着传统方法和基于学习的方法。为了获得清晰的边缘,大多数现有方法依赖于两种手工设计的后处理算法:非极大值抑制(Non-Maximum Suppression, NMS)和基于骨架的细化,这些算法是不可微分的,阻碍了端到端的优化。此外,所有现有的清晰边缘检测方法仍然依赖于这些后处理步骤以获得令人满意的结果。为了解决这一限制,我们提出了 extit{MethodLPP},这是一个轻量级的、仅需约21K额外参数的即插即用的基于匹配的监督模块,可以附加到任何边缘检测模型上,实现清晰边缘的联合端到端学习。在每次训练迭代中, extit{MethodLPP}根据空间距离和置信度在预测边缘和真实边缘之间执行一对一匹配,确保训练和测试协议之间的一致性。在四个流行数据集上的大量实验表明,集成 extit{MethodLPP}显著提高了现有边缘检测模型的性能。特别是,与基线模型相比, extit{MethodLPP}将平均清晰度(Average Crispness, AC)指标提高了2到4倍。在强调清晰度的评估(Crispness-Emphasized Evaluation, CEval)下, extit{MethodLPP}进一步将基线性能提升了20%到35%(ODS),并在OIS和AP中实现了类似的增益,首次实现了与标准后处理相匹配或超越的最先进(SOTA)性能。代码可在 https://cvpr26-matched.github.io 获取。
cs.CV / 51 / 2602.20700
NGL-Prompter: Training-Free Sewing Pattern Estimation from a Single Image
NGL-Prompter:无训练的单图像缝制模式估计
Abstract
Estimating sewing patterns from images is a practical approach for creating high-quality 3D garments. Due to the lack of real-world pattern-image paired data, prior approaches fine-tune large vision language models (VLMs) on synthetic garment datasets generated by randomly sampling from a parametric garment model GarmentCode. However, these methods often struggle to generalize to in-the-wild images, fail to capture real-world correlations between garment parts, and are typically restricted to single-layer outfits. In contrast, we observe that VLMs are effective at describing garments in natural language, yet perform poorly when asked to directly regress GarmentCode parameters from images. To bridge this gap, we propose NGL (Natural Garment Language), a novel intermediate language that restructures GarmentCode into a representation more understandable to language models. Leveraging this language, we introduce NGL-Prompter, a training-free pipeline that queries large VLMs to extract structured garment parameters, which are then deterministically mapped to valid GarmentCode. We evaluate our method on the Dress4D, CloSe and a newly collected dataset of approximately 5,000 in-the-wild fashion images. Our approach achieves state-of-the-art performance on standard geometry metrics and is strongly preferred in both human and GPT-based perceptual evaluations compared to existing baselines. Furthermore, NGL-prompter can recover multi-layer outfits whereas competing methods focus mostly on single-layer garments, highlighting its strong generalization to real-world images even with occluded parts. These results demonstrate that accurate sewing pattern reconstruction is possible without costly model training. Our code and data will be released for research use.
Chinese Translation
从图像中估计缝制模式是一种创建高质量3D服装的实用方法。由于缺乏真实世界的模式-图像配对数据,先前的方法在通过从参数化服装模型GarmentCode随机抽样生成的合成服装数据集上微调大型视觉语言模型(VLMs)。然而,这些方法通常难以推广到真实场景中的图像,未能捕捉服装部件之间的真实世界关联,并且通常仅限于单层服装。相比之下,我们观察到VLMs在用自然语言描述服装方面表现良好,但在直接从图像回归GarmentCode参数时表现不佳。为了解决这一问题,我们提出了NGL(自然服装语言),这是一种新的中间语言,将GarmentCode重构为更易于语言模型理解的表示。利用这种语言,我们引入了NGL-Prompter,一个无训练的管道,通过查询大型VLMs提取结构化的服装参数,然后确定性地映射到有效的GarmentCode。我们在Dress4D、CloSe以及新收集的约5000张真实场景时尚图像的数据集上评估了我们的方法。与现有基线相比,我们的方法在标准几何指标上达到了最先进的性能,并在人工和基于GPT的感知评估中受到强烈偏爱。此外,NGL-Prompter能够恢复多层服装,而竞争方法主要集中在单层服装上,突显了其在真实场景图像中的强泛化能力,即使在部分遮挡的情况下。这些结果表明,在没有昂贵模型训练的情况下,准确的缝制模式重建是可能的。我们的代码和数据将公开发布供研究使用。
cs.CV / 52 / 2602.20709
Onboard-Targeted Segmentation of Straylight in Space Camera Sensors
空间相机传感器中光晕的机载目标分割
Abstract
This study details an artificial intelligence (AI)-based methodology for the semantic segmentation of space camera faults. Specifically, we address the segmentation of straylight effects induced by solar presence around the camera's Field of View (FoV). Anomalous images are sourced from our published dataset. Our approach emphasizes generalization across diverse flare textures, leveraging pre-training on a public dataset (Flare7k++) including flares in various non-space contexts to mitigate the scarcity of realistic space-specific data. A DeepLabV3 model with MobileNetV3 backbone performs the segmentation task. The model design targets deployment in spacecraft resource-constrained hardware. Finally, based on a proposed interface between our model and the onboard navigation pipeline, we develop custom metrics to assess the model's performance in the system-level context.
Chinese Translation
本研究详细介绍了一种基于人工智能(AI)的空间相机故障语义分割方法。具体而言,我们关注于相机视场(FoV)周围由太阳存在引起的光晕效应的分割。异常图像来源于我们发布的数据集。我们的方法强调在多样的耀斑纹理上进行泛化,利用在公共数据集(Flare7k++)上的预训练,该数据集包含各种非空间背景下的耀斑,以缓解现实空间特定数据的稀缺性。使用带有MobileNetV3主干的DeepLabV3模型执行分割任务。模型设计旨在部署在资源受限的航天器硬件中。最后,基于我们模型与机载导航管道之间提出的接口,我们开发了自定义指标,以评估模型在系统级上下文中的性能。
cs.CV / 53 / 2602.20718
Monocular Endoscopic Tissue 3D Reconstruction with Multi-Level Geometry Regularization
基于单目内窥镜的组织三维重建与多层几何正则化
Abstract
Reconstructing deformable endoscopic tissues is crucial for achieving robot-assisted surgery. However, 3D Gaussian Splatting-based approaches encounter challenges in achieving consistent tissue surface reconstruction, while existing NeRF-based methods lack real-time rendering capabilities. In pursuit of both smooth deformable surfaces and real-time rendering, we introduce a novel approach based on 3D Gaussian Splatting. Specifically, we introduce surface-aware reconstruction, initially employing a Sign Distance Field-based method to construct a mesh, subsequently utilizing this mesh to constrain the Gaussian Splatting reconstruction process. Furthermore, to ensure the generation of physically plausible deformations, we incorporate local rigidity and global non-rigidity restrictions to guide Gaussian deformation, tailored for the highly deformable nature of soft endoscopic tissue. Based on 3D Gaussian Splatting, our proposed method delivers a fast rendering process and smooth surface appearances. Quantitative and qualitative analysis against alternative methodologies shows that our approach achieves solid reconstruction quality in both textures and geometries.
Chinese Translation
重建可变形的内窥镜组织对于实现机器人辅助手术至关重要。然而,基于3D高斯点云的方法在实现一致的组织表面重建方面面临挑战,而现有的基于NeRF的方法缺乏实时渲染能力。为了同时追求平滑的可变形表面和实时渲染,我们提出了一种基于3D高斯点云的新方法。具体而言,我们引入了表面感知重建,最初采用基于符号距离场(Sign Distance Field)的方法构建网格,随后利用该网格约束高斯点云重建过程。此外,为了确保生成物理上合理的变形,我们结合了局部刚性和全局非刚性约束,以指导高斯变形,特别针对软内窥镜组织的高度可变形特性。基于3D高斯点云,我们提出的方法实现了快速渲染过程和平滑的表面外观。与其他方法的定量和定性分析表明,我们的方法在纹理和几何方面都实现了可靠的重建质量。
cs.CV / 54 / 2602.20721
CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization
CleanStyle:即插即用的文本到图像风格化条件净化
Abstract
Style transfer in diffusion models enables controllable visual generation by injecting the style of a reference image. However, recent encoder-based methods, while efficient and tuning-free, often suffer from content leakage, where semantic elements from the style image undesirably appear in the output, impairing prompt fidelity and stylistic consistency. In this work, we introduce CleanStyle, a plug-and-play framework that filters out content-related noise from the style embedding without retraining. Motivated by empirical analysis, we observe that such leakage predominantly stems from the tail components of the style embedding, which are isolated via Singular Value Decomposition (SVD). To address this, we propose CleanStyleSVD (CS-SVD), which dynamically suppresses tail components using a time-aware exponential schedule, providing clean, style-preserving conditional embeddings throughout the denoising process. Furthermore, we present Style-Specific Classifier-Free Guidance (SS-CFG), which reuses the suppressed tail components to construct style-aware unconditional inputs. Unlike conventional methods that use generic negative embeddings (e.g., zero vectors), SS-CFG introduces targeted negative signals that reflect style-specific but prompt-irrelevant visual elements. This enables the model to effectively suppress these distracting patterns during generation, thereby improving prompt fidelity and enhancing the overall visual quality of stylized outputs. Our approach is lightweight, interpretable, and can be seamlessly integrated into existing encoder-based diffusion models without retraining. Extensive experiments demonstrate that CleanStyle substantially reduces content leakage, improves stylization quality and improves prompt alignment across a wide range of style references and prompts.
Chinese Translation
扩散模型中的风格迁移通过注入参考图像的风格实现可控的视觉生成。然而,最近的基于编码器的方法虽然高效且无需调优,但往往会遭遇内容泄漏的问题,即风格图像中的语义元素不期望地出现在输出中,从而损害了提示的保真度和风格一致性。在本研究中,我们提出了CleanStyle,一个即插即用的框架,能够在不重新训练的情况下过滤掉风格嵌入中的内容相关噪声。通过实证分析,我们观察到这种泄漏主要源于风格嵌入的尾部成分,这些成分通过奇异值分解(Singular Value Decomposition, SVD)被隔离。为了解决这个问题,我们提出了CleanStyleSVD(CS-SVD),它使用时间感知的指数调度动态抑制尾部成分,在去噪过程中提供干净、保留风格的条件嵌入。此外,我们还提出了风格特定的无分类器引导(Style-Specific Classifier-Free Guidance, SS-CFG),该方法重用被抑制的尾部成分来构建风格感知的无条件输入。与使用通用负嵌入(例如零向量)的传统方法不同,SS-CFG引入了反映风格特定但与提示无关的视觉元素的针对性负信号。这使得模型能够有效抑制这些干扰模式,从而提高提示的保真度并增强风格化输出的整体视觉质量。我们的方法轻量、可解释,并且可以无缝集成到现有的基于编码器的扩散模型中,而无需重新训练。大量实验表明,CleanStyle显著减少了内容泄漏,提高了风格化质量,并在广泛的风格参考和提示中改善了提示对齐。
cs.CV / 55 / 2602.20725
Bridging Physically Based Rendering and Diffusion Models with Stochastic Differential Equation
通过随机微分方程连接基于物理的渲染与扩散模型
Abstract
Diffusion-based image generators excel at producing realistic content from text or image conditions, but they offer only limited explicit control over low-level, physically grounded shading and material properties. In contrast, physically based rendering (PBR) offers fine-grained physical control but lacks prompt-driven flexibility. Although these two paradigms originate from distinct communities, both share a common evolution -- from noisy observations to clean images. In this paper, we propose a unified stochastic formulation that bridges Monte Carlo rendering and diffusion-based generative modeling. First, a general stochastic differential equation (SDE) formulation for Monte Carlo integration under the Central Limit Theorem is modeled. Through instantiation via physically based path tracing, we convert it into a physically grounded SDE representation. Moreover, we provide a systematic analysis of how the physical characteristics of path tracing can be extended to existing diffusion models from the perspective of noise variance. Extensive experiments across multiple tasks show that our method can exert physically grounded control over diffusion-generated results, covering tasks such as rendering and material editing.
Chinese Translation
基于扩散的图像生成器在从文本或图像条件中生成逼真内容方面表现出色,但它们对低级、基于物理的阴影和材料属性的显式控制能力有限。相比之下,基于物理的渲染(PBR)提供了细粒度的物理控制,但缺乏基于提示的灵活性。尽管这两种范式源自不同的社区,但它们在演变过程中有一个共同点——从噪声观察到清晰图像。在本文中,我们提出了一种统一的随机公式,连接蒙特卡洛渲染与基于扩散的生成建模。首先,我们为中央极限定理下的蒙特卡洛积分建模了一般的随机微分方程(SDE)公式。通过基于物理的路径追踪实例化,我们将其转换为基于物理的SDE表示。此外,我们从噪声方差的角度系统分析了路径追踪的物理特性如何扩展到现有的扩散模型。针对多个任务的广泛实验表明,我们的方法能够对扩散生成的结果施加基于物理的控制,涵盖渲染和材料编辑等任务。
cs.CV / 56 / 2602.20731
Communication-Inspired Tokenization for Structured Image Representations
受通信启发的结构化图像表示的标记化
Abstract
Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.
Chinese Translation
离散图像标记器已成为现代视觉和多模态系统的关键组成部分,为基于变换器的架构提供了顺序接口。然而,大多数现有方法主要优化于重建和压缩,通常生成的标记捕捉的是局部纹理,而非对象级的语义结构。受到人类通信的增量性和组合性启发,我们提出了受通信启发的标记化框架(COMmunication inspired Tokenization,COMiT),用于学习结构化的离散视觉标记序列。COMiT通过迭代观察局部图像裁剪并反复更新其离散表示,在固定的标记预算内构建潜在信息。在每一步中,模型整合新的视觉信息,同时细化和重组现有的标记序列。在经过几次编码迭代后,最终信息条件化一个流匹配解码器,以重建完整图像。编码和解码均在单一的变换器模型内实现,并通过流匹配重建和语义表示对齐损失的组合进行端到端训练。我们的实验表明,尽管语义对齐提供了基础,但关注的顺序标记化对于诱导可解释的、以对象为中心的标记结构至关重要,并显著改善了相较于先前方法的组合泛化和关系推理。
cs.CV / 57 / 2602.20752
OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation
OrthoDiffusion:一种可推广的多任务扩散基础模型用于肌肉骨骼MRI解读
Abstract
Musculoskeletal disorders represent a significant global health burden and are a leading cause of disability worldwide. While MRI is essential for accurate diagnosis, its interpretation remains exceptionally challenging. Radiologists must identify multiple potential abnormalities within complex anatomical structures across different imaging planes, a process that requires significant expertise and is prone to variability. We developed OrthoDiffusion, a unified diffusion-based foundation model designed for multi-task musculoskeletal MRI interpretation. The framework utilizes three orientation-specific 3D diffusion models, pre-trained in a self-supervised manner on 15,948 unlabeled knee MRI scans, to learn robust anatomical features from sagittal, coronal, and axial views. These view-specific representations are integrated to support diverse clinical tasks, including anatomical segmentation and multi-label diagnosis. Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities. The model exhibited remarkable robustness across different clinical centers and MRI field strengths, consistently outperforming traditional supervised models. Notably, in settings where labeled data was scarce, OrthoDiffusion maintained high diagnostic precision using only 10\% of training labels. Furthermore, the anatomical representations learned from knee imaging proved highly transferable to other joints, achieving strong diagnostic performance across 11 diseases of the ankle and shoulder. These findings suggest that diffusion-based foundation models can serve as a unified platform for multi-disease diagnosis and anatomical segmentation, potentially improving the efficiency and accuracy of musculoskeletal MRI interpretation in real-world clinical workflows.
Chinese Translation
肌肉骨骼疾病在全球健康中占据重要地位,是导致残疾的主要原因之一。尽管MRI对于准确诊断至关重要,但其解读仍然极具挑战性。放射科医生必须在复杂的解剖结构中识别多个潜在异常,这一过程需要显著的专业知识,并且容易受到变异的影响。我们开发了OrthoDiffusion,一个统一的基于扩散的基础模型,旨在用于多任务肌肉骨骼MRI解读。该框架利用三个方向特定的3D扩散模型,这些模型在15,948个未标记的膝关节MRI扫描上以自监督方式进行预训练,以学习来自矢状面、冠状面和轴面视图的稳健解剖特征。这些视图特定的表示被整合以支持多样的临床任务,包括解剖分割和多标签诊断。我们的评估表明,OrthoDiffusion在11个膝关节结构的分割和8个膝关节异常的检测中表现出色。该模型在不同临床中心和MRI磁场强度下表现出显著的鲁棒性,始终优于传统的监督模型。值得注意的是,在标记数据稀缺的环境中,OrthoDiffusion仅使用10%的训练标签仍能保持高诊断精度。此外,从膝关节影像中学习到的解剖表示在其他关节上表现出高度可转移性,在11种踝关节和肩关节疾病的诊断中取得了良好的表现。这些发现表明,基于扩散的基础模型可以作为多疾病诊断和解剖分割的统一平台,可能提高肌肉骨骼MRI解读在实际临床工作流程中的效率和准确性。
cs.CV / 58 / 2602.20773
Federated Learning for Cross-Modality Medical Image Segmentation via Augmentation-Driven Generalization
通过增强驱动的泛化实现跨模态医学图像分割的联邦学习
Abstract
Artificial intelligence has emerged as a transformative tool in medical image analysis, yet developing robust and generalizable segmentation models remains difficult due to fragmented, privacy-constrained imaging data siloed across institutions. While federated learning (FL) enables collaborative model training without centralizing data, cross-modality domain shifts pose a critical challenge, particularly when models trained on one modality fail to generalize to another. Many existing solutions require paired multimodal data per patient or rely on complex architectures, both of which are impractical in real clinical settings. In this work, we consider a realistic FL scenario where each client holds single-modality data (CT or MRI), and systematically investigate augmentation strategies for cross-modality generalization. Using abdominal organ segmentation and whole-heart segmentation as representative multi-class and binary segmentation benchmarks, we evaluate convolution-based spatial augmentation, frequency-domain manipulation, domain-specific normalization, and global intensity nonlinear (GIN) augmentation. Our results show that GIN consistently outperforms alternatives in both centralized and federated settings by simulating cross-modality appearance variations while preserving anatomical structure. For the pancreas, Dice score improved from 0.073 to 0.437, a 498% gain. Our federated approach achieves 93-98% of centralized training accuracy, demonstrating strong cross-modality generalization without compromising data privacy, pointing toward feasible federated AI deployment across diverse healthcare systems.
Chinese Translation
人工智能已成为医学图像分析中的一种变革性工具,但由于各机构之间分散的、受隐私限制的成像数据,开发稳健且具有良好泛化能力的分割模型仍然困难。虽然联邦学习(Federated Learning, FL)能够在不集中数据的情况下进行协作模型训练,但跨模态领域的转变构成了一个关键挑战,特别是当在一种模态上训练的模型无法在另一种模态上泛化时。许多现有解决方案要求每位患者提供配对的多模态数据,或依赖复杂的架构,这在实际临床环境中都不切实际。在本研究中,我们考虑了一个现实的联邦学习场景,其中每个客户端持有单模态数据(CT或MRI),并系统地研究了跨模态泛化的增强策略。以腹部器官分割和全心脏分割作为代表性的多类和二元分割基准,我们评估了基于卷积的空间增强、频域操作、特定领域的归一化和全局强度非线性(Global Intensity Nonlinear, GIN)增强。我们的结果表明,GIN在集中和联邦设置中均优于其他替代方案,通过模拟跨模态外观变化同时保持解剖结构。对于胰腺,Dice得分从0.073提高到0.437,增幅达到498%。我们的联邦方法实现了93-98%的集中训练准确率,展示了强大的跨模态泛化能力而不妨碍数据隐私,指向了在多样化医疗系统中可行的联邦人工智能部署。
cs.CV / 59 / 2602.20790
Real-time Motion Segmentation with Event-based Normal Flow
基于事件的法向流实时运动分割
Abstract
Event-based cameras are bio-inspired sensors with pixels that independently and asynchronously respond to brightness changes at microsecond resolution, offering the potential to handle visual tasks in challenging scenarios. However, due to the sparse information content in individual events, directly processing the raw event data to solve vision tasks is highly inefficient, which severely limits the applicability of state-of-the-art methods in real-time tasks, such as motion segmentation, a fundamental task for dynamic scene understanding. Incorporating normal flow as an intermediate representation to compress motion information from event clusters within a localized region provides a more effective solution. In this work, we propose a normal flow-based motion segmentation framework for event-based vision. Leveraging the dense normal flow directly learned from event neighborhoods as input, we formulate the motion segmentation task as an energy minimization problem solved via graph cuts, and optimize it iteratively with normal flow clustering and motion model fitting. By using a normal flow-based motion model initialization and fitting method, the proposed system is able to efficiently estimate the motion models of independently moving objects with only a limited number of candidate models, which significantly reduces the computational complexity and ensures real-time performance, achieving nearly a 800x speedup in comparison to the open-source state-of-the-art method. Extensive evaluations on multiple public datasets fully demonstrate the accuracy and efficiency of our framework.
Chinese Translation
基于事件的相机是一种生物启发的传感器,其像素能够独立且异步地以微秒级分辨率响应亮度变化,提供了在挑战性场景中处理视觉任务的潜力。然而,由于单个事件中信息内容的稀疏性,直接处理原始事件数据以解决视觉任务的效率极低,这严重限制了最先进方法在实时任务中的适用性,例如运动分割,这是动态场景理解的基本任务。将法向流作为中间表示来压缩局部区域内事件簇的运动信息提供了更有效的解决方案。在本研究中,我们提出了一种基于法向流的事件视觉运动分割框架。利用从事件邻域直接学习到的密集法向流作为输入,我们将运动分割任务形式化为通过图割解决的能量最小化问题,并通过法向流聚类和运动模型拟合进行迭代优化。通过使用基于法向流的运动模型初始化和拟合方法,所提出的系统能够高效地估计独立移动物体的运动模型,仅需有限数量的候选模型,这显著降低了计算复杂性并确保了实时性能,与开源的最先进方法相比,速度提升近800倍。在多个公共数据集上的广泛评估充分证明了我们框架的准确性和效率。
cs.CV / 60 / 2602.20792
SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
SIMSPINE:一种关注生物力学的3D脊柱运动标注与基准测试仿真框架
Abstract
Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine's complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking. Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.
Chinese Translation
脊柱运动建模是理解人类生物力学的基础,但由于脊柱复杂的多关节运动学和缺乏大规模3D标注,这一领域在计算机视觉中仍然未得到充分探索。我们提出了一种关注生物力学的关键点仿真框架,该框架通过来自肌肉骨骼建模的解剖学一致的3D脊柱关键点来增强现有的人体姿态数据集。利用该框架,我们创建了第一个开放数据集,命名为SIMSPINE,该数据集为室内多摄像头捕捉下自然全身运动提供稀疏的椎体级3D脊柱标注,且无需外部约束。该数据集包含214万帧数据,使得可以通过微小姿势变化进行椎体运动学的数据驱动学习,并弥合肌肉骨骼仿真与计算机视觉之间的差距。此外,我们发布了覆盖经过微调的2D检测器、单目3D姿态提升模型和多视角重建管道的预训练基线,建立了一个统一的生物力学有效脊柱运动估计基准。具体而言,我们的2D脊柱基线在受控环境中将最先进的技术从0.63提升至0.80 AUC,在野外脊柱跟踪中将AP从0.91提升至0.93。总之,该仿真框架和SIMSPINE数据集通过在自然条件下实现可重复的、基于解剖学的3D脊柱估计,推动了基于视觉的生物力学、运动分析和数字人类建模的研究。
cs.CV / 61 / 2602.20794
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
VGGDrive:通过跨视角几何基础增强视觉-语言模型以实现自动驾驶
Abstract
The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.
Chinese Translation
跨视角三维几何建模能力对于自动驾驶的重要性不言而喻,然而现有的视觉-语言模型(VLMs)本质上缺乏这一能力,导致其性能平平。尽管一些有前景的方法试图通过构建问答数据进行辅助训练来缓解这一问题,但它们仍未能从根本上赋予VLMs全面处理多样化评估协议的能力。因此,我们开辟了一条新路径,倡导将成熟的三维基础模型的跨视角几何基础注入VLMs,以填补自动驾驶中这一关键能力的缺口。在此背景下,我们提出了一种新颖的架构VGGDrive,它通过跨视角几何基础增强视觉-语言模型以实现自动驾驶。具体而言,为了将冻结的视觉三维模型的跨视角三维几何特征与VLM的二维视觉特征连接起来,我们引入了一种即插即用的跨视角三维几何启用器(CVGE)。CVGE解耦了基础VLM架构,并通过分层自适应注入机制有效地赋予VLM三维特征。大量实验表明,VGGDrive在五个自动驾驶基准测试中提升了基础VLM的性能,包括跨视角风险感知、运动预测和轨迹规划等任务。我们相信,成熟的三维基础模型可以通过有效整合来增强自动驾驶任务的能力,并希望我们的初步探索能够向自动驾驶社区展示这一范式的潜力。
cs.CV / 62 / 2602.20807
RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction
RU4D-SLAM:在高斯喷溅SLAM中重新加权不确定性以实现4D场景重建
Abstract
Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM remains largely underexplored. Along this direction, we propose a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. We enhance dynamic scene representation by integrating motion blur rendering, and improve uncertainty-aware tracking by extending per-pixel uncertainty modeling, which is originally designed for static scenarios, to handle blurred images. Furthermore, we propose a semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and introduce a learnable opacity weight to support adaptive 4D mapping. Extensive experiments on standard benchmarks demonstrate that our method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs. Code available: https://ru4d-slam.github.io
Chinese Translation
将3D高斯喷溅与同时定位与地图构建(SLAM)相结合,因其能够在运动过程中实现连续的3D环境重建而受到广泛关注。然而,现有方法在动态环境中表现不佳,特别是移动物体使得3D重建变得复杂,从而妨碍了可靠的跟踪。4D重建的出现,尤其是4D高斯喷溅,为解决这些挑战提供了一个有前景的方向,但其在4D感知SLAM中的潜力仍然未被充分探索。在此方向上,我们提出了一种稳健且高效的框架,即在高斯喷溅SLAM中重新加权不确定性(RU4D-SLAM),用于4D场景重建,该框架将时间因素引入空间3D表示,同时结合了对场景变化的不确定性感知、模糊图像合成和动态场景重建。我们通过集成运动模糊渲染来增强动态场景表示,并通过将原本为静态场景设计的逐像素不确定性建模扩展到处理模糊图像,从而改善不确定性感知的跟踪。此外,我们提出了一种基于语义的逐像素不确定性估计重新加权机制,并引入了一种可学习的不透明度权重以支持自适应的4D映射。在标准基准上的大量实验表明,我们的方法在轨迹精度和4D场景重建方面显著优于现有最先进的方法,尤其是在动态环境中处理移动物体和低质量输入时。代码可用: https://ru4d-slam.github.io
cs.CV / 63 / 2602.20818
GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection
GatedCLIP:用于仇恨表情包检测的门控多模态融合
Abstract
Detecting hateful content in multimodal memes presents unique challenges, as harmful messages often emerge from the complex interplay between benign images and text. We propose GatedCLIP, a Vision-Language model that enhances CLIP's multimodal capabilities with specialized architectural improvements for hateful memes detection. Our approach introduces learned projection heads that map CLIP embeddings to a task-optimized semantic space, a dynamic gated fusion mechanism that adaptively weights visual and textual features, and a contrastive learning objective that maintains cross-modal semantic alignment. Experiments on the Hateful Memes dataset demonstrate that GatedCLIP achieves an AUROC of 0.66, substantially outperforming the CLIP baseline (AUROC 0.49) while maintaining computational efficiency with only 350K trainable parameters.
Chinese Translation
在多模态表情包中检测仇恨内容面临独特挑战,因为有害信息往往源于良性图像与文本之间复杂的相互作用。我们提出了GatedCLIP,一种视觉-语言模型,通过针对仇恨表情包检测的专门架构改进增强了CLIP的多模态能力。我们的方法引入了学习的投影头,将CLIP嵌入映射到任务优化的语义空间,采用动态门控融合机制自适应地加权视觉和文本特征,并设定对比学习目标以保持跨模态语义一致性。在仇恨表情包数据集上的实验表明,GatedCLIP达到了0.66的AUROC,显著优于CLIP基线(AUROC 0.49),同时仅使用350K可训练参数保持了计算效率。
cs.CV / 64 / 2602.20839
Training-Free Multi-Concept Image Editing
无训练的多概念图像编辑
Abstract
Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept. Our approach enables combining and controlling multiple visual concepts directly within the diffusion process, integrating semantic guidance from text with low-level cues from pretrained concept adapters. We further refine DDS for stability and controllability through ordered timesteps, regularisation, and negative-prompt guidance. Quantitative and qualitative results demonstrate consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.
Chinese Translation
使用扩散模型进行图像编辑而不进行训练仍然具有挑战性。尽管最近的基于优化的方法能够从文本中实现强大的零-shot 编辑,但它们在保持身份或捕捉仅通过语言无法表达的细节方面存在困难。许多视觉概念,如面部结构、材料纹理或物体几何形状,无法仅通过文本提示来表达。为了解决这一问题,我们提出了一种无训练的概念基础图像编辑框架,该框架将优化的 DDS(Diffusion Denosing Score)与基于 LoRA(Low-Rank Adaptation)的概念组合相结合,其中 LoRA 的训练数据代表了该概念。我们的方法使得在扩散过程中直接结合和控制多个视觉概念成为可能,将来自文本的语义指导与来自预训练概念适配器的低级线索相结合。我们进一步通过有序时间步、正则化和负提示指导来优化 DDS,以提高稳定性和可控性。定量和定性结果表明,在 InstructPix2Pix 和 ComposLoRA 基准测试中,我们的方法在现有的无训练扩散编辑方法上实现了一致的改进。代码将公开发布。
cs.CV / 65 / 2602.20845
FLIM Networks with Bag of Feature Points
具有特征点包的FLIM网络
Abstract
Convolutional networks require extensive image annotation, which can be costly and time-consuming. Feature Learning from Image Markers (FLIM) tackles this challenge by estimating encoder filters (i.e., kernel weights) from user-drawn markers on discriminative regions of a few representative images without traditional optimization. Such an encoder combined with an adaptive decoder comprises a FLIM network fully trained without backpropagation. Prior research has demonstrated their effectiveness in Salient Object Detection (SOD), being significantly lighter than existing lightweight models. This study revisits FLIM SOD and introduces FLIM-Bag of Feature Points (FLIM-BoFP), a considerably faster filter estimation method. The previous approach, FLIM-Cluster, derives filters through patch clustering at each encoder's block, leading to computational overhead and reduced control over filter locations. FLIM-BoFP streamlines this process by performing a single clustering at the input block, creating a bag of feature points, and defining filters directly from mapped feature points across all blocks. The paper evaluates the benefits in efficiency, effectiveness, and generalization of FLIM-BoFP compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images.
Chinese Translation
卷积网络需要大量的图像标注,这既昂贵又耗时。图像标记特征学习(Feature Learning from Image Markers, FLIM)通过从用户绘制的标记中估计编码器滤波器(即,核权重),解决了这一挑战。这些标记位于少量代表性图像的判别区域,而无需传统的优化。这样的编码器与自适应解码器相结合,构成了一个完全不依赖反向传播训练的FLIM网络。先前的研究已证明其在显著目标检测(Salient Object Detection, SOD)中的有效性,且显著轻于现有的轻量级模型。本研究重新审视了FLIM SOD,并引入了FLIM-特征点包(FLIM-Bag of Feature Points, FLIM-BoFP),这是一种显著更快的滤波器估计方法。之前的方法FLIM-Cluster通过在每个编码器块中进行补丁聚类来推导滤波器,这导致了计算开销和对滤波器位置的控制减少。FLIM-BoFP通过在输入块中进行单次聚类,简化了这一过程,创建了一个特征点包,并直接从所有块中映射的特征点定义滤波器。本文评估了FLIM-BoFP在效率、有效性和泛化能力方面相较于FLIM-Cluster及其他最先进基线在光学显微镜图像中寄生虫检测的优势。
cs.CV / 66 / 2602.20851
Hybrid Fusion: One-Minute Efficient Training for Zero-Shot Cross-Domain Image Fusion
混合融合:一分钟高效训练的零-shot跨域图像融合
Abstract
Image fusion seeks to integrate complementary information from multiple sources into a single, superior image. While traditional methods are fast, they lack adaptability and performance. Conversely, deep learning approaches achieve state-of-the-art (SOTA) results but suffer from critical inefficiencies: their reliance on slow, resource-intensive, patch-based training introduces a significant gap with full-resolution inference. We propose a novel hybrid framework that resolves this trade-off. Our method utilizes a learnable U-Net to generate a dynamic guidance map that directs a classic, fixed Laplacian pyramid fusion kernel. This decoupling of policy learning from pixel synthesis enables remarkably efficient full-resolution training, eliminating the train-inference gap. Consequently, our model achieves SOTA-comparable performance in about one minute on a RTX 4090 or two minutes on a consumer laptop GPU from scratch without any external model and demonstrates powerful zero-shot generalization across diverse tasks, from infrared-visible to medical imaging. By design, the fused output is linearly constructed solely from source information, ensuring high faithfulness for critical applications. The codes are available at https://github.com/Zirconium233/HybridFusion
Chinese Translation
图像融合旨在将来自多个来源的互补信息整合为一幅单一的优质图像。虽然传统方法速度较快,但缺乏适应性和性能。相反,深度学习方法实现了最先进的(SOTA)结果,但存在严重的低效问题:它们依赖于缓慢、资源密集的基于补丁的训练,导致与全分辨率推理之间存在显著差距。我们提出了一种新颖的混合框架,以解决这一权衡。我们的方法利用可学习的 U-Net 生成动态引导图,该图指导经典的固定拉普拉斯金字塔融合核。这种将策略学习与像素合成解耦的方式使得全分辨率训练异常高效,消除了训练与推理之间的差距。因此,我们的模型在 RTX 4090 上从零开始训练约一分钟,或在消费级笔记本 GPU 上训练约两分钟,达到了与 SOTA 相当的性能,并在从红外-可见到医学成像等多种任务中展示了强大的零-shot 泛化能力。根据设计,融合的输出仅由源信息线性构建,确保在关键应用中具有高保真度。代码可在 https://github.com/Zirconium233/HybridFusion 获取。
cs.CV / 67 / 2602.20853
On the Explainability of Vision-Language Models in Art History
艺术史中视觉-语言模型的可解释性研究
Abstract
Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine 'understanding.' In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with human interpretability studies. Our results indicate that, while these methods capture some aspects of human interpretation, their effectiveness hinges on the conceptual stability and representational availability of the examined categories.
Chinese Translation
视觉-语言模型(VLMs)将视觉和文本数据转化为共享的嵌入空间。通过这种方式,它们支持广泛的多模态任务,同时也引发了关于机器“理解”本质的关键问题。在本文中,我们探讨了可解释人工智能(XAI)方法如何使视觉语言模型(VLM)——特别是 CLIP 的视觉推理在艺术历史背景下变得可读。为此,我们评估了七种方法,将零样本定位实验与人类可解释性研究相结合。我们的结果表明,尽管这些方法捕捉到了一些人类解释的方面,但它们的有效性依赖于所考察类别的概念稳定性和表征可用性。
cs.CV / 68 / 2602.20860
DA-Cal: Towards Cross-Domain Calibration in Semantic Segmentation
DA-Cal:面向语义分割中的跨域校准
Abstract
While existing unsupervised domain adaptation (UDA) methods greatly enhance target domain performance in semantic segmentation, they often neglect network calibration quality, resulting in misalignment between prediction confidence and actual accuracy -- a significant risk in safety-critical applications. Our key insight emerges from observing that performance degrades substantially when soft pseudo-labels replace hard pseudo-labels in cross-domain scenarios due to poor calibration, despite the theoretical equivalence of perfectly calibrated soft pseudo-labels to hard pseudo-labels. Based on this finding, we propose DA-Cal, a dedicated cross-domain calibration framework that transforms target domain calibration into soft pseudo-label optimization. DA-Cal introduces a Meta Temperature Network to generate pixel-level calibration parameters and employs bi-level optimization to establish the relationship between soft pseudo-labels and UDA supervision, while utilizing complementary domain-mixing strategies to prevent overfitting and reduce domain discrepancies. Experiments demonstrate that DA-Cal seamlessly integrates with existing self-training frameworks across multiple UDA segmentation benchmarks, significantly improving target domain calibration while delivering performance gains without inference overhead. The code will be released.
Chinese Translation
尽管现有的无监督领域适应(UDA)方法在语义分割中显著提升了目标领域的性能,但它们往往忽视了网络校准质量,导致预测置信度与实际准确性之间的不匹配——这在安全关键应用中构成了重大风险。我们的关键见解源于观察到,在跨域场景中,当软伪标签替代硬伪标签时,由于校准不良,性能显著下降,尽管理论上完美校准的软伪标签与硬伪标签是等价的。基于这一发现,我们提出了DA-Cal,一个专门的跨域校准框架,将目标领域的校准转化为软伪标签优化。DA-Cal引入了一个元温度网络(Meta Temperature Network)来生成像素级校准参数,并采用双层优化来建立软伪标签与UDA监督之间的关系,同时利用互补的领域混合策略来防止过拟合并减少领域差异。实验表明,DA-Cal能够与现有的自我训练框架无缝集成,在多个UDA分割基准上显著改善目标领域的校准,同时在不增加推理开销的情况下提升性能。代码将会发布。
cs.CV / 69 / 2602.20873
MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
MUSE:利用精确和多样化语义进行少样本全幻灯片图像分类
Abstract
In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization. Our code is available at: https://github.com/JiahaoXu-god/CVPR2026_MUSE.
Chinese Translation
在计算病理学中,少样本全幻灯片图像分类主要受到专家标注幻灯片极度稀缺的驱动。最近的视觉-语言方法结合了由大型语言模型生成的文本语义,但将这些描述视为跨所有样本共享的静态类别级先验,缺乏样本级的细化。这限制了视觉-语义对齐的多样性和精确性,妨碍了在有限监督下的泛化。为此,我们提出了随机多视图语义增强(MUSE)框架,该框架首先通过样本级适应来细化语义精度,然后通过检索增强的多视图生成来丰富语义。具体而言,MUSE引入了样本级细粒度语义增强(SFSE),通过基于MoE的自适应视觉-语义交互为每个样本生成细粒度语义先验。在该先验的指导下,随机多视图模型优化(SMMO)为每个类别构建了一个由LLM生成的多样病理描述知识库,然后在训练过程中检索并随机整合多个匹配的文本视图。这些动态选择的文本作为丰富的语义监督,随机优化视觉-语言模型,促进模型的鲁棒性并减轻过拟合。在三个基准WSI数据集上的实验表明,MUSE在少样本设置中始终优于现有的视觉-语言基线,证明有效的少样本病理学习不仅需要更丰富的语义来源,还需要其主动和样本感知的语义优化。我们的代码可在以下链接获取:https://github.com/JiahaoXu-god/CVPR2026_MUSE。
cs.CV / 70 / 2602.20880
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
当安全相碰:通过自适应安全引导解决文本到图像扩散中的多类别有害冲突
Abstract
Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
Chinese Translation
文本到图像(T2I)扩散模型在生成高质量图像方面取得了显著进展,但也引发了有关有害内容生成的潜在安全问题。基于安全引导的方法被提出以减轻有害输出,通过将生成引导远离有害区域来实现,其中这些区域是基于预定义关键词在多个有害类别中平均得出的。然而,这些方法未能捕捉不同有害类别之间的复杂相互作用,导致了“有害冲突”,即减轻一种类型的伤害可能无意中加剧另一种,从而增加总体有害率。为了解决这个问题,我们提出了冲突感知自适应安全引导(Conflict-aware Adaptive Safety Guidance, CASG),这是一种无训练的框架,能够在生成过程中动态识别并应用与类别对齐的安全方向。CASG由两个组件组成:(i)冲突感知类别识别(Conflict-aware Category Identification, CaCI),用于识别与模型不断演变的生成状态最为一致的有害类别;(ii)冲突解决引导应用(Conflict-resolving Guidance Application, CrGA),仅沿识别出的类别应用安全引导,以避免多类别干扰。CASG可以应用于潜在空间和文本空间的安全防护。对T2I安全基准的实验表明,CASG在性能上达到最先进水平,与现有方法相比,有害率降低了多达15.4%。
cs.CV / 71 / 2602.20901
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
SpatiaLQA:评估视觉语言模型空间逻辑推理能力的基准
Abstract
Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.
Chinese Translation
视觉语言模型(VLMs)因其卓越的理解和推理能力而越来越多地应用于现实场景。尽管VLMs在常见的视觉问答和逻辑推理方面已经展现出令人印象深刻的能力,但它们在复杂的现实环境中做出合理决策的能力仍然不足。我们将这种能力定义为空间逻辑推理,它不仅需要理解复杂场景中物体之间的空间关系,还需要理解多步骤任务中步骤之间的逻辑依赖关系。为了解决这一问题,我们引入了空间逻辑问答(SpatiaLQA),这是一个旨在评估VLMs空间逻辑推理能力的基准。SpatiaLQA包含从241个现实室内场景中提取的9,605对问答对。我们对41个主流VLMs进行了广泛的实验,结果表明,即使是最先进的模型在空间逻辑推理方面仍然面临挑战。为了解决这个问题,我们提出了一种称为递归场景图辅助推理的方法,该方法利用视觉基础模型逐步将复杂场景分解为与任务相关的场景图,从而增强VLMs的空间逻辑推理能力,超越了所有之前的方法。代码和数据集可在 https://github.com/xieyc99/SpatiaLQA 获取。
cs.CV / 72 / 2602.20903
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
TextPecker:奖励结构异常量化以增强视觉文本渲染
Abstract
Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.
Chinese Translation
视觉文本渲染(VTR)在文本到图像生成中仍然是一个关键挑战,即使是先进的模型也常常产生具有结构异常的文本,例如扭曲、模糊和错位。然而,我们发现领先的多模态大语言模型(MLLMs)和专业的光学字符识别(OCR)模型在很大程度上未能感知这些结构异常,这为VTR评估和基于强化学习(RL)的优化创造了一个关键瓶颈。因此,即使是最先进的生成器(例如SeedDream4.0、Qwen-Image)在渲染结构上忠实的文本时仍然面临困难。为了解决这个问题,我们提出了TextPecker,一种即插即用的结构异常感知RL策略,能够减轻噪声奖励信号,并与任何文本到图像生成器配合使用。为了实现这一能力,我们构建了一个具有字符级结构异常注释的识别数据集,并开发了一个笔画编辑合成引擎,以扩展结构错误的覆盖范围。实验表明,TextPecker持续改善多种文本到图像模型;即使在经过良好优化的Qwen-Image上,它在中文文本渲染中显著提高了4%的结构保真度和8.7%的语义对齐,确立了高保真VTR的新状态。我们的工作填补了VTR优化中的空白,为可靠且结构上忠实的视觉文本生成提供了基础性步骤。
cs.CV / 73 / 2602.20913
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1:低成本长视频理解的智能导航
Abstract
This paper addresses the critical and underexplored challenge of long video understanding with low computational budgets. We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. At the core of LongVideo-R1 lies a reasoning module that leverages high-level visual cues to infer the most informative video clip for subsequent processing. During inference, the agent initiates traversal from top-level visual summaries and iteratively refines its focus, immediately halting the exploration process upon acquiring sufficient knowledge to answer the query. To facilitate training, we first extract hierarchical video captions from CGBench, a video corpus with grounding annotations, and guide GPT-5 to generate 33K high-quality chain-of-thought-with-tool trajectories. The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm: supervised fine-tuning (SFT) followed by reinforcement learning (RL), where RL employs a specifically designed reward function to maximize selective and efficient clip navigation. Experiments on multiple long video benchmarks validate the effectiveness of name, which enjoys superior tradeoff between QA accuracy and efficiency. All curated data and source code are provided in the supplementary material and will be made publicly available. Code and data are available at: https://github.com/qiujihao19/LongVideo-R1
Chinese Translation
本文针对低计算预算下长视频理解这一关键且未被充分探索的挑战进行了研究。我们提出了LongVideo-R1,这是一种具备推理能力的主动多模态大型语言模型(MLLM)代理,旨在实现高效的视频上下文导航,避免冗余的全面搜索。LongVideo-R1的核心是一个推理模块,利用高级视觉线索推断出最具信息量的视频片段以供后续处理。在推理过程中,代理从顶层视觉摘要开始遍历,并迭代地细化其关注点,一旦获得足够的信息以回答查询,便立即停止探索。为了便于训练,我们首先从CGBench(一个带有基础注释的视频语料库)中提取层次化视频标题,并指导GPT-5生成33K个高质量的思维链与工具轨迹。LongVideo-R1代理在Qwen-3-8B模型上通过两阶段范式进行微调:监督微调(SFT)后接强化学习(RL),其中RL采用专门设计的奖励函数以最大化选择性和高效的片段导航。在多个长视频基准测试上的实验验证了该方法的有效性,其在问答准确性与效率之间享有优越的权衡。所有整理的数据和源代码已在补充材料中提供,并将公开发布。代码和数据可在以下网址获取:https://github.com/qiujihao19/LongVideo-R1
cs.CV / 74 / 2602.20930
Computing a Characteristic Orientation for Rotation-Independent Image Analysis
计算特征方向以实现与旋转无关的图像分析
Abstract
Handling geometric transformations, particularly rotations, remains a challenge in deep learning for computer vision. Standard neural networks lack inherent rotation invariance and typically rely on data augmentation or architectural modifications to improve robustness. Although effective, these approaches increase computational demands, require specialised implementations, or alter network structures, limiting their applicability. This paper introduces General Intensity Direction (GID), a preprocessing method that improves rotation robustness without modifying the network architecture. The method estimates a global orientation for each image and aligns it to a canonical reference frame, allowing standard models to process inputs more consistently across different rotations. Unlike moment-based approaches that extract invariant descriptors, this method directly transforms the image while preserving spatial structure, making it compatible with convolutional networks. Experimental evaluation on the rotated MNIST dataset shows that the proposed method achieves higher accuracy than state-of-the-art rotation-invariant architectures. Additional experiments on the CIFAR-10 dataset, confirm that the method remains effective under more complex conditions.
Chinese Translation
处理几何变换,特别是旋转,仍然是计算机视觉深度学习中的一大挑战。标准神经网络缺乏固有的旋转不变性,通常依赖数据增强或架构修改来提高鲁棒性。尽管这些方法有效,但它们增加了计算需求,需要专门的实现,或改变网络结构,从而限制了其适用性。本文提出了一种预处理方法——全局强度方向(General Intensity Direction, GID),该方法在不修改网络架构的情况下提高了旋转鲁棒性。该方法为每个图像估计一个全局方向,并将其对齐到一个标准参考框架,从而使标准模型能够在不同旋转下更一致地处理输入。与基于矩的提取不变描述符的方法不同,该方法直接转换图像,同时保持空间结构,使其与卷积网络兼容。在旋转的MNIST数据集上的实验评估表明,所提出的方法在准确性上优于最先进的旋转不变架构。对CIFAR-10数据集的额外实验确认该方法在更复杂条件下仍然有效。
cs.CV / 75 / 2602.20933
Dropping Anchor and Spherical Harmonics for Sparse-view Gaussian Splatting
锚点投放与球面谐波在稀疏视图高斯点云中的应用
Abstract
Recent 3D Gaussian Splatting (3DGS) Dropout methods address overfitting under sparse-view conditions by randomly nullifying Gaussian opacities. However, we identify a neighbor compensation effect in these approaches: dropped Gaussians are often compensated by their neighbors, weakening the intended regularization. Moreover, these methods overlook the contribution of high-degree spherical harmonic coefficients (SH) to overfitting. To address these issues, we propose DropAnSH-GS, a novel anchor-based Dropout strategy. Rather than dropping Gaussians independently, our method randomly selects certain Gaussians as anchors and simultaneously removes their spatial neighbors. This effectively disrupts local redundancies near anchors and encourages the model to learn more robust, globally informed representations. Furthermore, we extend the Dropout to color attributes by randomly dropping higher-degree SH to concentrate appearance information in lower-degree SH. This strategy further mitigates overfitting and enables flexible post-training model compression via SH truncation. Experimental results demonstrate that DropAnSH-GS substantially outperforms existing Dropout methods with negligible computational overhead, and can be readily integrated into various 3DGS variants to enhance their performances. Project Website: https://sk-fun.fun/DropAnSH-GS
Chinese Translation
近期的三维高斯点云(3D Gaussian Splatting, 3DGS)Dropout 方法通过随机使高斯不透明度失效来解决稀疏视图条件下的过拟合问题。然而,我们发现这些方法存在邻域补偿效应:被丢弃的高斯往往会被其邻域补偿,从而削弱了预期的正则化效果。此外,这些方法忽视了高阶球面谐波系数(Spherical Harmonic, SH)对过拟合的贡献。为了解决这些问题,我们提出了一种新颖的基于锚点的Dropout策略DropAnSH-GS。与其独立丢弃高斯,我们的方法随机选择某些高斯作为锚点,并同时移除其空间邻域。这有效地打破了锚点附近的局部冗余,并鼓励模型学习更强健的、全局信息驱动的表示。此外,我们将Dropout扩展到颜色属性,通过随机丢弃高阶SH来集中低阶SH中的外观信息。该策略进一步缓解了过拟合,并通过SH截断实现灵活的后训练模型压缩。实验结果表明,DropAnSH-GS在计算开销几乎可以忽略不计的情况下,显著优于现有的Dropout方法,并且可以轻松集成到各种3DGS变体中以提升其性能。项目网站:https://sk-fun.fun/DropAnSH-GS
cs.CV / 76 / 2602.20943
UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling
UFO:统一前馈和基于优化的方法用于大型驾驶场景建模
Abstract
Dynamic driving scene reconstruction is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods have shown promise for 3D reconstruction, they struggle with long-range driving sequences due to quadratic complexity in sequence length and challenges in modeling dynamic objects over extended durations. We propose UFO, a novel recurrent paradigm that combines the benefits of optimization-based and feed-forward methods for efficient long-range 4D reconstruction. Our approach maintains a 4D scene representation that is iteratively refined as new observations arrive, using a visibility-based filtering mechanism to select informative scene tokens and enable efficient processing of long sequences. For dynamic objects, we introduce an object pose-guided modeling approach that supports accurate long-range motion capture. Experiments on the Waymo Open Dataset demonstrate that our method significantly outperforms both per-scene optimization and existing feed-forward methods across various sequence lengths. Notably, our approach can reconstruct 16-second driving logs within 0.5 second while maintaining superior visual quality and geometric accuracy.
Chinese Translation
动态驾驶场景重建对于自动驾驶仿真和闭环学习至关重要。尽管最近的前馈方法在三维重建方面展现了潜力,但由于序列长度的平方复杂性以及在较长时间内建模动态物体的挑战,它们在长序列驾驶中表现不佳。我们提出了UFO,一种新颖的递归范式,结合了基于优化和前馈方法的优点,以实现高效的长距离四维重建。我们的方法保持一个四维场景表示,随着新观测的到来进行迭代优化,使用基于可见性的过滤机制选择信息丰富的场景标记,从而实现对长序列的高效处理。对于动态物体,我们引入了一种基于物体姿态的建模方法,支持准确的长距离运动捕捉。在Waymo开放数据集上的实验表明,我们的方法在各种序列长度下显著优于每场景优化和现有的前馈方法。值得注意的是,我们的方法能够在0.5秒内重建16秒的驾驶日志,同时保持卓越的视觉质量和几何精度。
cs.CV / 77 / 2602.20951
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
发现并修复缺陷:通过主动数据合成使视觉语言模型和扩散模型理解视觉伪影
Abstract
Despite recent advances in diffusion models, AI generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, which makes artifact mitigation a highly crucial area of study. Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets. In this paper, we propose ArtiAgent, which efficiently creates pairs of real and artifact-injected images. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images, a synthesis agent that introduces artifacts via artifact injection tools through novel patch-wise embedding manipulation within a diffusion transformer, and a curation agent that filters the synthesized artifacts and generates both local and global explanations for each instance. Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications. Code is available at link.
Chinese Translation
尽管扩散模型最近取得了进展,人工智能生成的图像仍然常常包含妨碍真实感的视觉伪影。虽然更全面的预训练和更大的模型可能减少伪影,但并不能保证它们能够完全消除,这使得伪影缓解成为一个极为重要的研究领域。以往的伪影感知方法依赖于人工标注的伪影数据集,这些数据集成本高且难以扩展,突显了自动化方法在可靠获取伪影标注数据集方面的必要性。在本文中,我们提出了ArtiAgent,它高效地创建真实图像与伪影注入图像的配对。该系统由三个代理组成:一个感知代理,它识别并定位真实图像中的实体和子实体;一个合成代理,它通过在扩散变换器中进行新颖的基于补丁的嵌入操作,利用伪影注入工具引入伪影;以及一个策展代理,它过滤合成的伪影,并为每个实例生成局部和全局解释。使用ArtiAgent,我们合成了10万张具有丰富伪影注释的图像,并展示了其在多种应用中的有效性和多样性。代码可在链接中获取。
cs.CV / 78 / 2602.20972
Are Multimodal Large Language Models Good Annotators for Image Tagging?
多模态大型语言模型在图像标注中的表现如何?
Abstract
Image tagging, a fundamental vision task, traditionally relies on human-annotated datasets to train multi-label classifiers, which incurs significant labor and costs. While Multimodal Large Language Models (MLLMs) offer promising potential to automate annotation, their capability to replace human annotators remains underexplored. This paper aims to analyze the gap between MLLM-generated and human annotations and to propose an effective solution that enables MLLM-based annotation to replace manual labeling. Our analysis of MLLM annotations reveals that, under a conservative estimate, MLLMs can reduce annotation cost to as low as one-thousandth of the human cost, mainly accounting for GPU usage, which is nearly negligible compared to manual efforts. Their annotation quality reaches about 50\% to 80\% of human performance, while achieving over 90\% performance on downstream training tasks.Motivated by these findings, we propose TagLLM, a novel framework for image tagging, which aims to narrow the gap between MLLM-generated and human annotations. TagLLM comprises two components: Candidates generation, which employs structured group-wise prompting to efficiently produce a compact candidate set that covers as many true labels as possible while reducing subsequent annotation workload; and label disambiguation, which interactively calibrates the semantic concept of categories in the prompts and effectively refines the candidate labels. Extensive experiments show that TagLLM substantially narrows the gap between MLLM-generated and human annotations, especially in downstream training performance, where it closes about 60\% to 80\% of the difference.
Chinese Translation
图像标注是一项基础的视觉任务,传统上依赖人工标注的数据集来训练多标签分类器,这导致了显著的劳动和成本支出。尽管多模态大型语言模型(MLLMs)在自动化标注方面展现出良好的潜力,但其替代人工标注者的能力仍未得到充分探讨。本文旨在分析MLLM生成的标注与人工标注之间的差距,并提出一种有效的解决方案,使基于MLLM的标注能够替代手动标记。我们对MLLM标注的分析表明,在保守估计下,MLLM可以将标注成本降低至人工成本的千分之一,主要涉及GPU使用,这与人工工作相比几乎可以忽略不计。它们的标注质量达到人工表现的约50%至80%,同时在下游训练任务中表现超过90%。基于这些发现,我们提出了TagLLM,一个用于图像标注的新框架,旨在缩小MLLM生成标注与人工标注之间的差距。TagLLM由两个部分组成:候选生成,通过结构化的组提示高效生成一个紧凑的候选集,尽可能覆盖更多真实标签,同时减少后续的标注工作量;标签消歧,通过交互式校准提示中的类别语义概念,有效地细化候选标签。大量实验表明,TagLLM显著缩小了MLLM生成标注与人工标注之间的差距,特别是在下游训练性能方面,缩小了约60%至80%的差异。
cs.CV / 79 / 2602.20980
CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
CrystaL:在多模态大语言模型中自发出现的视觉潜变量
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable performance by integrating powerful language backbones with large-scale visual encoders. Among these, latent Chain-of-Thought (CoT) methods enable implicit reasoning in continuous hidden states, facilitating seamless vision-language integration and faster inference. However, existing heuristically predefined supervision signals in latent CoT provide limited guidance for preserving critical visual information in intermediate latent states. To address this limitation, we propose CrystaL (Crystallized Latent Reasoning), a single-stage framework with two paths to process intact and corrupted images, respectively. By explicitly aligning the attention patterns and prediction distributions across the two paths, CrystaL crystallizes latent representations into task-relevant visual semantics, without relying on auxiliary annotations or external modules. Extensive experiments on perception-intensive benchmarks demonstrate that CrystaL consistently outperforms state-of-the-art baselines, achieving substantial gains in fine-grained visual understanding while maintaining robust reasoning capabilities.
Chinese Translation
多模态大语言模型(MLLMs)通过将强大的语言骨干与大规模视觉编码器相结合,取得了显著的性能。在这些模型中,潜在的思维链(Chain-of-Thought,CoT)方法使得在连续的隐藏状态中进行隐式推理成为可能,从而促进了视觉与语言的无缝融合和更快的推理。然而,现有的在潜在 CoT 中启发式预定义的监督信号对在中间潜在状态中保留关键视觉信息的指导有限。为了解决这一限制,我们提出了 CrystaL(结晶化潜在推理),这是一个单阶段框架,具有两个路径,分别处理完整和损坏的图像。通过明确对齐两个路径中的注意力模式和预测分布,CrystaL 将潜在表示结晶为与任务相关的视觉语义,而无需依赖辅助注释或外部模块。在感知密集型基准上的大量实验表明,CrystaL 始终优于最先进的基线,在细粒度视觉理解方面取得了显著提升,同时保持了强大的推理能力。
cs.CV / 80 / 2602.20981
Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models
时间的回声:解锁视频到音频生成模型中的长度泛化
Abstract
Scaling multimodal alignment between video and audio is challenging, particularly due to limited data and the mismatch between text descriptions and frame-level video information. In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. To tackle this challenge, we present multimodal hierarchical networks so-called MMHNet, an enhanced extension of state-of-the-art video-to-audio models. Our approach integrates a hierarchical method and non-causal Mamba to support long-form audio generation. Our proposed method significantly improves long audio generation up to more than 5 minutes. We also prove that training short and testing long is possible in the video-to-audio generation tasks without training on the longer durations. We show in our experiments that our proposed method could achieve remarkable results on long-video to audio benchmarks, beating prior works in video-to-audio tasks. Moreover, we showcase our model capability in generating more than 5 minutes, while prior video-to-audio methods fall short in generating with long durations.
Chinese Translation
在视频与音频之间实现多模态对齐的规模化面临挑战,尤其是由于数据有限以及文本描述与帧级视频信息之间的不匹配。在本研究中,我们解决了多模态到音频生成中的规模化挑战,考察了在测试过程中,训练于短实例的模型是否能够泛化到更长的实例。为了解决这一挑战,我们提出了一种多模态层次网络,称为 MMHNet,这是对最先进的视频到音频模型的增强扩展。我们的方法结合了层次化方法和非因果的 Mamba,以支持长格式音频生成。我们提出的方法显著提高了长达超过 5 分钟的音频生成能力。我们还证明了在视频到音频生成任务中,训练短实例并测试长实例是可行的,而无需在更长的持续时间上进行训练。我们的实验表明,我们提出的方法在长视频到音频基准测试中取得了显著的结果,超越了以往的视频到音频任务的研究。此外,我们展示了我们的模型在生成超过 5 分钟音频方面的能力,而以往的视频到音频方法在生成长时间音频时则显得不足。
cs.CV / 81 / 2602.20985
EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
EW-DETR:通过增量低秩检测变换器演化世界物体检测
Abstract
Real-world object detection must operate in evolving environments where new classes emerge, domains shift, and unseen objects must be identified as "unknown": all without accessing prior data. We introduce Evolving World Object Detection (EWOD), a paradigm coupling incremental learning, domain adaptation, and unknown detection under exemplar-free constraints. To tackle EWOD, we propose EW-DETR framework that augments DETR-based detectors with three synergistic modules: Incremental LoRA Adapters for exemplar-free incremental learning under evolving domains; a Query-Norm Objectness Adapter that decouples objectness-aware features from DETR decoder queries; and Entropy-Aware Unknown Mixing for calibrated unknown detection. This framework generalises across DETR-based detectors, enabling state-of-the-art RF-DETR to operate effectively in evolving-world settings. We also introduce FOGS (Forgetting, Openness, Generalisation Score) to holistically evaluate performance across these dimensions. Extensive experiments on Pascal Series and Diverse Weather benchmarks show EW-DETR outperforms other methods, improving FOGS by 57.24%.
Chinese Translation
现实世界的物体检测必须在不断变化的环境中进行,其中新类别不断出现,领域发生转变,且必须将未见物体识别为“未知”,而无需访问先前的数据。我们提出了演化世界物体检测(Evolving World Object Detection, EWOD)这一范式,将增量学习、领域适应和未知检测结合在无样本约束下。为了解决EWOD问题,我们提出了EW-DETR框架,该框架通过三个协同模块增强基于DETR的检测器:用于在不断变化的领域下进行无样本增量学习的增量低秩适配器(Incremental LoRA Adapters);一个将物体感知特征与DETR解码器查询解耦的查询归一化物体性适配器(Query-Norm Objectness Adapter);以及用于校准未知检测的熵感知未知混合(Entropy-Aware Unknown Mixing)。该框架在基于DETR的检测器中具有广泛的适用性,使得最先进的RF-DETR能够在演化世界环境中有效运行。我们还引入了FOGS(遗忘、开放性、泛化评分)来全面评估在这些维度上的表现。在Pascal系列和多样天气基准上的大量实验表明,EW-DETR的表现优于其他方法,FOGS提升了57.24%。
cs.CV / 82 / 2602.20989
Cycle-Consistent Tuning for Layered Image Decomposition
层次图像分解的循环一致性调优
Abstract
Disentangling visual layers in real-world images is a persistent challenge in vision and graphics, as such layers often involve non-linear and globally coupled interactions, including shading, reflection, and perspective distortion. In this work, we present an in-context image decomposition framework that leverages large diffusion foundation models for layered separation. We focus on the challenging case of logo-object decomposition, where the goal is to disentangle a logo from the surface on which it appears while faithfully preserving both layers. Our method fine-tunes a pretrained diffusion model via lightweight LoRA adaptation and introduces a cycle-consistent tuning strategy that jointly trains decomposition and composition models, enforcing reconstruction consistency between decomposed and recomposed images. This bidirectional supervision substantially enhances robustness in cases where the layers exhibit complex interactions. Furthermore, we introduce a progressive self-improving process, which iteratively augments the training set with high-quality model-generated examples to refine performance. Extensive experiments demonstrate that our approach achieves accurate and coherent decompositions and also generalizes effectively across other decomposition types, suggesting its potential as a unified framework for layered image decomposition.
Chinese Translation
在现实世界图像中解开视觉层次是一项持续的挑战,因为这些层次通常涉及非线性和全局耦合的交互,包括阴影、反射和透视失真。在本研究中,我们提出了一种上下文图像分解框架,利用大型扩散基础模型进行层次分离。我们专注于标志物体分解这一具有挑战性的案例,目标是将标志从其出现的表面中解开,同时忠实地保留两个层次。我们的方法通过轻量级的 LoRA(低秩适应)调整对预训练的扩散模型进行微调,并引入了一种循环一致性调优策略,该策略联合训练分解和合成模型,强制分解图像与重组图像之间的重建一致性。这种双向监督在层次表现出复杂交互的情况下显著增强了鲁棒性。此外,我们引入了一种渐进自我改进过程,迭代地用高质量模型生成的示例增强训练集,以提升性能。大量实验表明,我们的方法实现了准确且连贯的分解,并且在其他分解类型中也能有效泛化,表明其作为层次图像分解统一框架的潜力。
cs.CV / 83 / 2602.20999
VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models
VII:用于越狱图像到视频生成模型的视觉指令注入
Abstract
Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference images to act as implicit control signals for video generation. However, this capability also introduces a previously overlooked risk: adversaries may exploit visual instructions to inject malicious intent through the image modality. In this work, we uncover this risk by proposing Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that intentionally disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image. Specifically, VII coordinates a Malicious Intent Reprogramming module to distill malicious intent from unsafe text prompts while minimizing their static harmfulness, and a Visual Instruction Grounding module to ground the distilled intent onto a safe input image by rendering visual instructions that preserve semantic consistency with the original unsafe text prompt, thereby inducing harmful content during I2V generation. Empirically, our extensive experiments on four state-of-the-art commercial I2V models (Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, and PixVerse-V5) demonstrate that VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.
Chinese Translation
图像到视频(I2V)生成模型通过参考图像进行视频生成,展现出新兴的视觉指令跟随能力,使得参考图像中的某些视觉线索可以作为视频生成的隐性控制信号。然而,这种能力也引入了一个之前被忽视的风险:对手可能利用视觉指令通过图像模态注入恶意意图。在本研究中,我们通过提出视觉指令注入(VII)来揭示这一风险,这是一种无训练且可转移的越狱框架,故意将不安全文本提示的恶意意图伪装成安全参考图像中的良性视觉指令。具体而言,VII 协调一个恶意意图重编程模块,从不安全文本提示中提取恶意意图,同时最小化其静态危害性,并且一个视觉指令定位模块将提取的意图与安全输入图像结合,通过渲染与原始不安全文本提示保持语义一致的视觉指令,从而在 I2V 生成过程中诱导有害内容。通过大量实验证明,我们在四个最先进的商业 I2V 模型(Kling-v2.5-turbo、Gemini Veo-3.1、Seedance-1.5-pro 和 PixVerse-V5)上进行的广泛实验表明,VII 的攻击成功率高达 83.5%,而拒绝率接近零,显著优于现有基准。
cs.CV / 84 / 2602.21010
Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design
Le-DETR:重新审视具有高效编码器设计的实时检测变换器
Abstract
Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it is possible to have \textbf{high performance} with \textbf{low pre-training cost}. After a thorough study of the backbone architecture, we propose EfficientNAT at various scales, which incorporates modern efficient convolution and local attention mechanisms. Moreover, we re-design the hybrid encoder with local attention, significantly enhancing both performance and inference speed. Based on these advancements, we present Le-DETR (\textbf{L}ow-cost and \textbf{E}fficient \textbf{DE}tection \textbf{TR}ansformer), which achieves a new \textbf{SOTA} in real-time detection using only ImageNet1K and COCO2017 training datasets, saving about 80\% images in pre-training stage compared with previous methods. We demonstrate that with well-designed, real-time DETR models can achieve strong performance without the need for complex and computationally expensive pretraining. Extensive experiments show that Le-DETR-M/L/X achieves \textbf{52.9/54.3/55.1 mAP} on COCO Val2017 with \textbf{4.45/5.01/6.68 ms} on an RTX4090. It surpasses YOLOv12-L/X by \textbf{+0.6/-0.1 mAP} while achieving similar speed and \textbf{+20\%} speedup. Compared with DEIM-D-FINE, Le-DETR-M achieves \textbf{+0.2 mAP} with slightly faster inference, and surpasses DEIM-D-FINE-L by \textbf{+0.4 mAP} with only \textbf{0.4 ms} additional latency. Code and weights will be open-sourced.
Chinese Translation
实时目标检测对于实际应用至关重要,因为它需要高准确性和低延迟。尽管检测变换器(DETR)已显示出显著的性能提升,但当前的实时DETR模型由于在主干网络上的过度预训练开销,难以从头开始复现,这限制了研究进展,阻碍了新型主干架构的探索。在本文中,我们希望展示,通过使用一般良好的设计,可以实现 extbf{高性能}与 extbf{低预训练成本}。经过对主干架构的深入研究,我们提出了在不同尺度下的EfficientNAT,它结合了现代高效卷积和局部注意力机制。此外,我们重新设计了混合编码器,采用局部注意力,显著提升了性能和推理速度。基于这些进展,我们提出了Le-DETR( extbf{低成本和高效的检测变换器}),它在仅使用ImageNet1K和COCO2017训练数据集的情况下,实现了实时检测的新 extbf{SOTA},与之前的方法相比,预训练阶段节省了约80 ext{%}的图像。我们证明了通过良好的设计,实时DETR模型可以在不需要复杂且计算开销大的预训练的情况下实现强大的性能。大量实验表明,Le-DETR-M/L/X在COCO Val2017上分别达到了 extbf{52.9/54.3/55.1 mAP},在RTX4090上推理时间分别为 extbf{4.45/5.01/6.68 ms}。它超越了YOLOv12-L/X,分别提升了 extbf{+0.6/-0.1 mAP},同时实现了相似的速度和 extbf{+20 ext{%}}的加速。与DEIM-D-FINE相比,Le-DETR-M在推理速度略快的情况下提升了 extbf{+0.2 mAP},并且仅以 extbf{0.4 ms}的额外延迟超越了DEIM-D-FINE-L,提升了 extbf{+0.4 mAP}。代码和权重将开源。
cs.CV / 85 / 2602.21015
From Perception to Action: An Interactive Benchmark for Vision Reasoning
从感知到行动:视觉推理的互动基准测试
Abstract
Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
Chinese Translation
理解物理结构对于实际应用至关重要,例如具身智能体、互动设计和长时间操作。然而,现有的视觉-语言模型(VLM)评估仍然集中在与结构无关的单轮设置(例如,视觉问答(VQA)),这无法评估智能体推理几何、接触和支撑关系如何共同限制在动态环境中可能采取的行动的能力。为了解决这一问题,我们引入了因果行动与互动层级(Causal Hierarchy of Actions and Interactions,CHAIN)基准,这是一个互动的3D物理驱动测试平台,旨在评估模型是否能够理解、规划和执行基于物理约束的结构化行动序列。CHAIN将评估从被动感知转向主动问题解决,涵盖诸如机械拼图的互锁和3D堆叠与打包等任务。我们在统一的互动设置下对最先进的VLM和基于扩散的模型进行了全面研究。我们的结果表明,表现最好的模型仍然难以内化物理结构和因果约束,常常无法生成可靠的长时间规划,并且无法稳健地将感知到的结构转化为有效的行动。该项目可在 https://social-ai-studio.github.io/CHAIN/ 获取。
cs.CV / 86 / 2602.21033
MIP Candy: A Modular PyTorch Framework for Medical Image Processing
MIP Candy:一个模块化的PyTorch医学图像处理框架
Abstract
Medical image processing demands specialized software that handles high-dimensional volumetric data, heterogeneous file formats, and domain-specific training procedures. Existing frameworks either provide low-level components that require substantial integration effort or impose rigid, monolithic pipelines that resist modification. We present MIP Candy (MIPCandy), a freely available, PyTorch-based framework designed specifically for medical image processing. MIPCandy provides a complete, modular pipeline spanning data loading, training, inference, and evaluation, allowing researchers to obtain a fully functional process workflow by implementing a single method, $\texttt{build_network}$, while retaining fine-grained control over every component. Central to the design is $\texttt{LayerT}$, a deferred configuration mechanism that enables runtime substitution of convolution, normalization, and activation modules without subclassing. The framework further offers built-in $k$-fold cross-validation, dataset inspection with automatic region-of-interest detection, deep supervision, exponential moving average, multi-frontend experiment tracking (Weights & Biases, Notion, MLflow), training state recovery, and validation score prediction via quotient regression. An extensible bundle ecosystem provides pre-built model implementations that follow a consistent trainer--predictor pattern and integrate with the core framework without modification. MIPCandy is open-source under the Apache-2.0 license and requires Python~3.12 or later. Source code and documentation are available at https://github.com/ProjectNeura/MIPCandy.
Chinese Translation
医学图像处理需要专门的软件来处理高维体积数据、异构文件格式和特定领域的训练程序。现有框架要么提供需要大量集成工作的低级组件,要么强加僵化的整体管道,难以修改。我们提出了MIP Candy(MIPCandy),一个专为医学图像处理设计的免费PyTorch框架。MIPCandy提供了一个完整的模块化管道,涵盖数据加载、训练、推理和评估,使研究人员能够通过实现一个方法$ exttt{build_network}$来获得一个功能完整的过程工作流,同时对每个组件保持细粒度的控制。设计的核心是$ exttt{LayerT}$,一个延迟配置机制,允许在运行时替换卷积、归一化和激活模块,而无需子类化。该框架还提供内置的$k$-折交叉验证、带有自动感兴趣区域检测的数据集检查、深度监督、指数移动平均、多前端实验跟踪(Weights & Biases、Notion、MLflow)、训练状态恢复以及通过商回归进行的验证得分预测。一个可扩展的捆绑生态系统提供遵循一致的训练者-预测者模式的预构建模型实现,并与核心框架无缝集成。MIPCandy在Apache-2.0许可证下开源,要求Python~3.12或更高版本。源代码和文档可在https://github.com/ProjectNeura/MIPCandy获取。
cs.CV / 87 / 2602.21035
Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
不仅仅是存在的事物:使 CLIP 能够理解否定的视觉描述而无需微调
Abstract
Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.
Chinese Translation
视觉-语言模型(VLMs)如 CLIP 在理解否定方面存在困难,常常将肯定和否定的嵌入方式相似(例如,将“没有狗”与狗的图像匹配)。现有方法通过微调 CLIP 的文本编码器来改善对否定的理解,但这存在过拟合的风险。在本研究中,我们提出了 CLIPGlasses,这是一种即插即用的框架,增强了 CLIP 理解否定视觉描述的能力。CLIPGlasses 采用双阶段设计:一个 Lens 模块将否定语义从文本嵌入中解耦,另一个 Frame 模块预测上下文感知的排斥强度,该强度被整合到修改后的相似度计算中,以惩罚与否定语义的一致性,从而减少假阳性匹配。实验表明,配备 CLIPGlasses 的 CLIP 在领域内表现出竞争力,并在跨领域泛化方面优于现有的最先进方法。其优势在低资源条件下尤为明显,表明其在不同领域之间具有更强的鲁棒性。
cs.CV / 88 / 2602.21042
OmniOCR: Generalist OCR for Ethnic Minority Languages
OmniOCR:少数民族语言的通用光学字符识别
Abstract
Optical character recognition (OCR) has advanced rapidly with deep learning and multimodal models, yet most methods focus on well-resourced scripts such as Latin and Chinese. Ethnic minority languages remain underexplored due to complex writing systems, scarce annotations, and diverse historical and modern forms, making generalization in low-resource or zero-shot settings challenging. To address these challenges, we present OmniOCR, a universal framework for ethnic minority scripts. OmniOCR introduces Dynamic Low-Rank Adaptation (Dynamic LoRA) to allocate model capacity across layers and scripts, enabling effective adaptation while preserving knowledge.A sparsity regularization prunes redundant updates, ensuring compact and efficient adaptation without extra inference cost. Evaluations on TibetanMNIST, Shui, ancient Yi, and Dongba show that OmniOCR outperforms zero-shot foundation models and standard post training, achieving state-of-the-art accuracy with superior parameter efficiency, and compared with the state-of-the-art baseline models, it improves accuracy by 39%-66% on these four datasets. Code: https://github.com/AIGeeksGroup/OmniOCR.
Chinese Translation
光学字符识别(OCR)随着深度学习和多模态模型的快速发展而取得了显著进展,但大多数方法仍然集中在资源丰富的书写系统上,如拉丁文和中文。由于复杂的书写系统、稀缺的注释以及多样的历史和现代形式,少数民族语言仍然未得到充分探索,这使得在低资源或零样本设置下的泛化变得具有挑战性。为了解决这些问题,我们提出了OmniOCR,这是一个针对少数民族书写系统的通用框架。OmniOCR引入了动态低秩适应(Dynamic LoRA),以在层和书写系统之间分配模型容量,从而实现有效的适应,同时保持知识的完整性。稀疏性正则化修剪冗余更新,确保紧凑和高效的适应,而无需额外的推理成本。在TibetanMNIST、Shui、古代彝文和东巴文的数据集上的评估表明,OmniOCR在零样本基础模型和标准后训练中表现优越,达到了最先进的准确率,并在参数效率上表现出色,与最先进的基线模型相比,在这四个数据集上提高了39%-66%的准确率。代码链接:https://github.com/AIGeeksGroup/OmniOCR。
cs.CV / 89 / 2602.21053
OCR-Agent: Agentic OCR with Capability and Memory Reflection
OCR-Agent:具备能力和记忆反思的智能OCR
Abstract
Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.
Chinese Translation
大型视觉语言模型(VLMs)通过迭代优化方法在复杂视觉理解任务中展示了显著的潜力。然而,这些模型通常缺乏有效的自我纠正机制,使得它们难以独立纠正认知偏差。因此,在多轮修订过程中,它们常常陷入重复且无效的尝试,未能实现答案质量的稳定提升。为了解决这一问题,我们提出了一种新颖的迭代自我纠正框架,使模型具备两个关键能力:能力反思(Capability Reflection)和记忆反思(Memory Reflection)。该框架指导模型首先通过能力反思诊断错误并生成纠正计划,然后利用记忆反思回顾过去的尝试,以避免重复并探索新解决方案,最后通过严格的重新推理优化答案。在具有挑战性的OCRBench v2基准测试中,实验表明OCR-Agent在英语和中文子集上分别比当前开源的SOTA模型InternVL3-8B提高了+2.0和+1.2,同时在视觉理解(79.9)和推理(66.5)方面取得了最先进的结果,甚至超越了更大规模的微调模型。我们的方法表明,结构化的自我意识反思可以显著增强VLMs的推理鲁棒性,而无需额外的训练。代码:https://github.com/AIGeeksGroup/OCR-Agent。
cs.CV / 90 / 2602.21054
VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
VAUQ:面向视觉的自我评估不确定性量化框架
Abstract
Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.
Chinese Translation
大型视觉语言模型(LVLMs)经常出现幻觉现象,这限制了它们在实际应用中的安全部署。现有的LLM自我评估方法依赖于模型估计自身输出正确性的能力,这可以提高部署的可靠性;然而,这些方法在很大程度上依赖于语言先验,因此不适合评估基于视觉的预测。我们提出了VAUQ,一个面向视觉的不确定性量化框架,用于LVLM自我评估,明确测量模型输出对视觉证据的依赖程度。VAUQ引入了图像信息分数(Image-Information Score, IS),该分数捕捉由于视觉输入而导致的预测不确定性降低,以及一种无监督的核心区域遮罩策略,增强显著区域的影响。将预测熵与这种核心遮罩IS结合,产生了一种无训练的评分函数,可靠地反映答案的正确性。全面的实验表明,VAUQ在多个数据集上始终优于现有的自我评估方法。
cs.CV / 91 / 2602.21098
Optimizing Occupancy Sensor Placement in Smart Environments
智能环境中占用传感器布置的优化
Abstract
Understanding the locations of occupants in a commercial built environment is critical for realizing energy savings by delivering lighting, heating, and cooling only where it is needed. The key to achieving this goal is being able to recognize zone occupancy in real time, without impeding occupants' activities or compromising privacy. While low-resolution, privacy-preserving time-of-flight (ToF) sensor networks have demonstrated good performance in zone counting, the performance depends on careful sensor placement. To address this issue, we propose an automatic sensor placement method that determines optimal sensor layouts for a given number of sensors, and can predict the counting accuracy of such a layout. In particular, given the geometric constraints of an office environment, we simulate a large number of occupant trajectories. We then formulate the sensor placement problem as an integer linear programming (ILP) problem and solve it with the branch and bound method. We demonstrate the effectiveness of the proposed method based on simulations of several different office environments.
Chinese Translation
了解商业建筑环境中居住者的位置对于实现节能至关重要,因为这可以仅在需要的地方提供照明、供暖和制冷。实现这一目标的关键在于能够实时识别区域的占用情况,而不妨碍居住者的活动或侵犯隐私。尽管低分辨率、保护隐私的飞行时间(ToF)传感器网络在区域计数方面表现良好,但其性能依赖于传感器的精确布置。为了解决这一问题,我们提出了一种自动传感器布置方法,该方法能够为给定数量的传感器确定最佳传感器布局,并预测该布局的计数准确性。特别是,考虑到办公环境的几何约束,我们模拟了大量的居住者轨迹。然后,我们将传感器布置问题表述为整数线性规划(ILP)问题,并使用分支限界法进行求解。我们通过对几个不同办公环境的模拟展示了所提方法的有效性。
cs.CV / 92 / 2602.21100
Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction
Skullptor:基于多视角法线预测的高保真3D头部重建,秒级完成
Abstract
Reconstructing high-fidelity 3D head geometry from images is critical for a wide range of applications, yet existing methods face fundamental limitations. Traditional photogrammetry achieves exceptional detail but requires extensive camera arrays (25-200+ views), substantial computation, and manual cleanup in challenging areas like facial hair. Recent alternatives present a fundamental trade-off: foundation models enable efficient single-image reconstruction but lack fine geometric detail, while optimization-based methods achieve higher fidelity but require dense views and expensive computation. We bridge this gap with a hybrid approach that combines the strengths of both paradigms. Our method introduces a multi-view surface normal prediction model that extends monocular foundation models with cross-view attention to produce geometrically consistent normals in a feed-forward pass. We then leverage these predictions as strong geometric priors within an inverse rendering optimization framework to recover high-frequency surface details. Our approach outperforms state-of-the-art single-image and multi-view methods, achieving high-fidelity reconstruction on par with dense-view photogrammetry while reducing camera requirements and computational cost. The code and model will be released.
Chinese Translation
从图像中重建高保真的3D头部几何形状对广泛应用至关重要,但现有方法面临基本限制。传统的摄影测量技术能够实现卓越的细节,但需要大量的相机阵列(25-200+视角)、大量计算以及在面部毛发等复杂区域的手动清理。最近的替代方法呈现出一种基本的权衡:基础模型能够高效地进行单图像重建,但缺乏细致的几何细节,而基于优化的方法则实现了更高的保真度,但需要密集的视角和昂贵的计算。我们通过一种混合方法弥补了这一差距,结合了两种范式的优势。我们的方法引入了一种多视角表面法线预测模型,通过跨视角注意力扩展单目基础模型,以在前馈传递中生成几何一致的法线。然后,我们利用这些预测作为逆渲染优化框架中的强几何先验,以恢复高频表面细节。我们的方法在高保真重建方面超越了最先进的单图像和多视角方法,达到了与密集视角摄影测量相当的重建效果,同时降低了相机需求和计算成本。代码和模型将会发布。
cs.CV / 93 / 2602.21101
Event-Aided Sharp Radiance Field Reconstruction for Fast-Flying Drones
事件辅助的快速飞行无人机锐利辐射场重建
Abstract
Fast-flying aerial robots promise rapid inspection under limited battery constraints, with direct applications in infrastructure inspection, terrain exploration, and search and rescue. However, high speeds lead to severe motion blur in images and induce significant drift and noise in pose estimates, making dense 3D reconstruction with Neural Radiance Fields (NeRFs) particularly challenging due to their high sensitivity to such degradations. In this work, we present a unified framework that leverages asynchronous event streams alongside motion-blurred frames to reconstruct high-fidelity radiance fields from agile drone flights. By embedding event-image fusion into NeRF optimization and jointly refining event-based visual-inertial odometry priors using both event and frame modalities, our method recovers sharp radiance fields and accurate camera trajectories without ground-truth supervision. We validate our approach on both synthetic data and real-world sequences captured by a fast-flying drone. Despite highly dynamic drone flights, where RGB frames are severely degraded by motion blur and pose priors become unreliable, our method reconstructs high-fidelity radiance fields and preserves fine scene details, delivering a performance gain of over 50% on real-world data compared to state-of-the-art methods.
Chinese Translation
快速飞行的空中机器人在有限的电池约束下承诺快速检查,直接应用于基础设施检查、地形探索和搜索救援。然而,高速飞行导致图像严重的运动模糊,并在姿态估计中引入显著的漂移和噪声,使得使用神经辐射场(NeRF)进行密集的三维重建特别具有挑战性,因为它们对这些退化现象高度敏感。在本研究中,我们提出了一个统一框架,利用异步事件流与运动模糊帧相结合,从灵活的无人机飞行中重建高保真度的辐射场。通过将事件图像融合嵌入到NeRF优化中,并使用事件和帧模态共同优化基于事件的视觉惯性里程计先验,我们的方法在没有真实值监督的情况下恢复了锐利的辐射场和准确的相机轨迹。我们在合成数据和由快速飞行无人机捕获的真实世界序列上验证了我们的方法。尽管无人机飞行高度动态,RGB帧受到严重的运动模糊影响,姿态先验变得不可靠,但我们的方法仍能重建高保真度的辐射场并保留细致的场景细节,与最先进的方法相比,在真实世界数据上实现了超过50%的性能提升。
cs.CV / 94 / 2602.21105
BrepGaussian: CAD reconstruction from Multi-View Images with Gaussian Splatting
BrepGaussian:基于多视图图像的高斯溅射CAD重建
Abstract
The boundary representation (B-rep) models a 3D solid as its explicit boundaries: trimmed corners, edges, and faces. Recovering B-rep representation from unstructured data is a challenging and valuable task of computer vision and graphics. Recent advances in deep learning have greatly improved the recovery of 3D shape geometry, but still depend on dense and clean point clouds and struggle to generalize to novel shapes. We propose B-rep Gaussian Splatting (BrepGaussian), a novel framework that learns 3D parametric representations from 2D images. We employ a Gaussian Splatting renderer with learnable features, followed by a specific fitting strategy. To disentangle geometry reconstruction and feature learning, we introduce a two-stage learning framework that first captures geometry and edges and then refines patch features to achieve clean geometry and coherent instance representations. Extensive experiments demonstrate the superior performance of our approach to state-of-the-art methods. We will release our code and datasets upon acceptance.
Chinese Translation
边界表示(B-rep)将三维实体建模为其显式边界:修整的角、边和面。从非结构化数据中恢复B-rep表示是计算机视觉和图形学中的一项具有挑战性且重要的任务。近年来,深度学习的进展极大地改善了三维形状几何的恢复,但仍然依赖于密集且干净的点云,并且在新形状上泛化能力不足。我们提出了B-rep高斯溅射(BrepGaussian),这是一个新颖的框架,可以从二维图像中学习三维参数表示。我们采用具有可学习特征的高斯溅射渲染器,并随后应用特定的拟合策略。为了将几何重建与特征学习分离,我们引入了一个两阶段学习框架,首先捕捉几何和边缘,然后细化补丁特征,以实现干净的几何和一致的实例表示。大量实验表明我们的方法在性能上优于现有的最先进方法。我们将在论文接受后发布我们的代码和数据集。
cs.CV / 95 / 2602.21137
UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics
UDVideoQA:用于城市动态中多对象时空推理的交通视频问答数据集
Abstract
Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/.
Chinese Translation
理解城市交通中复杂的多主体动态仍然是视频语言模型面临的一个基本挑战。本文介绍了城市动态视频问答(Urban Dynamics VideoQA),这是一个基准数据集,捕捉了动态城市场景中未编排的真实行为。UDVideoQA 从16小时的交通录像中整理而成,这些录像在多座城市交叉口记录,涵盖了多样的交通、天气和光照条件。该数据集采用事件驱动的动态模糊技术,以确保隐私保护而不影响场景的真实感。通过统一的标注流程,该数据集包含28,000对问题-答案对,生成于8小时的密集标注视频中,平均每秒一个问题。其分类法遵循层次推理水平,涵盖基本理解和事件推理、逆向推理以及反事实推理,能够系统地评估视觉基础和因果推理。全面的实验对10个最先进(SOTA)的视频语言模型在UDVideoQA上进行了基准测试,并对8个模型在一个补充的视频问题生成基准上进行了评估。结果显示,感知-推理之间存在持续的差距,表明在抽象推理上表现出色的模型在基本视觉基础上往往失败。尽管像Gemini Pro这样的模型在零样本准确率上达到最高,但对较小的Qwen2.5-VL 7B模型进行微调,能够弥补这一差距,达到与专有系统相当的性能。在视频问题生成(VideoQGen)中,Gemini 2.5 Pro和Qwen3 Max生成了最相关和复杂的问题,尽管所有模型的语言多样性有限,强调了人本评估的必要性。UDVideoQA套件,包括数据集、标注工具以及视频问答和视频问题生成的基准,为推进稳健、注重隐私和真实世界的多模态推理提供了基础。UDVideoQA可在 https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/ 获取。
cs.CV / 96 / 2602.21141
SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception
SynthRender 和 IRIS:用于工业物体感知的双向真实-合成转移的开源框架和数据集
Abstract
Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning perception models require large datasets for robust automation under semi-uncontrolled conditions. The cost of acquiring and annotating such data for proprietary parts is a major barrier for widespread deployment. In this context, we release SynthRender, an open source framework for synthetic image generation with Guided Domain Randomization capabilities. Furthermore, we benchmark recent Reality-to-Simulation techniques for 3D asset creation from 2D images of real parts. Combined with Domain Randomization, these synthetic assets provide low-overhead, transferable data even for parts lacking 3D files. We also introduce IRIS, the Industrial Real-Sim Imagery Set, containing 32 categories with diverse textures, intra-class variation, strong inter-class similarities and about 20,000 labels. Ablations on multiple benchmarks outline guidelines for efficient data generation with SynthRender. Our method surpasses existing approaches, achieving 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.
Chinese Translation
物体感知是机器人物料处理和质量检查等任务的基础。然而,现代监督深度学习感知模型在半控制条件下需要大量数据集以实现稳健的自动化。获取和标注专有部件的数据的成本是广泛部署的主要障碍。在此背景下,我们发布了 SynthRender,这是一个具有引导领域随机化能力的合成图像生成开源框架。此外,我们对最近的现实到仿真技术进行了基准测试,以从真实部件的二维图像创建三维资产。结合领域随机化,这些合成资产即使对于缺乏三维文件的部件也能提供低开销、可转移的数据。我们还介绍了 IRIS,即工业真实-仿真图像集,包含32个类别,具有多样的纹理、类内变化、强类间相似性和约20,000个标签。在多个基准上的消融实验概述了使用 SynthRender 进行高效数据生成的指导方针。我们的方法超越了现有方法,在一个公共机器人数据集上达到了 99.1% mAP@50,在一个汽车基准上达到了 98.3% mAP@50,并在 IRIS 上达到了 95.3% mAP@50。
cs.CV / 97 / 2602.21142
LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
LUMEN:用于预后和诊断的纵向多模态放射学模型
Abstract
Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.
Chinese Translation
大型视觉语言模型(VLMs)已从通用应用演变为专门的使用案例,例如在临床领域,显示出在放射学中支持决策的潜力。一项有前景的应用是通过视觉和自然语言问答(VQA)接口,分析放射学影像数据(如胸部X光片(CXR))来协助放射科医生进行决策。当纵向影像可用时,放射科医生会分析时间变化,这对于准确的诊断和预后至关重要。手动进行纵向分析是一个耗时的过程,这激励了一个能够提供预后能力的训练框架的开发。我们介绍了一种新颖的训练框架LUMEN,针对纵向CXR解读进行了优化,利用多图像和多任务指令微调来增强预后和诊断性能。我们在公开可用的MIMIC-CXR及其相关的Medical-Diff-VQA数据集上进行了实验。我们进一步制定并构建了一个新颖的遵循指令的数据集,结合了纵向研究,从而实现了预后VQA任务的发展。我们的方法在诊断VQA任务上显示出显著优于基线模型的改进,更重要的是,展示了良好的预后能力潜力。这些结果强调了精心设计的、经过指令微调的VLMs在实现更准确和具有临床意义的纵向放射影像数据解读中的价值。
cs.CV / 98 / 2602.21153
SPRITETOMESH: Automatic Mesh Generation for 2D Skeletal Animation Using Learned Segmentation and Contour-Aware Vertex Placement
SPRITETOMESH:基于学习分割和轮廓感知顶点放置的2D骨骼动画自动网格生成
Abstract
We present SPRITETOMESH, a fully automatic pipeline for converting 2D game sprite images into triangle meshes compatible with skeletal animation frameworks such as Spine2D. Creating animation-ready meshes is traditionally a tedious manual process requiring artists to carefully place vertices along visual boundaries, a task that typically takes 15-60 minutes per sprite. Our method addresses this through a hybrid learned-algorithmic approach. A segmentation network (EfficientNet-B0 encoder with U-Net decoder) trained on over 100,000 sprite-mask pairs from 172 games achieves an IoU of 0.87, providing accurate binary masks from arbitrary input images. From these masks, we extract exterior contour vertices using Douglas-Peucker simplification with adaptive arc subdivision, and interior vertices along visual boundaries detected via bilateral-filtered multi-channel Canny edge detection with contour-following placement. Delaunay triangulation with mask-based centroid filtering produces the final mesh. Through controlled experiments, we demonstrate that direct vertex position prediction via neural network heatmap regression is fundamentally not viable for this task: the heatmap decoder consistently fails to converge (loss plateau at 0.061) while the segmentation decoder trains normally under identical conditions. We attribute this to the inherently artistic nature of vertex placement - the same sprite can be meshed validly in many different ways. This negative result validates our hybrid design: learned segmentation where ground truth is unambiguous, algorithmic placement where domain heuristics are appropriate. The complete pipeline processes a sprite in under 3 seconds, representing a speedup of 300x-1200x over manual creation. We release our trained model to the game development community.
Chinese Translation
我们提出了SPRITETOMESH,这是一个完全自动化的流程,用于将2D游戏精灵图像转换为与骨骼动画框架(如Spine2D)兼容的三角形网格。创建适合动画的网格传统上是一个繁琐的手动过程,艺术家需要仔细沿着视觉边界放置顶点,这一任务通常需要每个精灵15到60分钟。我们的方法通过混合学习-算法的方法来解决这个问题。一个在172款游戏的超过100,000对精灵-掩膜样本上训练的分割网络(EfficientNet-B0编码器与U-Net解码器)达到了0.87的交并比(IoU),能够从任意输入图像中提供准确的二进制掩膜。我们从这些掩膜中提取外部轮廓顶点,使用Douglas-Peucker简化与自适应弧细分,并沿着通过双边滤波的多通道Canny边缘检测检测到的视觉边界放置内部顶点。基于掩膜的质心过滤的Delaunay三角剖分生成最终网格。通过控制实验,我们证明了通过神经网络热图回归直接预测顶点位置在此任务中根本不可行:热图解码器在相同条件下始终无法收敛(损失停滞在0.061),而分割解码器则正常训练。我们将此归因于顶点放置的艺术性本质——同一个精灵可以以多种不同的方式有效地生成网格。这个负面结果验证了我们的混合设计:在真实值明确的情况下使用学习分割,在领域启发式适用的情况下使用算法放置。完整的流程在3秒内处理一个精灵,相比手动创建速度提升了300倍到1200倍。我们将发布我们的训练模型以供游戏开发社区使用。
cs.CV / 99 / 2602.21175
Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
透过文字看世界:利用语言模型控制视觉检索质量
Abstract
Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at https://github.com/Jianglin954/QCQC.
Chinese Translation
文本到图像的检索是视觉-语言学习中的一项基础任务,但在现实场景中,它常常受到简短且不明确的用户查询的挑战。这类查询通常只有一到两个词,导致其语义模糊,容易在多样的视觉解释中产生冲突,并且缺乏对检索图像质量的明确控制。为了解决这些问题,我们提出了一种新的质量可控检索范式,该范式通过上下文细节丰富简短查询,同时融入对图像质量的明确概念。我们的关键思想是利用生成语言模型作为查询补全函数,将不明确的查询扩展为描述性形式,以捕捉细粒度的视觉属性,如姿势、场景和美学。我们引入了一个通用框架,该框架将查询补全条件化于离散化的质量水平,这些质量水平源自相关性和美学评分模型,从而使查询丰富不仅在语义上有意义,而且在质量上也具有意识。最终的系统提供了三个主要优势:1)灵活性,它与任何预训练的视觉-语言模型(VLMs)兼容,无需修改;2)透明性,丰富的查询对用户是明确可解释的;3)可控性,使检索结果能够朝向用户偏好的质量水平引导。大量实验表明,我们提出的方法显著改善了检索结果,并提供了有效的质量控制,弥合了现代VLMs的表达能力与简短用户查询的不明确性之间的差距。我们的代码可在 https://github.com/Jianglin954/QCQC 获取。
cs.CV / 100 / 2602.21178
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
XMorph:基于大型语言模型辅助的可解释脑肿瘤分析的混合深度智能
Abstract
Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque ''black boxes'' and fail to quantify the complex, irregular tumor boundaries that characterize malignant growth. To address these challenges, we present XMorph, an explainable and computationally efficient framework for fine-grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tumors. We propose an Information-Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morphological representation of tumor growth. A dual-channel explainable AI module combines GradCAM++ visual cues with LLM-generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co-exist in AI-based medical imaging systems. The source code and materials for XMorph are all publicly available at: https://github.com/ALSER-Lab/XMorph.
Chinese Translation
深度学习在自动化脑肿瘤诊断方面取得了显著进展,但临床应用仍受到可解释性和计算限制的制约。传统模型往往作为不透明的“黑箱”运作,无法量化恶性生长特征的复杂、不规则肿瘤边界。为了解决这些挑战,我们提出了XMorph,一个可解释且计算高效的框架,用于对三种主要脑肿瘤类型(胶质瘤、脑膜瘤和垂体肿瘤)进行细粒度分类。我们提出了一种信息加权边界归一化(Information-Weighted Boundary Normalization, IWBN)机制,强调与诊断相关的边界区域,以及非线性混沌和临床验证特征,从而实现肿瘤生长的更丰富形态表示。一个双通道可解释人工智能模块将GradCAM++视觉线索与大型语言模型生成的文本推理相结合,将模型推理转化为临床可解释的见解。所提出的框架实现了96.0%的分类准确率,证明了可解释性和高性能可以在基于人工智能的医学影像系统中共存。XMorph的源代码和材料均可在以下网址公开获取:https://github.com/ALSER-Lab/XMorph。
cs.CV / 101 / 2602.21179
Mask-HybridGNet: Graph-based segmentation with emergent anatomical correspondence from pixel-level supervision
Mask-HybridGNet:基于图的分割与来自像素级监督的突现解剖对应
Abstract
Graph-based medical image segmentation represents anatomical structures using boundary graphs, providing fixed-topology landmarks and inherent population-level correspondences. However, their clinical adoption has been hindered by a major requirement: training datasets with manually annotated landmarks that maintain point-to-point correspondences across patients rarely exist in practice. We introduce Mask-HybridGNet, a framework that trains graph-based models directly using standard pixel-wise masks, eliminating the need for manual landmark annotations. Our approach aligns variable-length ground truth boundaries with fixed-length landmark predictions by combining Chamfer distance supervision and edge-based regularization to ensure local smoothness and regular landmark distribution, further refined via differentiable rasterization. A significant emergent property of this framework is that predicted landmark positions become consistently associated with specific anatomical locations across patients without explicit correspondence supervision. This implicit atlas learning enables temporal tracking, cross-slice reconstruction, and morphological population analyses. Beyond direct segmentation, Mask-HybridGNet can extract correspondences from existing segmentation masks, allowing it to generate stable anatomical atlases from any high-quality pixel-based model. Experiments across chest radiography, cardiac ultrasound, cardiac MRI, and fetal imaging demonstrate that our model achieves competitive results against state-of-the-art pixel-based methods, while ensuring anatomical plausibility by enforcing boundary connectivity through a fixed graph adjacency matrix. This framework leverages the vast availability of standard segmentation masks to build structured models that maintain topological integrity and provide implicit correspondences.
Chinese Translation
基于图的医学图像分割通过边界图表示解剖结构,提供固定拓扑的标志物和固有的人群级对应关系。然而,它们在临床应用中的推广受到一个主要要求的制约:在实践中,保持患者间点对点对应关系的手动标志物注释的训练数据集极为稀缺。我们提出了Mask-HybridGNet,一个直接使用标准像素级掩码训练基于图的模型的框架,从而消除了对手动标志物注释的需求。我们的方法通过结合Chamfer距离监督和基于边缘的正则化,将可变长度的真实边界与固定长度的标志物预测对齐,以确保局部平滑性和规则的标志物分布,并通过可微分光栅化进一步优化。该框架的一个显著突现特性是,预测的标志物位置在患者间与特定解剖位置的一致性关联,而无需显式的对应监督。这种隐式的图谱学习使得时间跟踪、跨切片重建和形态人群分析成为可能。除了直接分割外,Mask-HybridGNet还可以从现有的分割掩码中提取对应关系,使其能够从任何高质量的基于像素的模型生成稳定的解剖图谱。在胸部X光、心脏超声、心脏MRI和胎儿成像的实验中,我们的模型在与最先进的基于像素的方法相比时,取得了具有竞争力的结果,同时通过强制边界连通性来确保解剖的合理性,使用固定的图邻接矩阵。该框架利用标准分割掩码的广泛可用性,构建保持拓扑完整性并提供隐式对应关系的结构化模型。
cs.CV / 102 / 2602.21186
Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
Spa3R:用于三维视觉推理的预测空间场建模
Abstract
While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at https://github.com/hustvl/Spa3R.
Chinese Translation
尽管视觉-语言模型(VLMs)在二维视觉理解方面表现出色,但它们对三维空间的理解和推理能力——空间智能的基石——仍然较为肤浅。目前的方法试图通过依赖显式的三维模态或通过用部分的、视角条件的几何先验增强VLMs来弥补这一领域的差距。然而,这些方法限制了可扩展性,并最终使语言模型承担从稀疏线索中隐式重建整体三维几何的难题。在本文中,我们认为空间智能可以仅通过二维视觉自然而然地出现,而不是通过显式的空间指令调优来强加。为此,我们提出了Spa3R,一个自监督框架,直接从未定姿态的多视图图像中学习统一的、视角不变的空间表示。Spa3R建立在所提出的预测空间场建模(Predictive Spatial Field Modeling, PSFM)范式之上,其中Spa3R学习合成任意未见视角的特征场,并以紧凑的潜在表示为条件,从而内化对基础三维场景的整体和一致的理解。我们进一步通过轻量级适配器将预训练的Spa3R编码器集成到现有的VLMs中,形成Spa3-VLM,有效地将语言推理扎根于全球空间背景中。在具有挑战性的VSI-Bench上的实验表明,Spa3-VLM在三维视觉问答(3D VQA)任务上达到了58.6%的最新准确率,显著优于之前的方法。这些结果突显了PSFM作为推动空间智能发展的可扩展路径。代码可在 https://github.com/hustvl/Spa3R 获取。
cs.CV / 103 / 2602.21188
Human Video Generation from a Single Image with 3D Pose and View Control
基于单幅图像的人类视频生成:3D姿态与视角控制
Abstract
Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.
Chinese Translation
近年来,扩散方法在从单幅图像生成视频方面取得了显著进展,得益于其强大的视觉生成能力。然而,在图像到视频合成中,尤其是人类视频生成方面,仍然存在挑战,尤其是从单幅图像推断视角一致、运动依赖的衣物褶皱仍然是一个艰巨的问题。本文提出了4D人类视频生成模型(Human Video Generation in 4D, HVG),这是一个潜在的视频扩散模型,能够从单幅图像生成高质量、多视角、时空一致的人类视频,并具备3D姿态和视角控制。HVG通过三个关键设计实现这一目标:(i) 关节姿态调制(Articulated Pose Modulation),通过新颖的双维骨架图捕捉3D关节的解剖关系,并通过引入3D信息解决视角间的自遮挡问题;(ii) 视角与时间对齐(View and Temporal Alignment),确保参考图像与姿态序列之间的多视角一致性和帧间稳定性;(iii) 逐步时空采样(Progressive Spatio-Temporal Sampling),结合时间对齐以保持长时间多视角动画中的平滑过渡。在图像到视频任务上的大量实验表明,HVG在从多样的人类图像和姿态输入生成高质量4D人类视频方面优于现有方法。
cs.CV / 104 / 2602.21195
Region of Interest Segmentation and Morphological Analysis for Membranes in Cryo-Electron Tomography
冷冻电子断层成像中膜的感兴趣区域分割与形态分析
Abstract
Cryo-electron tomography (cryo-ET) enables high resolution, three-dimensional reconstruction of biological structures, including membranes and membrane proteins. Identification of regions of interest (ROIs) is central to scientific imaging, as it enables isolation and quantitative analysis of specific structural features within complex datasets. In practice, however, ROIs are typically derived indirectly through full structure segmentation followed by post hoc analysis. This limitation is especially apparent for continuous and geometrically complex structures such as membranes, which are segmented as single entities. Here, we developed TomoROIS-SurfORA, a two step framework for direct, shape-agnostic ROI segmentation and morphological surface analysis. TomoROIS performs deep learning-based ROI segmentation and can be trained from scratch using small annotated datasets, enabling practical application across diverse imaging data. SurfORA processes segmented structures as point clouds and surface meshes to extract quantitative morphological features, including inter-membrane distances, curvature, and surface roughness. It supports both closed and open surfaces, with specific considerations for open surfaces, which are common in cryo-ET due to the missing wedge effect. We demonstrate both tools using in vitro reconstituted membrane systems containing deformable vesicles with complex geometries, enabling automatic quantitative analysis of membrane contact sites and remodeling events such as invagination. While demonstrated here on cryo-ET membrane data, the combined approach is applicable to ROI detection and surface analysis in broader scientific imaging contexts.
Chinese Translation
冷冻电子断层成像(cryo-ET)能够高分辨率地三维重建生物结构,包括膜和膜蛋白。感兴趣区域(ROIs)的识别是科学成像的核心,因为它能够在复杂数据集中隔离和定量分析特定的结构特征。然而,在实际操作中,ROIs通常是通过完整结构分割后进行事后分析间接获得的。这一限制在连续且几何复杂的结构(如膜)中尤为明显,因为这些结构通常被分割为单一实体。在此,我们开发了TomoROIS-SurfORA,一个用于直接、形状无关的ROI分割和形态表面分析的两步框架。TomoROIS执行基于深度学习的ROI分割,并可以使用小型标注数据集从头开始训练,使其能够在多样的成像数据中得到实际应用。SurfORA将分割后的结构处理为点云和表面网格,以提取定量形态特征,包括膜间距、曲率和表面粗糙度。它支持闭合和开放表面,并针对开放表面(由于缺失楔效应在cryo-ET中常见)进行了特定考虑。我们使用包含具有复杂几何形状的可变形囊泡的体外重构膜系统演示了这两个工具,使得膜接触位点和重塑事件(如内陷)的自动定量分析成为可能。尽管在这里展示的是cryo-ET膜数据,但该组合方法适用于更广泛的科学成像背景中的ROI检测和表面分析。
cs.AI / 1 / 2602.20303
Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged 10-17: Comparative Evaluation of Statistical and Machine Learning Approaches Using the 2021 National Survey of Children's Health
美国10-17岁儿童超重和肥胖的多层次决定因素:基于2021年国家儿童健康调查的统计与机器学习方法的比较评估
Abstract
Background: Childhood and adolescent overweight and obesity remain major public health concerns in the United States and are shaped by behavioral, household, and community factors. Their joint predictive structure at the population level remains incompletely characterized. Objectives: The study aims to identify multilevel predictors of overweight and obesity among U.S. adolescents and compare the predictive performance, calibration, and subgroup equity of statistical, machine-learning, and deep-learning models. Data and Methods: We analyze 18,792 children aged 10-17 years from the 2021 National Survey of Children's Health. Overweight/obesity is defined using BMI categories. Predictors included diet, physical activity, sleep, parental stress, socioeconomic conditions, adverse experiences, and neighborhood characteristics. Models include logistic regression, random forest, gradient boosting, XGBoost, LightGBM, multilayer perceptron, and TabNet. Performance is evaluated using AUC, accuracy, precision, recall, F1 score, and Brier score. Results: Discrimination range from 0.66 to 0.79. Logistic regression, gradient boosting, and MLP showed the most stable balance of discrimination and calibration. Boosting and deep learning modestly improve recall and F1 score. No model was uniformly superior. Performance disparities across race and poverty groups persist across algorithms. Conclusion: Increased model complexity yields limited gains over logistic regression. Predictors consistently span behavioral, household, and neighborhood domains. Persistent subgroup disparities indicate the need for improved data quality and equity-focused surveillance rather than greater algorithmic complexity.
Chinese Translation
背景:儿童和青少年的超重与肥胖在美国仍然是主要的公共卫生问题,受到行为、家庭和社区因素的影响。其在群体层面的联合预测结构尚未完全表征。目的:本研究旨在识别美国青少年超重和肥胖的多层次预测因素,并比较统计模型、机器学习模型和深度学习模型的预测性能、校准和亚组公平性。数据与方法:我们分析了来自2021年国家儿童健康调查的18,792名10-17岁儿童的数据。超重/肥胖通过BMI类别进行定义。预测因素包括饮食、身体活动、睡眠、父母压力、社会经济状况、不良经历和邻里特征。模型包括逻辑回归、随机森林、梯度提升、XGBoost、LightGBM、多层感知器和TabNet。使用AUC、准确率、精确率、召回率、F1分数和Brier分数评估性能。结果:模型的区分度范围为0.66到0.79。逻辑回归、梯度提升和多层感知器显示出最稳定的区分和校准平衡。提升和深度学习在召回率和F1分数上有适度改善。没有模型在所有方面均优于其他模型。不同种族和贫困群体之间的性能差异在各算法中仍然存在。结论:模型复杂性的增加在逻辑回归之上带来的收益有限。预测因素始终涵盖行为、家庭和邻里领域。持续存在的亚组差异表明需要改善数据质量和以公平为中心的监测,而不是增加算法复杂性。
cs.AI / 2 / 2602.20324
An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models
基于大语言模型的临床笔记端到端罕见疾病表型构建的人工智能框架
Abstract
Phenotyping is fundamental to rare disease diagnosis, but manual curation of structured phenotypes from clinical notes is labor-intensive and difficult to scale. Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontology (HPO) terms, and prioritizing diagnostically informative HPO terms. We developed RARE-PHENIX, an end-to-end AI framework for rare disease phenotyping that integrates large language model-based phenotype extraction, ontology-grounded standardization to HPO terms, and supervised ranking of diagnostically informative phenotypes. We trained RARE-PHENIX using data from 2,671 patients across 11 Undiagnosed Diseases Network clinical sites, and externally validated it on 16,357 real-world clinical notes from Vanderbilt University Medical Center. Using clinician-curated HPO terms as the gold standard, RARE-PHENIX consistently outperformed a state-of-the-art deep learning baseline (PhenoBERT) across ontology-based similarity and precision-recall-F1 metrics in end-to-end evaluation (i.e., ontology-based similarity of 0.70 vs. 0.58). Ablation analyses demonstrated performance improvements with the addition of each module in RARE-PHENIX (extraction, standardization, and prioritization), supporting the value of modeling the full clinical phenotyping workflow. By modeling phenotyping as a clinically aligned workflow rather than a single extraction task, RARE-PHENIX provides structured, ranked phenotypes that are more concordant with clinician curation and has the potential to support human-in-the-loop rare disease diagnosis in real-world settings.
Chinese Translation
表型构建是罕见疾病诊断的基础,但从临床笔记中手动整理结构化表型既费时又难以扩展。现有的人工智能方法通常优化表型构建的单个组件,但并未将从临床文本中提取特征、将其标准化为人类表型本体(Human Phenotype Ontology, HPO)术语以及优先考虑具有诊断信息的HPO术语的完整临床工作流程进行操作化。我们开发了RARE-PHENIX,这是一个端到端的罕见疾病表型构建AI框架,集成了基于大语言模型的表型提取、基于本体的HPO术语标准化以及对具有诊断信息的表型的监督排名。我们使用来自11个未诊断疾病网络临床站点的2671名患者的数据训练了RARE-PHENIX,并在范德比尔特大学医学中心的16357份真实临床笔记上进行了外部验证。以临床医生整理的HPO术语作为金标准,RARE-PHENIX在端到端评估中(即,本体基础相似度为0.70 vs. 0.58)始终优于最先进的深度学习基线(PhenoBERT),在基于本体的相似度和精确率-召回-F1指标上表现出色。消融分析表明,RARE-PHENIX中每个模块(提取、标准化和优先排序)的增加均提高了性能,支持了对完整临床表型构建工作流程建模的价值。通过将表型构建建模为临床对齐的工作流程而非单一的提取任务,RARE-PHENIX提供了结构化的、排名的表型,这些表型与临床医生的整理结果更为一致,并有潜力在真实环境中支持人机协作的罕见疾病诊断。
cs.AI / 3 / 2602.20333
DMCD: Semantic-Statistical Framework for Causal Discovery
DMCD:用于因果发现的语义统计框架
Abstract
We present DMCD (DataMap Causal Discovery), a two-phase causal discovery framework that integrates LLM-based semantic drafting from variable metadata with statistical validation on observational data. In Phase I, a large language model proposes a sparse draft DAG, serving as a semantically informed prior over the space of possible causal structures. In Phase II, this draft is audited and refined via conditional independence testing, with detected discrepancies guiding targeted edge revisions. We evaluate our approach on three metadata-rich real-world benchmarks spanning industrial engineering, environmental monitoring, and IT systems analysis. Across these datasets, DMCD achieves competitive or leading performance against diverse causal discovery baselines, with particularly large gains in recall and F1 score. Probing and ablation experiments suggest that these improvements arise from semantic reasoning over metadata rather than memorization of benchmark graphs. Overall, our results demonstrate that combining semantic priors with principled statistical verification yields a high-performing and practically effective approach to causal structure learning.
Chinese Translation
我们提出了DMCD(DataMap因果发现),这是一种两阶段的因果发现框架,结合了基于大型语言模型的变量元数据语义草拟与对观察数据的统计验证。在第一阶段,一个大型语言模型提出一个稀疏的草拟有向无环图(DAG),作为可能因果结构空间的语义信息先验。在第二阶段,通过条件独立性检验对该草拟进行审计和精炼,检测到的差异指导针对性的边缘修订。我们在三个元数据丰富的真实世界基准上评估了我们的方法,这些基准涵盖了工业工程、环境监测和IT系统分析。在这些数据集中,DMCD在与多种因果发现基线的比较中实现了竞争性或领先的性能,尤其在召回率和F1分数上获得了显著提升。探测和消融实验表明,这些改进源于对元数据的语义推理,而非对基准图的记忆。总体而言,我们的结果表明,将语义先验与原则性统计验证相结合,能够产生一种高效且在实践中有效的因果结构学习方法。
cs.AI / 4 / 2602.20422
Diffusion Modulation via Environment Mechanism Modeling for Planning
通过环境机制建模的扩散调制用于规划
Abstract
Diffusion models have shown promising capabilities in trajectory generation for planning in offline reinforcement learning (RL). However, conventional diffusion-based planning methods often fail to account for the fact that generating trajectories in RL requires unique consistency between transitions to ensure coherence in real environments. This oversight can result in considerable discrepancies between the generated trajectories and the underlying mechanisms of a real environment. To address this problem, we propose a novel diffusion-based planning method, termed as Diffusion Modulation via Environment Mechanism Modeling (DMEMM). DMEMM modulates diffusion model training by incorporating key RL environment mechanisms, particularly transition dynamics and reward functions. Experimental results demonstrate that DMEMM achieves state-of-the-art performance for planning with offline reinforcement learning.
Chinese Translation
扩散模型在离线强化学习(RL)中的轨迹生成方面展现了良好的能力。然而,传统的基于扩散的规划方法往往未能考虑到在RL中生成轨迹需要在过渡之间保持独特的一致性,以确保在真实环境中的连贯性。这一疏忽可能导致生成的轨迹与真实环境的基本机制之间存在显著差异。为了解决这一问题,我们提出了一种新颖的基于扩散的规划方法,称为通过环境机制建模的扩散调制(Diffusion Modulation via Environment Mechanism Modeling, DMEMM)。DMEMM通过结合关键的RL环境机制,特别是过渡动态和奖励函数,来调制扩散模型的训练。实验结果表明,DMEMM在离线强化学习的规划任务中实现了最先进的性能。
cs.AI / 5 / 2602.20424
Implicit Intelligence -- Evaluating Agents on What Users Don't Say
隐性智能——评估代理在用户未言明内容上的表现
Abstract
Real-world requests to AI agents are fundamentally underspecified. Natural human communication relies on shared context and unstated constraints that speakers expect listeners to infer. Current agentic benchmarks test explicit instruction-following but fail to evaluate whether agents can reason about implicit requirements spanning accessibility needs, privacy boundaries, catastrophic risks, and contextual constraints. We present Implicit Intelligence, an evaluation framework testing whether AI agents can move beyond prompt-following to become genuine goal-fulfillers, paired with Agent-as-a-World (AaW), a harness where interactive worlds are defined in human-readable YAML files and simulated by language models. Our scenarios feature apparent simplicity in user requests, hidden complexity in correct solutions, and discoverability of constraints through environmental exploration. Evaluating 16 frontier and open-weight models across 205 scenarios, we find that even the best-performing model achieves only 48.3% scenario pass rate, revealing substantial room for improvement in bridging the gap between literal instruction-following and human-like contextual reasoning.
Chinese Translation
现实世界中对人工智能代理的请求本质上是不完全指定的。自然的人类交流依赖于共享的背景和未言明的约束,讲话者期望听众能够推断这些内容。目前的代理基准测试主要关注明确的指令执行,但未能评估代理是否能够推理关于隐性需求的内容,这些需求包括可及性需求、隐私界限、灾难性风险和上下文约束。我们提出了隐性智能(Implicit Intelligence),一个评估框架,用于测试人工智能代理是否能够超越简单的提示跟随,成为真正的目标实现者,并配合代理作为世界(Agent-as-a-World, AaW)模型,这是一个通过人类可读的YAML文件定义的互动世界,并由语言模型进行模拟。我们的场景在用户请求中表现出明显的简单性,而在正确解决方案中隐藏着复杂性,并通过环境探索发现约束。在205个场景中评估了16个前沿和开放权重模型,我们发现即使是表现最好的模型,其场景通过率也仅为48.3%,这表明在缩小字面指令执行与类人上下文推理之间的差距方面还有很大的改进空间。
cs.AI / 6 / 2602.20426
Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use
学习重写工具描述以实现可靠的LLM-Agent工具使用
Abstract
The performance of LLM-based agents depends not only on the agent itself but also on the quality of the tool interfaces it consumes. While prior work has focused heavily on agent fine-tuning, tool interfaces-including natural language descriptions and parameter schemas-remain largely human-oriented and often become a bottleneck, especially when agents must select from large candidate tool sets. Existing approaches to improving tool interfaces rely on execution traces, which are frequently unavailable in cold-start or privacy-constrained settings, and typically optimize each tool independently, limiting scalability and generalization to unseen tools. We propose Trace-Free+, a curriculum learning framework that progressively transfers supervision from trace-rich settings to trace-free deployment, encouraging the model to abstract reusable interface-usage patterns and tool usage outcomes. To support this approach, we construct a large-scale dataset of high-quality tool interfaces using a structured workflow over a diverse collection of tools. Experiments on StableToolBench and RestBench show consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100, demonstrating that tool interface optimization is a practical and deployable complement to agent fine-tuning.
Chinese Translation
基于LLM的代理的性能不仅依赖于代理本身,还依赖于其使用的工具接口的质量。虽然以往的研究主要集中在代理的微调上,但工具接口——包括自然语言描述和参数模式——仍然主要面向人类,往往成为瓶颈,尤其是在代理必须从大量候选工具集中进行选择时。现有的改善工具接口的方法依赖于执行轨迹,而在冷启动或隐私受限的环境中,这些轨迹通常不可用,并且通常独立优化每个工具,限制了可扩展性和对未见工具的泛化能力。我们提出了Trace-Free+,一个课程学习框架,逐步将监督从富含轨迹的环境转移到无轨迹的部署,鼓励模型抽象可重用的接口使用模式和工具使用结果。为了支持这一方法,我们构建了一个大规模的高质量工具接口数据集,采用结构化工作流程,涵盖多样化的工具。我们在StableToolBench和RestBench上的实验显示,在未见工具上取得了一致的增益,强大的跨领域泛化能力,以及随着候选工具数量增加到100个以上的鲁棒性,证明了工具接口优化是代理微调的一个实用且可部署的补充。
cs.AI / 7 / 2602.20459
PreScience: A Benchmark for Forecasting Scientific Contributions
PreScience:科学贡献预测的基准
Abstract
Can AI systems trained on the scientific record up to a fixed point in time forecast the scientific advances that follow? Such a capability could help researchers identify collaborators and impactful research directions, and anticipate which problems and methods will become central next. We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction. PreScience is a carefully curated dataset of 98K recent AI-related research papers, featuring disambiguated author identities, temporally aligned scholarly metadata, and a structured graph of companion author publication histories and citations spanning 502K total papers. We develop baselines and evaluations for each task, including LACERScore, a novel LLM-based measure of contribution similarity that outperforms previous metrics and approximates inter-annotator agreement. We find substantial headroom remains in each task -- e.g. in contribution generation, frontier LLMs achieve only moderate similarity to the ground-truth (GPT-5, averages 5.6 on a 1-10 scale). When composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus is systematically less diverse and less novel than human-authored research from the same period.
Chinese Translation
人工智能系统是否能够基于固定时间点之前的科学记录预测随后的科学进展?这种能力可以帮助研究人员识别合作者和有影响力的研究方向,并预测哪些问题和方法将成为下一个中心。我们介绍了PreScience——一个科学预测基准,它将研究过程分解为四个相互依赖的生成任务:合作者预测、先前工作选择、贡献生成和影响预测。PreScience是一个经过精心策划的数据集,包含98K篇近期与人工智能相关的研究论文,具有消歧义的作者身份、时间对齐的学术元数据,以及一个涵盖502K篇论文的伴随作者出版历史和引用的结构化图谱。我们为每个任务开发了基线和评估,包括LACERScore,这是一种基于大型语言模型(LLM)的贡献相似性新度量,超越了之前的指标,并近似于标注者间一致性。我们发现每个任务仍有显著的提升空间——例如,在贡献生成中,前沿的LLM与真实值的相似性仅为中等水平(GPT-5在1-10的评分中平均为5.6)。当将其组合成一个为期12个月的科学生产端到端模拟时,生成的合成语料库在多样性和新颖性上系统性地低于同一时期的人类撰写的研究。
cs.AI / 8 / 2602.20494
KairosVL: Orchestrating Time Series and Semantics for Unified Reasoning
KairosVL:协调时间序列与语义以实现统一推理
Abstract
Driven by the increasingly complex and decision-oriented demands of time series analysis, we introduce the Semantic-Conditional Time Series Reasoning task, which extends conventional time series analysis beyond purely numerical modeling to incorporate contextual and semantic understanding. To further enhance the mode's reasoning capabilities on complex time series problems, we propose a two-round reinforcement learning framework: the first round strengthens the mode's perception of fundamental temporal primitives, while the second focuses on semantic-conditioned reasoning. The resulting model, KairosVL, achieves competitive performance across both synthetic and real-world tasks. Extensive experiments and ablation studies demonstrate that our framework not only boosts performance but also preserves intrinsic reasoning ability and significantly improves generalization to unseen scenarios. To summarize, our work highlights the potential of combining semantic reasoning with temporal modeling and provides a practical framework for real-world time series intelligence, which is in urgent demand.
Chinese Translation
随着时间序列分析需求日益复杂和以决策为导向,我们提出了语义条件时间序列推理任务,该任务将传统的时间序列分析从纯数字建模扩展到包含上下文和语义理解。为了进一步增强模型在复杂时间序列问题上的推理能力,我们提出了一个两轮强化学习框架:第一轮加强模型对基本时间原语的感知,而第二轮则专注于语义条件推理。最终模型KairosVL在合成和真实世界任务中均表现出竞争力。大量实验和消融研究表明,我们的框架不仅提升了性能,还保留了内在的推理能力,并显著改善了对未见场景的泛化能力。总之,我们的工作突显了将语义推理与时间建模相结合的潜力,并提供了一个满足现实世界时间序列智能迫切需求的实用框架。
cs.AI / 9 / 2602.20502
ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory
ActionEngine:通过状态机记忆从反应式到程序化的图形用户界面代理
Abstract
Existing Graphical User Interface (GUI) agents operate through step-by-step calls to vision language models--taking a screenshot, reasoning about the next action, executing it, then repeating on the new page--resulting in high costs and latency that scale with the number of reasoning steps, and limited accuracy due to no persistent memory of previously visited pages. We propose ActionEngine, a training-free framework that transitions from reactive execution to programmatic planning through a novel two-agent architecture: a Crawling Agent that constructs an updatable state-machine memory of the GUIs through offline exploration, and an Execution Agent that leverages this memory to synthesize complete, executable Python programs for online task execution. To ensure robustness against evolving interfaces, execution failures trigger a vision-based re-grounding fallback that repairs the failed action and updates the memory. This design drastically improves both efficiency and accuracy: on Reddit tasks from the WebArena benchmark, our agent achieves 95% task success with on average a single LLM call, compared to 66% for the strongest vision-only baseline, while reducing cost by 11.8x and end-to-end latency by 2x. Together, these components yield scalable and reliable GUI interaction by combining global programmatic planning, crawler-validated action templates, and node-level execution with localized validation and repair.
Chinese Translation
现有的图形用户界面(GUI)代理通过逐步调用视觉语言模型进行操作——截取屏幕截图、推理下一步行动、执行该行动,然后在新页面上重复这一过程——这导致了随着推理步骤数量的增加而增加的高成本和延迟,并且由于缺乏对之前访问页面的持久记忆,准确性也受到限制。我们提出了ActionEngine,一个无须训练的框架,通过一种新颖的双代理架构,从反应式执行过渡到程序化规划:一个爬虫代理(Crawling Agent)通过离线探索构建可更新的GUI状态机记忆,另一个执行代理(Execution Agent)利用该记忆合成完整的、可执行的Python程序以进行在线任务执行。为了确保对不断变化的界面的鲁棒性,执行失败会触发基于视觉的重新定位回退机制,以修复失败的行动并更新记忆。该设计显著提高了效率和准确性:在WebArena基准测试中的Reddit任务上,我们的代理在平均仅调用一次大型语言模型(LLM)的情况下实现了95%的任务成功率,而最强的仅基于视觉的基线成功率为66%,同时成本降低了11.8倍,端到端延迟减少了2倍。这些组件结合全球程序化规划、爬虫验证的行动模板和节点级执行与局部验证和修复,实现了可扩展且可靠的GUI交互。
cs.AI / 10 / 2602.20517
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
内在语言作为行为指导:可调节的多样行为模仿用于人机协调
Abstract
Effective human-AI coordination requires artificial agents capable of exhibiting and responding to human-like behaviors while adapting to changing contexts. Imitation learning has emerged as one of the prominent approaches to build such agents by training them to mimic human-demonstrated behaviors. However, current methods struggle to capture the inherent diversity and non-Markovian nature of human behavior and lack the ability to steer behavior at inference time. Drawing inspiration from the theory of human cognitive processes, where inner speech guides action selection before execution, we propose MIMIC (Modeling Inner Motivations for Imitation and Control), a framework that uses language as an internal representation of behavioral intent. MIMIC employs the novel use of vision-language models as linguistic scaffolding to train a conditional variational autoencoder capable of generating inner speech from observations. A diffusion-based behavior cloning policy then selects actions conditioned on current observations and the generated inner speech. MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech. Experiments across robotic manipulation tasks and human-AI collaboration games demonstrate that MIMIC significantly enhances both behavior diversity and fidelity to human demonstrations while enabling nuanced behavioral steering without training on additional demonstrations. We open source our code and provide pre-trained MIMIC agents and qualitative demos at: https://mimic-research.github.io.
Chinese Translation
有效的人机协调需要人工智能代理能够表现和响应类人行为,同时适应不断变化的环境。模仿学习已成为构建此类代理的主要方法之一,通过训练它们模仿人类展示的行为。然而,当前的方法难以捕捉人类行为固有的多样性和非马尔可夫特性,并且缺乏在推理时引导行为的能力。我们受到人类认知过程理论的启发,该理论认为内在语言在执行前指导行动选择,我们提出了MIMIC(内在动机建模用于模仿与控制)框架,该框架使用语言作为行为意图的内部表征。MIMIC新颖地利用视觉-语言模型作为语言支架,训练一个条件变分自编码器,能够从观察中生成内在语言。然后,基于扩散的行为克隆策略根据当前观察和生成的内在语言选择行动。MIMIC通过将代理与特定行为的语言条件化,使得在推理时能够对行为进行细致的引导。针对机器人操作任务和人机协作游戏的实验表明,MIMIC显著增强了行为的多样性和对人类演示的忠实度,同时在不需要额外演示训练的情况下实现了细致的行为引导。我们开源了我们的代码,并提供了预训练的MIMIC代理和定性演示,网址为:https://mimic-research.github.io。
cs.AI / 11 / 2602.20558
From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation in Production
从日志到语言:学习基于大语言模型的生产推荐的最佳表述
Abstract
Large language models (LLMs) are promising backbones for generative recommender systems, yet a key challenge remains underexplored: verbalization, i.e., converting structured user interaction logs into effective natural language inputs. Existing methods rely on rigid templates that simply concatenate fields, yielding suboptimal representations for recommendation. We propose a data-centric framework that learns verbalization for LLM-based recommendation. Using reinforcement learning, a verbalization agent transforms raw interaction histories into optimized textual contexts, with recommendation accuracy as the training signal. This agent learns to filter noise, incorporate relevant metadata, and reorganize information to improve downstream predictions. Experiments on a large-scale industrial streaming dataset show that learned verbalization delivers up to 93% relative improvement in discovery item recommendation accuracy over template-based baselines. Further analysis reveals emergent strategies such as user interest summarization, noise removal, and syntax normalization, offering insights into effective context construction for LLM-based recommender systems.
Chinese Translation
大型语言模型(LLMs)是生成推荐系统的有前景的基础,但一个关键挑战仍未得到充分探索:表述,即将结构化的用户交互日志转换为有效的自然语言输入。现有方法依赖于僵化的模板,仅仅将字段连接在一起,导致推荐的表现不佳。我们提出了一种以数据为中心的框架,学习基于LLM的推荐的表述。通过强化学习,一个表述代理将原始交互历史转化为优化的文本上下文,以推荐准确性作为训练信号。该代理学习过滤噪声、整合相关元数据,并重新组织信息,以改善下游预测。在一个大规模工业流数据集上的实验表明,学习到的表述在发现项目推荐准确性上相较于基于模板的基线提供了高达93%的相对提升。进一步分析揭示了新兴策略,如用户兴趣摘要、噪声去除和语法标准化,为基于LLM的推荐系统的有效上下文构建提供了见解。
cs.AI / 12 / 2602.20571
CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
因果推理基准:用于因果识别与估计解耦评估的真实世界基准
Abstract
Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification-formulating a valid research design under stated assumptions-and estimation-implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 138 real-world datasets, curated from 85 peer-reviewed research papers and four widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. Baseline results with a state-of-the-art LLM show that, while the model correctly identifies the high-level strategy in 84 % of cases, full identification-specification correctness drops to only 30 %, revealing that the bottleneck lies in the nuanced details of research design rather than in computation. CausalReasoningBenchmark is publicly available on Hugging Face and is designed to foster the development of more robust automated causal-inference systems.
Chinese Translation
许多自动化因果推断基准通过单一数值输出(如平均处理效应(ATE))来评估系统的性能。这种方法混淆了因果分析中的两个不同步骤:识别——在规定假设下制定有效的研究设计,以及估计——在有限数据上对该设计进行数值实现。我们引入了因果推理基准(CausalReasoningBenchmark),这是一个包含173个查询和138个真实世界数据集的基准,这些数据集来自85篇经过同行评审的研究论文和四本广泛使用的因果推断教科书。对于每个查询,系统必须生成(i)一个结构化的识别规范,命名策略、处理、结果和控制变量,以及所有设计特定元素;(ii)一个带有标准误差的点估计。通过分别对这两个组件进行评分,我们的基准能够进行细致的诊断:它区分了因果推理中的失败与数值执行中的错误。使用最先进的语言模型(LLM)的基线结果显示,尽管模型在84%的情况下正确识别了高层次策略,但完整的识别规范正确率仅下降到30%,这表明瓶颈在于研究设计的细微细节而非计算。因果推理基准(CausalReasoningBenchmark)已在Hugging Face上公开发布,旨在促进更强大的自动化因果推断系统的发展。
cs.AI / 13 / 2602.20624
Physics-based phenomenological characterization of cross-modal bias in multimodal models
基于物理的跨模态偏差的现象学表征在多模态模型中的应用
Abstract
The term 'algorithmic fairness' is used to evaluate whether AI models operate fairly in both comparative (where fairness is understood as formal equality, such as "treat like cases as like") and non-comparative (where unfairness arises from the model's inaccuracy, arbitrariness, or inscrutability) contexts. Recent advances in multimodal large language models (MLLMs) are breaking new ground in multimodal understanding, reasoning, and generation; however, we argue that inconspicuous distortions arising from complex multimodal interaction dynamics can lead to systematic bias. The purpose of this position paper is twofold: first, it is intended to acquaint AI researchers with phenomenological explainable approaches that rely on the physical entities that the machine experiences during training/inference, as opposed to the traditional cognitivist symbolic account or metaphysical approaches; second, it is to state that this phenomenological doctrine will be practically useful for tackling algorithmic fairness issues in MLLMs. We develop a surrogate physics-based model that describes transformer dynamics (i.e., semantic network structure and self-/cross-attention) to analyze the dynamics of cross-modal bias in MLLM, which are not fully captured by conventional embedding- or representation-level analyses. We support this position through multi-input diagnostic experiments: 1) perturbation-based analyses of emotion classification using Qwen2.5-Omni and Gemma 3n, and 2) dynamical analysis of Lorenz chaotic time-series prediction through the physical surrogate. Across two architecturally distinct MLLMs, we show that multimodal inputs can reinforce modality dominance rather than mitigate it, as revealed by structured error-attractor patterns under systematic label perturbation, complemented by dynamical analysis.
Chinese Translation
“算法公平性”一词用于评估人工智能模型在比较(公平性被理解为形式上的平等,例如“对待相似案例应相似”)和非比较(不公平性源于模型的不准确性、任意性或不可理解性)背景下的公平运作。近年来,多模态大型语言模型(MLLMs)的进展在多模态理解、推理和生成方面开辟了新天地;然而,我们认为,复杂的多模态交互动态所产生的不显著扭曲可能导致系统性偏差。本文的目的有两个:首先,旨在使人工智能研究人员了解依赖于机器在训练/推理过程中所经历的物理实体的现象学可解释方法,而非传统的认知主义符号论或形而上学方法;其次,声明这一现象学理论在解决MLLMs中的算法公平性问题时将具有实际应用价值。我们开发了一个基于物理的替代模型,描述了变换器动态(即语义网络结构和自/交叉注意力),以分析MLLM中的跨模态偏差动态,这些动态并未被传统的嵌入或表示层分析充分捕捉。我们通过多输入诊断实验支持这一观点:1)使用Qwen2.5-Omni和Gemma 3n进行情感分类的扰动基础分析,以及2)通过物理替代模型对洛伦兹混沌时间序列预测的动态分析。在两种架构上不同的MLLM中,我们展示了多模态输入可以强化模态主导性而非缓解它,这一点通过在系统性标签扰动下的结构化误差吸引子模式以及动态分析得以揭示。
cs.AI / 14 / 2602.20628
When can we trust untrusted monitoring? A safety case sketch across collusion strategies
我们何时可以信任不可信的监控?跨越合谋策略的安全案例草图
Abstract
AIs are increasingly being deployed with greater autonomy and capabilities, which increases the risk that a misaligned AI may be able to cause catastrophic harm. Untrusted monitoring -- using one untrusted model to oversee another -- is one approach to reducing risk. Justifying the safety of an untrusted monitoring deployment is challenging because developers cannot safely deploy a misaligned model to test their protocol directly. In this paper, we develop upon existing methods for rigorously demonstrating safety based on pre-deployment testing. We relax assumptions that previous AI control research made about the collusion strategies a misaligned AI might use to subvert untrusted monitoring. We develop a taxonomy covering passive self-recognition, causal collusion (hiding pre-shared signals), acausal collusion (hiding signals via Schelling points), and combined strategies. We create a safety case sketch to clearly present our argument, explicitly state our assumptions, and highlight unsolved challenges. We identify conditions under which passive self-recognition could be a more effective collusion strategy than those studied previously. Our work builds towards more robust evaluations of untrusted monitoring.
Chinese Translation
人工智能(AI)正越来越多地被部署以更大的自主性和能力,这增加了不对齐的AI可能造成灾难性伤害的风险。不可信监控——使用一个不可信的模型来监督另一个模型——是降低风险的一种方法。证明不可信监控部署的安全性是具有挑战性的,因为开发者无法安全地部署一个不对齐的模型来直接测试他们的协议。在本文中,我们在现有方法的基础上发展出一种严格证明安全性的方法,该方法基于部署前的测试。我们放宽了以往AI控制研究对不对齐的AI可能使用的合谋策略的假设,以便颠覆不可信监控。我们开发了一个分类法,涵盖了被动自我识别、因果合谋(隐藏预共享信号)、非因果合谋(通过谢林点隐藏信号)和组合策略。我们创建了一个安全案例草图,以清晰地呈现我们的论点,明确陈述我们的假设,并突出尚未解决的挑战。我们确定了在何种条件下,被动自我识别可能比以往研究的合谋策略更有效。我们的工作为对不可信监控的更稳健评估奠定了基础。
cs.AI / 15 / 2602.20638
Identifying two piecewise linear additive value functions from anonymous preference information
从匿名偏好信息中识别两个分段线性加性价值函数
Abstract
Eliciting a preference model involves asking a person, named decision-maker, a series of questions. We assume that these preferences can be represented by an additive value function. In this work, we query simultaneously two decision-makers in the aim to elicit their respective value functions. For each query we receive two answers, without noise, but without knowing which answer corresponds to which decision-maker.We propose an elicitation procedure that identifies the two preference models when the marginal value functions are piecewise linear with known breaking points.
Chinese Translation
引导偏好模型的获取涉及向一个被称为决策者的人提出一系列问题。我们假设这些偏好可以通过加性价值函数来表示。在本研究中,我们同时询问两个决策者,以期获取他们各自的价值函数。对于每个查询,我们收到两个答案,且没有噪声,但不知道哪个答案对应于哪个决策者。我们提出了一种引导程序,当边际价值函数为具有已知断点的分段线性时,可以识别这两个偏好模型。
cs.AI / 16 / 2602.20639
Grounding LLMs in Scientific Discovery via Embodied Actions
通过具身动作将大型语言模型(LLMs)与科学发现相结合
Abstract
Large Language Models (LLMs) have shown significant potential in scientific discovery but struggle to bridge the gap between theoretical reasoning and verifiable physical simulation. Existing solutions operate in a passive "execute-then-response" loop and thus lacks runtime perception, obscuring agents to transient anomalies (e.g., numerical instability or diverging oscillations). To address this limitation, we propose EmbodiedAct, a framework that transforms established scientific software into active embodied agents by grounding LLMs in embodied actions with a tight perception-execution loop. We instantiate EmbodiedAct within MATLAB and evaluate it on complex engineering design and scientific modeling tasks. Extensive experiments show that EmbodiedAct significantly outperforms existing baselines, achieving SOTA performance by ensuring satisfactory reliability and stability in long-horizon simulations and enhanced accuracy in scientific modeling.
Chinese Translation
大型语言模型(LLMs)在科学发现中展现出显著潜力,但在理论推理与可验证的物理模拟之间存在鸿沟。现有解决方案在被动的“执行-再响应”循环中运作,因此缺乏运行时感知,导致代理无法察觉瞬态异常(例如,数值不稳定或发散振荡)。为了解决这一局限性,我们提出了EmbodiedAct框架,该框架通过将LLMs与具身动作结合,构建紧密的感知-执行循环,将成熟的科学软件转化为主动的具身代理。我们在MATLAB中实现了EmbodiedAct,并在复杂的工程设计和科学建模任务上进行了评估。大量实验表明,EmbodiedAct显著优于现有基线,在确保长时间模拟中的可靠性和稳定性以及提高科学建模的准确性方面,实现了SOTA(最先进技术)性能。
cs.AI / 17 / 2602.20659
Recursive Belief Vision Language Model
递归信念视觉语言模型
Abstract
Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5% and 37.5% higher success on multi-stage pick-and-place and stacking tasks, respectively, compared to {\pi}0. It also reduces inference latency by up to 5x relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show that the belief module is the primary driver of performance, increasing success rates from 32.5% to 77.5%. These results demonstrate the effectiveness of belief-based state representations for long-horizon VLA policies.
Chinese Translation
当前的视觉-语言-行动(VLA)模型在部分可观测性下进行长时间操作时面临挑战。大多数现有方法仍然以观察为驱动,依赖于短期上下文窗口或对视觉-语言模型(VLM)的重复查询。这导致任务进展的丧失、在感知混淆下的动作重复以及高推理延迟。单靠语义推理并不是长时间操作的主要瓶颈。相反,VLA缺乏持久的、基于动作的状态表示,并且在时间和物理推理方面表现有限,使其不适合多阶段控制。本文介绍了RB-VLA,一种以信念为中心的架构,通过自监督的世界模型目标进行训练,保持紧凑的潜在状态编码与任务相关的历史、动态和物体交互。VLM在高层意图上仅被查询一次,提供任务规范,而信念则跟踪任务进展,并在部分可观测性下实现阶段感知的因果基础控制,而无需存储原始观察或随着时间的推移扩展内存。信念和意图共同条件化扩散策略,以实现稳健的闭环执行。RB-VLA在长时间基准测试中优于之前的VLA,在多阶段拾取和放置以及堆叠任务中,成功率分别比{ extpi}0高出52.5%和37.5%。它还将推理延迟相对于基线减少了多达5倍,并消除了现有VLA中观察到的时间步长内存增长。消融实验表明,信念模块是性能的主要驱动因素,将成功率从32.5%提高到77.5%。这些结果证明了基于信念的状态表示在长时间VLA策略中的有效性。
cs.AI / 18 / 2602.20687
How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
基础技能如何影响基于视觉语言模型的具身智能体:一种本土视角
Abstract
Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill. This joint evaluation across task and skill granularities enables fine-grained assessment of embodied agents. Experiments with state-of-the-art VLMs reveal clear deficiencies in several fundamental embodied skills, and further analysis shows that these bottlenecks significantly limit performance on high-level tasks. NativeEmbodied highlights key challenges for current VLM-driven embodied agents and provides insights to guide future research.
Chinese Translation
最近在视觉语言模型(VLMs)方面的进展显示出实现人类水平具身智能的潜力。然而,现有的基于VLM的具身智能体基准往往依赖于高层指令或离散化的动作空间,这些非本土环境与现实世界控制有显著差异。此外,当前的基准主要集中于高层任务,缺乏对低层和高层的联合评估与分析。为了解决这些局限性,我们提出了NativeEmbodied,这是一个针对基于VLM的具身智能体的挑战性基准,采用统一的本土低层动作空间。NativeEmbodied建立在多样化的模拟场景上,包含三个复杂场景中的代表性高层任务,以评估整体性能。为了进行更详细的分析,我们进一步解耦复杂任务所需的技能,并构建四种类型的低层任务,每种任务针对一种基本的具身技能。这种跨任务和技能粒度的联合评估使得对具身智能体的细致评估成为可能。与最先进的VLM进行的实验揭示了几种基本具身技能的明显不足,进一步分析表明这些瓶颈显著限制了高层任务的性能。NativeEmbodied突出了当前基于VLM的具身智能体面临的关键挑战,并为未来的研究提供了指导性见解。
cs.AI / 19 / 2602.20696
PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding
PromptCD:通过极性提示对比解码增强测试时行为
Abstract
Reliable AI systems require large language models (LLMs) to exhibit behaviors aligned with human preferences and values. However, most existing alignment approaches operate at training time and rely on additional high-quality data, incurring significant computational and annotation costs. While recent work has shown that contrastive decoding can leverage a model's internal distributions to improve specific capabilities, its applicability remains limited to narrow behavioral scopes and scenarios. In this work, we introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings. PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses-specifically token-level probability distributions in LLMs and visual attention patterns in VLMs-to reinforce desirable outcomes. This formulation extends contrastive decoding to a wide range of enhancement objectives and is applicable to both LLMs and Vision-Language Models (VLMs) without additional training. For LLMs, experiments on the "3H" alignment objectives (helpfulness, honesty, and harmlessness) demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time. For VLMs, we further analyze contrastive effects on visual attention, showing that PromptCD significantly improves VQA performance by reinforcing behavior-consistent visual grounding. Collectively, these results highlight PromptCD as a simple, general, and cost-efficient strategy for reliable behavior control across modalities.
Chinese Translation
可靠的人工智能系统要求大型语言模型(LLMs)展现与人类偏好和价值观一致的行为。然而,大多数现有的对齐方法在训练时进行,并依赖额外的高质量数据,导致显著的计算和标注成本。尽管近期研究表明对比解码可以利用模型的内部分布来改善特定能力,但其适用性仍然局限于狭窄的行为范围和场景。在本研究中,我们提出了极性提示对比解码(PromptCD),这是一种测试时行为控制方法,将对比解码推广到更广泛的增强设置。PromptCD 为目标行为构建成对的正向和负向引导提示,并对模型响应进行对比——特别是大型语言模型中的令牌级概率分布和视觉语言模型中的视觉注意模式——以强化期望结果。这一公式将对比解码扩展到广泛的增强目标,并适用于大型语言模型和视觉语言模型(VLMs),无需额外训练。对于大型语言模型,关于“3H”对齐目标(有用性、诚实性和无害性)的实验表明了一致且显著的改进,表明后训练模型可以在测试时实现有意义的自我增强。对于视觉语言模型,我们进一步分析了对比效应对视觉注意的影响,显示PromptCD通过强化行为一致的视觉定位显著提高了视觉问答(VQA)性能。总体而言,这些结果突显了PromptCD作为一种简单、通用且成本高效的策略,在多模态中实现可靠的行为控制。
cs.AI / 20 / 2602.20706
Online Algorithms with Unreliable Guidance
具有不可靠指导的在线算法
Abstract
This paper introduces a new model for ML-augmented online decision making, called online algorithms with unreliable guidance (OAG). This model completely separates between the predictive and algorithmic components, thus offering a single well-defined analysis framework that relies solely on the considered problem. Formulated through the lens of request-answer games, an OAG algorithm receives, with each incoming request, a piece of guidance which is taken from the problem's answer space; ideally, this guidance is the optimal answer for the current request, however with probability $\beta$, the guidance is adversarially corrupted. The goal is to develop OAG algorithms that admit good competitiveness when $\beta = 0$ (a.k.a. consistency) as well as when $\beta = 1$ (a.k.a. robustness); the appealing notion of smoothness, that in most prior work required a dedicated loss function, now arises naturally as $\beta$ shifts from $0$ to $1$. We then describe a systematic method, called the drop or trust blindly (DTB) compiler, which transforms any online algorithm into a learning-augmented online algorithm in the OAG model. Given a prediction-oblivious online algorithm, its learning-augmented counterpart produced by applying the DTB compiler either follows the incoming guidance blindly or ignores it altogether and proceeds as the initial algorithm would have; the choice between these two alternatives is based on the outcome of a (biased) coin toss. As our main technical contribution, we prove (rigorously) that although remarkably simple, the class of algorithms produced via the DTB compiler includes algorithms with attractive consistency-robustness guarantees for three classic online problems: for caching and uniform metrical task systems our algorithms are optimal, whereas for bipartite matching (with adversarial arrival order), our algorithm outperforms the state-of-the-art.
Chinese Translation
本文介绍了一种用于机器学习增强的在线决策的新模型,称为具有不可靠指导的在线算法(OAG)。该模型完全分离了预测和算法组件,从而提供了一个单一的明确定义的分析框架,仅依赖于所考虑的问题。通过请求-回答游戏的视角进行表述,OAG算法在每个传入请求中接收一条来自问题回答空间的指导;理想情况下,这条指导是当前请求的最优答案,但以概率$eta$,该指导会被对手恶意破坏。我们的目标是开发OAG算法,使其在$eta = 0$(即一致性)和$eta = 1$(即鲁棒性)时都具有良好的竞争性;在大多数先前的工作中需要专门损失函数的平滑性概念,现在随着$eta$从$0$到$1$的变化自然出现。接下来,我们描述了一种系统的方法,称为盲目丢弃或信任(DTB)编译器,它将任何在线算法转换为OAG模型中的学习增强在线算法。给定一个对预测无感知的在线算法,通过应用DTB编译器生成的学习增强对应算法要么盲目遵循传入的指导,要么完全忽略它,并按照初始算法的方式进行;这两种选择的依据是一次(有偏)掷硬币的结果。作为我们的主要技术贡献,我们严格证明,尽管相当简单,通过DTB编译器生成的算法类包括对三种经典在线问题具有吸引力的一致性-鲁棒性保证的算法:对于缓存和均匀度量任务系统,我们的算法是最优的,而对于二分匹配(具有对手到达顺序),我们的算法超越了当前最先进的水平。
cs.AI / 21 / 2602.20708
ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
ICON:基于推理时修正的代理间接提示注入防御
Abstract
Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.
Chinese Translation
大型语言模型(LLM)代理易受间接提示注入(IPI)攻击,恶意指令在检索内容中劫持代理的执行。现有防御通常依赖严格的过滤或拒绝机制,这存在一个关键限制:过度拒绝,过早终止有效的代理工作流。我们提出了ICON,一个从探测到缓解的框架,能够在保持任务连续性的同时中和攻击。我们的关键见解是,IPI攻击在潜在空间中留下明显的过度聚焦特征。我们引入了一种潜在空间追踪探测器,根据高强度评分检测攻击。随后,缓解整流器执行外科式的注意力引导,选择性地操控对抗性查询关键依赖,同时增强与任务相关的元素,以恢复LLM的功能轨迹。在多个基础模型上的广泛评估表明,ICON实现了竞争性的0.4% ASR,匹配商业级检测器,同时任务效用提升超过50%。此外,ICON展示了强大的分布外(OOD)泛化能力,并有效扩展到多模态代理,建立了安全性与效率之间的优越平衡。
cs.AI / 22 / 2602.20710
Counterfactual Simulation Training for Chain-of-Thought Faithfulness
反事实模拟训练以提高思维链的可信度
Abstract
Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output. But well-known problems with CoT faithfulness severely limit what insights can be gained from this practice. In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs. We apply CST in two settings: (1) CoT monitoring with cue-based counterfactuals, to detect when models rely on spurious features, reward hack, or are sycophantic, and (2) counterfactual simulation over generic model-based counterfactuals, to encourage models to produce more faithful, generalizable reasoning in the CoT. Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals (by 35 accuracy points) as well as simulatability over generic counterfactuals (by 2 points). We further show that: (1) CST outperforms prompting baselines, (2) rewriting unfaithful CoTs with an LLM is 5x more efficient than RL alone, (3) faithfulness improvements do not generalize to dissuading cues (as opposed to persuading cues), and (4) larger models do not show more faithful CoT out of the box, but they do benefit more from CST. These results suggest that CST can improve CoT faithfulness in general, with promising applications for CoT monitoring. Code for experiments in this paper is available at https://github.com/peterbhase/counterfactual-simulation-training
Chinese Translation
检查思维链(Chain-of-Thought, CoT)推理是理解大型语言模型(LLM)产生输出原因的最常见方法之一。然而,CoT 可信度的众所周知的问题严重限制了从这一实践中获得的见解。在本文中,我们介绍了一种名为反事实模拟训练(Counterfactual Simulation Training, CST)的训练方法,旨在通过奖励能够使模拟器准确预测模型在反事实输入下输出的思维链,从而提高 CoT 的可信度。我们在两个场景中应用 CST:(1)基于提示的反事实的 CoT 监控,以检测模型何时依赖于虚假特征、奖励黑客行为或表现出谄媚行为,以及(2)针对通用模型的反事实进行反事实模拟,以鼓励模型在 CoT 中产生更可信、可推广的推理。对高达 235B 参数的模型进行的实验表明,CST 可以显著提高基于提示的反事实的监控准确性(提高 35 个准确度点),以及在通用反事实上的可模拟性(提高 2 个点)。我们进一步显示:(1)CST 超越了提示基线,(2)使用 LLM 重写不可信的 CoT 比仅使用强化学习(RL)效率高 5 倍,(3)可信度的提高并不适用于劝阻提示(与劝说提示相对),以及(4)更大的模型在初始状态下并未表现出更可信的 CoT,但它们从 CST 中受益更多。这些结果表明,CST 可以普遍提高 CoT 的可信度,并对 CoT 监控有着良好的应用前景。本文实验的代码可在 https://github.com/peterbhase/counterfactual-simulation-training 获取。
cs.AI / 23 / 2602.20722
Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
缓冲区的重要性:释放大语言模型推理中离策略强化学习的潜力
Abstract
Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.
Chinese Translation
传统的基于策略的可验证奖励强化学习(RLVR)框架存在经验浪费和奖励同质性的问题,这直接阻碍了大语言模型后训练阶段在困难样本上的学习效率。本文提出了一种批适应策略优化(Batch Adaptation Policy Optimization, BAPO)方法,作为一种离策略RLVR框架,以提高大语言模型后训练阶段的数据效率。该方法通过重新评估历史上困难的样本并重用高质量样本,动态选择训练批次,同时对策略改进提供下限保证。大量实验进一步表明,BAPO在数学、规划和视觉推理任务上相比于GRPO平均提高了12.5%。重要的是,BAPO成功解决了基础模型始终无法解决的40.7%的问题。
cs.AI / 24 / 2602.20723
Modality-Guided Mixture of Graph Experts with Entropy-Triggered Routing for Multimodal Recommendation
基于模态引导的图专家混合模型与熵触发路由的多模态推荐
Abstract
Multimodal recommendation enhances ranking by integrating user-item interactions with item content, which is particularly effective under sparse feedback and long-tail distributions. However, multimodal signals are inherently heterogeneous and can conflict in specific contexts, making effective fusion both crucial and challenging. Existing approaches often rely on shared fusion pathways, leading to entangled representations and modality imbalance. To address these issues, we propose \textbf{MAGNET}, a \textbf{M}odality-Guided Mixture of \textbf{A}daptive \textbf{G}raph Experts \textbf{N}etwork with Progressive \textbf{E}ntropy-\textbf{T}riggered Routing for Multimodal Recommendation, designed to enhance controllability, stability, and interpretability in multimodal fusion. MAGNET couples interaction-conditioned expert routing with structure-aware graph augmentation, so that both \emph{what} to fuse and \emph{how} to fuse are explicitly controlled and interpretable. At the representation level, a dual-view graph learning module augments the interaction graph with content-induced edges, improving coverage for sparse and long-tail items while preserving collaborative structure via parallel encoding and lightweight fusion. At the fusion level, MAGNET employs structured experts with explicit modality roles -- dominant, balanced, and complementary -- enabling a more interpretable and adaptive combination of behavioral, visual, and textual cues. To further stabilize sparse routing and prevent expert collapse, we introduce a two-stage entropy-weighting mechanism that monitors routing entropy. This mechanism automatically transitions training from an early coverage-oriented regime to a later specialization-oriented regime, progressively balancing expert utilization and routing confidence. Extensive experiments on public benchmarks demonstrate consistent improvements over strong baselines.
Chinese Translation
多模态推荐通过将用户-物品交互与物品内容相结合来增强排名,特别在稀疏反馈和长尾分布下效果显著。然而,多模态信号本质上是异质的,并且在特定上下文中可能存在冲突,这使得有效融合既至关重要又具有挑战性。现有方法通常依赖于共享的融合路径,导致表示纠缠和模态不平衡。为了解决这些问题,我们提出了 extbf{MAGNET},即 extbf{M}odality-Guided Mixture of extbf{A}daptive extbf{G}raph Experts extbf{N}etwork with Progressive extbf{E}ntropy- extbf{T}riggered Routing for Multimodal Recommendation,旨在增强多模态融合中的可控性、稳定性和可解释性。MAGNET将交互条件下的专家路由与结构感知的图增强相结合,使得 extit{融合什么}和 extit{如何融合}都能被明确控制和解释。在表示层面,双视图图学习模块通过内容诱导的边增强交互图,提高了稀疏和长尾物品的覆盖率,同时通过并行编码和轻量级融合保持协作结构。在融合层面,MAGNET采用具有明确模态角色的结构化专家——主导、平衡和互补——使行为、视觉和文本线索的组合更加可解释和自适应。为了进一步稳定稀疏路由并防止专家崩溃,我们引入了一种两阶段熵加权机制来监控路由熵。该机制自动将训练从早期的覆盖导向阶段过渡到后期的专业化导向阶段,逐步平衡专家利用率和路由信心。在公共基准上的大量实验表明,与强基线相比,MAGNET在性能上有一致的提升。
cs.AI / 25 / 2602.20728
Balancing Multiple Objectives in Urban Traffic Control with Reinforcement Learning from AI Feedback
基于AI反馈的强化学习在城市交通控制中的多目标平衡
Abstract
Reward design has been one of the central challenges for real world reinforcement learning (RL) deployment, especially in settings with multiple objectives. Preference-based RL offers an appealing alternative by learning from human preferences over pairs of behavioural outcomes. More recently, RL from AI feedback (RLAIF) has demonstrated that large language models (LLMs) can generate preference labels at scale, mitigating the reliance on human annotators. However, existing RLAIF work typically focuses only on single-objective tasks, leaving the open question of how RLAIF handles systems that involve multiple objectives. In such systems trade-offs among conflicting objectives are difficult to specify, and policies risk collapsing into optimizing for a dominant goal. In this paper, we explore the extension of the RLAIF paradigm to multi-objective self-adaptive systems. We show that multi-objective RLAIF can produce policies that yield balanced trade-offs reflecting different user priorities without laborious reward engineering. We argue that integrating RLAIF into multi-objective RL offers a scalable path toward user-aligned policy learning in domains with inherently conflicting objectives.
Chinese Translation
奖励设计一直是现实世界强化学习(RL)部署中的核心挑战之一,尤其是在涉及多个目标的环境中。基于偏好的强化学习提供了一种吸引人的替代方案,通过学习人类对行为结果对的偏好来进行优化。最近,基于AI反馈的强化学习(RLAIF)展示了大型语言模型(LLMs)能够大规模生成偏好标签,从而减少对人类标注者的依赖。然而,现有的RLAIF研究通常仅关注单一目标任务,尚未解决RLAIF如何处理涉及多个目标的系统这一开放性问题。在这类系统中,冲突目标之间的权衡难以明确,策略可能会陷入仅优化主导目标的风险。本文探讨了将RLAIF范式扩展到多目标自适应系统的可能性。我们展示了多目标RLAIF能够生成反映不同用户优先级的平衡权衡策略,而无需繁琐的奖励工程。我们认为,将RLAIF整合到多目标强化学习中,为在本质上存在冲突目标的领域实现用户对齐的政策学习提供了一条可扩展的路径。
cs.AI / 26 / 2602.20732
CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
CHESS:用于长上下文 LLM 推理的上下文感知层次高效语义选择
Abstract
Long-context LLMs demand accurate inference at low latency, yet decoding becomes primarily constrained by KV cache as context grows. Prior pruning methods are largely context-agnostic: their token selection ignores step-wise relevance and local semantics, which undermines quality. Moreover, their irregular accesses and selection overheads yield only limited wall-clock speedups. To address this, we propose \textbf{CHESS}, an \textit{algorithm-system co-design} KV-cache management system. Algorithmically, CHESS introduces a context-aware, hierarchical selection policy that dynamically reconstructs a coherent context for the current decoding. System-wise, coarse granularity selection eliminates expensive data movement, fully realizing practical acceleration from theoretical sparsity. Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other strong baselines. Code is available at \href{https://anonymous.4open.science/r/CHESS-9958/}{https://anonymous.4open.science/r/CHESS/}.
Chinese Translation
长上下文 LLM 需要在低延迟下进行准确推理,但随着上下文的增长,解码主要受到 KV 缓存的限制。先前的剪枝方法在很大程度上是与上下文无关的:它们的令牌选择忽视了逐步相关性和局部语义,这削弱了质量。此外,它们的不规则访问和选择开销仅能带来有限的实际加速。为了解决这个问题,我们提出了 extbf{CHESS},一种 extit{算法-系统协同设计} 的 KV 缓存管理系统。在算法层面,CHESS 引入了一种上下文感知的层次选择策略,动态重构当前解码的连贯上下文。在系统层面,粗粒度选择消除了昂贵的数据移动,充分实现了从理论稀疏性到实际加速的转变。广泛的评估表明,CHESS 在仅使用 extbf{1 ext{%}} 的 KV 缓存的情况下超越了 Full-KV 的质量,提供了低延迟的稳定推理,吞吐量最高可提高 extbf{4.56$ imes$},并且始终优于其他强基线。代码可在 exttt{https://anonymous.4open.science/r/CHESS-9958/} 获取。
cs.AI / 27 / 2602.20739
PyVision-RL: Forging Open Agentic Vision Models via RL
PyVision-RL:通过强化学习构建开放式智能视觉模型
Abstract
Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.
Chinese Translation
用于智能多模态模型的强化学习常常面临交互崩溃的问题,即模型学习减少工具使用和多轮推理,从而限制了智能行为的优势。我们提出了PyVision-RL,一个用于开放权重多模态模型的强化学习框架,旨在稳定训练并维持交互。我们的方法结合了过采样-过滤-排名的回合策略和累积工具奖励,以防止崩溃并鼓励多轮工具使用。通过统一的训练流程,我们开发了PyVision-Image和PyVision-Video,用于图像和视频理解。在视频推理中,PyVision-Video采用按需上下文构建,在推理过程中选择性地采样与任务相关的帧,从而显著减少视觉标记的使用。实验结果显示出强大的性能和提高的效率,证明了持续交互和按需视觉处理对于可扩展的多模态智能体至关重要。
cs.AI / 28 / 2602.20770
Pipeline for Verifying LLM-Generated Mathematical Solutions
验证大型语言模型生成的数学解题方案的流程
Abstract
With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilities. We introduce a pipeline for both automatic and interactive verification as a more accurate alternative to only checking the answer which is currently the most popular approach for benchmarks. The pipeline can also be used as a generator of correct solutions both in formal and informal languages. 3 AI agents, which can be chosen for the benchmark accordingly, are included in the structure. The key idea is the use of prompts to obtain the solution in the specific form which allows for easier verification using proof assistants and possible use of small models ($\le 8B$). Experiments on several datasets suggest low probability of False Positives. The open-source implementation with instructions on setting up a server is available at https://github.com/LogicEnj/lean4_verification_pipeline.
Chinese Translation
随着大型推理模型的日益普及及其在解决数学问题方面的成果,评估其能力变得至关重要。我们提出了一种流程,既可以进行自动验证,也可以进行交互式验证,这是一种比仅检查答案更为准确的替代方案,而后者目前是基准测试中最流行的方法。该流程还可以作为正确解答的生成器,适用于正式和非正式语言。结构中包含了3个可以根据基准测试选择的AI代理。关键思想是使用提示以特定形式获得解答,这样可以更容易地使用证明助手进行验证,并可能使用小型模型($ ext{≤ 8B}$)。在多个数据集上的实验表明,假阳性的概率较低。开源实现及服务器设置说明可在 https://github.com/LogicEnj/lean4_verification_pipeline 获取。
cs.AI / 29 / 2602.20810
POMDPPlanners: Open-Source Package for POMDP Planning
POMDPPlanners:用于POMDP规划的开源软件包
Abstract
We present POMDPPlanners, an open-source Python package for empirical evaluation of Partially Observable Markov Decision Process (POMDP) planning algorithms. The package integrates state-of-the-art planning algorithms, a suite of benchmark environments with safety-critical variants, automated hyperparameter optimization via Optuna, persistent caching with failure recovery, and configurable parallel simulation -- reducing the overhead of extensive simulation studies. POMDPPlanners is designed to enable scalable, reproducible research on decision-making under uncertainty, with particular emphasis on risk-sensitive settings where standard toolkits fall short.
Chinese Translation
我们介绍了POMDPPlanners,这是一个用于部分可观察马尔可夫决策过程(POMDP)规划算法的开源Python软件包。该软件包集成了最先进的规划算法、一套具有安全关键变体的基准环境、通过Optuna进行的自动超参数优化、具有故障恢复的持久缓存,以及可配置的并行仿真——减少了广泛仿真研究的开销。POMDPPlanners旨在支持在不确定性下的决策研究的可扩展性和可重复性,特别强调在标准工具包不足的风险敏感环境中的应用。
cs.AI / 30 / 2602.20812
Qwen-BIM: developing large language model for BIM-based design with domain-specific benchmark and dataset
Qwen-BIM:基于BIM设计的大型语言模型的开发及其领域特定基准和数据集
Abstract
As the construction industry advances toward digital transformation, BIM (Building Information Modeling)-based design has become a key driver supporting intelligent construction. Despite Large Language Models (LLMs) have shown potential in promoting BIM-based design, the lack of specific datasets and LLM evaluation benchmarks has significantly hindered the performance of LLMs. Therefore, this paper addresses this gap by proposing: 1) an evaluation benchmark for BIM-based design together with corresponding quantitative indicators to evaluate the performance of LLMs, 2) a method for generating textual data from BIM and constructing corresponding BIM-derived datasets for LLM evaluation and fine-tuning, and 3) a fine-tuning strategy to adapt LLMs for BIM-based design. Results demonstrate that the proposed domain-specific benchmark effectively and comprehensively assesses LLM capabilities, highlighting that general LLMs are still incompetent for domain-specific tasks. Meanwhile, with the proposed benchmark and datasets, Qwen-BIM is developed and achieves a 21.0% average increase in G-Eval score compared to the base LLM model. Notably, with only 14B parameters, performance of Qwen-BIM is comparable to that of general LLMs with 671B parameters for BIM-based design tasks. Overall, this study develops the first domain-specific LLM for BIM-based design by introducing a comprehensive benchmark and high-quality dataset, which provide a solid foundation for developing BIM-related LLMs in various fields.
Chinese Translation
随着建筑行业向数字化转型的推进,基于建筑信息模型(BIM)的设计已成为支持智能建筑的关键驱动力。尽管大型语言模型(LLMs)在促进基于BIM的设计方面展现出潜力,但缺乏特定的数据集和LLM评估基准显著阻碍了LLMs的性能。因此,本文通过提出以下内容来填补这一空白:1)基于BIM的设计评估基准及相应的定量指标,以评估LLMs的性能;2)从BIM生成文本数据的方法,并构建相应的BIM衍生数据集用于LLM评估和微调;3)适应LLMs于基于BIM设计的微调策略。结果表明,所提出的领域特定基准能够有效且全面地评估LLM的能力,突显出通用LLMs在领域特定任务中的不足。同时,借助所提出的基准和数据集,开发了Qwen-BIM,并在G-Eval评分上实现了比基础LLM模型高出21.0%的平均提升。值得注意的是,Qwen-BIM仅有140亿参数,其在基于BIM的设计任务中的表现可与拥有6710亿参数的通用LLMs相媲美。总体而言,本研究通过引入全面的基准和高质量的数据集,开发了首个针对基于BIM设计的领域特定LLM,为在各个领域开发与BIM相关的LLMs奠定了坚实基础。
cs.AI / 31 / 2602.20813
Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
压力揭示特征:深度行为一致性评估
Abstract
Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.
Chinese Translation
评估语言模型的一致性需要测试它们在现实压力下的行为,而不仅仅是它们声称会做什么。尽管一致性失败日益导致现实世界的危害,但缺乏具有现实多轮场景的综合评估框架。我们引入了一个涵盖六个类别(诚实、安全、非操控性、鲁棒性、可纠正性和阴谋)的904个场景的一致性基准,这些场景经过人类评审者验证为现实的。我们的场景将模型置于相互矛盾的指令、模拟工具访问和多轮升级的情境中,以揭示单轮评估所忽视的行为倾向。通过使用与人类注释验证的 LLM(大型语言模型)评审者对24个前沿模型进行评估,我们发现即使是表现最好的模型在特定类别中也存在差距,而大多数模型在各个方面显示出一致的弱点。因子分析表明,一致性作为一个统一的构造(类似于认知研究中的 g 因子)表现出来,模型在一个类别中得分高往往也倾向于在其他类别中得分高。我们公开发布了这一基准和一个互动排行榜,以支持持续评估,并计划在观察到持续弱点的领域扩展场景,并在新模型发布时添加它们。
cs.AI / 32 / 2602.20878
Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
通过结构化相关图诊断视觉语言模型中的因果推理
Abstract
Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.
Chinese Translation
大型视觉语言模型(LVLMs)在视觉问答基准测试中表现出色,但往往依赖于虚假的相关性而非真正的因果推理。现有评估主要评估答案的正确性,使得不清楚失败是由于推理能力有限还是由于错误识别因果相关信息。我们引入了视觉语言因果图(VLCGs),这是一种结构化的、基于查询的表示,明确编码因果相关的对象、属性、关系和场景基础假设。在此基础上,我们提出了ViLCaR,这是一个诊断基准,包括因果归因、因果推理和问答任务,以及评估相关性识别的图对齐评估指标,超越最终答案的准确性。对最先进的LVLMs的实验表明,与零-shot和标准上下文学习相比,注入结构化相关信息显著提高了归因和推理的一致性。这些发现表明,当前LVLM因果推理的局限性主要源于缺乏足够的结构指导,而非缺乏推理能力。
cs.AI / 33 / 2602.20918
Predicting Sentence Acceptability Judgments in Multimodal Contexts
在多模态语境中预测句子可接受性判断
Abstract
Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts. We consider the effect of prior exposure to visual images (i.e., visual context) on these judgments for humans and large language models (LLMs). Our results suggest that, in contrast to textual context, visual images appear to have little if any impact on human acceptability ratings. However, LLMs display the compression effect seen in previous work on human judgments in document contexts. Different sorts of LLMs are able to predict human acceptability judgments to a high degree of accuracy, but in general, their performance is slightly better when visual contexts are removed. Moreover, the distribution of LLM judgments varies among models, with Qwen resembling human patterns, and others diverging from them. LLM-generated predictions on sentence acceptability are highly correlated with their normalised log probabilities in general. However, the correlations decrease when visual contexts are present, suggesting that a higher gap exists between the internal representations of LLMs and their generated predictions in the presence of visual contexts. Our experimental work suggests interesting points of similarity and of difference between human and LLM processing of sentences in multimodal contexts.
Chinese Translation
以往的研究考察了深度神经网络(DNNs),特别是变换器(transformers),在独立于上下文和文档上下文中预测人类句子可接受性判断的能力。我们考虑了先前接触视觉图像(即视觉上下文)对人类和大型语言模型(LLMs)这些判断的影响。我们的结果表明,与文本上下文相比,视觉图像似乎对人类的可接受性评分几乎没有影响。然而,LLMs在文档上下文中显示出与以往研究中人类判断相似的压缩效应。不同类型的LLMs能够以较高的准确度预测人类的可接受性判断,但一般而言,当去除视觉上下文时,它们的表现略有提升。此外,LLM的判断分布在不同模型之间存在差异,其中Qwen与人类模式相似,而其他模型则与之偏离。LLM生成的句子可接受性预测与其标准化对数概率通常高度相关。然而,当存在视觉上下文时,这种相关性下降,表明在视觉上下文存在时,LLM的内部表征与其生成预测之间存在更大的差距。我们的实验工作揭示了人类与LLM在多模态语境中处理句子时的相似性与差异性。
cs.AI / 34 / 2602.20926
HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG
HELP:超节点扩展和逻辑路径引导的证据定位用于准确高效的图形RAG
Abstract
Large Language Models (LLMs) often struggle with inherent knowledge boundaries and hallucinations, limiting their reliability in knowledge-intensive tasks. While Retrieval-Augmented Generation (RAG) mitigates these issues, it frequently overlooks structural interdependencies essential for multi-hop reasoning. Graph-based RAG approaches attempt to bridge this gap, yet they typically face trade-offs between accuracy and efficiency due to challenges such as costly graph traversals and semantic noise in LLM-generated summaries. In this paper, we propose HyperNode Expansion and Logical Path-Guided Evidence Localization strategies for GraphRAG (HELP), a novel framework designed to balance accuracy with practical efficiency through two core strategies: 1) HyperNode Expansion, which iteratively chains knowledge triplets into coherent reasoning paths abstracted as HyperNodes to capture complex structural dependencies and ensure retrieval accuracy; and 2) Logical Path-Guided Evidence Localization, which leverages precomputed graph-text correlations to map these paths directly to the corpus for superior efficiency. HELP avoids expensive random walks and semantic distortion, preserving knowledge integrity while drastically reducing retrieval latency. Extensive experiments demonstrate that HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$\times$ speedup over leading Graph-based RAG baselines.
Chinese Translation
大型语言模型(LLMs)常常面临固有的知识边界和幻觉问题,限制了其在知识密集型任务中的可靠性。尽管检索增强生成(RAG)在一定程度上缓解了这些问题,但它常常忽视多跳推理所需的结构性相互依赖关系。基于图的RAG方法试图弥补这一差距,但由于图遍历成本高和LLM生成摘要中的语义噪声等挑战,通常在准确性和效率之间存在权衡。在本文中,我们提出了超节点扩展和逻辑路径引导的证据定位策略(HELP),这是一个新颖的框架,旨在通过两个核心策略平衡准确性与实际效率:1)超节点扩展,通过将知识三元组迭代链接成抽象为超节点的连贯推理路径,以捕捉复杂的结构依赖关系并确保检索准确性;2)逻辑路径引导的证据定位,利用预计算的图-文本相关性将这些路径直接映射到语料库中,以实现更高的效率。HELP避免了昂贵的随机游走和语义扭曲,保持了知识的完整性,同时大幅降低了检索延迟。大量实验表明,HELP在多个简单和多跳问答基准测试中表现出竞争力,并在领先的基于图的RAG基线中实现了高达28.8倍的加速。
cs.AI / 35 / 2602.20934
Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence
构建AgentOS:从令牌级上下文到涌现的系统级智能
Abstract
The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engineering the theoretical bridge between micro scale token processing and macro scale systemic intelligence remains fragmented.This paper proposes AgentOS,a holistic conceptual framework that redefines the LLM as a "Reasoning Kernel" governed by structured operating system logic.Central to this architecture is Deep Context Management which conceptualizes the context window as an Addressable Semantic Space rather than a passive buffer.We systematically deconstruct the transition from discrete sequences to coherent cognitive states introducing mechanisms for Semantic Slicing and Temporal Alignment to mitigate cognitive drift in multi-agent orchestration.By mapping classical OS abstractions such as memory paging interrupt handling and process scheduling onto LLM native constructs, this review provides a rigorous roadmap for architecting resilient scalable and self-evolving cognitive environments.Our analysis asserts that the next frontier of AGI development lies in the architectural efficiency of system-level coordination.
Chinese Translation
大型语言模型的范式正经历从静态推理引擎到动态自主认知系统的根本转变。尽管当前研究主要集中在扩展上下文窗口或优化提示工程,但微观层面令牌处理与宏观层面系统智能之间的理论桥梁仍然支离破碎。本文提出了AgentOS,一个整体概念框架,重新定义了大型语言模型(LLM)为一个由结构化操作系统逻辑驱动的“推理内核”。该架构的核心是深度上下文管理,它将上下文窗口概念化为一个可寻址的语义空间,而不是一个被动的缓冲区。我们系统性地解构了从离散序列到连贯认知状态的过渡,引入了语义切片和时间对齐机制,以减轻多智能体编排中的认知漂移。通过将经典操作系统抽象(如内存分页、中断处理和进程调度)映射到LLM本地构造上,本文提供了一条严谨的路线图,用于构建具有弹性、可扩展和自我演化的认知环境。我们的分析表明,通用人工智能(AGI)发展的下一个前沿在于系统级协调的架构效率。
cs.AI / 36 / 2602.21044
LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
LogicGraph:通过神经符号生成和验证进行多路径逻辑推理的基准测试
Abstract
Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof. However, many real-world reasoning problems admit multiple valid derivations, requiring models to explore diverse logical paths rather than committing to one route. To address this limitation, we introduce LogicGraph, the first benchmark aimed to systematically evaluate multi-path logical reasoning, constructed via a neuro-symbolic framework that leverages backward logic generation and semantic instantiation. This pipeline yields solver-verified reasoning problems formalized by high-depth multi-path reasoning and inherent logical distractions, where each instance is associated with an exhaustive set of minimal proofs. We further propose a reference-free evaluation framework to rigorously assess model performance in both convergent and divergent regimes. Experiments on state-of-the-art language models reveal a common limitation: models tend to commit early to a single route and fail to explore alternatives, and the coverage gap grows substantially with reasoning depth. LogicGraph exposes this divergence gap and provides actionable insights to motivate future improvements. Our code and data will be released at https://github.com/kkkkarry/LogicGraph.
Chinese Translation
大型语言模型(LLMs)的评估主要强调收敛逻辑推理,其中成功的定义是产生一个正确的证明。然而,许多现实世界的推理问题允许多个有效的推导,要求模型探索多样的逻辑路径,而不是仅仅坚持一条路线。为了解决这一局限性,我们提出了LogicGraph,这是第一个旨在系统评估多路径逻辑推理的基准,构建于一个利用反向逻辑生成和语义实例化的神经符号框架之上。该流程生成了经过求解器验证的推理问题,这些问题由高深度的多路径推理和固有的逻辑干扰形式化,每个实例都与一组详尽的最小证明相关联。我们进一步提出了一种无参考的评估框架,以严格评估模型在收敛和发散状态下的表现。对最先进语言模型的实验揭示了一个共同的局限性:模型往往过早地坚持于单一路径,未能探索替代方案,并且随着推理深度的增加,覆盖差距显著扩大。LogicGraph揭示了这一发散差距,并提供了可操作的见解,以激励未来的改进。我们的代码和数据将发布在 https://github.com/kkkkarry/LogicGraph。
cs.AI / 37 / 2602.21061
Tool Building as a Path to "Superintelligence"
工具构建作为通向“超智能”的路径
Abstract
The Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search, provided a sufficient step-success probability $\gamma$. In this work, we design a benchmark to measure $\gamma$ on logical out-of-distribution inference. We construct a class of tasks involving GF(2) circuit reconstruction that grow more difficult with each reasoning step, and that are, from an information-theoretic standpoint, impossible to reliably solve unless the LLM carefully integrates all of the information provided. Our analysis demonstrates that while the $\gamma$ value for small LLMs declines superlinearly as depth increases, frontier models exhibit partial robustness on this task. Furthermore, we find that successful reasoning at scale is contingent upon precise tool calls, identifying tool design as a critical capability for LLMs to achieve general superintelligence through the Diligent Learner framework.
Chinese Translation
勤奋学习者框架表明,LLMs(大规模语言模型)可以通过测试时搜索实现超智能,前提是具有足够的步骤成功概率 $b3$。在本研究中,我们设计了一个基准来测量逻辑分布外推理中的 $b3$。我们构建了一类任务,涉及 GF(2) 电路重构,这些任务随着每个推理步骤的增加而变得更加困难,并且从信息论的角度来看,除非 LLM 认真整合所有提供的信息,否则无法可靠地解决。我们的分析表明,尽管小型 LLM 的 $b3$ 值随着深度的增加而超线性下降,但前沿模型在此任务上表现出部分鲁棒性。此外,我们发现成功的大规模推理依赖于精确的工具调用,识别工具设计作为 LLM 通过勤奋学习者框架实现通用超智能的关键能力。
cs.AI / 38 / 2602.21064
Motivation is Something You Need
动机是你所需要的
Abstract
This work introduces a novel training paradigm that draws from affective neuroscience. Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model is activated intermittently during predefined "motivation conditions". The framework mimics the emotional state of high curiosity and anticipation of reward in which broader brain regions are recruited to enhance cognitive performance. Exploiting scalable architectures where larger models extend smaller ones, our method enables shared weight updates and selective expansion of network capacity during noteworthy training steps. Empirical evaluation on the image classification task demonstrates that, not only does the alternating training scheme efficiently and effectively enhance the base model compared to a traditional scheme, in some cases, the motivational model also surpasses its standalone counterpart despite seeing less data per epoch. This opens the possibility of simultaneously training two models tailored to different deployment constraints with competitive or superior performance while keeping training cost lower than when training the larger model.
Chinese Translation
本研究介绍了一种新的训练范式,该范式源自情感神经科学。受人脑中情感与认知相互作用的启发,特别是受到SEEKING(寻求)动机状态的影响,我们设计了一个双模型框架,其中较小的基础模型持续训练,而较大的动机模型在预定义的“动机条件”下间歇性激活。该框架模拟了高好奇心和对奖励的期待的情感状态,在这种状态下,更多的脑区被招募以增强认知表现。利用可扩展的架构,使得较大的模型能够扩展较小的模型,我们的方法实现了共享权重更新和在显著训练步骤中选择性扩展网络容量。在图像分类任务上的实证评估表明,与传统方案相比,交替训练方案不仅高效且有效地增强了基础模型,在某些情况下,动机模型尽管每个周期看到的数据较少,仍然超越了其独立模型。这为同时训练两个针对不同部署约束的模型提供了可能性,且在保持训练成本低于训练较大模型的情况下,能够实现竞争或更优的性能。
cs.AI / 39 / 2602.21066
The Initial Exploration Problem in Knowledge Graph Exploration
知识图谱探索中的初始探索问题
Abstract
Knowledge Graphs (KGs) enable the integration and representation of complex information across domains, but their semantic richness and structural complexity create substantial barriers for lay users without expertise in semantic web technologies. When encountering an unfamiliar KG, such users face a distinct orientation challenge: they do not know what questions are possible, how the knowledge is structured, or how to begin exploration. This paper identifies and theorises this phenomenon as the Initial Exploration Problem (IEP). Drawing on theories from information behaviour and human-computer interaction, including ASK, exploratory search, information foraging, and cognitive load theory, we develop a conceptual framing of the IEP characterised by three interdependent barriers: scope uncertainty, ontology opacity, and query incapacity. We argue that these barriers converge at the moment of first contact, distinguishing the IEP from related concepts that presuppose an existing starting point or information goal. Analysing KG exploration interfaces at the level of interaction primitives, we suggest that many systems rely on epistemic assumptions that do not hold at first contact. This reveals a structural gap in the design space: the absence of interaction primitives for scope revelation, mechanisms that communicate what a KG contains without requiring users to formulate queries or interpret ontological structures. In articulating the IEP, this paper provides a theoretical lens for evaluating KG interfaces and for designing entry-point scaffolding that supports initial exploration.
Chinese Translation
知识图谱(KGs)能够跨领域整合和表示复杂信息,但其语义丰富性和结构复杂性为缺乏语义网技术专业知识的普通用户带来了实质性障碍。当遇到一个不熟悉的知识图谱时,这些用户面临着独特的定向挑战:他们不知道可以提出什么问题,知识是如何结构化的,或者如何开始探索。本文识别并理论化了这一现象,称之为初始探索问题(IEP)。基于信息行为和人机交互的理论,包括ASK、探索性搜索、信息觅食和认知负荷理论,我们发展了一个以三种相互依赖的障碍为特征的IEP概念框架:范围不确定性、本体不透明性和查询无能。我们认为,这些障碍在首次接触时汇聚,使得IEP与假设已有起点或信息目标的相关概念有所区别。通过分析知识图谱探索界面的交互原语层面,我们建议许多系统依赖于在首次接触时并不成立的认知假设。这揭示了设计空间中的结构性缺口:缺乏用于范围揭示的交互原语,即在不要求用户制定查询或解释本体结构的情况下,传达知识图谱所包含内容的机制。通过阐明IEP,本文为评估知识图谱界面和设计支持初始探索的入口支架提供了理论视角。
cs.AI / 40 / 2602.21143
A Benchmark for Deep Information Synthesis
深度信息综合基准
Abstract
Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval. To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights. DEEPSYNTH contains 120 tasks collected across 7 domains and data sources covering 67 countries. DEEPSYNTH is constructed using a multi-stage data collection pipeline that requires annotators to collect official data sources, create hypotheses, perform manual analysis, and design tasks with verifiable answers. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 score of 8.97 and 17.5 on the LLM-judge metric, underscoring the difficulty of the benchmark. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces, highlighting DEEPSYNTH as a crucial benchmark for guiding future research.
Chinese Translation
基于大型语言模型(LLM)的智能体越来越多地被用于解决涉及工具使用的复杂任务,例如网页浏览、代码执行和数据分析。然而,目前的评估基准并未充分评估它们解决需要从多个来源综合信息并推断超越简单事实检索的现实任务的能力。为了解决这一问题,我们引入了DEEPSYNTH,一个新颖的基准,旨在评估智能体在现实的、耗时的问题上,结合信息收集、综合和结构化推理以产生洞察力。DEEPSYNTH包含120个任务,涵盖7个领域和覆盖67个国家的数据源。DEEPSYNTH的构建采用了多阶段数据收集流程,要求注释员收集官方数据源、创建假设、进行手动分析,并设计具有可验证答案的任务。在DEEPSYNTH上的评估中,11个最先进的LLM和深度研究智能体在LLM-judge指标上分别达到了最高F1分数8.97和17.5,突显了该基准的难度。我们的分析显示,当前的智能体在处理幻觉和在大信息空间中推理方面存在困难,这使得DEEPSYNTH成为指导未来研究的重要基准。
cs.AI / 41 / 2602.21154
CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning
CG-DMER:用于解耦多模态心电图表示学习的混合对比生成框架
Abstract
Accurate interpretation of electrocardiogram (ECG) signals is crucial for diagnosing cardiovascular diseases. Recent multimodal approaches that integrate ECGs with accompanying clinical reports show strong potential, but they still face two main concerns from a modality perspective: (1) intra-modality: existing models process ECGs in a lead-agnostic manner, overlooking spatial-temporal dependencies across leads, which restricts their effectiveness in modeling fine-grained diagnostic patterns; (2) inter-modality: existing methods directly align ECG signals with clinical reports, introducing modality-specific biases due to the free-text nature of the reports. In light of these two issues, we propose CG-DMER, a contrastive-generative framework for disentangled multimodal ECG representation learning, powered by two key designs: (1) Spatial-temporal masked modeling is designed to better capture fine-grained temporal dynamics and inter-lead spatial dependencies by applying masking across both spatial and temporal dimensions and reconstructing the missing information. (2) A representation disentanglement and alignment strategy is designed to mitigate unnecessary noise and modality-specific biases by introducing modality-specific and modality-shared encoders, ensuring a clearer separation between modality-invariant and modality-specific representations. Experiments on three public datasets demonstrate that CG-DMER achieves state-of-the-art performance across diverse downstream tasks.
Chinese Translation
准确解读心电图(ECG)信号对于诊断心血管疾病至关重要。最近的多模态方法将心电图与相关临床报告结合,展现出强大的潜力,但从模态的角度来看,仍面临两个主要问题:(1)模态内部:现有模型以无导联的方式处理心电图,忽视了导联之间的时空依赖性,这限制了其在建模细粒度诊断模式方面的有效性;(2)模态之间:现有方法直接将心电图信号与临床报告对齐,由于报告的自由文本特性,引入了模态特定的偏差。针对这两个问题,我们提出了CG-DMER,一种用于解耦多模态心电图表示学习的对比生成框架,依赖于两个关键设计:(1)空间-时间掩蔽建模旨在通过在空间和时间维度上应用掩蔽并重建缺失信息,更好地捕捉细粒度的时间动态和导联间的空间依赖性;(2)表示解耦和对齐策略旨在通过引入模态特定和模态共享的编码器,减轻不必要的噪声和模态特定的偏差,确保模态不变表示与模态特定表示之间的更清晰分离。在三个公共数据集上的实验表明,CG-DMER在多种下游任务中实现了最先进的性能。
cs.AI / 42 / 2602.21172
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
NoRD:一种无需推理的数据高效视觉-语言-动作模型
Abstract
Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbf{No} \textbf{R}easoning for \textbf{D}riving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on $<$60\% of the data and no reasoning annotations, resulting in 3$\times$ fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.
Chinese Translation
视觉-语言-动作(VLA)模型通过用统一的端到端架构替代模块化管道,推动了自动驾驶的发展。然而,当前的VLA面临两个昂贵的要求:(1)大规模数据集的收集,以及(2)密集的推理注释。在本研究中,我们通过 extbf{NoRD}( extbf{No} extbf{R}easoning for extbf{D}riving)来解决这两个挑战。与现有的VLA相比, extbf{NoRD}在少于60 ext{%}的数据和没有推理注释的情况下实现了竞争性的性能,结果减少了3倍的标记数量。我们发现,标准的群体相对策略优化(GRPO)在应用于在如此小的、无推理的数据集上训练的策略时,未能产生显著的改进。我们表明,这一限制源于难度偏差,它对来自于产生高方差回报的场景的奖励信号施加了不成比例的惩罚。 extbf{NoRD}通过引入Dr.~GRPO,一种旨在减轻大型语言模型(LLMs)中难度偏差的最新算法,克服了这一问题。因此, extbf{NoRD}在Waymo和NAVSIM上以少量的训练数据和没有推理开销的情况下实现了竞争性的性能,从而使得自动化系统更加高效。
cs.AI / 43 / 2602.21201
Aletheia tackles FirstProof autonomously
Aletheia自主应对FirstProof挑战
Abstract
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at https://github.com/google-deepmind/superhuman/tree/main/aletheia.
Chinese Translation
我们报告了Aletheia(Feng et al., 2026b)的表现,该数学研究代理由Gemini 3 Deep Think驱动,参与了首届FirstProof挑战。在挑战允许的时间内,Aletheia根据多数专家评估自主解决了10个问题中的6个(2, 5, 7, 8, 9, 10);我们注意到专家在问题8上并未达成一致。为了确保透明性,我们解释了对FirstProof的理解,并披露了实验的细节及评估结果。原始提示和输出可在https://github.com/google-deepmind/superhuman/tree/main/aletheia获取。
cs.CL / 1 / 2602.20162
Talking to Yourself: Defying Forgetting in Large Language Models
自言自语:在大型语言模型中抵御遗忘
Abstract
Catastrophic forgetting remains a major challenge when fine-tuning large language models (LLMs) on narrow, task-specific data, often degrading their general knowledge and reasoning abilities. We propose SA-SFT, a lightweight self-augmentation routine in which an LLM generates self-dialogues prior to fine-tuning, and the resulting self-authored data are mixed with task data without modifying optimization or training schedules. Despite requiring no external data or additional tuning, SA-SFT consistently mitigates catastrophic forgetting while improving in-domain performance. Across 50 evaluation scenarios, it maintains performance comparable to the original model and achieves the best results in 40 cases, outperforming common baselines such as layer freezing and external data mixing. Guided by these empirical findings, we further present a theoretical analysis suggesting that forgetting can partly stem from style-induced parameter drift, and that self-alignment through self-generated data provides an effective means to counteract this effect. Overall, our results indicate that self-augmentation offers a simple and effective mechanism for robust LLM adaptation without incurring catastrophic forgetting.
Chinese Translation
灾难性遗忘仍然是对大型语言模型(LLMs)进行细调时面临的主要挑战,尤其是在狭窄的任务特定数据上进行细调时,常常会降低其一般知识和推理能力。我们提出了SA-SFT,一种轻量级自我增强例程,其中LLM在细调之前生成自我对话,生成的自创数据与任务数据混合,而不修改优化或训练计划。尽管不需要外部数据或额外的调优,SA-SFT始终有效减轻灾难性遗忘,同时提高了领域内的表现。在50个评估场景中,它保持了与原始模型相当的性能,并在40个案例中取得了最佳结果,超越了常见的基线方法,如层冻结和外部数据混合。基于这些实证发现,我们进一步提出了一种理论分析,表明遗忘部分源于风格引起的参数漂移,而通过自生成数据进行自我对齐提供了一种有效的手段来抵消这一影响。总体而言,我们的结果表明,自我增强为在不导致灾难性遗忘的情况下实现稳健的LLM适应提供了一种简单而有效的机制。
cs.CL / 2 / 2602.20164
Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings
蒸馏语言模型的基准测试:资源受限环境中的性能与效率
Abstract
Knowledge distillation offers a transformative pathway to developing powerful, yet efficient, small language models (SLMs) suitable for resource-constrained environments. In this paper, we benchmark the performance and computational cost of distilled models against their vanilla and proprietary counterparts, providing a quantitative analysis of their efficiency. Our results demonstrate that distillation creates a superior performance-tocompute curve. We find that creating a distilled 8B model is over 2,000 times more compute-efficient than training its vanilla counterpart, while achieving reasoning capabilities on par with, or even exceeding, standard models ten times its size. These findings validate distillation not just as a compression technique, but as a primary strategy for building state-of-the-art, accessible AI
Chinese Translation
知识蒸馏为开发强大而高效的小型语言模型(SLMs)提供了一条变革性路径,适用于资源受限的环境。在本文中,我们对蒸馏模型的性能和计算成本进行了基准测试,与其原始模型和专有模型进行了比较,提供了其效率的定量分析。我们的结果表明,蒸馏创造了一个优越的性能与计算曲线。我们发现,创建一个蒸馏的8B模型的计算效率比训练其原始模型高出超过2000倍,同时在推理能力上与标准模型相当,甚至超过其十倍大小的模型。这些发现验证了蒸馏不仅是一种压缩技术,更是构建最先进、可访问的人工智能的主要策略。
cs.CL / 3 / 2602.20166
ConceptRM: The Quest to Mitigate Alert Fatigue through Consensus-Based Purity-Driven Data Cleaning for Reflection Modelling
ConceptRM:通过基于共识的纯度驱动数据清洗来减轻警报疲劳的探索
Abstract
In many applications involving intelligent agents, the overwhelming volume of alerts (mostly false) generated by the agents may desensitize users and cause them to overlook critical issues, leading to the so-called ''alert fatigue''. A common strategy is to train a reflection model as a filter to intercept false alerts with labelled data collected from user verification feedback. However, a key challenge is the noisy nature of such data as it is often collected in production environments. As cleaning noise via manual annotation incurs high costs, this paper proposes a novel method ConceptRM for constructing a high-quality corpus to train a reflection model capable of effectively intercepting false alerts. With only a small amount of expert annotations as anchors, ConceptRM creates perturbed datasets with varying noise ratios and utilizes co-teaching to train multiple distinct models for collaborative learning. By analyzing the consensus decisions of these models, it effectively identifies reliable negative samples from a noisy dataset. Experimental results demonstrate that ConceptRM significantly enhances the interception of false alerts with minimal annotation cost, outperforming several state-of-the-art LLM baselines by up to 53.31% on in-domain datasets and 41.67% on out-of-domain datasets.
Chinese Translation
在许多涉及智能代理的应用中,代理生成的海量警报(大多数为虚假警报)可能使用户产生麻木感,从而忽视关键问题,导致所谓的“警报疲劳”。一种常见的策略是训练一个反射模型作为过滤器,以拦截通过用户验证反馈收集的标记数据中的虚假警报。然而,一个关键挑战是此类数据的噪声特性,因为它通常是在生产环境中收集的。由于通过人工标注清理噪声的成本高昂,本文提出了一种新方法ConceptRM,用于构建高质量语料库,以训练一个能够有效拦截虚假警报的反射模型。ConceptRM仅需少量专家标注作为锚点,便可创建具有不同噪声比率的扰动数据集,并利用共同教学训练多个不同模型以进行协作学习。通过分析这些模型的一致性决策,它有效地从噪声数据集中识别出可靠的负样本。实验结果表明,ConceptRM在最小标注成本下显著提高了虚假警报的拦截能力,在领域内数据集上比多个最先进的LLM基线提高了高达53.31%,在领域外数据集上提高了41.67%。
cs.CL / 4 / 2602.20294
InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
InterviewSim:一个可扩展的基于访谈的人格模拟框架
Abstract
Simulating real personalities with large language models requires grounding generation in authentic personal data. Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said. We address this gap with an interview-grounded evaluation framework for personality simulation at a large scale. We extract over 671,000 question-answer pairs from 23,000 verified interview transcripts across 1,000 public personalities, each with an average of 11.5 hours of interview content. We propose a multi-dimensional evaluation framework with four complementary metrics measuring content similarity, factual consistency, personality alignment, and factual knowledge retention. Through systematic comparison, we demonstrate that methods grounded in real interview data substantially outperform those relying solely on biographical profiles or the model's parametric knowledge. We further reveal a trade-off in how interview data is best utilized: retrieval-augmented methods excel at capturing personality style and response quality, while chronological-based methods better preserve factual consistency and knowledge retention. Our evaluation framework enables principled method selection based on application requirements, and our empirical findings provide actionable insights for advancing personality simulation research.
Chinese Translation
使用大型语言模型模拟真实人格需要将生成内容与真实个人数据相结合。现有的评估方法依赖于人口统计调查、人格问卷或短暂的人工智能主导访谈作为代理,但缺乏对个体实际言论的直接评估。我们通过一个基于访谈的评估框架来填补这一空白,旨在大规模的人格模拟。我们从1,000个公共人格的23,000份经过验证的访谈记录中提取了超过671,000个问答对,每个访谈内容平均长达11.5小时。我们提出了一个多维评估框架,包含四个互补指标,用于测量内容相似性、事实一致性、人格对齐和事实知识保留。通过系统比较,我们证明了基于真实访谈数据的方法在性能上显著优于仅依赖传记资料或模型参数知识的方法。我们进一步揭示了访谈数据最佳利用的权衡:增强检索的方法在捕捉人格风格和反应质量方面表现优异,而基于时间顺序的方法则更好地保持事实一致性和知识保留。我们的评估框架使得根据应用需求进行有原则的方法选择成为可能,而我们的实证发现为推动人格模拟研究提供了可行的见解。
cs.CL / 5 / 2602.20300
What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance
什么样的查询是好的查询?测量人类混淆语言特征对大型语言模型性能的影响
Abstract
Large Language Model (LLM) hallucinations are usually treated as defects of the model or its decoding strategy. Drawing on classical linguistics, we argue that a query's form can also shape a listener's (and model's) response. We operationalize this insight by constructing a 22-dimension query feature vector covering clause complexity, lexical rarity, and anaphora, negation, answerability, and intention grounding, all known to affect human comprehension. Using 369,837 real-world queries, we ask: Are there certain types of queries that make hallucination more likely? A large-scale analysis reveals a consistent "risk landscape": certain features such as deep clause nesting and underspecification align with higher hallucination propensity. In contrast, clear intention grounding and answerability align with lower hallucination rates. Others, including domain specificity, show mixed, dataset- and model-dependent effects. Thus, these findings establish an empirically observable query-feature representation correlated with hallucination risk, paving the way for guided query rewriting and future intervention studies.
Chinese Translation
大型语言模型(LLM)的幻觉通常被视为模型或其解码策略的缺陷。基于经典语言学,我们认为查询的形式也可以影响听者(以及模型)的响应。我们通过构建一个涵盖从句复杂性、词汇稀有性到指代、否定、可回答性和意图基础等22维查询特征向量来实现这一见解,这些特征都已知会影响人类理解。使用369,837个真实世界的查询,我们提出了一个问题:是否存在某些类型的查询更容易导致幻觉?大规模分析揭示了一个一致的“风险景观”:某些特征,如深层从句嵌套和不充分说明,与更高的幻觉倾向相关。相反,清晰的意图基础和可回答性则与较低的幻觉率相关。其他特征,包括领域特异性,显示出混合的、依赖于数据集和模型的效果。因此,这些发现建立了一个与幻觉风险相关的可观察的查询特征表示,为引导查询重写和未来的干预研究铺平了道路。
cs.CL / 6 / 2602.20332
No One Size Fits All: QueryBandits for Hallucination Mitigation
没有一种适合所有的方案:用于幻觉缓解的查询强盗算法(QueryBandits)
Abstract
Advanced reasoning capabilities in Large Language Models (LLMs) have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation.
Chinese Translation
大型语言模型(LLMs)中先进的推理能力导致了更频繁的幻觉现象;然而,大多数缓解工作集中在开源模型的事后检测和参数编辑上。针对闭源模型幻觉现象的研究匮乏尤其令人担忧,因为闭源模型在机构部署中占据了绝大多数。我们提出了查询强盗算法(QueryBandits),这是一种模型无关的上下文强盗框架,能够通过利用经过实证验证和校准的奖励函数,在线自适应地学习选择最佳的查询重写策略。在16个问答场景中,我们的顶级查询强盗(汤普森采样,Thompson Sampling)在无重写基线(No-Rewrite)上达到了87.5%的胜率,并且分别比零-shot静态策略(例如,改述或扩展)提高了42.6%和60.3%。此外,所有上下文强盗在所有数据集上均优于普通强盗,特征方差的增加与臂选择的方差增大相一致。这证实了我们的发现:没有一种重写策略对所有查询都是最佳的。我们还发现某些静态策略的累积遗憾高于无重写策略,这表明不灵活的查询重写策略可能会加剧幻觉现象。因此,通过查询强盗学习基于语义特征的在线策略可以仅通过前向传播机制改变模型行为,使其能够与闭源模型配合使用,避免了重新训练或基于梯度的适应需求。
cs.CL / 7 / 2602.20336
Natural Language Processing Models for Robust Document Categorization
用于稳健文档分类的自然语言处理模型
Abstract
This article presents an evaluation of several machine learning methods applied to automated text classification, alongside the design of a demonstrative system for unbalanced document categorization and distribution. The study focuses on balancing classification accuracy with computational efficiency, a key consideration when integrating AI into real world automation pipelines. Three models of varying complexity were examined: a Naive Bayes classifier, a bidirectional LSTM network, and a fine tuned transformer based BERT model. The experiments reveal substantial differences in performance. BERT achieved the highest accuracy, consistently exceeding 99\%, but required significantly longer training times and greater computational resources. The BiLSTM model provided a strong compromise, reaching approximately 98.56\% accuracy while maintaining moderate training costs and offering robust contextual understanding. Naive Bayes proved to be the fastest to train, on the order of milliseconds, yet delivered the lowest accuracy, averaging around 94.5\%. Class imbalance influenced all methods, particularly in the recognition of minority categories. A fully functional demonstrative system was implemented to validate practical applicability, enabling automated routing of technical requests with throughput unattainable through manual processing. The study concludes that BiLSTM offers the most balanced solution for the examined scenario, while also outlining opportunities for future improvements and further exploration of transformer architectures.
Chinese Translation
本文评估了几种应用于自动文本分类的机器学习方法,并设计了一个用于不平衡文档分类和分发的演示系统。研究重点在于平衡分类准确性与计算效率,这是将人工智能集成到现实世界自动化流程中的关键考虑因素。我们考察了三种复杂度不同的模型:朴素贝叶斯分类器、双向长短期记忆网络(BiLSTM)和经过微调的基于变换器的BERT模型。实验结果显示,性能存在显著差异。BERT模型实现了最高的准确率,始终超过99%,但需要显著更长的训练时间和更大的计算资源。BiLSTM模型提供了一个良好的折衷,准确率约为98.56%,同时保持了适中的训练成本,并提供了强大的上下文理解能力。朴素贝叶斯模型训练速度最快,约为毫秒级,但准确率最低,平均约为94.5%。类别不平衡影响了所有方法,特别是在识别少数类别时。为了验证实际应用性,实施了一个功能齐全的演示系统,实现了技术请求的自动路由,其吞吐量无法通过人工处理实现。研究结论指出,BiLSTM为所考察的场景提供了最平衡的解决方案,同时也概述了未来改进和进一步探索变换器架构的机会。
cs.CL / 8 / 2602.20372
How communicatively optimal are exact numeral systems? Once more on lexicon size and morphosyntactic complexity
精确数词系统的交际最优性如何?关于词汇规模和形态句法复杂性的再探讨
Abstract
Recent research argues that exact recursive numeral systems optimize communicative efficiency by balancing a tradeoff between the size of the numeral lexicon and the average morphosyntactic complexity (roughly length in morphemes) of numeral terms. We argue that previous studies have not characterized the data in a fashion that accounts for the degree of complexity languages display. Using data from 52 genetically diverse languages and an annotation scheme distinguishing between predictable and unpredictable allomorphy (formal variation), we show that many of the world's languages are decisively less efficient than one would expect. We discuss the implications of our findings for the study of numeral systems and linguistic evolution more generally.
Chinese Translation
近期研究认为,精确递归数词系统通过在数词词汇的规模与数词术语的平均形态句法复杂性(大致以形态素的长度衡量)之间进行权衡,从而优化了交际效率。我们认为,之前的研究没有以能够反映语言复杂性程度的方式来表征数据。通过使用来自52种基因多样性语言的数据和一种区分可预测与不可预测变体(形式变异)的注释方案,我们展示了世界上许多语言的效率明显低于预期。我们讨论了这些发现对数词系统及语言演化研究的更广泛影响。
cs.CL / 9 / 2602.20379
Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
面向案例的 LLM 作为评判者的企业级 RAG 系统评估
Abstract
Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error codes, versions), and resolution workflows. Existing RAG evaluation frameworks are primarily designed for benchmark-style or single-turn settings and often fail to capture enterprise-specific failure modes such as case misidentification, workflow misalignment, and partial resolution across turns. We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems. The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment. A severity-aware scoring protocol reduces score inflation and improves diagnostic clarity across heterogeneous enterprise cases. The system uses deterministic prompting with strict JSON outputs, enabling scalable batch evaluation, regression testing, and production monitoring. Through a comparative study of two instruction-tuned models across short and long workflows, we show that generic proxy metrics provide ambiguous signals, while the proposed framework exposes enterprise-critical tradeoffs that are actionable for system improvement.
Chinese Translation
企业检索增强生成(RAG)助手在多轮、基于案例的工作流程中运行,例如技术支持和 IT 运维,其中评估必须反映操作约束、结构化标识符(例如错误代码、版本)和解决工作流程。现有的 RAG 评估框架主要设计用于基准风格或单轮设置,往往无法捕捉企业特定的失败模式,如案例错误识别、工作流程不对齐和跨轮次的部分解决。我们提出了一种面向案例的 LLM 作为评判者的评估框架,适用于企业多轮 RAG 系统。该框架使用八个基于操作的指标评估每一轮,分别关注检索质量、基础可靠性、答案实用性、精确完整性以及案例/工作流程对齐。一个关注严重性的评分协议减少了评分膨胀,并提高了跨异构企业案例的诊断清晰度。该系统使用确定性提示和严格的 JSON 输出,支持可扩展的批量评估、回归测试和生产监控。通过对两个指令调优模型在短流程和长流程中的比较研究,我们表明,通用代理指标提供了模糊的信号,而所提出的框架揭示了对企业至关重要的权衡,这些权衡对系统改进具有可操作性。
cs.CL / 10 / 2602.20433
Disentangling Geometry, Performance, and Training in Language Models
解构语言模型中的几何、性能与训练
Abstract
Geometric properties of Transformer weights, particularly the unembedding matrix, have been widely useful in language model interpretability research. Yet, their utility for estimating downstream performance remains unclear. In this work, we systematically investigate the relationship between model performance and the unembedding matrix geometry, particularly its effective rank. Our experiments, involving a suite of 108 OLMo-style language models trained under controlled variation, reveal several key findings. While the best-performing models often exhibit a high effective rank, this trend is not universal across tasks and training setups. Contrary to prior work, we find that low effective rank does not cause late-stage performance degradation in small models, but instead co-occurs with it; we find adversarial cases where low-rank models do not exhibit saturation. Moreover, we show that effective rank is strongly influenced by pre-training hyperparameters, such as batch size and weight decay, which in-turn affect the model's performance. Lastly, extending our analysis to other geometric metrics and final-layer representation, we find that these metrics are largely aligned, but none can reliably predict downstream performance. Overall, our findings suggest that the model's geometry, as captured by existing metrics, primarily reflects training choices rather than performance.
Chinese Translation
变换器(Transformer)权重的几何特性,特别是去嵌入矩阵(unembedding matrix),在语言模型可解释性研究中具有广泛的应用。然而,它们在估计下游性能方面的效用仍不明确。在本研究中,我们系统地探讨了模型性能与去嵌入矩阵几何之间的关系,特别是其有效秩(effective rank)。我们的实验涉及一组在受控变异下训练的108个OLMo风格语言模型,揭示了几个关键发现。尽管表现最佳的模型通常表现出较高的有效秩,但这一趋势并不普遍适用于所有任务和训练设置。与之前的研究相反,我们发现低有效秩并不会导致小模型在后期的性能下降,而是与之共存;我们发现了一些对抗性案例,其中低秩模型并未表现出饱和。此外,我们还表明,有效秩受到预训练超参数(如批量大小和权重衰减)的强烈影响,这反过来又影响模型的性能。最后,我们将分析扩展到其他几何度量和最终层表示,发现这些度量在很大程度上是一致的,但没有任何一个能够可靠地预测下游性能。总体而言,我们的发现表明,模型的几何特性,如现有度量所捕捉的,主要反映了训练选择而非性能。
cs.CL / 11 / 2602.20513
From Performance to Purpose: A Sociotechnical Taxonomy for Evaluating Large Language Model Utility
从性能到目的:评估大型语言模型效用的社会技术分类法
Abstract
As large language models (LLMs) continue to improve at completing discrete tasks, they are being integrated into increasingly complex and diverse real-world systems. However, task-level success alone does not establish a model's fit for use in practice. In applied, high-stakes settings, LLM effectiveness is driven by a wider array of sociotechnical determinants that extend beyond conventional performance measures. Although a growing set of metrics capture many of these considerations, they are rarely organized in a way that supports consistent evaluation, leaving no unified taxonomy for assessing and comparing LLM utility across use cases. To address this gap, we introduce the Language Model Utility Taxonomy (LUX), a comprehensive framework that structures utility evaluation across four domains: performance, interaction, operations, and governance. Within each domain, LUX is organized hierarchically into thematically aligned dimensions and components, each grounded in metrics that enable quantitative comparison and alignment of model selection with intended use. In addition, an external dynamic web tool is provided to support exploration of the framework by connecting each component to a repository of relevant metrics (factors) for applied evaluation.
Chinese Translation
随着大型语言模型(LLMs)在完成离散任务方面的不断提升,它们正被整合到越来越复杂和多样化的现实世界系统中。然而,仅仅依靠任务层面的成功并不能确定模型在实际应用中的适用性。在应用性高风险环境中,LLM的有效性受到更广泛的社会技术决定因素的驱动,这些因素超出了传统的性能衡量标准。尽管越来越多的指标捕捉了许多这些考虑因素,但它们很少以支持一致评估的方式进行组织,导致缺乏一个统一的分类法来评估和比较不同用例中的LLM效用。为了解决这一问题,我们提出了语言模型效用分类法(Language Model Utility Taxonomy,LUX),这是一个全面的框架,结构化了在四个领域中的效用评估:性能、交互、操作和治理。在每个领域内,LUX按主题对齐的维度和组件进行层次化组织,每个维度和组件都基于能够进行定量比较的指标,使模型选择与预期用途相一致。此外,还提供了一个外部动态网络工具,以支持通过将每个组件与相关指标(因素)库连接来探索该框架,从而进行应用评估。
cs.CL / 12 / 2602.20528
Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
停止-思考-自回归:具有潜在扩散规划的语言建模
Abstract
The Stop-Think-AutoRegress Language Diffusion Model (STAR-LDM) integrates latent diffusion planning with autoregressive generation. Unlike conventional autoregressive language models limited to token-by-token decisions, STAR-LDM incorporates a "thinking" phase that pauses generation to refine a semantic plan through diffusion before continuing. This enables global planning in continuous space prior to committing to discrete tokens. Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves $>70\%$ win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning. The architecture also allows straightforward control through lightweight classifiers, enabling fine-grained steering of attributes without model retraining while maintaining better fluency-control trade-offs than specialized approaches.
Chinese Translation
停止-思考-自回归语言扩散模型(STAR-LDM)将潜在扩散规划与自回归生成相结合。与传统的自回归语言模型仅限于逐个标记决策不同,STAR-LDM 引入了一个“思考”阶段,在生成过程中暂停,以通过扩散来细化语义计划,然后再继续。这使得在承诺离散标记之前能够在连续空间中进行全局规划。评估结果表明,STAR-LDM 在语言理解基准测试中显著优于同等规模的模型,并在叙事连贯性和常识推理的 LLM-作为-评判者比较中实现了超过 70% 的胜率。该架构还允许通过轻量级分类器进行简单控制,使得在不重新训练模型的情况下,能够对属性进行细致的引导,同时保持比专门方法更好的流畅性控制权衡。
cs.CL / 13 / 2602.20580
Personal Information Parroting in Language Models
语言模型中的个人信息重复
Abstract
Modern language models (LM) are trained on large scrapes of the Web, containing millions of personal information (PI) instances, many of which LMs memorize, increasing privacy risks. In this work, we develop the regexes and rules (R&R) detector suite to detect email addresses, phone numbers, and IP addresses, which outperforms the best regex-based PI detectors. On a manually curated set of 483 instances of PI, we measure memorization: finding that 13.6% are parroted verbatim by the Pythia-6.9b model, i.e., when the model is prompted with the tokens that precede the PI in the original document, greedy decoding generates the entire PI span exactly. We expand this analysis to study models of varying sizes (160M-6.9B) and pretraining time steps (70k-143k iterations) in the Pythia model suite and find that both model size and amount of pretraining are positively correlated with memorization. Even the smallest model, Pythia-160m, parrots 2.7% of the instances exactly. Consequently, we strongly recommend that pretraining datasets be aggressively filtered and anonymized to minimize PI parroting.
Chinese Translation
现代语言模型(LM)在大量网络数据上进行训练,这些数据包含数百万个个人信息(PI)实例,其中许多被语言模型记忆,从而增加了隐私风险。在本研究中,我们开发了正则表达式和规则(R&R)检测器套件,以检测电子邮件地址、电话号码和IP地址,其性能优于最佳的基于正则表达式的个人信息检测器。在一组手动整理的483个个人信息实例中,我们测量了记忆化现象:发现Pythia-6.9b模型逐字重复了13.6%的实例,即当模型以原始文档中PI前面的标记作为提示时,贪婪解码会准确生成整个PI范围。我们将此分析扩展到研究不同大小(160M-6.9B)和预训练时间步(70k-143k迭代)的Pythia模型套件,发现模型大小和预训练量与记忆化呈正相关。即使是最小的模型Pythia-160m,也准确重复了2.7%的实例。因此,我们强烈建议对预训练数据集进行严格过滤和匿名化,以尽量减少个人信息的重复。
cs.CL / 14 / 2602.20634
Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches
增强社交媒体仇恨言论检测:机器学习模型与文本转换方法的比较分析
Abstract
The proliferation of hate speech on social media platforms has necessitated the development of effective detection and moderation tools. This study evaluates the efficacy of various machine learning models in identifying hate speech and offensive language and investigates the potential of text transformation techniques to neutralize such content. We compare traditional models like CNNs and LSTMs with advanced neural network models such as BERT and its derivatives, alongside exploring hybrid models that combine different architectural features. Our results indicate that while advanced models like BERT show superior accuracy due to their deep contextual understanding, hybrid models exhibit improved capabilities in certain scenarios. Furthermore, we introduce innovative text transformation approaches that convert negative expressions into neutral ones, thereby potentially mitigating the impact of harmful content. The implications of these findings are discussed, highlighting the strengths and limitations of current technologies and proposing future directions for more robust hate speech detection systems.
Chinese Translation
社交媒体平台上仇恨言论的泛滥迫使我们开发有效的检测和管理工具。本研究评估了各种机器学习模型在识别仇恨言论和攻击性语言方面的有效性,并探讨了文本转换技术中和此类内容的潜力。我们将传统模型如卷积神经网络(CNNs)和长短期记忆网络(LSTMs)与先进的神经网络模型如BERT及其衍生模型进行比较,同时探索结合不同架构特征的混合模型。我们的结果表明,尽管像BERT这样的先进模型由于其深层上下文理解而表现出更高的准确性,但混合模型在某些场景中展现出更好的能力。此外,我们还介绍了创新的文本转换方法,将负面表达转化为中性表达,从而可能减轻有害内容的影响。讨论了这些发现的意义,强调了当前技术的优势和局限性,并提出了未来更强大的仇恨言论检测系统的发展方向。
cs.CL / 15 / 2602.20647
Semantic Novelty at Scale: Narrative Shape Taxonomy and Readership Prediction in 28,606 Books
大规模语义新颖性:叙事形状分类法与28606本书的读者预测
Abstract
I introduce semantic novelty--cosine distance between each paragraph's sentence embedding and the running centroid of all preceding paragraphs--as an information-theoretic measure of narrative structure at corpus scale. Applying it to 28,606 books in PG19 (pre-1920 English literature), I compute paragraph-level novelty curves using 768-dimensional SBERT embeddings, then reduce each to a 16-segment Piecewise Aggregate Approximation (PAA). Ward-linkage clustering on PAA vectors reveals eight canonical narrative shape archetypes, from Steep Descent (rapid convergence) to Steep Ascent (escalating unpredictability). Volume--variance of the novelty trajectory--is the strongest length-independent predictor of readership (partial rho = 0.32), followed by speed (rho = 0.19) and Terminal/Initial ratio (rho = 0.19). Circuitousness shows strong raw correlation (rho = 0.41) but is 93 percent correlated with length; after control, partial rho drops to 0.11--demonstrating that naive correlations in corpus studies can be dominated by length confounds. Genre strongly constrains narrative shape (chi squared = 2121.6, p < 10 to the power negative 242), with fiction maintaining plateau profiles while nonfiction front-loads information. Historical analysis shows books became progressively more predictable between 1840 and 1910 (T/I ratio trend r = negative 0.74, p = 0.037). SAX analysis reveals 85 percent signature uniqueness, suggesting each book traces a nearly unique path through semantic space. These findings demonstrate that information-density dynamics, distinct from sentiment or topic, constitute a fundamental dimension of narrative structure with measurable consequences for reader engagement. Dataset: https://huggingface.co/datasets/wfzimmerman/pg19-semantic-novelty
Chinese Translation
我引入了语义新颖性——每个段落的句子嵌入与所有前面段落的运行质心之间的余弦距离——作为一种信息论度量,用于在语料库规模上分析叙事结构。将其应用于PG19中的28606本书(1920年前的英语文学),我使用768维的SBERT嵌入计算段落级新颖性曲线,然后将每条曲线简化为16段的分段聚合近似(Piecewise Aggregate Approximation, PAA)。对PAA向量进行Ward链接聚类揭示了八种典型的叙事形状原型,从陡峭下降(快速收敛)到陡峭上升(不断增加的不确定性)。新颖性轨迹的方差(Volume)是与读者人数最强的长度无关预测因子(部分相关系数 rho = 0.32),其次是速度(rho = 0.19)和终端/初始比率(Terminal/Initial ratio, rho = 0.19)。曲折度显示出强烈的原始相关性(rho = 0.41),但与长度的相关性高达93%;经过控制后,部分相关系数降至0.11——这表明在语料库研究中,天真的相关性可能会受到长度混淆的主导影响。体裁强烈限制了叙事形状(卡方 = 2121.6, p < 10^-242),其中虚构类作品保持平稳的轮廓,而非虚构类作品则前置信息。历史分析显示,1840年至1910年间,书籍变得越来越可预测(终端/初始比率趋势 r = -0.74, p = 0.037)。SAX分析揭示了85%的特征唯一性,表明每本书在语义空间中几乎描绘了一条独特的路径。这些发现表明,信息密度动态,区别于情感或主题,构成了叙事结构的一个基本维度,对读者参与度具有可测量的影响。数据集: https://huggingface.co/datasets/wfzimmerman/pg19-semantic-novelty
cs.CL / 16 / 2602.20648
CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models
CARE:一种可解释的计算框架,用于利用大型语言模型评估客户感知的治疗联盟
Abstract
Client perceptions of the therapeutic alliance are critical for counseling effectiveness. Accurately capturing these perceptions remains challenging, as traditional post-session questionnaires are burdensome and often delayed, while existing computational approaches produce coarse scores, lack interpretable rationales, and fail to model holistic session context. We present CARE, an LLM-based framework to automatically predict multi-dimensional alliance scores and generate interpretable rationales from counseling transcripts. Built on the CounselingWAI dataset and enriched with 9,516 expert-curated rationales, CARE is fine-tuned using rationale-augmented supervision with the LLaMA-3.1-8B-Instruct backbone. Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance, achieving over 70% higher Pearson correlation with client ratings. Rationale-augmented supervision further improves predictive accuracy. CARE also produces high-quality, contextually grounded rationales, validated by both automatic and human evaluations. Applied to real-world Chinese online counseling sessions, CARE uncovers common alliance-building challenges, illustrates how interaction patterns shape alliance development, and provides actionable insights, demonstrating its potential as an AI-assisted tool for supporting mental health care.
Chinese Translation
客户对治疗联盟的感知对于咨询效果至关重要。准确捕捉这些感知仍然具有挑战性,因为传统的会后问卷既繁琐又常常延迟,而现有的计算方法产生的评分粗糙,缺乏可解释的理由,并且未能建模整体会话背景。我们提出了CARE,一个基于大型语言模型(LLM)的框架,能够自动预测多维度的联盟评分,并从咨询记录中生成可解释的理由。CARE基于CounselingWAI数据集,并结合了9516个专家策划的理由,通过使用LLaMA-3.1-8B-Instruct骨干网进行理由增强监督进行微调。实验表明,CARE优于领先的LLM,并显著缩小了咨询师评估与客户感知联盟之间的差距,与客户评分的Pearson相关性提高了超过70%。理由增强监督进一步提高了预测准确性。CARE还生成高质量、具有上下文基础的理由,经过自动和人工评估的验证。应用于真实的中国在线咨询会话,CARE揭示了常见的联盟建立挑战,说明了互动模式如何塑造联盟发展,并提供了可行的见解,展示了其作为支持心理健康护理的AI辅助工具的潜力。
cs.CL / 17 / 2602.20670
CAMEL: Confidence-Gated Reflection for Reward Modeling
CAMEL:基于置信度的反思框架用于奖励建模
Abstract
Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.
Chinese Translation
奖励模型在将大型语言模型与人类偏好对齐中发挥着基础性作用。现有方法主要遵循两种范式:标量判别偏好模型,虽然高效但缺乏可解释性;生成判断模型,提供更丰富的推理但代价是更高的计算开销。我们观察到,裁决标记之间的对数概率差与预测正确性强相关,提供了一种无需额外推理成本的实例难度的可靠代理。在此基础上,我们提出了CAMEL,一个基于置信度的反思框架,首先执行轻量级的单标记偏好决策,仅对低置信度实例选择性地调用反思。为了引导有效的自我修正,我们通过反事实前缀增强进行强化学习训练模型,这使模型接触到多样的初始裁决并鼓励真实的修订。从经验上看,CAMEL在三个广泛使用的奖励模型基准上实现了最先进的性能,平均准确率为82.9%,超越了之前最佳模型3.2%,并且在仅使用14B参数的情况下超越了70B参数模型,同时建立了一个严格更好的准确性-效率帕累托前沿。
cs.CL / 18 / 2602.20727
ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition
ID-LoRA:受矩阵插值分解启发的高效低秩适应
Abstract
LoRA has become a universal Parameter-Efficient Fine-Tuning (PEFT) technique that equips Large Language Models (LLMs) to adapt quickly to new tasks. However, when these models are scaled up, even the latest LoRA variants still introduce considerable overhead in trainable parameters. Conversely, aggressively lowering the rank to curb this overhead markedly degrades performance in complex multi-task settings. We propose ID-LoRA, a novel PEFT framework that breaks the trade-off. Its core innovation lies in extracting and reusing clustered parameter groups from the pretrained weight matrix. These groups are then used to form multiple low-rank components, all of which share only a single initialized trainable low-rank matrix. This approach cuts the number of trainable parameters while keeping the model's capacity intact. We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment. ID-LoRA outperforms both full fine-tuning and existing PEFT baselines (e.g., LoRA, DoRA, HydraLoRA) while using up to 46% fewer trainable parameters than the standard LoRA. In multi-task scenarios, it surpasses LoRA and its recent variants (e.g., DoRA and HydraLoRA) on both Code and MMLU tasks, yet requires only 54% of the trainable parameters demanded by the conventional LoRA.
Chinese Translation
LoRA 已成为一种通用的参数高效微调(PEFT)技术,使大型语言模型(LLMs)能够快速适应新任务。然而,当这些模型规模扩大时,即使是最新的 LoRA 变体仍然会引入相当大的可训练参数开销。相反,过度降低秩以抑制这种开销会显著降低在复杂多任务环境中的性能。我们提出了 ID-LoRA,这是一种新颖的 PEFT 框架,打破了这一权衡。其核心创新在于从预训练权重矩阵中提取和重用聚类参数组。这些组随后用于形成多个低秩组件,所有组件仅共享一个初始化的可训练低秩矩阵。这种方法在保持模型能力不变的情况下,减少了可训练参数的数量。我们在五个不同的基准上评估了 ID-LoRA:数学推理、代码生成、MMLU、常识问答(CommonsenseQA)和安全对齐(Safety Alignment)。ID-LoRA 在性能上超越了完全微调和现有的 PEFT 基准(例如,LoRA、DoRA、HydraLoRA),同时使用的可训练参数比标准 LoRA 少多达 46%。在多任务场景中,它在代码和 MMLU 任务上超越了 LoRA 及其近期变体(例如,DoRA 和 HydraLoRA),但仅需传统 LoRA 所需可训练参数的 54%。
cs.CL / 19 / 2602.20743
Adaptive Text Anonymization: Learning Privacy-Utility Trade-offs via Prompt Optimization
自适应文本匿名化:通过提示优化学习隐私与效用的权衡
Abstract
Anonymizing textual documents is a highly context-sensitive problem: the appropriate balance between privacy protection and utility preservation varies with the data domain, privacy objectives, and downstream application. However, existing anonymization methods rely on static, manually designed strategies that lack the flexibility to adjust to diverse requirements and often fail to generalize across domains. We introduce adaptive text anonymization, a new task formulation in which anonymization strategies are automatically adapted to specific privacy-utility requirements. We propose a framework for task-specific prompt optimization that automatically constructs anonymization instructions for language models, enabling adaptation to different privacy goals, domains, and downstream usage patterns. To evaluate our approach, we present a benchmark spanning five datasets with diverse domains, privacy constraints, and utility objectives. Across all evaluated settings, our framework consistently achieves a better privacy-utility trade-off than existing baselines, while remaining computationally efficient and effective on open-source language models, with performance comparable to larger closed-source models. Additionally, we show that our method can discover novel anonymization strategies that explore different points along the privacy-utility trade-off frontier.
Chinese Translation
文本文件的匿名化是一个高度依赖上下文的问题:隐私保护与效用保持之间的适当平衡因数据领域、隐私目标和下游应用而异。然而,现有的匿名化方法依赖于静态的、手动设计的策略,缺乏适应多样化需求的灵活性,且往往无法在不同领域中推广。我们提出了自适应文本匿名化,这是一种新的任务表述,其中匿名化策略自动适应特定的隐私与效用要求。我们提出了一个针对特定任务的提示优化框架,该框架自动构建语言模型的匿名化指令,使其能够适应不同的隐私目标、领域和下游使用模式。为了评估我们的方法,我们展示了一个涵盖五个数据集的基准,这些数据集具有多样的领域、隐私约束和效用目标。在所有评估的设置中,我们的框架始终实现了比现有基线更好的隐私-效用权衡,同时在计算效率和在开源语言模型上的有效性方面表现良好,其性能可与更大规模的闭源模型相媲美。此外,我们还展示了我们的方法能够发现新的匿名化策略,探索隐私-效用权衡前沿的不同点。
cs.CL / 20 / 2602.20749
Explicit Grammar Semantic Feature Fusion for Robust Text Classification
显式语法语义特征融合用于鲁棒文本分类
Abstract
Natural Language Processing enables computers to understand human language by analysing and classifying text efficiently with deep-level grammatical and semantic features. Existing models capture features by learning from large corpora with transformer models, which are computationally intensive and unsuitable for resource-constrained environments. Therefore, our proposed study incorporates comprehensive grammatical rules alongside semantic information to build a robust, lightweight classification model without resorting to full parameterised transformer models or heavy deep learning architectures. The novelty of our approach lies in its explicit encoding of sentence-level grammatical structure, including syntactic composition, phrase patterns, and complexity indicators, into a compact grammar vector, which is then fused with frozen contextual embeddings. These heterogeneous elements unified a single representation that captures both the structural and semantic characteristics of the text. Deep learning models such as Deep Belief Networks (DBNs), Long Short-Term Memory (LSTMs), BiLSTMs, and transformer-based BERT and XLNET were used to train and evaluate the model, with the number of epochs varied. Based on experimental results, the unified feature representation model captures both the semantic and structural properties of text, outperforming baseline models by 2%-15%, enabling more effective learning across heterogeneous domains. Unlike prior syntax-aware transformer models that inject grammatical structure through additional attention layers, tree encoders, or full fine-tuning, the proposed framework treats grammar as an explicit inductive bias rather than a learnable module, resulting in a very lightweight model that delivers better performance on edge devices
Chinese Translation
自然语言处理使计算机能够通过分析和分类文本,利用深层次的语法和语义特征高效理解人类语言。现有模型通过从大型语料库中学习来捕捉特征,使用变换器模型,这些模型计算密集且不适合资源受限的环境。因此,我们提出的研究结合了全面的语法规则和语义信息,以构建一个鲁棒且轻量的分类模型,而无需依赖全参数化的变换器模型或重型深度学习架构。我们方法的新颖之处在于显式编码句子级别的语法结构,包括句法组成、短语模式和复杂性指标,形成一个紧凑的语法向量,然后与冻结的上下文嵌入进行融合。这些异构元素统一成一个单一的表示,捕捉文本的结构和语义特征。深度学习模型如深度信念网络(Deep Belief Networks, DBNs)、长短期记忆网络(Long Short-Term Memory, LSTMs)、双向长短期记忆网络(BiLSTMs)以及基于变换器的BERT和XLNET被用于训练和评估模型,训练轮数有所变化。根据实验结果,统一特征表示模型同时捕捉文本的语义和结构属性,超越基线模型2%-15%,在异构领域实现更有效的学习。与之前通过额外注意力层、树编码器或完全微调注入语法结构的语法感知变换器模型不同,所提出的框架将语法视为显式的归纳偏置,而不是可学习的模块,从而形成一个非常轻量的模型,在边缘设备上提供更好的性能。
cs.CL / 21 / 2602.20751
SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
SibylSense:通过记忆调优和对抗探测实现自适应评分标准学习
Abstract
Designing aligned and robust rewards for open-ended generation remains a key barrier to RL post-training. Rubrics provide structured, interpretable supervision, but scaling rubric construction is difficult: expert rubrics are costly, prompted rubrics are often superficial or inconsistent, and fixed-pool discriminative rubrics can saturate and drift, enabling reward hacking. We present SibylSense, an inference-time learning approach that adapts a frozen rubric generator through a tunable memory bank of validated rubric items. Memory is updated via verifier-based item rewards measured by reference-candidate answer discriminative gaps from a handful of examples. SibylSense alternates memory tuning with a rubric-adversarial policy update that produces rubric-satisfying candidate answers, shrinking discriminative gaps and driving the rubric generator to capture new quality dimensions. Experiments on two open-ended tasks show that SibylSense yields more discriminative rubrics and improves downstream RL performance over static and non-adaptive baselines.
Chinese Translation
为开放式生成设计对齐且稳健的奖励仍然是强化学习后训练的一个关键障碍。评分标准提供了结构化、可解释的监督,但扩展评分标准的构建非常困难:专家评分标准成本高昂,提示生成的评分标准往往表面化或不一致,而固定池的判别评分标准可能会饱和并漂移,从而导致奖励黑客行为。我们提出了SibylSense,这是一种推理时学习的方法,通过可调的验证评分项目记忆库来适应一个冻结的评分标准生成器。记忆通过基于验证器的项目奖励进行更新,这些奖励是通过少量示例的参考-候选答案判别差距来测量的。SibylSense交替进行记忆调优和评分标准对抗策略更新,生成符合评分标准的候选答案,缩小判别差距,并推动评分标准生成器捕捉新的质量维度。在两个开放式任务上的实验表明,SibylSense产生了更具判别性的评分标准,并改善了相较于静态和非自适应基线的下游强化学习性能。
cs.CL / 22 / 2602.20759
Overton Pluralistic Reinforcement Learning for Large Language Models
针对大型语言模型的Overton多元强化学习
Abstract
Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.
Chinese Translation
现有的对齐范式在捕捉人类价值观的多元性方面仍然有限。Overton多元主义通过从单一查询生成具有多样化视角的响应来填补这一空白。本文介绍了OP-GRPO(Overton多元群体相对策略优化),这是一个用于隐式Overton多元主义的强化学习框架,使单一大型语言模型能够在没有明确提示或模块化协调的情况下生成多元化的响应。我们的工作流程包括两个主要步骤。首先,相似度估计器训练对Sentence Transformer进行微调,以适应Overton多元主义任务,从而提供对生成响应的更准确覆盖评估。其次,OP-GRPO训练将该相似度估计器纳入一个双重奖励系统,旨在确保对真实人类视角的广泛覆盖和每个视角的独特性,从而促进多样性。实证结果展示了“小模型,大视角覆盖”的效果。训练后的Qwen2.5-3B-Instruct模型在自然语言推理基准测试中超越了20B GPT-OSS基线,获得了37.4%的相对准确性提升,并且在模块化架构基线中也实现了19.1%的相对改善。使用GPT-4.1作为大型语言模型的额外评估进一步确认了该方法的稳健性。
cs.CL / 23 / 2602.20816
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation
不要忽视尾部:解耦前K个概率以实现高效的语言模型蒸馏
Abstract
The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution. We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions, while maintaining the same computational profile as the KL Divergence. Our decoupled approach reduces the impact of the teacher modes and, consequently, increases the contribution of the tail of the distribution. Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models across various datasets. Furthermore, the distillation process is efficient and can be performed with a modest academic budget for large datasets, eliminating the need for industry-scale computing.
Chinese Translation
语言模型蒸馏中使用的核心学习信号是学生模型与教师模型分布之间的标准Kullback-Leibler (KL) 散度。传统的KL散度往往受到概率最高的下一个token的主导,即教师模型的模式,从而削弱了输出分布中较低概率但潜在信息丰富的成分的影响。我们提出了一种新的尾部感知散度,它将教师模型的前K个预测概率的贡献与较低概率预测的贡献解耦,同时保持与KL散度相同的计算特征。我们解耦的方法减少了教师模式的影响,因此增加了分布尾部的贡献。实验结果表明,我们修改后的蒸馏方法在各种数据集上对解码器模型的预训练和监督蒸馏均表现出竞争力的性能。此外,蒸馏过程高效,可以在适度的学术预算下对大数据集进行,消除了对工业规模计算的需求。
cs.CL / 24 / 2602.20859
FinAnchor: Aligned Multi-Model Representations for Financial Prediction
FinAnchor:用于金融预测的对齐多模型表示
Abstract
Financial prediction from long documents involves significant challenges, as actionable signals are often sparse and obscured by noise, and the optimal LLM for generating embeddings varies across tasks and time periods. In this paper, we propose FinAnchor(Financial Anchored Representations), a lightweight framework that integrates embeddings from multiple LLMs without fine-tuning the underlying models. FinAnchor addresses the incompatibility of feature spaces by selecting an anchor embedding space and learning linear mappings to align representations from other models into this anchor. These aligned features are then aggregated to form a unified representation for downstream prediction. Across multiple financial NLP tasks, FinAnchor consistently outperforms strong single-model baselines and standard ensemble methods, demonstrating the effectiveness of anchoring heterogeneous representations for robust financial prediction.
Chinese Translation
从长文档中进行金融预测面临重大挑战,因为可操作信号通常稀疏且被噪声掩盖,生成嵌入的最佳大语言模型(LLM)在不同任务和时间段之间有所不同。本文提出了FinAnchor(金融锚定表示),这是一个轻量级框架,能够整合来自多个LLM的嵌入,而无需对基础模型进行微调。FinAnchor通过选择一个锚定嵌入空间并学习线性映射,将其他模型的表示对齐到该锚定空间,从而解决特征空间的不兼容性。这些对齐的特征随后被聚合形成一个统一的表示,以用于下游预测。在多个金融自然语言处理(NLP)任务中,FinAnchor始终优于强大的单模型基准和标准集成方法,证明了锚定异构表示在稳健金融预测中的有效性。
cs.CL / 25 / 2602.20892
Exa-PSD: a new Persian sentiment analysis dataset on Twitter
Exa-PSD:一个新的波斯语Twitter情感分析数据集
Abstract
Today, Social networks such as Twitter are the most widely used platforms for communication of people. Analyzing this data has useful information to recognize the opinion of people in tweets. Sentiment analysis plays a vital role in NLP, which identifies the opinion of the individuals about a specific topic. Natural language processing in Persian has many challenges despite the adventure of strong language models. The datasets available in Persian are generally in special topics such as products, foods, hotels, etc while users may use ironies, colloquial phrases in social media To overcome these challenges, there is a necessity for having a dataset of Persian sentiment analysis on Twitter. In this paper, we introduce the Exa sentiment analysis Persian dataset, which is collected from Persian tweets. This dataset contains 12,000 tweets, annotated by 5 native Persian taggers. The aforementioned data is labeled in 3 classes: positive, neutral and negative. We present the characteristics and statistics of this dataset and use the pre-trained Pars Bert and Roberta as the base model to evaluate this dataset. Our evaluation reached a 79.87 Macro F-score, which shows the model and data can be adequately valuable for a sentiment analysis system.
Chinese Translation
如今,社交网络如Twitter是人们交流最广泛使用的平台。分析这些数据可以提供有用的信息,以识别人们在推文中的观点。情感分析在自然语言处理(NLP)中发挥着至关重要的作用,它识别个体对特定主题的看法。尽管强大的语言模型的出现,波斯语的自然语言处理仍面临许多挑战。现有的波斯语数据集通常集中在特定主题上,如产品、食品、酒店等,而用户在社交媒体上可能会使用讽刺、口语短语。为了克服这些挑战,迫切需要一个波斯语Twitter情感分析的数据集。在本文中,我们介绍了Exa情感分析波斯语数据集,该数据集是从波斯语推文中收集的。该数据集包含12,000条推文,由5名母语为波斯语的标注员进行标注。上述数据被标记为3个类别:积极、中立和消极。我们展示了该数据集的特征和统计信息,并使用预训练的Pars Bert和Roberta作为基模型来评估该数据集。我们的评估达到了79.87的宏观F1-score,这表明该模型和数据对于情感分析系统具有足够的价值。
cs.CL / 26 / 2602.20945
The Art of Efficient Reasoning: Data, Reward, and Optimization
高效推理的艺术:数据、奖励与优化
Abstract
Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.
Chinese Translation
大型语言模型(LLMs)在规模化的思维链(Chain-of-Thought, CoT)推理中持续受益,但也面临着巨大的计算开销。为了解决这一问题,高效推理旨在激励短而准确的思维轨迹,通常通过强化学习(Reinforcement Learning, RL)进行奖励塑造。本文系统地研究了LLMs高效推理的机制。为了进行全面评估,我们倡导采用更细致的指标,包括基于正确性条件的长度分布以及在从2k到32k的广泛标记预算范围内的性能表现。首先,我们揭示了训练过程遵循两阶段范式:长度适应和推理精炼。随后,我们在统一协议下进行了广泛的实验(约20万GPU小时),对训练提示和回滚、奖励塑造以及优化策略进行了拆解。特别地,一个关键发现是对相对简单的提示进行训练,确保正奖励信号的密度,从而避免长度崩溃。同时,学习到的长度偏差可以在不同领域中推广。我们将所有发现提炼为有价值的见解和实用指南,并在Qwen3系列(范围从0.6B到30B)中进一步验证,展示了其稳健性和泛化能力。
cs.CL / 27 / 2602.20966
Blackbird Language Matrices: A Framework to Investigate the Linguistic Competence of Language Models
黑鸟语言矩阵:研究语言模型语言能力的框架
Abstract
This article describes a novel language task, the Blackbird Language Matrices (BLM) task, inspired by intelligence tests, and illustrates the BLM datasets, their construction and benchmarking, and targeted experiments on chunking and systematicity. BLMs are multiple-choice problems, structured at multiple levels: within each sentence, across the input sequence, within each candidate answer. Because of their rich structure, these curated, but naturalistic datasets are key to answer some core questions about current large language models abilities: do LLMs detect linguistic objects and their properties? Do they detect and use systematic patterns across sentences? Are they more prone to linguistic or reasoning errors, and how do these interact? We show that BLMs, while challenging, can be solved at good levels of performance, in more than one language, with simple baseline models or, at better performance levels, with more tailored models. We show that their representations contain the grammatical objects and attributes relevant to solve a linguistic task. We also show that these solutions are reached by detecting systematic patterns across sentences. The paper supports the point of view that curated, structured datasets support multi-faceted investigations of properties of language and large language models. Because they present a curated, articulated structure, because they comprise both learning contexts and expected answers, and because they are partly built by hand, BLMs fall in the category of datasets that can support explainability investigations, and be useful to ask why large language models behave the way they do.
Chinese Translation
本文描述了一项新颖的语言任务——黑鸟语言矩阵(Blackbird Language Matrices, BLM)任务,该任务受到智力测试的启发,并阐述了BLM数据集的构建、基准测试及针对分块和系统性的实验。BLM是多项选择问题,结构分为多个层次:在每个句子内、跨输入序列以及在每个候选答案内。由于其丰富的结构,这些经过精心策划但又自然的数据显示出对当前大型语言模型能力的一些核心问题的回答至关重要:大型语言模型(LLMs)是否能够识别语言对象及其属性?它们是否能够识别并利用句子之间的系统性模式?它们更容易出现语言错误还是推理错误,这两者之间又是如何相互作用的?我们展示了BLM在多个语言中虽然具有挑战性,但可以通过简单的基线模型或更为定制的模型在良好的性能水平上解决。我们还表明,它们的表示包含了解决语言任务所需的语法对象和属性。我们还展示了这些解决方案是通过识别句子之间的系统性模式而实现的。本文支持这样一种观点:经过策划的结构化数据集能够支持对语言及大型语言模型特性的多方面研究。由于它们呈现出经过策划的、明确的结构,包含学习上下文和预期答案,并且部分是手工构建的,BLM属于能够支持可解释性研究的数据集,并且对探讨大型语言模型为何以特定方式表现具有重要意义。
cs.CL / 28 / 2602.20973
Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving
线性推理与案例证明:大型语言模型在一阶逻辑问题解决中的障碍
Abstract
To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets. However, most existing datasets primarily focus on linear reasoning, neglecting other parts such as proof by contradiction and proof by cases, which are crucial for investigating LLMs' reasoning abilities. To address this limitation, we first introduce a novel first-order logic (FOL) dataset named PC-FOL, annotated by professional mathematicians, focusing on case-based reasoning problems. All instances in this dataset are equipped with a manually written natural language proof, clearly distinguishing it from conventional linear reasoning datasets. Our experimental results over leading LLMs demonstrate a substantial performance gap between linear reasoning and case-based reasoning problems. To further investigate this phenomenon, we provide a theoretical analysis grounded in graphical model, which provides an explanation for the observed disparity between the two types of reasoning problems. We hope this work can reveal the core challenges in the field of automated natural language mathematical proof generation, paving the way for future research.
Chinese Translation
为了全面评估大型语言模型(LLMs)的数学推理能力,研究人员引入了大量的数学推理数据集。然而,现有的大多数数据集主要集中于线性推理,忽视了其他重要部分,如反证法和案例证明,这些对于研究LLMs的推理能力至关重要。为了解决这一局限性,我们首先介绍了一个新颖的一阶逻辑(FOL)数据集,命名为PC-FOL,由专业数学家进行注释,专注于基于案例的推理问题。该数据集中的所有实例都配备了手动编写的自然语言证明,清晰地区别于传统的线性推理数据集。我们在领先的LLMs上的实验结果显示,线性推理与基于案例的推理问题之间存在显著的性能差距。为了进一步探讨这一现象,我们提供了基于图模型的理论分析,为观察到的两种推理问题之间的差异提供了解释。我们希望这项工作能够揭示自动自然语言数学证明生成领域的核心挑战,为未来的研究铺平道路。
cs.CL / 29 / 2602.20976
Evaluating Proactive Risk Awareness of Large Language Models
评估大型语言模型的主动风险意识
Abstract
As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks. In this work, we introduce a proactive risk awareness evaluation framework that measures whether LLMs can anticipate potential harms and provide warnings before damage occurs. We construct the Butterfly dataset to instantiate this framework in the environmental and ecological domain. It contains 1,094 queries that simulate ordinary solution-seeking activities whose responses may induce latent ecological impact. Through experiments across five widely used LLMs, we analyze the effects of response length, languages, and modality. Experimental results reveal consistent, significant declines in proactive awareness under length-restricted responses, cross-lingual similarities, and persistent blind spots in (multimodal) species protection. These findings highlight a critical gap between current safety alignment and the requirements of real-world ecological responsibility, underscoring the need for proactive safeguards in LLM deployment.
Chinese Translation
随着大型语言模型(LLMs)越来越多地融入日常决策中,它们的安全责任不仅限于对明确的有害意图作出反应,还包括预见意外但重要的风险。在本研究中,我们提出了一种主动风险意识评估框架,旨在衡量LLMs是否能够预见潜在的危害并在损害发生之前提供警告。我们构建了Butterfly数据集,以在环境和生态领域实现该框架。该数据集包含1,094个查询,模拟普通的寻求解决方案的活动,其响应可能会引发潜在的生态影响。通过对五种广泛使用的LLMs进行实验,我们分析了响应长度、语言和模态的影响。实验结果显示,在长度受限的响应、跨语言相似性以及(多模态)物种保护方面存在持续的盲点下,主动意识的显著下降。这些发现突显了当前安全对齐与现实世界生态责任要求之间的关键差距,强调了在LLM部署中需要主动的保护措施。
cs.CL / 30 / 2602.21082
Beyond the Star Rating: A Scalable Framework for Aspect-Based Sentiment Analysis Using LLMs and Text Classification
超越星级评分:基于大规模情感分析的可扩展框架,结合大语言模型与文本分类
Abstract
Customer-provided reviews have become an important source of information for business owners and other customers alike. However, effectively analyzing millions of unstructured reviews remains challenging. While large language models (LLMs) show promise for natural language understanding, their application to large-scale review analysis has been limited by computational costs and scalability concerns. This study proposes a hybrid approach that uses LLMs for aspect identification while employing classic machine-learning methods for sentiment classification at scale. Using ChatGPT to analyze sampled restaurant reviews, we identified key aspects of dining experiences and developed sentiment classifiers using human-labeled reviews, which we subsequently applied to 4.7 million reviews collected over 17 years from a major online platform. Regression analysis reveals that our machine-labeled aspects significantly explain variance in overall restaurant ratings across different aspects of dining experiences, cuisines, and geographical regions. Our findings demonstrate that combining LLMs with traditional machine learning approaches can effectively automate aspect-based sentiment analysis of large-scale customer feedback, suggesting a practical framework for both researchers and practitioners in the hospitality industry and potentially, other service sectors.
Chinese Translation
顾客提供的评论已成为商业业主和其他顾客的重要信息来源。然而,有效分析数百万条非结构化评论仍然具有挑战性。尽管大型语言模型(LLMs)在自然语言理解方面展现出潜力,但其在大规模评论分析中的应用受到计算成本和可扩展性问题的限制。本研究提出了一种混合方法,利用LLMs进行方面识别,同时采用经典机器学习方法进行大规模情感分类。通过使用ChatGPT分析抽样的餐厅评论,我们识别了用餐体验的关键方面,并利用人工标注的评论开发了情感分类器,随后将其应用于从一个主要在线平台收集的470万条评论,这些评论跨越了17年。回归分析表明,我们机器标注的方面显著解释了不同用餐体验、菜系和地理区域的整体餐厅评分的方差。我们的研究结果表明,将LLMs与传统机器学习方法相结合,可以有效自动化大规模客户反馈的基于方面的情感分析,为酒店业及其他服务行业的研究人员和从业者提供了一个实用框架。
cs.CL / 31 / 2602.21103
Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
提示级蒸馏:一种非参数化的模型微调替代方案,用于高效推理
Abstract
Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57\% to 90.0\% and 67\% to 83\% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.
Chinese Translation
高级推理通常需要链式思维提示,这种方法虽然准确,但会导致显著的延迟和高昂的测试时间推理成本。标准替代方案是微调较小的模型,但这往往牺牲了解释性,同时引入了显著的资源和操作开销。为了解决这些局限性,我们提出了提示级蒸馏(Prompt-Level Distillation, PLD)。我们从教师模型中提取明确的推理模式,并将其组织成一系列结构化的表达指令,以供学生模型的系统提示使用。在使用Gemma-3 4B评估StereoSet和Contract-NLI数据集时,PLD将宏观F1分数分别从57\%提高到90.0\%和67\%提高到83\%,使得该紧凑模型能够以微不足道的延迟开销匹配前沿性能。这些表达指令使决策过程透明化,允许对逻辑进行全面的人类验证,使该方法非常适合法律、金融和内容审核等受监管行业,以及高容量用例和边缘设备。
cs.CL / 32 / 2602.21165
PVminer: A Domain-Specific Tool to Detect the Patient Voice in Patient Generated Data
PVminer:一种检测患者生成数据中患者声音的领域特定工具
Abstract
Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.
Chinese Translation
患者生成的文本,如安全消息、调查和访谈,包含了丰富的患者声音(PV)表达,反映了沟通行为和健康的社会决定因素(SDoH)。传统的定性编码框架劳动强度大,无法扩展到跨健康系统的大量患者撰写消息。现有的机器学习(ML)和自然语言处理(NLP)方法提供了部分解决方案,但通常将以患者为中心的沟通(PCC)和SDoH视为独立任务,或依赖于不适合患者语言的模型。我们提出了PVminer,一个为安全的患者-提供者沟通结构化患者声音而适应领域的NLP框架。PVminer将PV检测公式化为一个多标签、多类别的预测任务,整合了患者特定的BERT编码器(PV-BERT-base和PV-BERT-large)、用于主题增强的无监督主题建模(PV-Topic-BERT)以及针对代码、子代码和组合级标签的微调分类器。在微调和推理过程中,主题表示被纳入以丰富语义输入。PVminer在层级任务中表现出色,超越了生物医学和临床预训练基线,达到了82.25%(代码)、80.14%(子代码)和最高77.87%(组合)的F1分数。消融研究进一步表明,作者身份和基于主题的增强各自贡献了显著的提升。预训练模型、源代码和文档将公开发布,带注释的数据集可根据请求用于研究。
cs.CL / 33 / 2602.21193
On Data Engineering for Scaling LLM Terminal Capabilities
关于数据工程在扩展大型语言模型终端能力中的应用
Abstract
Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.
Chinese Translation
尽管大型语言模型的终端能力近期取得了快速进展,但当前最先进的终端代理背后的训练数据策略仍然 largely 未被披露。我们通过对终端代理的数据工程实践进行系统研究来填补这一空白,做出了两项关键贡献:(1) Terminal-Task-Gen,这是一个轻量级的合成任务生成管道,支持基于种子和基于技能的任务构建;(2) 对数据和训练策略的全面分析,包括过滤、课程学习、长上下文训练和扩展行为。我们的管道生成了 Terminal-Corpus,这是一个用于终端任务的大规模开源数据集。利用该数据集,我们训练了 Nemotron-Terminal,这是一系列从 Qwen3(8B, 14B, 32B) 初始化的模型,在 Terminal-Bench 2.0 上取得了显著提升:Nemotron-Terminal-8B 从 2.5% 提升至 13.0%,Nemotron-Terminal-14B 从 4.0% 提升至 20.2%,而 Nemotron-Terminal-32B 从 3.4% 提升至 27.4%,其性能与显著更大的模型相匹配。为了加速该领域的研究,我们在 https://huggingface.co/collections/nvidia/nemotron-terminal 开源了我们的模型检查点和大部分合成数据集。