cs.RO / 1 / 2603.16952
Embodied Foundation Models at the Edge: A Survey of Deployment Constraints and Mitigation Strategies
边缘计算中的具身基础模型:部署约束与缓解策略的调查
Abstract
Deploying foundation models in embodied edge systems is fundamentally a systems problem, not just a problem of model compression. Real-time control must operate within strict size, weight, and power constraints, where memory traffic, compute latency, timing variability, and safety margins interact directly. The Deployment Gauntlet organizes these constraints into eight coupled barriers that determine whether embodied foundation models can run reliably in practice. Across representative edge workloads, autoregressive Vision-Language-Action policies are constrained primarily by memory bandwidth, whereas diffusion-based controllers are limited more by compute latency and sustained execution cost. Reliable deployment therefore depends on system-level co-design across memory, scheduling, communication, and model architecture, including decompositions that separate fast control from slower semantic reasoning.
Chinese Translation
在具身边缘系统中部署基础模型根本上是一个系统问题,而不仅仅是模型压缩的问题。实时控制必须在严格的尺寸、重量和功率约束下运行,其中内存流量、计算延迟、时序变异性和安全边际直接相互作用。部署障碍将这些约束组织成八个相互关联的障碍,这些障碍决定了具身基础模型在实践中是否能够可靠运行。在具有代表性的边缘工作负载中,自回归的视觉-语言-动作策略主要受到内存带宽的限制,而基于扩散的控制器则更多地受到计算延迟和持续执行成本的限制。因此,可靠的部署依赖于在内存、调度、通信和模型架构等方面的系统级协同设计,包括将快速控制与较慢的语义推理分离的分解方法。
cs.RO / 2 / 2603.16978
Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models
奖励DINO:利用视觉基础模型预测密集奖励
Abstract
Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert demonstrations based on visual similarity or sequential frame ordering. However, this biases the resulting reward function towards a specific solution and leaves it undefined in states not covered by the demonstrations. In this work, we introduce Rewarding DINO, a method for language-conditioned reward modeling that learns actual reward functions rather than specific trajectories. The model's compact size allows it to serve as a direct replacement for analytical reward functions with comparatively low computational overhead. We train our model on data sampled from 24 Meta-World+ tasks using a rank-based loss and evaluate pairwise accuracy, rank correlation, and calibration. Rewarding DINO achieves competitive performance in tasks from the training set and generalizes to new settings in simulation and the real world, indicating that it learns task semantics. We also test the model with off-the-shelf reinforcement learning algorithms to solve tasks from our Meta-World+ training set.
Chinese Translation
在机器人操作中,设计良好的密集奖励函数不仅能够指示任务是否完成,还能编码任务执行过程中的进展。通常,设计密集奖励是具有挑战性的,通常需要访问仅在模拟中可用的特权状态信息,而在现实世界实验中无法获得。这使得从相机图像推断任务状态信息的奖励预测模型变得具有吸引力。一种常见的方法是基于视觉相似性或序列帧顺序从专家演示中预测奖励。然而,这会使得生成的奖励函数偏向于特定解决方案,并在未被演示覆盖的状态下留下未定义的情况。在本研究中,我们提出了奖励DINO,这是一种语言条件奖励建模的方法,旨在学习实际的奖励函数,而不是特定的轨迹。该模型的紧凑性使其能够作为解析奖励函数的直接替代,且计算开销相对较低。我们在从24个Meta-World+任务中采样的数据上训练我们的模型,使用基于排名的损失,并评估成对准确性、排名相关性和校准。奖励DINO在训练集中的任务上表现出竞争力,并且能够推广到模拟和现实世界中的新设置,表明它学习了任务语义。我们还使用现成的强化学习算法测试该模型,以解决来自Meta-World+训练集的任务。
cs.RO / 3 / 2603.17016
Efficient and Reliable Teleoperation through Real-to-Sim-to-Real Shared Autonomy
通过真实-模拟-真实共享自主实现高效可靠的远程操作
Abstract
Fine-grained, contact-rich teleoperation remains slow, error-prone, and unreliable in real-world manipulation tasks, even for experienced operators. Shared autonomy offers a promising way to improve performance by combining human intent with automated assistance, but learning effective assistance in simulation requires a faithful model of human behavior, which is difficult to obtain in practice. We propose a real-to-sim-to-real shared autonomy framework that augments human teleoperation with learned corrective behaviors, using a simple yet effective k-nearest-neighbor (kNN) human surrogate to model operator actions in simulation. The surrogate is fit from less than five minutes of real-world teleoperation data and enables stable training of a residual copilot policy with model-free reinforcement learning. The resulting copilot is deployed to assist human operators in real-world fine-grained manipulation tasks. Through simulation experiments and a user study with sixteen participants on industry-relevant tasks, including nut threading, gear meshing, and peg insertion, we show that our system improves task success for novice operators and execution efficiency for experienced operators compared to direct teleoperation and shared-autonomy baselines that rely on expert priors or behavioral-cloning pilots. In addition, copilot-assisted teleoperation produces higher-quality demonstrations for downstream imitation learning.
Chinese Translation
细粒度、接触丰富的远程操作在现实世界的操作任务中依然缓慢、易出错且不可靠,即使对于经验丰富的操作员也是如此。共享自主提供了一种通过结合人类意图与自动化辅助来提升性能的有希望的方法,但在模拟中学习有效的辅助需要一个忠实的人类行为模型,而这在实践中难以获得。我们提出了一种真实-模拟-真实共享自主框架,通过学习的纠正行为增强人类远程操作,使用简单而有效的k最近邻(kNN)人类替代模型来模拟操作员在模拟中的动作。该替代模型通过少于五分钟的真实世界远程操作数据进行拟合,并使得使用无模型强化学习稳定训练残余副驾驶策略成为可能。最终的副驾驶被部署以协助人类操作员完成现实世界中的细粒度操作任务。通过模拟实验和对十六名参与者进行的用户研究,涉及螺母穿线、齿轮啮合和插销插入等与工业相关的任务,我们展示了与直接远程操作和依赖专家先验或行为克隆副驾驶的共享自主基线相比,我们的系统提高了新手操作员的任务成功率和经验丰富操作员的执行效率。此外,副驾驶辅助的远程操作为下游模仿学习提供了更高质量的演示。
cs.RO / 4 / 2603.17022
Contingency-Aware Planning via Certified Neural Hamilton-Jacobi Reachability
基于认证神经哈密顿-雅可比可达性的应急感知规划
Abstract
Hamilton-Jacobi (HJ) reachability provides formal safety guarantees for dynamical systems, but solving high-dimensional HJ partial differential equations limits its use in real-time planning. This paper presents a contingency-aware multi-goal navigation framework that integrates learning-based reachability with sampling-based planning in unknown environments. We use Fourier Neural Operator (FNO) to approximate the solution operator of the Hamilton-Jacobi-Isaacs variational inequality under varying obstacle configurations. We first provide a theoretical under-approximation guarantee on the safe backward reach-avoid set, which enables formal safety certification of the learned reachable sets. Then, we integrate the certified reachable sets with an incremental multi-goal planner, which enforces reachable-set constraints and a recovery policy that guarantees finite-time return to a safe region. Overall, we demonstrate that the proposed framework achieves asymptotically optimal navigation with provable contingency behavior, and validate its performance through real-time deployment on KUKA's youBot in Webots simulation.
Chinese Translation
哈密顿-雅可比(HJ)可达性为动态系统提供了形式化的安全保障,但高维HJ偏微分方程的求解限制了其在实时规划中的应用。本文提出了一种应急感知的多目标导航框架,该框架将基于学习的可达性与基于采样的规划相结合,适用于未知环境。我们使用傅里叶神经算子(Fourier Neural Operator, FNO)来近似哈密顿-雅可比-伊萨克斯变分不等式的解算子,以应对不同障碍配置的变化。我们首先提供了对安全后向避让集的理论下界保证,这使得所学习的可达集能够进行形式化的安全认证。然后,我们将认证的可达集与增量多目标规划器相结合,该规划器强制执行可达集约束和恢复策略,以保证在有限时间内返回安全区域。总体而言,我们展示了所提出的框架实现了渐近最优的导航,并具备可证明的应急行为,并通过在Webots仿真中对KUKA的youBot进行实时部署验证了其性能。
cs.RO / 5 / 2603.17065
TeleDex: Accessible Dexterous Teleoperation
TeleDex:可访问的灵巧远程操作
Abstract
Despite increasing dataset scale and model capacity, robot manipulation policies still struggle to generalize beyond their training distributions. As a result, deploying state-of-the-art policies in new environments, tasks, or robot embodiments often requires collecting additional demonstrations. Enabling this in real-world deployment settings requires tools that allow users to collect demonstrations quickly, affordably, and with minimal setup. We present TeleDex, an open-source system for intuitive teleoperation of dexterous hands and robotic manipulators using any readily available phone. The system streams low-latency 6-DoF wrist poses and articulated 21-DoF hand state estimates from the phone, which are retargeted to robot arms and multi-fingered hands without requiring external tracking infrastructure. TeleDex supports both a handheld phone-only mode and an optional 3D-printable hand-mounted interface for finger-level teleoperation. By lowering the hardware and setup barriers to dexterous teleoperation, TeleDex enables users to quickly collect demonstrations during deployment to support policy fine-tuning. We evaluate the system across simulation and real-world manipulation tasks, demonstrating its effectiveness as a unified scalable interface for robot teleoperation. All software and hardware designs, along with demonstration videos, are open-source and available at orayyan.com/teledex.
Chinese Translation
尽管数据集规模和模型容量不断增加,机器人操作策略在训练分布之外的泛化能力仍然存在困难。因此,在新的环境、任务或机器人形态中部署最先进的策略通常需要收集额外的演示。在现实世界的部署环境中实现这一点需要能够快速、经济且最小化设置的工具,以便用户收集演示。我们提出了TeleDex,这是一个开源系统,旨在通过任何现成的手机直观地进行灵巧手和机器人操纵器的远程操作。该系统从手机流式传输低延迟的6自由度(6-DoF)手腕姿态和21自由度(21-DoF)手部状态估计,这些数据可以重新定向到机器人手臂和多指手,而无需外部跟踪基础设施。TeleDex支持仅使用手持手机的模式,以及一个可选的3D打印手持接口,用于指级远程操作。通过降低灵巧远程操作的硬件和设置障碍,TeleDex使用户能够在部署期间快速收集演示,以支持策略的微调。我们在仿真和现实世界的操作任务中评估了该系统,展示了其作为机器人远程操作统一可扩展接口的有效性。所有软件和硬件设计以及演示视频均为开源,并可在orayyan.com/teledex上获取。
cs.RO / 6 / 2603.17092
SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion
SLowRL:用于运动控制的安全低秩适应强化学习
Abstract
Sim-to-real transfer of locomotion policies often leads to performance degradation due to the inevitable sim-to-real gap. Naively fine-tuning these policies directly on hardware is problematic, as it poses risks of mechanical failure and suffers from high sample inefficiency. In this paper, we address the challenge of safely and efficiently fine-tuning reinforcement learning (RL) policies for dynamic locomotion tasks. Specifically, we focus on fine-tuning policies learned in simulation directly on hardware, while explicitly enforcing safety constraints. In doing so, we introduce SLowRL, a framework that combines Low-Rank Adaptation (LoRA) with training-time safety enforcement via a recovery policy. We evaluate our method both in simulation and on a real Unitree Go2 quadruped robot for jump and trot tasks. Experimental results show that our method achieves a $46.5\%$ reduction in fine-tuning time and near-zero safety violations compared to standard proximal policy optimization (PPO) baselines. Notably, we find that a rank-1 adaptation alone is sufficient to recover pre-trained performance in the real world, while maintaining stable and safe real-world fine-tuning. These results demonstrate the practicality of safe, efficient fine-tuning for dynamic real-world robotic applications.
Chinese Translation
运动策略的仿真到现实转移常常导致性能下降,这是由于不可避免的仿真与现实之间的差距。直接在硬件上对这些策略进行简单的微调存在问题,因为这可能导致机械故障的风险,并且样本效率低下。在本文中,我们解决了安全且高效地微调动态运动任务的强化学习(RL)策略的挑战。具体而言,我们专注于在硬件上直接微调在仿真中学习到的策略,同时明确执行安全约束。在此过程中,我们引入了SLowRL,一个将低秩适应(Low-Rank Adaptation, LoRA)与训练时安全执行通过恢复策略相结合的框架。我们在仿真环境和真实的Unitree Go2四足机器人上评估了我们的方法,针对跳跃和小跑任务进行实验。实验结果表明,与标准的近端策略优化(Proximal Policy Optimization, PPO)基线相比,我们的方法在微调时间上减少了46.5%,并且几乎没有安全违规事件。值得注意的是,我们发现仅通过秩-1适应就足以在现实世界中恢复预训练性能,同时保持稳定和安全的现实世界微调。这些结果证明了安全、高效微调在动态现实世界机器人应用中的实用性。
cs.RO / 7 / 2603.17152
Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints
动态时序逻辑约束下的受保护强化学习
Abstract
Reinforcement Learning (RL) has shown promise in various robotics applications, yet its deployment on real systems is still limited due to safety and operational constraints. The safe RL field has gained considerable attention in recent years, which focuses on imposing safety constraints throughout the learning process. However, real systems often require more complex constraints than just safety, such as periodic recharging or time-bounded visits to specific regions. Imposing such spatio-temporal tasks during learning still remains a challenge. Signal Temporal Logic (STL) is a formal language for specifying temporal properties of real-valued signals and provides a way to express such complex tasks. In this paper, we propose a framework that leverages sequential control barrier functions and model-free RL to ensure that the given STL tasks are satisfied throughout the learning process. Our method extends beyond traditional safety constraints by enforcing rich STL specifications, which can involve visits to dynamic targets with unknown trajectories. We also demonstrate the effectiveness of our framework through various simulations.
Chinese Translation
强化学习(Reinforcement Learning, RL)在各种机器人应用中展现了良好的前景,但由于安全性和操作约束,其在实际系统中的应用仍然有限。近年来,安全强化学习领域引起了相当大的关注,重点在于在学习过程中施加安全约束。然而,实际系统往往需要比单纯的安全性更复杂的约束,例如周期性充电或在特定区域的时间限制访问。在学习过程中施加此类时空任务仍然是一个挑战。信号时序逻辑(Signal Temporal Logic, STL)是一种用于指定实值信号时序特性的形式语言,提供了一种表达此类复杂任务的方法。本文提出了一种框架,利用序列控制障碍函数和无模型强化学习,确保在整个学习过程中满足给定的STL任务。我们的方法超越了传统的安全约束,通过强制执行丰富的STL规范,涉及对具有未知轨迹的动态目标的访问。我们还通过各种仿真实验展示了我们框架的有效性。
cs.RO / 8 / 2603.17165
SLAM Adversarial Lab: An Extensible Framework for Visual SLAM Robustness Evaluation under Adverse Conditions
SLAM对抗实验室:一个可扩展的视觉SLAM鲁棒性评估框架,适用于不利条件
Abstract
We present SAL (SLAM Adversarial Lab), a modular framework for evaluating visual SLAM systems under adversarial conditions such as fog and rain. SAL represents each adversarial condition as a perturbation that transforms an existing dataset into an adversarial dataset. When transforming a dataset, SAL supports severity levels using easily-interpretable real-world units such as meters for fog visibility. SAL's extensible architecture decouples datasets, perturbations, and SLAM algorithms through common interfaces, so users can add new components without rewriting integration code. Moreover, SAL includes a search procedure that finds the severity level of a perturbation at which a SLAM system fails. To showcase the capabilities of SAL, our evaluation integrates seven SLAM algorithms and evaluates them across three datasets under weather, camera, and video transport perturbations.
Chinese Translation
我们提出了SAL(SLAM对抗实验室),这是一个用于评估视觉SLAM系统在不利条件下(如雾和雨)表现的模块化框架。SAL将每种不利条件表示为一种扰动,该扰动将现有数据集转换为对抗数据集。在转换数据集时,SAL支持使用易于理解的现实世界单位(如雾的可见度以米为单位)来表示严重程度。SAL的可扩展架构通过公共接口解耦数据集、扰动和SLAM算法,因此用户可以在不重写集成代码的情况下添加新组件。此外,SAL还包括一个搜索程序,用于查找SLAM系统失效时扰动的严重程度。为了展示SAL的能力,我们的评估集成了七种SLAM算法,并在天气、相机和视频传输扰动下对它们进行了三种数据集的评估。
cs.RO / 9 / 2603.17189
Influence of Gripper Design on Human Demonstration Quality for Robot Learning
夹具设计对机器人学习中人类示范质量的影响
Abstract
Opening sterile medical packaging is routine for healthcare workers but remains challenging for robots. Learning from demonstration enables robots to acquire manipulation skills directly from humans, and handheld gripper tools such as the Universal Manipulation Interface (UMI) offer a pathway for efficient data collection. However, the effectiveness of these tools depends heavily on their usability. We evaluated UMI in demonstrating a bandage opening task, a common manipulation task in hospital settings, by testing three conditions: distributed load grippers, concentrated load grippers, and bare hands. Eight participants performed timed trials, with task performance assessed by success rate, completion time, and damage, alongside perceived workload using the NASA-TLX questionnaire. Concentrated load grippers improved performance relative to distributed load grippers but remained substantially slower and less effective than hands. These results underscore the importance of ergonomic and mechanical refinements in handheld grippers to reduce user burden and improve demonstration quality, especially for applications in healthcare robotics.
Chinese Translation
打开无菌医疗包装是医疗工作者的日常工作,但对机器人来说仍然具有挑战性。通过示范学习,机器人能够直接从人类那里获取操作技能,而手持夹具工具,如通用操作接口(Universal Manipulation Interface, UMI),为高效数据收集提供了一条途径。然而,这些工具的有效性在很大程度上依赖于其可用性。我们评估了UMI在演示绷带打开任务中的表现,这是一项在医院环境中常见的操作任务,测试了三种条件:分布负载夹具、集中负载夹具和赤手。八名参与者进行了计时试验,任务表现通过成功率、完成时间和损坏情况进行评估,同时使用NASA-TLX问卷评估感知工作负荷。与分布负载夹具相比,集中负载夹具提高了性能,但仍然显著慢于且效果不如手。结果强调了在手持夹具中进行人体工程学和机械改进的重要性,以减少用户负担并提高示范质量,特别是在医疗机器人应用中。
cs.RO / 10 / 2603.17201
FastLoop: Parallel Loop Closing with GPU-Acceleration in Visual SLAM
FastLoop:基于GPU加速的视觉SLAM中的并行回环闭合
Abstract
Visual SLAM systems combine visual tracking with global loop closure to maintain a consistent map and accurate localization. Loop closure is a computationally expensive process as we need to search across the whole map for matches. This paper presents FastLoop, a GPU-accelerated loop closing module to alleviate this computational complexity. We identify key performance bottlenecks in the loop closing pipeline of visual SLAM and address them through parallel optimizations on the GPU. Specifically, we use task-level and data-level parallelism and integrate a GPU-accelerated pose graph optimization. Our implementation is built on top of ORB-SLAM3 and leverages CUDA for GPU programming. Experimental results show that FastLoop achieves an average speedup of 1.4x and 1.3x on the EuRoC dataset and 3.0x and 2.4x on the TUM-VI dataset for the loop closing module on desktop and embedded platforms, respectively, while maintaining the accuracy of the original system.
Chinese Translation
视觉SLAM系统结合了视觉跟踪与全局回环闭合,以保持一致的地图和准确的定位。回环闭合是一个计算开销较大的过程,因为我们需要在整个地图中搜索匹配项。本文提出了FastLoop,一个GPU加速的回环闭合模块,以减轻这一计算复杂性。我们识别了视觉SLAM回环闭合流程中的关键性能瓶颈,并通过在GPU上进行并行优化来解决这些问题。具体而言,我们采用了任务级和数据级并行性,并集成了GPU加速的位姿图优化。我们的实现基于ORB-SLAM3,并利用CUDA进行GPU编程。实验结果表明,FastLoop在EuRoC数据集上实现了1.4倍和1.3倍的平均加速,在TUM-VI数据集上实现了3.0倍和2.4倍的加速,分别适用于桌面和嵌入式平台,同时保持了原系统的准确性。
cs.RO / 11 / 2603.17229
Visual SLAM with DEM Anchoring for Lunar Surface Navigation
基于数字高程模型锚定的视觉SLAM用于月球表面导航
Abstract
Future lunar missions will require autonomous rovers capable of traversing tens of kilometers across challenging terrain while maintaining accurate localization and producing globally consistent maps. However, the absence of global positioning systems, extreme illumination, and low-texture regolith make long-range navigation on the Moon particularly difficult, as visual-inertial odometry pipelines accumulate drift over extended traverses. To address this challenge, we present a stereo visual simultaneous localization and mapping (SLAM) system that integrates learned feature detection and matching with global constraints from digital elevation models (DEMs). Our front-end employs learning-based feature extraction and matching to achieve robustness to illumination extremes and repetitive terrain, while the back-end incorporates DEM-derived height and surface-normal factors into a pose graph, providing absolute surface constraints that mitigate long-term drift. We validate our approach using both simulated lunar traverse data generated in Unreal Engine and real Moon/Mars analog data collected from Mt. Etna. Results demonstrate that DEM anchoring consistently reduces absolute trajectory error compared to baseline SLAM methods, lowering drift in long-range navigation even in repetitive or visually aliased terrain.
Chinese Translation
未来的月球任务将需要能够在复杂地形中自主行驶数十公里的探测车,同时保持准确的定位并生成全球一致的地图。然而,缺乏全球定位系统、极端的光照条件以及低纹理的月壤使得在月球上的远程导航特别困难,因为视觉惯性里程计管道在长距离行驶中会积累漂移。为了解决这一挑战,我们提出了一种立体视觉同时定位与地图构建(SLAM)系统,该系统将学习的特征检测和匹配与来自数字高程模型(DEMs)的全球约束相结合。我们的前端采用基于学习的特征提取和匹配,以实现对极端光照和重复地形的鲁棒性,而后端则将DEMs衍生的高度和表面法线因素纳入姿态图中,提供绝对的表面约束,从而减轻长期漂移。我们使用在虚幻引擎中生成的模拟月球行驶数据和从埃特纳山收集的真实月球/火星类比数据验证了我们的方法。结果表明,与基线SLAM方法相比,DEMs锚定始终能有效减少绝对轨迹误差,即使在重复或视觉混叠的地形中,也能降低远程导航的漂移。
cs.RO / 12 / 2603.17232
Full Stack Navigation, Mapping, and Planning for the Lunar Autonomy Challenge
月球自主挑战的全栈导航、映射与规划
Abstract
We present a modular, full-stack autonomy system for lunar surface navigation and mapping developed for the Lunar Autonomy Challenge. Operating in a GNSS-denied, visually challenging environment, our pipeline integrates semantic segmentation, stereo visual odometry, pose graph SLAM with loop closures, and layered planning and control. We leverage lightweight learning-based perception models for real-time segmentation and feature tracking and use a factor-graph backend to maintain globally consistent localization. High-level waypoint planning is designed to promote mapping coverage while encouraging frequent loop closures, and local motion planning uses arc sampling with geometric obstacle checks for efficient, reactive control. We evaluate our approach in the competition's high-fidelity lunar simulator, demonstrating centimeter-level localization accuracy, high-fidelity map generation, and strong repeatability across random seeds and rock distributions. Our solution achieved first place in the final competition evaluation.
Chinese Translation
我们提出了一种模块化的全栈自主系统,用于月球表面的导航和映射,该系统是为月球自主挑战开发的。在一个没有全球导航卫星系统(GNSS)且视觉环境具有挑战性的环境中,我们的流程集成了语义分割、立体视觉里程计、带有回环闭合的位姿图SLAM以及分层规划与控制。我们利用轻量级的基于学习的感知模型进行实时分割和特征跟踪,并使用因子图后端来保持全球一致的定位。高层次的航点规划旨在促进映射覆盖,同时鼓励频繁的回环闭合,而局部运动规划则采用弧采样和几何障碍物检查,以实现高效的反应控制。我们在比赛的高保真月球模拟器中评估了我们的方法,展示了厘米级的定位精度、高保真的地图生成以及在随机种子和岩石分布下的强重复性。我们的解决方案在最终比赛评估中获得了第一名。
cs.RO / 13 / 2603.17236
Neural Radiance Maps for Extraterrestrial Navigation and Path Planning
用于外星导航和路径规划的神经辐射图
Abstract
Autonomous vehicles such as the Mars rovers currently lead the vanguard of surface exploration on extraterrestrial planets and moons. In order to accelerate the pace of exploration and science objectives, it is critical to plan safe and efficient paths for these vehicles. However, current rover autonomy is limited by a lack of global maps which can be easily constructed and stored for onboard re-planning. Recently, Neural Radiance Fields (NeRFs) have been introduced as a detailed 3D scene representation which can be trained from sparse 2D images and efficiently stored. We propose to use NeRFs to construct maps for online use in autonomous navigation, and present a planning framework which leverages the NeRF map to integrate local and global information. Our approach interpolates local cost observations across global regions using kernel ridge regression over terrain features extracted from the NeRF map, allowing the rover to re-route itself around untraversable areas discovered during online operation. We validate our approach in high-fidelity simulation and demonstrate lower cost and higher percentage success rate path planning compared to various baselines.
Chinese Translation
自主车辆,如火星探测车,目前在外星行星和卫星的表面探索中处于前沿地位。为了加快探索和科学目标的进程,为这些车辆规划安全高效的路径至关重要。然而,目前探测车的自主性受到缺乏可轻松构建和存储的全球地图的限制,这影响了其机载重新规划的能力。最近,神经辐射场(Neural Radiance Fields, NeRFs)作为一种详细的三维场景表示方法被提出,可以从稀疏的二维图像中进行训练并高效存储。我们提出使用NeRFs构建用于自主导航的在线地图,并提出一个规划框架,该框架利用NeRF地图整合局部和全球信息。我们的方法通过对从NeRF地图提取的地形特征进行核岭回归,跨全球区域插值局部成本观测,使探测车能够在在线操作中绕过发现的不可通行区域。我们在高保真模拟中验证了我们的方法,并与多种基线相比,展示了更低的成本和更高的成功率路径规划。
cs.RO / 14 / 2603.17300
ReSteer: Quantifying and Refining the Steerability of Multitask Robot Policies
ReSteer:量化和优化多任务机器人策略的可操控性
Abstract
Despite strong multi-task pretraining, existing policies often exhibit poor task steerability. For example, a robot may fail to respond to a new instruction ``put the bowl in the sink" when moving towards the oven, executing ``close the oven", even though it can complete both tasks when executed separately. We propose ReSteer, a framework to quantify and improve task steerability in multitask robot policies. We conduct an exhaustive evaluation of state-of-the-art policies, revealing a common lack of steerability. We find that steerability is associated with limited overlap among training task trajectory distributions, and introduce a proxy metric to measure this overlap from policy behavior. Building on this insight, ReSteer improves steerability via three components: (i) a steerability estimator that identifies low-steerability states without full-rollout evaluation, (ii) a steerable data generator that synthesizes motion segments from these states, and (iii) a self-refinement pipeline that improves policy steerability using the generated data. In simulation on LIBERO, ReSteer improves steerability by 11\% over 18k rollouts. In real-world experiments, we show that improved steerability is critical for interactive use, enabling users to instruct robots to perform any task at any time. We hope this work motivates further study on quantifying steerability and data collection strategies for large robot policies.
Chinese Translation
尽管经过强大的多任务预训练,现有策略往往表现出较差的任务可操控性。例如,当机器人朝向烤箱移动并执行“关闭烤箱”时,可能无法响应新的指令“把碗放进水槽”,尽管它在单独执行这两个任务时都能完成。我们提出了ReSteer,一个量化和改善多任务机器人策略中任务可操控性的框架。我们对最先进的策略进行了全面评估,揭示了普遍存在的可操控性不足。我们发现,可操控性与训练任务轨迹分布之间的有限重叠相关,并引入了一种代理指标来测量这种重叠。基于这一洞察,ReSteer通过三个组件改善可操控性:(i)一个可操控性评估器,能够在没有完全回放评估的情况下识别低可操控性状态;(ii)一个可操控数据生成器,从这些状态合成运动片段;(iii)一个自我优化管道,利用生成的数据提高策略的可操控性。在LIBERO的仿真中,ReSteer在18000次回放中将可操控性提高了11%。在现实世界实验中,我们展示了改善的可操控性对于交互使用至关重要,使用户能够随时指示机器人执行任何任务。我们希望这项工作能够激励进一步研究可操控性的量化和大规模机器人策略的数据收集策略。
cs.RO / 15 / 2603.17323
DexEXO: A Wearability-First Dexterous Exoskeleton for Operator-Agnostic Demonstration and Learning
DexEXO:一种以可穿戴性为先的灵巧外骨骼,用于操作员无关的演示与学习
Abstract
Scaling dexterous robot learning is constrained by the difficulty of collecting high-quality demonstrations across diverse operators. Existing wearable interfaces often trade comfort and cross-user adaptability for kinematic fidelity, while embodiment mismatch between demonstration and deployment requires visual post-processing before policy training. We present DexEXO, a wearability-first hand exoskeleton that aligns visual appearance, contact geometry, and kinematics at the hardware level. DexEXO features a pose-tolerant thumb mechanism and a slider-based finger interface analytically modeled to support hand lengths from 140~mm to 217~mm, reducing operator-specific fitting and enabling scalable cross-operator data collection. A passive hand visually matches the deployed robot, allowing direct policy training from raw wrist-mounted RGB observations. User studies demonstrate improved comfort and usability compared to prior wearable systems. Using visually aligned observations alone, we train diffusion policies that achieve competitive performance while substantially simplifying the end-to-end pipeline. These results show that prioritizing wearability and hardware-level embodiment alignment reduces both human and algorithmic bottlenecks without sacrificing task performance. Project Page: https://dexexo-research.github.io/
Chinese Translation
灵巧机器人学习的规模化受到在不同操作员之间收集高质量演示的困难限制。现有的可穿戴接口往往在运动学精度与舒适性和跨用户适应性之间进行权衡,而演示与部署之间的体现不匹配则需要在策略训练之前进行视觉后处理。我们提出了DexEXO,这是一种以可穿戴性为先的手部外骨骼,在硬件层面上对视觉外观、接触几何和运动学进行了对齐。DexEXO具有一个容忍姿态的拇指机制和一个基于滑块的手指接口,经过分析建模以支持140~mm到217~mm的手长,减少了操作员特定的适配,并实现了可扩展的跨操作员数据收集。一个被动手部在视觉上与部署的机器人相匹配,允许直接从原始的腕部RGB观察中进行策略训练。用户研究表明,与之前的可穿戴系统相比,DexEXO在舒适性和可用性方面有所改善。仅使用视觉对齐的观察,我们训练的扩散策略在性能上具有竞争力,同时大大简化了端到端的流程。这些结果表明,优先考虑可穿戴性和硬件层面的体现对齐可以减少人类和算法瓶颈,而不牺牲任务性能。项目页面:https://dexexo-research.github.io/
cs.RO / 16 / 2603.17351
OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms
OmniVLN:面向空中和地面平台的全向3D感知与高效令牌推理的视觉语言导航
Abstract
Language-guided embodied navigation requires an agent to interpret object-referential instructions, search across multiple rooms, localize the referenced target, and execute reliable motion toward it. Existing systems remain limited in real indoor environments because narrow field-of-view sensing exposes only a partial local scene at each step, often forcing repeated rotations, delaying target discovery, and producing fragmented spatial understanding; meanwhile, directly prompting LLMs with dense 3D maps or exhaustive object lists quickly exceeds the context budget. We present OmniVLN, a zero-shot visual-language navigation framework that couples omnidirectional 3D perception with token-efficient hierarchical reasoning for both aerial and ground robots. OmniVLN fuses a rotating LiDAR and panoramic vision into a hardware-agnostic mapping stack, incrementally constructs a five-layer Dynamic Scene Graph (DSG) from mesh geometry to room- and building-level structure, and stabilizes high-level topology through persistent-homology-based room partitioning and hybrid geometric/VLM relation verification. For navigation, the global DSG is transformed into an agent-centric 3D octant representation with multi-resolution spatial attention prompting, enabling the LLM to progressively filter candidate rooms, infer egocentric orientation, localize target objects, and emit executable navigation primitives while preserving fine local detail and compact long-range memory. Experiments show that the proposed hierarchical interface improves spatial referring accuracy from 77.27\% to 93.18\%, reduces cumulative prompt tokens by up to 61.7\% in cluttered multi-room settings, and improves navigation success by up to 11.68\% over a flat-list baseline. We will release the code and an omnidirectional multimodal dataset to support reproducible research.
Chinese Translation
语言引导的具身导航要求代理能够理解物体指代的指令,在多个房间中搜索,定位所指目标,并朝其执行可靠的运动。现有系统在真实室内环境中的应用仍然有限,因为狭窄的视场感知每一步仅暴露部分局部场景,常常迫使代理进行重复旋转,延迟目标发现,并产生碎片化的空间理解;与此同时,直接用密集的3D地图或详尽的物体列表提示大型语言模型(LLMs)很快会超出上下文预算。我们提出了OmniVLN,一个零-shot视觉语言导航框架,将全向3D感知与高效的分层推理结合,适用于空中和地面机器人。OmniVLN将旋转激光雷达和全景视觉融合成一个硬件无关的映射堆栈,逐步从网格几何构建一个五层动态场景图(DSG),并通过基于持久同调的房间划分和混合几何/视觉语言模型(VLM)关系验证来稳定高层拓扑。对于导航,全球DSG被转换为以代理为中心的3D八分体表示,并通过多分辨率空间注意力提示,使LLM能够逐步筛选候选房间,推断自我中心的方向,定位目标物体,并发出可执行的导航原语,同时保持细致的局部细节和紧凑的长程记忆。实验表明,所提出的分层接口将空间指代准确率从77.27%提高到93.18%,在杂乱的多房间环境中将累积提示令牌减少多达61.7%,并在平面列表基线之上提高导航成功率多达11.68%。我们将发布代码和一个全向多模态数据集,以支持可重复的研究。
cs.RO / 17 / 2603.17416
Physics-informed Deep Mixture-of-Koopmans Vehicle Dynamics Model with Dual-branch Encoder for Distributed Electric-drive Trucks
基于物理知识的深度混合库普曼车辆动力学模型,采用双分支编码器用于分布式电驱动卡车
Abstract
Advanced autonomous driving systems require accurate vehicle dynamics modeling. However, identifying a precise dynamics model remains challenging due to strong nonlinearities and the coupled longitudinal and lateral dynamic characteristics. Previous research has employed physics-based analytical models or neural networks to construct vehicle dynamics representations. Nevertheless, these approaches often struggle to simultaneously achieve satisfactory performance in terms of system identification efficiency, modeling accuracy, and compatibility with linear control strategies. In this paper, we propose a fully data-driven dynamics modeling method tailored for complex distributed electric-drive trucks (DETs), leveraging Koopman operator theory to represent highly nonlinear dynamics in a lifted linear embedding space. To achieve high-precision modeling, we first propose a novel dual-branch encoder which encodes dynamic states and provides a powerful basis for the proposed Koopman-based methods entitled KODE. A physics-informed supervision mechanism, grounded in the geometric consistency of temporal vehicle motion, is incorporated into the training process to facilitate effective learning of both the encoder and the Koopman operator. Furthermore, to accommodate the diverse driving patterns of DETs, we extend the vanilla Koopman operator to a mixture-of-Koopman operator framework, enhancing modeling capability. Simulations conducted in a high-fidelity TruckSim environment and real-world experiments demonstrate that the proposed approach achieves state-of-the-art performance in long-term dynamics state estimation.
Chinese Translation
先进的自动驾驶系统需要准确的车辆动力学建模。然而,由于强非线性以及纵向和横向动态特性的耦合,识别精确的动力学模型仍然具有挑战性。以往的研究采用基于物理的分析模型或神经网络来构建车辆动力学表示。然而,这些方法往往难以在系统识别效率、建模准确性和与线性控制策略的兼容性方面同时取得令人满意的性能。本文提出了一种完全数据驱动的动力学建模方法,专门针对复杂的分布式电驱动卡车(DETs),利用库普曼算子理论在提升的线性嵌入空间中表示高度非线性的动力学。为了实现高精度建模,我们首先提出了一种新颖的双分支编码器,该编码器对动态状态进行编码,并为所提出的基于库普曼的方法(称为 KODE)提供了强大的基础。训练过程中引入了一种基于时间车辆运动几何一致性的物理知识监督机制,以促进编码器和库普曼算子的有效学习。此外,为了适应 DETs 的多样化驾驶模式,我们将传统的库普曼算子扩展到混合库普曼算子框架,从而增强建模能力。在高保真度的 TruckSim 环境和真实世界实验中进行的仿真表明,所提出的方法在长期动力学状态估计中达到了最先进的性能。
cs.RO / 18 / 2603.17430
SafeLand: Safe Autonomous Landing in Unknown Environments with Bayesian Semantic Mapping
SafeLand:在未知环境中使用贝叶斯语义映射实现安全自主着陆
Abstract
Autonomous landing of uncrewed aerial vehicles (UAVs) in unknown, dynamic environments poses significant safety challenges, particularly near people and infrastructure, as UAVs transition to routine urban and rural operations. Existing methods often rely on prior maps, heavy sensors like LiDAR, static markers, or fail to handle non-cooperative dynamic obstacles like humans, limiting generalization and real-time performance. To address these challenges, we introduce SafeLand, a lean, vision-based system for safe autonomous landing (SAL) that requires no prior information and operates only with a camera and a lightweight height sensor. Our approach constructs an online semantic ground map via deep learning-based semantic segmentation, optimized for embedded deployment and trained on a consolidation of seven curated public aerial datasets (achieving 70.22% mIoU across 20 classes), which is further refined through Bayesian probabilistic filtering with temporal semantic decay to robustly identify metric-scale landing spots. A behavior tree then governs adaptive landing, iteratively validates the spot, and reacts in real time to dynamic obstacles by pausing, climbing, or rerouting to alternative spots, maximizing human safety. We extensively evaluate our method in 200 simulations and 60 end-to-end field tests across industrial, urban, and rural environments at altitudes up to 100m, demonstrating zero false negatives for human detection. Compared to the state of the art, SafeLand achieves sub-second response latency, substantially lower than previous methods, while maintaining a superior success rate of 95%. To facilitate further research in aerial robotics, we release SafeLand's segmentation model as a plug-and-play ROS package, available at https://github.com/markus-42/SafeLand.
Chinese Translation
在未知动态环境中,无人驾驶飞行器(UAV)的自主着陆面临重大安全挑战,尤其是在靠近人员和基础设施的情况下,因为无人机正在向常规城市和农村操作过渡。现有方法通常依赖于先前的地图、重型传感器(如激光雷达)、静态标记,或无法处理像人类这样的非合作动态障碍物,从而限制了其泛化能力和实时性能。为了解决这些挑战,我们提出了SafeLand,这是一种精简的基于视觉的安全自主着陆(SAL)系统,无需先前信息,仅依靠相机和轻量级高度传感器进行操作。我们的方法通过基于深度学习的语义分割构建在线语义地面地图,优化用于嵌入式部署,并在七个精心策划的公共航空数据集的整合上进行训练(在20个类别中实现70.22%的mIoU),进一步通过贝叶斯概率滤波和时间语义衰减进行精炼,以稳健地识别度量尺度的着陆点。然后,行为树管理自适应着陆,迭代验证着陆点,并实时对动态障碍物作出反应,通过暂停、爬升或重新规划到替代着陆点来最大化人类安全。我们在200次模拟和60次端到端的实地测试中对我们的方法进行了广泛评估,涵盖工业、城市和农村环境,飞行高度达到100米,展示了零假阴性的人体检测率。与现有技术相比,SafeLand实现了亚秒级响应延迟,显著低于以前的方法,同时保持了95%的优越成功率。为了促进空中机器人领域的进一步研究,我们将SafeLand的分割模型作为即插即用的ROS包发布,地址为https://github.com/markus-42/SafeLand。
cs.RO / 19 / 2603.17437
FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation
FloorPlan-VLN:一种基于平面图引导的视觉-语言导航新范式
Abstract
Existing Vision-Language Navigation (VLN) task requires agents to follow verbose instructions, ignoring some potentially useful global spatial priors, limiting their capability to reason about spatial structures. Although human-readable spatial schematics (e.g., floor plans) are ubiquitous in real-world buildings, current agents lack the cognitive ability to comprehend and utilize them. To bridge this gap, we introduce \textbf{FloorPlan-VLN}, a new paradigm that leverages structured semantic floor plans as global spatial priors to enable navigation with only concise instructions. We first construct the FloorPlan-VLN dataset, which comprises over 10k episodes across 72 scenes. It pairs more than 100 semantically annotated floor plans with Matterport3D-based navigation trajectories and concise instructions that omit step-by-step guidance. Then, we propose a simple yet effective method \textbf{FP-Nav} that uses a dual-view, spatio-temporally aligned video sequence, and auxiliary reasoning tasks to align observations, floor plans, and instructions. When evaluated under this new benchmark, our method significantly outperforms adapted state-of-the-art VLN baselines, achieving more than a 60\% relative improvement in navigation success rate. Furthermore, comprehensive noise modeling and real-world deployments demonstrate the feasibility and robustness of FP-Nav to actuation drift and floor plan distortions. These results validate the effectiveness of floor plan guided navigation and highlight FloorPlan-VLN as a promising step toward more spatially intelligent navigation.
Chinese Translation
现有的视觉-语言导航(VLN)任务要求代理遵循冗长的指令,忽视了一些潜在有用的全局空间先验,限制了其推理空间结构的能力。尽管人类可读的空间示意图(例如,平面图)在现实建筑中无处不在,但当前的代理缺乏理解和利用这些信息的认知能力。为了解决这一问题,我们提出了 extbf{FloorPlan-VLN},一种利用结构化语义平面图作为全局空间先验的新范式,以仅依赖简洁指令进行导航。我们首先构建了FloorPlan-VLN数据集,该数据集包含超过10,000个场景,涵盖72个场景。它将100多个语义标注的平面图与基于Matterport3D的导航轨迹和省略逐步指导的简洁指令配对。然后,我们提出了一种简单而有效的方法 extbf{FP-Nav},该方法使用双视图、时空对齐的视频序列和辅助推理任务来对齐观察结果、平面图和指令。在这一新基准下进行评估时,我们的方法显著优于适应的最新VLN基线,导航成功率相对提高超过60\%。此外,全面的噪声建模和现实世界的部署展示了FP-Nav在执行漂移和平面图失真方面的可行性和鲁棒性。这些结果验证了平面图引导导航的有效性,并突出了FloorPlan-VLN作为更具空间智能导航的有前景的一步。
cs.RO / 20 / 2603.17459
P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation
P$^{3}$Nav:用于视觉与语言导航的端到端感知、预测与规划
Abstract
In Vision-and-Language Navigation (VLN), an agent is required to plan a path to the target specified by the language instruction, using its visual observations. Consequently, prevailing VLN methods primarily focus on building powerful planners through visual-textual alignment. However, these approaches often bypass the imperative of comprehensive scene understanding prior to planning, leaving the agent with insufficient perception or prediction capabilities. Thus, we propose P$^{3}$Nav, a novel end-to-end framework integrating perception, prediction, and planning in a unified pipeline to strengthen the VLN agent's scene understanding and boost navigation success. Specifically, P$^{3}$Nav augments perception by extracting complementary cues from object-level and map-level perspectives. Subsequently, our P$^{3}$Nav predicts waypoints to model the agent's potential future states, endowing the agent with intrinsic awareness of candidate positions during navigation. Conditioned on these future waypoints, P$^{3}$Nav further forecasts semantic map cues, enabling proactive planning and reducing the strict reliance on purely historical context. Integrating these perceptual and predictive cues, a holistic planning module finally carries out the VLN tasks. Extensive experiments demonstrate that our P$^{3}$Nav achieves new state-of-the-art performance on the REVERIE, R2R-CE, and RxR-CE benchmarks.
Chinese Translation
在视觉与语言导航(VLN)中,代理需要根据语言指令规划通往目标的路径,并利用其视觉观察。因此,现有的VLN方法主要集中在通过视觉-文本对齐构建强大的规划器。然而,这些方法往往忽视了在规划之前进行全面场景理解的必要性,导致代理的感知或预测能力不足。因此,我们提出了P$^{3}$Nav,一个新颖的端到端框架,将感知、预测和规划整合在一个统一的流程中,以增强VLN代理的场景理解并提高导航成功率。具体而言,P$^{3}$Nav通过从对象级和地图级视角提取互补线索来增强感知。随后,我们的P$^{3}$Nav预测航点,以建模代理的潜在未来状态,使代理在导航过程中对候选位置具有内在意识。在这些未来航点的条件下,P$^{3}$Nav进一步预测语义地图线索,从而实现主动规划,减少对纯历史上下文的严格依赖。通过整合这些感知和预测线索,最终的整体规划模块执行VLN任务。大量实验表明,我们的P$^{3}$Nav在REVERIE、R2R-CE和RxR-CE基准测试中达到了新的最先进性能。
cs.RO / 21 / 2603.17472
Bringing Network Coding into Multi-Robot Systems: Interplay Study for Autonomous Systems over Wireless Communications
将网络编码引入多机器人系统:无线通信中自主系统的相互作用研究
Abstract
Communication is a core enabler for multi-robot systems (MRS), providing the mechanism through which robots exchange state information, coordinate actions, and satisfy safety constraints. While many MRS autonomy algorithms assume reliable and timely message delivery, realistic wireless channels introduce delay, erasures, and ordering stalls that can degrade performance and compromise safety-critical decisions of the robot task. In this paper, we investigate how transport-layer reliability mechanisms that mitigate communication losses and delays shape the autonomy-communication loop. We show that conventional non-coded retransmission-based protocols introduce long delays that are misaligned with the timeliness requirements of MRS applications, and may render the received data irrelevant. As an alternative, we advocate for adaptive and causal network coding, which proactively injects coded redundancy to achieve the desired delay and throughput that enable relevant data delivery to the robotic task. Specifically, this method adapts to channel conditions between robots and causally tunes the communication rates via efficient algorithms. We present two case studies: cooperative localization under delayed and lossy inter-robot communication, and a safety-critical overtaking maneuver where timely vehicle-to-vehicle message availability determines whether an ego vehicle can abort to avoid a crash. Our results demonstrate that coding-based communication significantly reduces in-order delivery stalls, preserves estimation consistency under delay, and improves deadline reliability relative to retransmission-based transport. Overall, the study highlights the need to jointly design autonomy algorithms and communication mechanisms, and positions network coding as a principled tool for dependable multi-robot operation over wireless networks.
Chinese Translation
通信是多机器人系统(MRS)的核心推动力,为机器人交换状态信息、协调行动和满足安全约束提供了机制。虽然许多MRS自主算法假设消息能够可靠及时地传递,但现实中的无线信道引入了延迟、丢包和顺序停滞,这可能会降低性能并危及机器人任务的安全关键决策。本文研究了如何通过运输层可靠性机制来缓解通信损失和延迟,从而塑造自主-通信循环。我们表明,传统的基于非编码重传的协议引入了与MRS应用的时效性要求不匹配的长延迟,并可能使接收到的数据变得无关。作为替代,我们提倡采用自适应和因果网络编码,该方法主动注入编码冗余,以实现所需的延迟和吞吐量,从而使相关数据能够传递给机器人任务。具体而言,该方法根据机器人之间的信道条件进行适应,并通过高效算法因果调节通信速率。我们展示了两个案例研究:在延迟和丢包的机器人间通信下的协作定位,以及一个安全关键的超车操作,其中及时的车对车消息可用性决定了自我车辆是否可以中止以避免碰撞。我们的结果表明,基于编码的通信显著减少了顺序交付停滞,在延迟下保持了估计一致性,并相对于基于重传的传输提高了截止时间的可靠性。总体而言,本研究强调了自主算法与通信机制的联合设计需求,并将网络编码定位为在无线网络上实现可靠多机器人操作的原则性工具。
cs.RO / 22 / 2603.17497
From Optimizable to Interactable: Mixed Digital Twin-Empowered Testing of Vehicle-Infrastructure Cooperation Systems
从可优化到可交互:混合数字双胞胎赋能的车辆-基础设施合作系统测试
Abstract
Sufficient testing under corner cases is critical for the long-term operation of vehicle-infrastructure cooperation systems (VICS). However, existing corner-case generation methods are primarily AI-driven, and VICS testing under corner cases is typically limited to simulation. In this paper, we introduce an L5 ''Interactable'' level to the VICS digital twin (VICS-DT) taxonomy, extending beyond the conventional L4 ''Optimizable'' level. We further propose an L5-level VICS testing framework, IMPACT (Interactive Mixed-digital-twin Paradigm for Advanced Cooperative vehicle-infrastructure Testing). By enabling direct human interactions with VICS entities, IMPACT incorporates highly uncertain and unpredictable human behaviors into the testing loop, naturally generating high-quality corner cases that complement AI-based methods. Furthermore, the mixedDT-enabled ''Physical-Virtual Action Interaction'' facilitates safe VICS testing under corner cases, incorporating real-world environments and entities rather than purely in simulation. Finally, we implement IMPACT on the I-VIT (Interactive Vehicle-Infrastructure Testbed), and experiments demonstrate its effectiveness. The experimental videos are available at our project website: https://dongjh20.github.io/IMPACT.
Chinese Translation
在极端情况下进行充分测试对于车辆-基础设施合作系统(VICS)的长期运行至关重要。然而,现有的极端情况生成方法主要依赖于人工智能,且VICS在极端情况下的测试通常仅限于模拟。在本文中,我们为VICS数字双胞胎(VICS-DT)分类法引入了L5“可交互”级别,超越了传统的L4“可优化”级别。我们进一步提出了一个L5级别的VICS测试框架IMPACT(用于先进合作车辆-基础设施测试的交互式混合数字双胞胎范式)。通过使人类能够直接与VICS实体进行交互,IMPACT将高度不确定和不可预测的人类行为纳入测试循环,自然生成高质量的极端情况,以补充基于人工智能的方法。此外,混合数字双胞胎支持的“物理-虚拟动作交互”促进了在极端情况下安全的VICS测试,结合了真实世界的环境和实体,而不仅仅是在模拟中进行。最后,我们在I-VIT(交互式车辆-基础设施测试平台)上实现了IMPACT,实验结果证明了其有效性。实验视频可在我们的项目网站上查看:https://dongjh20.github.io/IMPACT。
cs.RO / 23 / 2603.17510
Interpreting Context-Aware Human Preferences for Multi-Objective Robot Navigation
解释上下文感知的人类偏好用于多目标机器人导航
Abstract
Robots operating in human-shared environments must not only achieve task-level navigation objectives such as safety and efficiency, but also adapt their behavior to human preferences. However, as human preferences are typically expressed in natural language and depend on environmental context, it is difficult to directly integrate them into low-level robot control policies. In this work, we present a pipeline that enables robots to understand and apply context-dependent navigation preferences by combining foundational models with a Multi-Objective Reinforcement Learning (MORL) navigation policy. Thus, our approach integrates high-level semantic reasoning with low-level motion control. A Vision-Language Model (VLM) extracts structured environmental context from onboard visual observations, while Large Language Models (LLM) convert natural language user feedback into interpretable, context-dependent behavioral rules stored in a persistent but updatable rule memory. A preference translation module then maps contextual information and stored rules into numerical preference vectors that parameterize a pretrained MORL policy for real-time navigation adaptation. We evaluate the proposed framework through quantitative component-level evaluations, a user study, and real-world robot deployments in various indoor environments. Our results demonstrate that the system reliably captures user intent, generates consistent preference vectors, and enables controllable behavior adaptation across diverse contexts. Overall, the proposed pipeline improves the adaptability, transparency, and usability of robots operating in shared human environments, while maintaining safe and responsive real-time control.
Chinese Translation
在与人类共享的环境中运行的机器人不仅必须实现安全性和效率等任务级导航目标,还必须根据人类偏好调整其行为。然而,由于人类偏好通常以自然语言表达,并且依赖于环境上下文,因此将其直接整合到低级机器人控制策略中是困难的。在本研究中,我们提出了一种管道,使机器人能够理解和应用上下文依赖的导航偏好,通过将基础模型与多目标强化学习(Multi-Objective Reinforcement Learning, MORL)导航策略相结合。因此,我们的方法将高级语义推理与低级运动控制相结合。视觉-语言模型(Vision-Language Model, VLM)从车载视觉观测中提取结构化的环境上下文,而大型语言模型(Large Language Models, LLM)将自然语言用户反馈转换为可解释的、上下文依赖的行为规则,这些规则存储在一个持久但可更新的规则记忆中。然后,偏好翻译模块将上下文信息和存储的规则映射为数值偏好向量,这些向量为预训练的MORL策略参数化,以实现实时导航适应。我们通过定量组件级评估、用户研究以及在各种室内环境中的实际机器人部署来评估所提出的框架。我们的结果表明,该系统可靠地捕捉用户意图,生成一致的偏好向量,并在多样化的上下文中实现可控的行为适应。总体而言,所提出的管道提高了在共享人类环境中运行的机器人的适应性、透明性和可用性,同时保持安全和响应迅速的实时控制。
cs.RO / 24 / 2603.17524
KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition
KineVLA:朝着具有双层动作分解的运动学感知视觉-语言-动作模型迈进
Abstract
In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To support this task, we construct the kinematics-aware VLA datasets spanning both simulation and real-world robotic platforms, featuring instruction-level kinematic variations and bi-level annotations. Extensive experiments on LIBERO and a Realman-75 robot demonstrate that KineVLA consistently outperforms strong VLA baselines on kinematics-sensitive benchmarks, achieving more precise, controllable, and generalizable manipulation behaviors.
Chinese Translation
在本文中,我们介绍了一种新颖的运动学丰富的视觉-语言-动作(VLA)任务,其中语言指令密集编码了从开始到完成的多样化运动学属性(如方向、轨迹、方位和相对位移),与现有的仅粗略或部分捕捉运动学的动作指令不同,从而支持细粒度和个性化的操作。在这种情况下,任务目标保持不变,而执行轨迹必须适应指令级运动学规范。为了解决这一挑战,我们提出了KineVLA,一个视觉-语言-动作框架,通过双层动作表示和双层推理标记明确地将目标级不变性与运动学级变异性解耦,以作为明确的、监督的中间变量,来对齐语言和动作。为了支持这一任务,我们构建了运动学感知的VLA数据集,涵盖了模拟和真实世界的机器人平台,具有指令级运动学变异和双层注释。在LIBERO和Realman-75机器人上的大量实验表明,KineVLA在运动学敏感基准上始终优于强大的VLA基线,达到了更精确、可控和可推广的操作行为。
cs.RO / 25 / 2603.17573
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
HeiSD:具有运动意识的具身视觉-语言-动作模型的混合推测解码
Abstract
Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlled by the VLA model and derive a key insight: the two types of SD should be used in a hybrid manner. However, achieving hybrid SD in VLA models poses several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.
Chinese Translation
视觉-语言-动作(VLA)模型已成为机器人控制的主流解决方案,但其推理速度较慢。推测解码(SD)是一种有前景的加速方法,可分为两类:基于草稿的SD和基于检索的SD。现有方法未能分析这两种类型的SD在VLA模型中的优缺点,导致它们仅被单独应用或优化。本文分析了由VLA模型控制的机器人轨迹模式,并得出了一个关键见解:这两种类型的SD应以混合方式使用。然而,在VLA模型中实现混合SD面临几个挑战:(1)基于检索的SD中的草稿拒绝和持续错误;(2)确定混合边界的困难。为了解决这些问题,我们提出了HeiSD框架。在HeiSD中,我们提出了一种基于检索的SD优化方法,该方法包含验证跳过机制和序列级放宽接受策略。此外,我们在HeiSD中提出了一种基于运动学的融合度量,以自动确定混合边界。实验结果表明,HeiSD在仿真基准测试中实现了最高2.45倍的加速,在现实场景中实现了2.06倍至2.41倍的加速,同时保持了高任务成功率。
cs.RO / 26 / 2603.17652
VectorWorld: Efficient Streaming World Model via Diffusion Flow on Vector Graphs
VectorWorld:通过向量图上的扩散流实现高效流媒体世界模型
Abstract
Closed-loop evaluation of autonomous-driving policies requires interactive simulation beyond log replay. However, existing generative world models often degrade in closed loop due to (i) history-free initialization that mismatches policy inputs, (ii) multi-step sampling latency that violates real-time budgets, and (iii) compounding kinematic infeasibility over long horizons. We propose VectorWorld, a streaming world model that incrementally generates ego-centric $64 \mathrm{m}\times 64\mathrm{m}$ lane--agent vector-graph tiles during rollout. VectorWorld aligns initialization with history-conditioned policies by producing a policy-compatible interaction state via a motion-aware gated VAE. It enables real-time outpainting via solver-free one-step masked completion with an edge-gated relational DiT trained with interval-conditioned MeanFlow and JVP-based large-step supervision. To stabilize long-horizon rollouts, we introduce $\Delta$Sim, a physics-aligned non-ego (NPC) policy with hybrid discrete--continuous actions and differentiable kinematic logit shaping. On Waymo open motion and nuPlan, VectorWorld improves map-structure fidelity and initialization validity, and supports stable, real-time $1\mathrm{km}+$ closed-loop rollouts (\href{https://github.com/jiangchaokang/VectorWorld}{code}).
Chinese Translation
自主驾驶策略的闭环评估需要超越日志重放的交互式模拟。然而,现有的生成世界模型在闭环中往往会退化,原因包括:(i) 无历史初始化与策略输入不匹配,(ii) 多步采样延迟违反实时预算,以及 (iii) 长时间范围内运动学不可行性的累积。我们提出了VectorWorld,这是一种流媒体世界模型,在回放过程中增量生成以自我为中心的 $64 ext{m} imes 64 ext{m}$ 车道-代理向量图块。VectorWorld通过生成与策略兼容的交互状态,利用运动感知的门控变分自编码器(gated VAE)将初始化与历史条件策略对齐。它通过无求解器的一步掩蔽完成,结合使用间隔条件的MeanFlow和基于JVP的大步监督,支持实时的外绘。为了稳定长时间范围的回放,我们引入了$ ext{ΔSim}$,这是一种与物理对齐的非自我(NPC)策略,具有混合离散-连续动作和可微分的运动学对数形状调整。在Waymo开放运动和nuPlan上,VectorWorld提高了地图结构的保真度和初始化的有效性,并支持稳定的实时 $1 ext{km}+ $ 闭环回放( ext{code})。
cs.RO / 27 / 2603.17653
REAL: Robust Extreme Agility via Spatio-Temporal Policy Learning and Physics-Guided Filtering
REAL:通过时空策略学习和物理引导过滤实现鲁棒极限灵活性
Abstract
Extreme legged parkour demands rapid terrain assessment and precise foot placement under highly dynamic conditions. While recent learning-based systems achieve impressive agility, they remain fundamentally fragile to perceptual degradation, where even brief visual noise or latency can cause catastrophic failure. To overcome this, we propose Robust Extreme Agility Learning (REAL), an end-to-end framework for reliable parkour under sensory corruption. Instead of relying on perfectly clean perception, REAL tightly couples vision, proprioceptive history, and temporal memory. We distill a cross-modal teacher policy into a deployable student equipped with a FiLM-modulated Mamba backbone to actively filter visual noise and build short-term terrain memory actively. Furthermore, a physics-guided Bayesian state estimator enforces rigid-body consistency during high-impact maneuvers. Validated on a Unitree Go2 quadruped, REAL successfully traverses extreme obstacles even with a 1-meter visual blind zone, while strictly satisfying real-time control constraints with a bounded 13.1 ms inference time.
Chinese Translation
极限四足跑酷要求在高度动态的条件下快速评估地形和精确放置脚步。尽管最近的基于学习的系统在灵活性方面取得了令人印象深刻的成果,但它们在感知退化方面仍然根本脆弱,甚至短暂的视觉噪声或延迟都可能导致灾难性失败。为了解决这个问题,我们提出了鲁棒极限灵活性学习(REAL),这是一个在感官干扰下实现可靠跑酷的端到端框架。REAL不依赖于完美清晰的感知,而是紧密结合视觉、身体感知历史和时间记忆。我们将跨模态教师策略提炼为一个可部署的学生模型,该模型配备了FiLM调制的Mamba骨干网络,以主动过滤视觉噪声并积极构建短期地形记忆。此外,物理引导的贝叶斯状态估计器在高冲击动作中强制执行刚体一致性。在Unitree Go2四足机器人上进行验证,REAL成功地穿越极端障碍,即使在1米的视觉盲区内,也严格满足实时控制约束,推理时间限制为13.1毫秒。
cs.RO / 28 / 2603.17670
AgentVLN: Towards Agentic Vision-and-Language Navigation
AgentVLN:面向自主视觉与语言导航
Abstract
Vision-and-Language Navigation (VLN) requires an embodied agent to ground complex natural-language instructions into long-horizon navigation in unseen environments. While Vision-Language Models (VLMs) offer strong 2D semantic understanding, current VLN systems remain constrained by limited spatial perception, 2D-3D representation mismatch, and monocular scale ambiguity. In this paper, we propose AgentVLN, a novel and efficient embodied navigation framework that can be deployed on edge computing platforms. We formulate VLN as a Partially Observable Semi-Markov Decision Process (POSMDP) and introduce a VLM-as-Brain paradigm that decouples high-level semantic reasoning from perception and planning via a plug-and-play skill library. To resolve multi-level representation inconsistency, we design a cross-space representation mapping that projects perception-layer 3D topological waypoints into the image plane, yielding pixel-aligned visual prompts for the VLM. Building on this bridge, we integrate a context-aware self-correction and active exploration strategy to recover from occlusions and suppress error accumulation over long trajectories. To further address the spatial ambiguity of instructions in unstructured environments, we propose a Query-Driven Perceptual Chain-of-Thought (QD-PCoT) scheme, enabling the agent with the metacognitive ability to actively seek geometric depth information. Finally, we construct AgentVLN-Instruct, a large-scale instruction-tuning dataset with dynamic stage routing conditioned on target visibility. Extensive experiments show that AgentVLN consistently outperforms prior state-of-the-art methods (SOTA) on long-horizon VLN benchmarks, offering a practical paradigm for lightweight deployment of next-generation embodied navigation models. Code: https://github.com/Allenxinn/AgentVLN.
Chinese Translation
视觉与语言导航(VLN)要求一个具身代理将复杂的自然语言指令转化为在未知环境中的长距离导航。尽管视觉语言模型(VLMs)提供了强大的二维语义理解,目前的VLN系统仍受到有限空间感知、二维-三维表示不匹配和单目尺度模糊的限制。本文提出了AgentVLN,一种新颖且高效的具身导航框架,可在边缘计算平台上部署。我们将VLN形式化为部分可观测半马尔可夫决策过程(POSMDP),并引入VLM作为大脑的范式,通过即插即用的技能库将高层语义推理与感知和规划解耦。为了解决多层表示不一致的问题,我们设计了一种跨空间表示映射,将感知层的三维拓扑航点投影到图像平面,生成与像素对齐的视觉提示供VLM使用。在此基础上,我们整合了一种上下文感知的自我修正和主动探索策略,以从遮挡中恢复并抑制长轨迹上的误差积累。为了进一步解决非结构化环境中指令的空间模糊性,我们提出了一种基于查询驱动的感知思维链(QD-PCoT)方案,使代理具备主动寻求几何深度信息的元认知能力。最后,我们构建了AgentVLN-Instruct,一个大规模的指令调优数据集,具有基于目标可见性的动态阶段路由。大量实验表明,AgentVLN在长距离VLN基准测试中始终优于先前的最先进方法(SOTA),为下一代具身导航模型的轻量级部署提供了实用范式。代码链接:https://github.com/Allenxinn/AgentVLN。
cs.RO / 29 / 2603.17672
Consistency-Driven Dual LSTM Models for Kinematic Control of a Wearable Soft Robotic Arm
基于一致性驱动的双LSTM模型用于可穿戴软体机器人手臂的运动控制
Abstract
In this paper, we introduce a consistency-driven dual LSTM framework for accurately learning both the forward and inverse kinematics of a pneumatically actuated soft robotic arm integrated into a wearable device. This approach effectively captures the nonlinear and hysteretic behaviors of soft pneumatic actuators while addressing the one-to-many mapping challenge between actuation inputs and end-effector positions. By incorporating a cycle consistency loss, we enhance physical realism and improve the stability of inverse predictions. Extensive experiments-including trajectory tracking, ablation studies, and wearable demonstrations-confirm the effectiveness of our method. Results indicate that the inclusion of the consistency loss significantly boosts prediction accuracy and promotes physical consistency over conventional approaches. Moreover, the wearable soft robotic arm demonstrates strong human-robot collaboration capabilities and adaptability in everyday tasks such as object handover, obstacle-aware pick-and-place, and drawer operation. This work underscores the promising potential of learning-based kinematic models for human-centric, wearable robotic systems.
Chinese Translation
本文提出了一种基于一致性驱动的双LSTM框架,用于准确学习集成在可穿戴设备中的气动驱动软体机器人手臂的正向和逆向运动学。该方法有效捕捉了软气动执行器的非线性和滞后特性,同时解决了驱动输入与末端执行器位置之间的一对多映射挑战。通过引入循环一致性损失,我们增强了物理现实感,并提高了逆向预测的稳定性。大量实验,包括轨迹跟踪、消融研究和可穿戴演示,证实了我们方法的有效性。结果表明,加入一致性损失显著提高了预测精度,并在传统方法上促进了物理一致性。此外,可穿戴软体机器人手臂在日常任务中表现出强大的人机协作能力和适应性,如物体交接、障碍物感知的拾取与放置以及抽屉操作。本研究强调了基于学习的运动学模型在以人为中心的可穿戴机器人系统中的广阔潜力。
cs.RO / 30 / 2603.17712
AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation
AERR-Nav:用于零样本物体导航的自适应探索-恢复-回忆策略
Abstract
Zero-Shot Object Navigation (ZSON) in unknown multi-floor environments presents a significant challenge. Recent methods, mostly based on semantic value greedy waypoint selection, spatial topology-enhanced memory, and Multimodal Large Language Model (MLLM) as a decision-making framework, have led to improvements. However, these architectures struggle to balance exploration and exploitation for ZSON when encountering unseen environments, especially in multi-floor settings, such as robots getting stuck at narrow intersections, endlessly wandering, or failing to find stair entrances. To overcome these challenges, we propose AERR-Nav, a Zero-Shot Object Navigation framework that dynamically adjusts its state based on the robot's environment. Specifically, AERR-Nav has the following two key advantages: (1) An Adaptive Exploration-Recovery-Reminiscing Strategy, enables robots to dynamically transition between three states, facilitating specialized responses to diverse navigation scenarios. (2) An Adaptive Exploration State featuring Fast and Slow-Thinking modes helps robots better balance exploration, exploitation, and higher-level reasoning based on evolving environmental information. Extensive experiments on the HM3D and MP3D benchmarks demonstrate that our AERR-Nav achieves state-of-the-art performance among zero-shot methods. Comprehensive ablation studies further validate the efficacy of our proposed strategy and modules.
Chinese Translation
在未知的多层环境中,零样本物体导航(ZSON)面临着重大挑战。最近的方法主要基于语义值贪婪路径点选择、空间拓扑增强记忆以及多模态大型语言模型(MLLM)作为决策框架,取得了一定的进展。然而,当遇到未见过的环境时,这些架构在ZSON中平衡探索与利用的能力仍显不足,尤其是在多层设置中,例如机器人在狭窄交叉口被卡住、无休止地徘徊或无法找到楼梯入口。为了解决这些挑战,我们提出了AERR-Nav,这是一种零样本物体导航框架,能够根据机器人所处环境动态调整其状态。具体而言,AERR-Nav具有以下两个关键优势:(1)自适应探索-恢复-回忆策略,使机器人能够在三种状态之间动态转换,从而针对多样的导航场景提供专业化响应。(2)自适应探索状态,具有快速和慢速思维模式,帮助机器人根据不断变化的环境信息更好地平衡探索、利用和更高层次的推理。在HM3D和MP3D基准上的大量实验表明,我们的AERR-Nav在零样本方法中达到了最先进的性能。全面的消融研究进一步验证了我们提出的策略和模块的有效性。
cs.RO / 31 / 2603.17720
VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning
VolumeDP:用于操控策略学习的体积表示建模
Abstract
Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable module and converts them into a compact set of spatial tokens, markedly reducing computation while preserving action-critical geometry. Finally, a multi-token decoder conditions on the entire token set to predict actions, thereby avoiding lossy aggregation that collapses multiple spatial tokens into a single descriptor. VolumeDP achieves a state-of-the-art average success rate of 88.8% on the LIBERO simulation benchmark, outperforming the strongest baseline by a substantial 14.8% improvement. It also delivers large performance gains over prior methods on the ManiSkill and LIBERO-Plus benchmarks. Real-world experiments further demonstrate higher success rates and robust generalization to novel spatial layouts, camera viewpoints, and environment backgrounds. Code will be released.
Chinese Translation
模仿学习是机器人操控中的一种重要范式。然而,现有的视觉模仿方法将2D图像观测直接映射到3D动作输出,这导致了2D与3D之间的不匹配,妨碍了空间推理并降低了鲁棒性。我们提出了VolumeDP,这是一种通过明确的3D推理恢复空间对齐的策略架构。VolumeDP首先通过交叉注意力将图像特征提升为体积表示。然后,它使用一个可学习模块选择与任务相关的体素,并将其转换为一组紧凑的空间标记,从而显著减少计算量,同时保留与动作相关的几何信息。最后,一个多标记解码器基于整个标记集来预测动作,从而避免将多个空间标记聚合为单一描述符时造成的信息损失。VolumeDP在LIBERO模拟基准上实现了88.8%的最新平均成功率,较最强基线提高了14.8%。在ManiSkill和LIBERO-Plus基准上,它也相较于之前的方法实现了显著的性能提升。实际实验进一步证明了在新颖空间布局、摄像机视角和环境背景下具有更高的成功率和鲁棒性泛化。代码将会发布。
cs.RO / 32 / 2603.17751
Multi-Source Human-in-the-Loop Digital Twin Testbed for Connected and Autonomous Vehicles in Mixed Traffic Flow
多源人机协同数字双胞胎测试平台用于混合交通流中的连接与自动驾驶车辆
Abstract
In the emerging mixed traffic environments, Connected and Autonomous Vehicles (CAVs) have to interact with surrounding human-driven vehicles (HDVs). This paper introduces MSH-MCCT (Multi-Source Human-in-the-Loop Mixed Cloud Control Testbed), a novel CAV testbed that captures complex interactions between various CAVs and HDVs. Utilizing the Mixed Digital Twin concept, which combines Mixed Reality with Digital Twin, MSH-MCCT integrates physical, virtual, and mixed platforms, along with multi-source control inputs. Bridged by the mixed platform, MSH-MCCT allows human drivers and CAV algorithms to operate both physical and virtual vehicles within multiple fields of view. Particularly, this testbed facilitates the coexistence and real-time interaction of physical and virtual CAVs \& HDVs, significantly enhancing the experimental flexibility and scalability. Experiments on vehicle platooning in mixed traffic showcase the potential of MSH-MCCT to conduct CAV testing with multi-source real human drivers in the loop through driving simulators of diverse fidelity. The videos for the experiments are available at our project website: https://dongjh20.github.io/MSH-MCCT.
Chinese Translation
在新兴的混合交通环境中,连接与自动驾驶车辆(CAVs)必须与周围的人驾驶车辆(HDVs)进行互动。本文介绍了MSH-MCCT(多源人机协同混合云控制测试平台),这是一个新颖的CAV测试平台,能够捕捉各种CAV与HDV之间的复杂互动。MSH-MCCT利用混合数字双胞胎概念,将混合现实与数字双胞胎相结合,整合了物理、虚拟和混合平台,以及多源控制输入。通过混合平台的桥接,MSH-MCCT允许人类驾驶员和CAV算法在多个视野内操作物理和虚拟车辆。特别是,该测试平台促进了物理和虚拟CAV与HDV的共存和实时互动,显著增强了实验的灵活性和可扩展性。在混合交通中的车辆编队实验展示了MSH-MCCT通过不同保真度的驾驶模拟器与多源真实人类驾驶员进行CAV测试的潜力。实验视频可在我们的项目网站上查看:https://dongjh20.github.io/MSH-MCCT。
cs.RO / 33 / 2603.17768
Huddle: Parallel Shape Assembly using Decentralized, Minimalistic Robots
Huddle:使用去中心化、简约机器人进行并行形状组装
Abstract
We propose a novel algorithm for forming arbitrarily shaped assemblies using decentralized robots. By relying on local interactions, the algorithm ensures there are no unreachable states or gaps in the assembly, which are global properties. The in-assembly robots attract passing-by robots into expanding the assembly via a simple implementation of signaling and alignment. Our approach is minimalistic, requiring only communication between attached, immediate neighbors. It is motion-agnostic and requires no pose localization, enabling asynchronous and order-independent assembly. We prove the algorithm's correctness and demonstrate its effectiveness in forming a 107-robot assembly.
Chinese Translation
我们提出了一种新颖的算法,用于利用去中心化机器人形成任意形状的组装。该算法依赖于局部交互,确保组装中不存在不可达状态或间隙,这些都是全局特性。在组装中的机器人通过简单的信号和对齐实现吸引路过的机器人以扩展组装。我们的方法是简约的,仅需与附加的、直接邻居之间进行通信。它与运动无关,并且不需要姿态定位,从而实现异步和无序的组装。我们证明了算法的正确性,并展示了其在形成107个机器人的组装中的有效性。
cs.RO / 34 / 2603.17808
EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards
EVA:通过逆动力学奖励将视频世界模型与可执行机器人动作对齐
Abstract
Video generative models are increasingly used as world models for robotics, where a model generates a future visual rollout conditioned on the current observation and task instruction, and an inverse dynamics model (IDM) converts the generated frames into executable robot actions. However, current video world models lack explicit executability constraints. As a result, visually coherent rollouts may still violate rigid-body and kinematic consistency, producing unstable or infeasible control commands when decoded by an IDM. We refer to this mismatch between visual generation and physically executable control as the executability gap. While this gap can be mitigated at inference time using techniques such as rejection sampling, such approaches are inefficient due to the high cost of video generation. In this paper, we leverage the executability gap as a training signal and introduce Executable Video Alignment (EVA), a reinforcement-learning post-training framework for aligning video world models. EVA trains an inverse dynamics model on real robot trajectories and repurposes it as a reward model that evaluates generated videos through the action sequences they induce, encouraging smooth motions measured by velocity, acceleration, and jerk while penalizing actions that violate embodiment constraints. Importantly, the reward remains informative even when generated videos contain severe visual artifacts, since such artifacts typically translate into unstable or out-of-bound actions. Experiments on the RoboTwin benchmark and a real bimanual robot show that EVA reduces embodiment-specific artifacts in generated rollouts and improves downstream task execution success.
Chinese Translation
视频生成模型越来越多地被用作机器人领域的世界模型,其中模型根据当前观察和任务指令生成未来的视觉展开,而逆动力学模型(IDM)将生成的帧转换为可执行的机器人动作。然而,目前的视频世界模型缺乏明确的可执行性约束。因此,视觉上连贯的展开可能仍然违反刚体和运动学的一致性,在通过IDM解码时产生不稳定或不可行的控制命令。我们将视觉生成与物理可执行控制之间的这种不匹配称为可执行性差距。虽然可以在推理时使用拒绝采样等技术来缓解这种差距,但由于视频生成的高成本,这种方法效率低下。在本文中,我们利用可执行性差距作为训练信号,并引入可执行视频对齐(EVA),这是一种用于对齐视频世界模型的强化学习后训练框架。EVA在真实机器人轨迹上训练逆动力学模型,并将其重新用作奖励模型,通过所诱导的动作序列评估生成的视频,鼓励通过速度、加速度和抖动测量的平滑运动,同时惩罚违反体现约束的动作。重要的是,即使生成的视频包含严重的视觉伪影,奖励仍然保持信息量,因为这些伪影通常会转化为不稳定或超出边界的动作。在RoboTwin基准和一个真实的双手机器人上的实验表明,EVA减少了生成展开中的体现特定伪影,并提高了下游任务执行的成功率。
cs.RO / 35 / 2603.17834
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
生成控制作为优化:用于自适应和鲁棒机器人控制的时间无条件流匹配
Abstract
Diffusion models and flow matching have become a cornerstone of robotic imitation learning, yet they suffer from a structural inefficiency where inference is often bound to a fixed integration schedule that is agnostic to state complexity. This paradigm forces the policy to expend the same computational budget on trivial motions as it does on complex tasks. We introduce Generative Control as Optimization (GeCO), a time-unconditional framework that transforms action synthesis from trajectory integration into iterative optimization. GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors. Consequently, test-time inference becomes an adaptive process that allocates computation based on convergence--exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training-free safety signal, as the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector, remaining low for in-distribution states while significantly increasing for anomalies. We validate GeCO on standard simulation benchmarks and demonstrate seamless scaling to pi0-series Vision-Language-Action (VLA) models. As a plug-and-play replacement for standard flow-matching heads, GeCO improves success rates and efficiency with an optimization-native mechanism for safe deployment. Video and code can be found at https://hrh6666.github.io/GeCO/
Chinese Translation
扩散模型和流匹配已成为机器人模仿学习的基石,但它们存在结构性低效的问题,推断通常受限于一个对状态复杂性无感的固定积分调度。这一范式迫使策略在简单动作和复杂任务上消耗相同的计算预算。我们提出了生成控制作为优化(Generative Control as Optimization, GeCO),这是一个时间无条件框架,将动作合成从轨迹积分转变为迭代优化。GeCO在动作序列空间中学习一个稳定的速度场,其中专家行为形成稳定的吸引子。因此,测试时的推断成为一个自适应过程,根据收敛情况分配计算——对于简单状态提前退出,而对于困难状态则进行更长时间的精细化。此外,这种静态几何结构产生了一种内在的、无训练的安全信号,因为在优化动作下的场范数作为一个鲁棒的分布外(out-of-distribution, OOD)检测器,对于分布内状态保持较低,而对于异常状态则显著增加。我们在标准仿真基准上验证了GeCO,并展示了其在pi0系列视觉-语言-动作(Vision-Language-Action, VLA)模型上的无缝扩展。作为标准流匹配头的即插即用替代方案,GeCO通过一种优化原生机制提高了成功率和效率,确保安全部署。视频和代码可在 https://hrh6666.github.io/GeCO/ 找到。
cs.RO / 36 / 2603.17850
ProbeFlow: Training-Free Adaptive Flow Matching for Vision-Language-Action Models
ProbeFlow:无训练自适应流匹配用于视觉-语言-动作模型
Abstract
Recent Vision-Language-Action (VLA) models equipped with Flow Matching (FM) action heads achieve state-of-the-art performance in complex robot manipulation. However, the multi-step iterative ODE solving required by FM introduces inference latency that precludes responsive physical control. While current acceleration efforts optimize the Vision-Language Model (VLM) backbone, the action head bottleneck remains overlooked. To address this, we propose ProbeFlow, a training-free adaptive inference framework tai- lored for continuous robotic control. By evaluating geometric trajectory complexity via the cosine similarity between initial and lookahead velocity vectors, ProbeFlow dynamically sched- ules integration steps to prune redundant network evaluations. On the MetaWorld benchmark, it accelerates action decoding by 14.8x (reducing average steps from N = 50 to 2.6) and cuts end-to-end system latency by 2.8x without compromising the manipulation success rate. On the long-horizon LIBERO benchmark, the probe automatically allocates a denser schedule to navigate semantic bottlenecks, effectively resolving the flow solver delay. Real-world physical deployments confirm that ProbeFlow successfully mitigates action decoding latency while ensuring execution stability, offering a highly practical solution for low-latency continuous generative policies.
Chinese Translation
最近,配备流匹配(Flow Matching, FM)动作头的视觉-语言-动作(Vision-Language-Action, VLA)模型在复杂的机器人操作中实现了最先进的性能。然而,FM所需的多步迭代常微分方程(ODE)求解引入了推理延迟,妨碍了响应式物理控制。尽管当前的加速努力优化了视觉-语言模型(Vision-Language Model, VLM)主干,但动作头瓶颈仍未得到重视。为了解决这个问题,我们提出了ProbeFlow,这是一种无训练的自适应推理框架,专为连续机器人控制量身定制。通过评估初始和前瞻速度向量之间的余弦相似度来衡量几何轨迹复杂性,ProbeFlow动态调度积分步骤,以修剪冗余的网络评估。在MetaWorld基准测试中,它将动作解码加速了14.8倍(将平均步骤从N=50减少到2.6),并将端到端系统延迟减少了2.8倍,而不影响操作成功率。在长时间范围的LIBERO基准测试中,探针自动分配更密集的调度以导航语义瓶颈,有效解决流求解器延迟。现实世界的物理部署确认ProbeFlow成功减轻了动作解码延迟,同时确保了执行稳定性,为低延迟连续生成策略提供了高度实用的解决方案。
cs.RO / 37 / 2603.17851
DexViTac: Collecting Human Visuo-Tactile-Kinematic Demonstrations for Contact-Rich Dexterous Manipulation
DexViTac:收集用于接触丰富的灵巧操作的人类视觉-触觉-运动演示
Abstract
Large-scale, high-quality multimodal demonstrations are essential for robot learning of contact-rich dexterous manipulation. While human-centric data collection systems lower the barrier to scaling, they struggle to capture the tactile information during physical interactions. Motivated by this, we present DexViTac, a portable, human-centric data collection system tailored for contact-rich dexterous manipulation. The system enables the high-fidelity acquisition of first-person vision, high-density tactile sensing, end-effector poses, and hand kinematics within unstructured, in-the-wild environments. Building upon this hardware, we propose a kinematics-grounded tactile representation learning algorithm that effectively resolves semantic ambiguities within tactile signals. Leveraging the efficiency of DexViTac, we construct a multimodal dataset comprising over 2,400 visuo-tactile-kinematic demonstrations. Experiments demonstrate that DexViTac achieves a collection efficiency exceeding 248 demonstrations per hour and remains robust against complex visual occlusions. Real-world deployment confirms that policies trained with the proposed dataset and learning strategy achieve an average success rate exceeding 85% across four challenging tasks. This performance significantly outperforms baseline methods, thereby validating the substantial improvement the system provides for learning contact-rich dexterous manipulation. Project page: https://xitong-c.github.io/DexViTac/.
Chinese Translation
大规模、高质量的多模态演示对于机器人学习接触丰富的灵巧操作至关重要。尽管以人为中心的数据收集系统降低了扩展的门槛,但在物理交互过程中,它们难以捕捉触觉信息。基于此,我们提出了DexViTac,一个便携式、以人为中心的数据收集系统,专为接触丰富的灵巧操作而设计。该系统能够在非结构化的自然环境中高保真地获取第一人称视觉、高密度触觉感知、末端执行器姿态和手部运动学。基于这一硬件,我们提出了一种基于运动学的触觉表示学习算法,有效解决了触觉信号中的语义模糊性。利用DexViTac的高效性,我们构建了一个包含2400多个视觉-触觉-运动演示的多模态数据集。实验表明,DexViTac的收集效率超过每小时248个演示,并且在复杂的视觉遮挡下保持稳健。实际部署确认,使用所提数据集和学习策略训练的策略在四个具有挑战性的任务中平均成功率超过85%。这一性能显著优于基线方法,从而验证了该系统在学习接触丰富的灵巧操作方面所提供的显著改进。项目页面:https://xitong-c.github.io/DexViTac/
cs.RO / 38 / 2603.17927
RoboForge: Physically Optimized Text-guided Whole-Body Locomotion for Humanoids
RoboForge:针对类人机器人进行物理优化的文本引导全身运动
Abstract
While generative models have become effective at producing human-like motions from text, transferring these motions to humanoid robots for physical execution remains challenging. Existing pipelines are often limited by retargeting, where kinematic quality is undermined by physical infeasibility, contact-transition errors, and the high cost of real-world dynamical data. We present a unified latent-driven framework that bridges natural language and whole-body humanoid locomotion through a retarget-free, physics-optimized pipeline. Rather than treating generation and control as separate stages, our key insight is to couple them bidirectionally under physical constraints.We introduce a Physical Plausibility Optimization (PP-Opt) module as the coupling interface. In the forward direction, PP-Opt refines a teacher-student distillation policy with a plausibility-centric reward to suppress artifacts such as floating, skating, and penetration. In the backward direction, it converts reward-optimized simulation rollouts into high-quality explicit motion data, which is used to fine-tune the motion generator toward a more physically plausible latent distribution. This bidirectional design forms a self-improving cycle: the generator learns a physically grounded latent space, while the controller learns to execute latent-conditioned behaviors with dynamical integrity.Extensive experiments on the Unitree G1 humanoid show that our bidirectional optimization improves tracking accuracy and success rates. Across IsaacLab and MuJoCo, the implicit latent-driven pipeline consistently outperforms conventional explicit retargeting baselines in both precision and stability. By coupling diffusion-based motion generation with physical plausibility optimization, our framework provides a practical path toward deployable text-guided humanoid intelligence.
Chinese Translation
尽管生成模型在根据文本生成类人运动方面已变得有效,但将这些运动转移到类人机器人以进行物理执行仍然具有挑战性。现有的流程常常受到重定向的限制,其中运动学质量受到物理不可行性、接触过渡错误以及现实世界动态数据高成本的影响。我们提出了一个统一的潜在驱动框架,通过无重定向的物理优化流程将自然语言与类人机器人全身运动连接起来。我们的关键见解是将生成与控制在物理约束下双向耦合,而不是将其视为独立的阶段。我们引入了一个物理合理性优化(PP-Opt)模块作为耦合接口。在正向过程中,PP-Opt通过以合理性为中心的奖励来精炼教师-学生蒸馏策略,以抑制漂浮、滑行和穿透等伪影。在反向过程中,它将奖励优化的模拟结果转换为高质量的显式运动数据,这些数据用于微调运动生成器,使其朝向更具物理合理性的潜在分布。这个双向设计形成了一个自我改进的循环:生成器学习一个物理基础的潜在空间,而控制器学习以动态完整性执行潜在条件行为。在Unitree G1类人机器人上进行的广泛实验表明,我们的双向优化提高了跟踪精度和成功率。在IsaacLab和MuJoCo上,隐式潜在驱动流程在精度和稳定性方面始终优于传统的显式重定向基线。通过将基于扩散的运动生成与物理合理性优化相结合,我们的框架为可部署的文本引导类人智能提供了一条实用路径。
cs.RO / 39 / 2603.17969
Specification-Aware Distribution Shaping for Robotics Foundation Models
面向规范的分布塑形用于机器人基础模型
Abstract
Robotics foundation models have demonstrated strong capabilities in executing natural language instructions across diverse tasks and environments. However, they remain largely data-driven and lack formal guarantees on safety and satisfaction of time-dependent specifications during deployment. In practice, robots often need to comply with operational constraints involving rich spatio-temporal requirements such as time-bounded goal visits, sequential objectives, and persistent safety conditions. In this work, we propose a specification-aware action distribution optimization framework that enforces a broad class of Signal Temporal Logic (STL) constraints during execution of a pretrained robotics foundation model without modifying its parameters. At each decision step, the method computes a minimally modified action distribution that satisfies a hard STL feasibility constraint by reasoning over the remaining horizon using forward dynamics propagation. We validate the proposed framework in simulation using a state-of-the-art robotics foundation model across multiple environments and complex specifications.
Chinese Translation
机器人基础模型在执行自然语言指令方面展现了强大的能力,能够适应多样的任务和环境。然而,它们仍然主要依赖数据,缺乏在部署过程中对安全性和时间依赖规范满足的正式保证。在实际应用中,机器人往往需要遵循涉及丰富时空要求的操作约束,例如时间限制的目标访问、顺序目标和持续的安全条件。在本研究中,我们提出了一种面向规范的动作分布优化框架,该框架在执行预训练的机器人基础模型时强制执行广泛的信号时序逻辑(Signal Temporal Logic, STL)约束,而无需修改其参数。在每个决策步骤中,该方法通过对剩余时间范围进行前向动力学传播推理,计算出满足严格STL可行性约束的最小修改动作分布。我们在多个环境和复杂规范下,使用一种先进的机器人基础模型在仿真中验证了所提出的框架。
cs.RO / 40 / 2603.17990
A Single-Fiber Optical Frequency Domain Reflectometry (OFDR)-Based Shape Sensing of Concentric Tube Steerable Drilling Robots
基于单光纤光频域反射测量(OFDR)的同心管可转向钻井机器人形状感知
Abstract
This paper introduces a novel shape-sensing approach for Concentric Tube Steerable Drilling Robots (CT-SDRs) based on Optical Frequency Domain Reflectometry (OFDR). Unlike traditional FBG-based methods, OFDR enables continuous strain measurement along the entire fiber length with enhanced spatial resolution. In the proposed method, a Shape Sensing Assembly (SSA) is first fabricated by integrating a single OFDR fiber with a flat NiTi wire. The calibrated SSA is then routed through and housed within the internal channel of a flexible drilling instrument, which is guided by the pre-shaped NiTi tube of the CT-SDR. In this configuration, the drilling instrument serves as a protective sheath for the SSA during drilling, eliminating the need for integration or adhesion to the instrument surface that is typical of conventional optical sensor approaches. The performance of the proposed SSA, integrated within the cannulated CT-SDR, was thoroughly evaluated under free-bending conditions and during drilling along multiple J-shaped trajectories in synthetic Sawbones phantoms. Results demonstrate accurate and reliable shape-sensing capability, confirming the feasibility and robustness of this integration strategy.
Chinese Translation
本文介绍了一种基于光频域反射测量(OFDR)的同心管可转向钻井机器人(CT-SDR)的新型形状感知方法。与传统的基于光纤布拉格光栅(FBG)的方法不同,OFDR能够沿整个光纤长度进行连续应变测量,并具有更高的空间分辨率。在所提出的方法中,首先通过将单根OFDR光纤与平坦的镍钛(NiTi)线结合,制造出形状感知组件(SSA)。经过校准的SSA随后被引导并安置在灵活钻井工具的内部通道中,该工具由CT-SDR的预成型镍钛管引导。在这种配置中,钻井工具在钻孔过程中充当SSA的保护外壳,消除了传统光学传感器方法中通常需要的与工具表面的集成或粘附。所提出的SSA在带孔的CT-SDR内的性能在自由弯曲条件下以及在合成Sawbones模型中沿多个J形轨迹钻孔时进行了全面评估。结果表明,该方法具有准确可靠的形状感知能力,确认了这种集成策略的可行性和稳健性。
cs.CV / 1 / 2603.16876
Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation: Radiologist-Like Workflow with Clinically Verifiable Rewards
用于放射学报告生成的多模态多智能体强化学习:类似放射科医生的工作流程与临床可验证奖励
Abstract
We propose MARL-Rad, a novel multi-modal multi-agent reinforcement learning framework for radiology report generation that coordinates region-specific agents and a global integrating agent, optimized via clinically verifiable rewards. Unlike prior single-model reinforcement learning or post-hoc agentization of independently trained models, our method jointly trains multiple agents and optimizes the entire agent system through reinforcement learning. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinically efficacy (CE) metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art CE performance. Further analyses confirm that MARL-Rad enhances laterality consistency and produces more accurate, detail-informed reports.
Chinese Translation
我们提出了MARL-Rad,一种新颖的多模态多智能体强化学习框架,用于放射学报告生成,该框架协调区域特定智能体和全局整合智能体,并通过临床可验证的奖励进行优化。与之前的单模型强化学习或独立训练模型的事后智能体化不同,我们的方法联合训练多个智能体,并通过强化学习优化整个智能体系统。在MIMIC-CXR和IU X-ray数据集上的实验表明,MARL-Rad在临床有效性(CE)指标方面持续改善,如RadGraph、CheXbert和GREEN评分,达到了最先进的CE性能。进一步的分析确认,MARL-Rad增强了侧别一致性,并生成了更准确、信息更详细的报告。
cs.CV / 2 / 2603.16883
Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition
分词与增强:基于惯性测量单元的在线手写识别中书写者差异的系统研究
Abstract
Inertial measurement unit-based online handwriting recognition enables the recognition of input signals collected across different writing surfaces but remains challenged by uneven character distributions and inter-writer variability. In this work, we systematically investigate two strategies to address these issues: sub-word tokenization and concatenation-based data augmentation. Our experiments on the OnHW-Words500 dataset reveal a clear dichotomy between handling inter-writer and intra-writer variance. On the writer-independent split, structural abstraction via Bigram tokenization significantly improves performance to unseen writing styles, reducing the word error rate (WER) from 15.40% to 12.99%. In contrast, on the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts between the training and validation sets. Instead, our proposed concatenation-based data augmentation acts as a powerful regularizer, reducing the character error rate by 34.5% and the WER by 25.4%. Further analysis shows that short, low-level tokens benefit model performance and that concatenation-based data augmentation performance gain surpasses those achieved by proportionally extended training. These findings reveal a clear variance-dependent effect: sub-word tokenization primarily mitigates inter-writer stylistic variability, whereas concatenation-based data augmentation effectively compensates for intra-writer distributional sparsity.
Chinese Translation
基于惯性测量单元的在线手写识别能够识别在不同书写表面收集的输入信号,但仍面临字符分布不均和书写者间变异性的问题。在本研究中,我们系统地探讨了两种应对这些问题的策略:子词分词和基于连接的 数据增强。我们在 OnHW-Words500 数据集上的实验揭示了处理书写者间和书写者内变异性的明显二分法。在书写者独立的划分中,通过 Bigram 分词的结构抽象显著提高了对未见书写风格的性能,将单词错误率(WER)从 15.40% 降低到 12.99%。相反,在书写者依赖的划分中,由于训练集和验证集之间词汇分布的变化,分词导致性能下降。相对而言,我们提出的基于连接的数据增强作为一种强有力的正则化方法,将字符错误率降低了 34.5%,将 WER 降低了 25.4%。进一步分析表明,短小的低级别标记有利于模型性能,而基于连接的数据增强所带来的性能提升超过了通过按比例扩展训练所获得的提升。这些发现揭示了明显的依赖变异效应:子词分词主要缓解书写者间的风格变异,而基于连接的数据增强有效补偿了书写者内分布的稀疏性。
cs.CV / 3 / 2603.16927
Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks
利用大型视觉模型实现低空无线网络中的多无人机协同感知
Abstract
Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird's-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.
Chinese Translation
多无人机(UAV)协同感知已成为多种低空经济应用中一种有前景的范式,通过无线通信利用互补的多视角观测来增强感知性能。然而,多个无人机生成的大量视觉数据在通信延迟和资源效率方面带来了重大挑战。为了解决这些挑战,本文提出了一种高效的协同感知框架,称为基站辅助无人机(BHU),该框架在提高感知性能的同时减少了通信开销。具体而言,我们采用了一种Top-K选择机制,从无人机捕获的RGB图像中识别出最具信息量的像素,从而实现数据量和延迟减少的稀疏视觉传输。稀疏图像通过多用户MIMO(MU-MIMO)传输到地面服务器,在那里基于Swin-large的MaskDINO编码器提取鸟瞰视图(BEV)特征,并进行地面车辆感知的协同特征融合。此外,我们开发了一种基于扩散模型的深度强化学习(DRL)算法,联合选择协同无人机、稀疏化比例和预编码矩阵,实现通信效率与感知效用之间的平衡。在Air-Co-Pred数据集上的仿真结果表明,与传统的基于CNN的BEV融合基线相比,所提出的BHU框架在提高感知性能超过5%的同时,通信开销减少了85%,为资源受限的无线环境下的多无人机协同感知提供了有效的解决方案。
cs.CV / 4 / 2603.16930
Facial beauty prediction fusing transfer learning and broad learning system
融合迁移学习与广泛学习系统的面部美感预测
Abstract
Facial beauty prediction (FBP) is an important and challenging problem in the fields of computer vision and machine learning. Not only it is easily prone to overfitting due to the lack of large-scale and effective data, but also difficult to quickly build robust and effective facial beauty evaluation models because of the variability of facial appearance and the complexity of human perception. Transfer Learning can be able to reduce the dependence on large amounts of data as well as avoid overfitting problems. Broad learning system (BLS) can be capable of quickly completing models building and training. For this purpose, Transfer Learning was fused with BLS for FBP in this paper. Firstly, a feature extractor is constructed by way of CNNs models based on transfer learning for facial feature extraction, in which EfficientNets are used in this paper, and the fused features of facial beauty extracted are transferred to BLS for FBP, called E-BLS. Secondly, on the basis of E-BLS, a connection layer is designed to connect the feature extractor and BLS, called ER-BLS. Finally, experimental results show that, compared with the previous BLS and CNNs methods existed, the accuracy of FBP was improved by E-BLS and ER-BLS, demonstrating the effectiveness and superiority of the method presented, which can also be widely used in pattern recognition, object detection and image classification.
Chinese Translation
面部美感预测(FBP)是计算机视觉和机器学习领域中的一个重要且具有挑战性的问题。由于缺乏大规模有效的数据,它不仅容易出现过拟合现象,而且由于面部外观的变化性和人类感知的复杂性,快速构建稳健有效的面部美感评估模型也十分困难。迁移学习能够减少对大量数据的依赖,并避免过拟合问题。广泛学习系统(BLS)能够快速完成模型的构建和训练。为此,本文将迁移学习与BLS融合用于FBP。首先,基于迁移学习构建一个特征提取器,通过CNN模型进行面部特征提取,本文中使用了EfficientNets,并将提取的面部美感融合特征转移到BLS中,称为E-BLS。其次,在E-BLS的基础上,设计了一个连接层,将特征提取器与BLS连接起来,称为ER-BLS。最后,实验结果表明,与现有的BLS和CNN方法相比,E-BLS和ER-BLS提高了FBP的准确性,证明了所提出方法的有效性和优越性,该方法也可以广泛应用于模式识别、目标检测和图像分类。
cs.CV / 5 / 2603.16931
Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation
脚本与幻灯片对接:将脚本句子与幻灯片对象对接以实现自动化教学视频生成
Abstract
While slide-based videos augmented with visual effects are widely utilized in education and research presentations, the video editing process -- particularly applying visual effects to ground spoken content to slide objects -- remains highly labor-intensive. This study aims to develop a system that automatically generates such instructional videos from slides and corresponding scripts. As a foundational step, this paper proposes and formulates Script-to-Slide Grounding (S2SG), defined as the task of grounding script sentences to their corresponding slide objects. Furthermore, as an initial step, we propose ``Text-S2SG,'' a method that utilizes a large language model (LLM) to perform this grounding task for text objects. Our experiments demonstrate that the proposed method achieves high performance (F1-score: 0.924). The contribution of this work is the formalization of a previously implicit slide-based video editing process into a computable task, thereby paving the way for its automation.
Chinese Translation
尽管增强视觉效果的幻灯片视频在教育和研究演示中被广泛使用,但视频编辑过程——特别是将视觉效果应用于将口语内容与幻灯片对象对接——仍然非常劳动密集。本研究旨在开发一个系统,能够从幻灯片和相应的脚本自动生成此类教学视频。作为基础步骤,本文提出并定义了脚本与幻灯片对接(Script-to-Slide Grounding, S2SG),即将脚本句子与其对应的幻灯片对象对接的任务。此外,作为初步步骤,我们提出了“文本-S2SG”(Text-S2SG),这是一种利用大型语言模型(Large Language Model, LLM)来执行文本对象对接任务的方法。我们的实验表明,所提出的方法达到了高性能(F1-score: 0.924)。本工作的贡献在于将之前隐含的基于幻灯片的视频编辑过程形式化为一个可计算的任务,从而为其自动化铺平了道路。
cs.CV / 6 / 2603.16932
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
关注重要区域:高分辨率作物检索以提高视觉语言模型的效率
Abstract
Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes
Chinese Translation
视觉语言模型(VLMs)通常以原生高分辨率处理图像,这在准确性与计算效率之间形成了权衡:高分辨率输入能够捕捉细节,但会带来显著的计算成本,而低分辨率输入则倡导效率,但可能会错过关键的视觉信息,例如小文本。我们提出了 AwaRes,一个按需空间框架,通过在低分辨率全局视图上操作并使用工具调用仅检索特定查询所需的高分辨率片段,从而解决了这一准确性与效率的权衡。我们自动构建监督数据:一名评审比较低分辨率与高分辨率答案,以标记是否需要裁剪,而一个神谕定位模型则定位正确答案的证据,我们将其映射到一个离散裁剪集,以形成多轮工具使用轨迹。我们用冷启动的 SFT 训练我们的框架,随后进行多轮 GRPO,采用复合奖励,将语义答案的正确性与显式的裁剪成本惩罚相结合。项目页面:https://nimrodshabtay.github.io/AwaRes
cs.CV / 7 / 2603.16934
AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding
AgriChat:一种用于农业图像理解的多模态大型语言模型
Abstract
The deployment of Multimodal Large Language Models (MLLMs) in agriculture is currently stalled by a critical trade-off: the existing literature lacks the large-scale agricultural datasets required for robust model development and evaluation, while current state-of-the-art models lack the verified domain expertise necessary to reason across diverse taxonomies. To address these challenges, we propose the Vision-to-Verified-Knowledge (V2VK) pipeline, a novel generative AI-driven annotation framework that integrates visual captioning with web-augmented scientific retrieval to autonomously generate the AgriMM benchmark, effectively eliminating biological hallucinations by grounding training data in verified phytopathological literature. The AgriMM benchmark contains over 3,000 agricultural classes and more than 607k VQAs spanning multiple tasks, including fine-grained plant species identification, plant disease symptom recognition, crop counting, and ripeness assessment. Leveraging this verifiable data, we present AgriChat, a specialized MLLM that presents broad knowledge across thousands of agricultural classes and provides detailed agricultural assessments with extensive explanations. Extensive evaluation across diverse tasks, datasets, and evaluation conditions reveals both the capabilities and limitations of current agricultural MLLMs, while demonstrating AgriChat's superior performance over other open-source models, including internal and external benchmarks. The results validate that preserving visual detail combined with web-verified knowledge constitutes a reliable pathway toward robust and trustworthy agricultural AI. The code and dataset are publicly available at https://github.com/boudiafA/AgriChat .
Chinese Translation
多模态大型语言模型(MLLMs)在农业中的应用目前受到一个关键权衡的制约:现有文献缺乏大规模农业数据集,这些数据集对于稳健的模型开发和评估至关重要,而当前的最先进模型缺乏必要的经过验证的领域专业知识,以便在多样化的分类法中进行推理。为了解决这些挑战,我们提出了视觉到验证知识(Vision-to-Verified-Knowledge, V2VK)管道,这是一种新颖的生成式人工智能驱动的注释框架,结合了视觉描述和网络增强的科学检索,能够自主生成AgriMM基准,有效消除生物幻觉,通过将训练数据基于经过验证的植物病理学文献进行基础化。AgriMM基准包含超过3000个农业类别和超过607,000个视觉问答(VQAs),涵盖多个任务,包括细粒度植物物种识别、植物病害症状识别、作物计数和成熟度评估。利用这些可验证的数据,我们提出了AgriChat,一种专门的MLLM,展示了数千个农业类别的广泛知识,并提供详细的农业评估和广泛的解释。在多样化的任务、数据集和评估条件下的广泛评估揭示了当前农业MLLMs的能力和局限性,同时展示了AgriChat在其他开源模型(包括内部和外部基准)上的优越性能。结果验证了保留视觉细节与网络验证知识相结合,构成了通向稳健和可信赖的农业人工智能的可靠途径。代码和数据集可在 https://github.com/boudiafA/AgriChat 上公开获取。
cs.CV / 8 / 2603.16935
GenLie: A Global-Enhanced Lie Detection Network under Sparsity and Semantic Interference
GenLie:一种在稀疏性和语义干扰下的全局增强谎言检测网络
Abstract
Video-based lie detection aims to identify deceptive behaviors from visual cues. Despite recent progress, its core challenge lies in learning sparse yet discriminative representations. Deceptive signals are typically subtle and short-lived, easily overwhelmed by redundant information, while individual and contextual variations introduce strong identity-related noise. To address this issue, we propose GenLie, a Global-Enhanced Lie Detection Network that performs local feature modeling under global supervision. Specifically, sparse and subtle deceptive cues are captured at the local level, while global supervision and optimization ensure robust and discriminative representations by suppressing identity-related noise. Experiments on three public datasets, covering both high- and low-stakes scenarios, show that GenLie consistently outperforms state-of-the-art methods. Source code is available at https://github.com/AliasDictusZ1/GenLie.
Chinese Translation
基于视频的谎言检测旨在从视觉线索中识别欺骗行为。尽管近期取得了一些进展,其核心挑战在于学习稀疏而具有区分性的表征。欺骗信号通常是微妙且短暂的,容易被冗余信息淹没,而个体和上下文的变化则引入了强烈的身份相关噪声。为了解决这一问题,我们提出了GenLie,一种在全局监督下进行局部特征建模的全局增强谎言检测网络。具体而言,稀疏且微妙的欺骗线索在局部层面被捕获,而全局监督和优化则通过抑制身份相关噪声来确保稳健且具有区分性的表征。在三个公共数据集上的实验结果表明,GenLie在高风险和低风险场景中均优于最先进的方法。源代码可在 https://github.com/AliasDictusZ1/GenLie 获取。
cs.CV / 9 / 2603.16936
TDMM-LM: Bridging Facial Understanding and Animation via Language Models
TDMM-LM:通过语言模型连接面部理解与动画
Abstract
Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.
Chinese Translation
文本引导的人体动画发展迅速,但面部动画由于缺乏良好注释的文本配对面部语料库而滞后。为了填补这一空白,我们利用基础生成模型合成了一个大型、平衡的面部行为语料库。我们设计了一套涵盖情感和头部动作的提示,使用多个生成器生成约80小时的面部视频,并为每帧拟合3D面部参数,从而产生用于训练的大规模(提示和参数)对。基于该数据集,我们探讨了语言模型在面部运动上的双向能力,通过两个互补任务进行验证:(1)Motion2Language:给定一系列3D面部参数,模型生成自然语言描述,捕捉内容、风格和动态;(2)Language2Motion:给定一个提示,模型通过量化运动标记合成相应的3D面部参数序列以用于后续动画。大量实验表明,在这种设置下,语言模型能够有效地解释和合成面部运动,并具有强大的泛化能力。根据我们的最佳知识,这是首次将面部参数建模视为语言问题的研究,为文本条件的面部动画和运动理解建立了统一的路径。
cs.CV / 10 / 2603.16939
Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion
基于发散的多模态融合解决第十届模棱两可/犹豫(AH)视频识别挑战
Abstract
We address the Ambivalence/Hesitancy (A/H) Video Recognition Challenge at the 10th ABAW Competition (CVPR 2026). We propose a divergence-based multimodal fusion that explicitly measures cross-modal conflict between visual, audio, and textual channels. Visual features are encoded as Action Units (AUs) extracted via Py-Feat, audio via Wav2Vec 2.0, and text via BERT. Each modality is processed by a BiLSTM with attention pooling and projected into a shared embedding space. The fusion module computes pairwise absolute differences between modality embeddings, directly capturing the incongruence that characterizes A/H. On the BAH dataset, our approach achieves a Macro F1 of 0.6808 on the validation test set, outperforming the challenge baseline of 0.2827. Statistical analysis across 1{,}132 videos confirms that temporal variability of AUs is the dominant visual discriminator of A/H.
Chinese Translation
我们在第十届ABA(CVPR 2026)模棱两可/犹豫(A/H)视频识别挑战中提出了解决方案。我们提出了一种基于发散的多模态融合方法,该方法明确测量视觉、音频和文本通道之间的跨模态冲突。视觉特征通过Py-Feat提取的动作单元(Action Units, AUs)进行编码,音频通过Wav2Vec 2.0处理,文本通过BERT处理。每种模态都通过带有注意力池化的双向长短期记忆网络(BiLSTM)进行处理,并投影到共享的嵌入空间。融合模块计算模态嵌入之间的成对绝对差异,直接捕捉A/H特征的不一致性。在BAH数据集上,我们的方法在验证测试集上实现了0.6808的宏F1值,超越了挑战基线的0.2827。对1,132个视频的统计分析确认,AUs的时间变异性是A/H的主要视觉区分因素。
cs.CV / 11 / 2603.16943
KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition
KGS-GCN:通过运动驱动的高斯溅射和概率拓扑增强稀疏骨架感知以进行动作识别
Abstract
Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction and intelligent surveillance. Nevertheless, current sensor devices typically generate sparse skeleton data as discrete coordinates, which inevitably discards fine-grained spatiotemporal details during highly dynamic movements. Moreover, the rigid constraints of predefined physical sensor topologies hinder the modeling of latent long-range dependencies. To overcome these limitations, we propose KGS-GCN, a graph convolutional network that integrates kinematics-driven Gaussian splatting with probabilistic topology. Our framework explicitly addresses the challenges of sensor data sparsity and topological rigidity by transforming discrete joints into continuous generative representations. Firstly, a kinematics-driven Gaussian splatting module is designed to dynamically construct anisotropic covariance matrices using instantaneous joint velocity vectors. This module enhances visual representation by rendering sparse skeleton sequences into multi-view continuous heatmaps rich in spatiotemporal semantics. Secondly, to transcend the limitations of fixed physical connections, a probabilistic topology construction method is proposed. This approach generates an adaptive prior adjacency matrix by quantifying statistical correlations via the Bhattacharyya distance between joint Gaussian distributions. Ultimately, the GCN backbone is adaptively modulated by the rendered visual features via a visual context gating mechanism. Empirical results demonstrate that KGS-GCN significantly enhances the modeling of complex spatiotemporal dynamics. By addressing the inherent limitations of sparse inputs, our framework offers a robust solution for processing low-fidelity sensor data. This approach establishes a practical pathway for improving perceptual reliability in real-world sensing applications.
Chinese Translation
基于骨架的动作识别广泛应用于包括人机交互和智能监控在内的传感器系统。然而,目前的传感器设备通常生成稀疏的骨架数据,作为离散坐标,这不可避免地在高度动态的运动中丢弃了细粒度的时空细节。此外,预定义物理传感器拓扑的刚性约束阻碍了潜在长距离依赖关系的建模。为了解决这些限制,我们提出了KGS-GCN,一种将运动驱动的高斯溅射与概率拓扑相结合的图卷积网络。我们的框架明确解决了传感器数据稀疏性和拓扑刚性的问题,通过将离散关节转化为连续生成表示。首先,设计了一个运动驱动的高斯溅射模块,动态构建各向异性的协方差矩阵,使用瞬时关节速度向量。该模块通过将稀疏骨架序列渲染为富含时空语义的多视角连续热图,增强了视觉表示。其次,为了超越固定物理连接的限制,提出了一种概率拓扑构建方法。该方法通过量化关节高斯分布之间的Bhattacharyya距离来生成自适应的先验邻接矩阵,量化统计相关性。最终,GCN主干通过视觉上下文门控机制自适应地调节渲染的视觉特征。实证结果表明,KGS-GCN显著增强了复杂时空动态的建模。通过解决稀疏输入的固有限制,我们的框架为处理低保真度传感器数据提供了一种稳健的解决方案。这种方法为提高现实世界传感应用中的感知可靠性建立了一条实用路径。
cs.CV / 12 / 2603.16944
Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models
Omni IIE Bench:图像编辑模型实际能力的基准测试
Abstract
While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.
Chinese Translation
尽管基于指令的图像编辑(IIE)已取得显著进展,但现有基准通过混合评估追求任务广度。这一范式掩盖了在专业应用中至关重要的一个失败模式:模型在不同语义尺度任务中的不一致表现。为了解决这一问题,我们引入了Omni IIE Bench,这是一个高质量的人类注释基准,专门设计用于诊断IIE模型在实际应用场景中的编辑一致性。Omni IIE Bench具有创新的双轨诊断设计:(1)单轮一致性,包括属性修改和实体替换的共享上下文任务对;(2)多轮协调,涉及跨越语义尺度的连续对话任务。该基准通过一个极其严格的多阶段人类筛选过程构建,结合了计算机视觉研究生执行的质量标准和专业设计师进行的行业相关性审查。我们使用Omni IIE Bench对8个主流IIE模型进行了全面评估。我们的分析首次量化了一个普遍的性能差距:几乎所有模型在从低语义尺度任务转向高语义尺度任务时表现出显著的性能下降。Omni IIE Bench为下一代更可靠和稳定的IIE模型的发展提供了关键的诊断工具和见解。
cs.CV / 13 / 2603.16945
Joint Optimization of Storage and Loading for High-Performance 3D Point Cloud Data Processing
高性能三维点云数据处理的存储与加载联合优化
Abstract
With the rapid development of computer vision and deep learning, significant advancements have been made in 3D vision, partic- ularly in autonomous driving, robotic perception, and augmented reality. 3D point cloud data, as a crucial representation of 3D information, has gained widespread attention. However, the vast scale and complexity of point cloud data present significant chal- lenges for loading and processing and traditional algorithms struggle to handle large-scale datasets.The diversity of storage formats for point cloud datasets (e.g., PLY, XYZ, BIN) adds complexity to data handling and results in inefficiencies in data preparation. Al- though binary formats like BIN and NPY have been used to speed up data access, they still do not fully address the time-consuming data loading and processing phase. To overcome these challenges, we propose the .PcRecord format, a unified data storage solution designed to reduce the storage occupation and accelerate the processing of point cloud data. We also introduce a high-performance data processing pipeline equipped with multiple modules. By leveraging a multi-stage parallel pipeline architecture, our system optimizes the use of computational resources, significantly improving processing speed and efficiency. This paper details the im- plementation of this system and demonstrates its effectiveness in addressing the challenges of handling large-scale point cloud datasets.On average, our system achieves performance improvements of 6.61x (ModelNet40), 2.69x (S3DIS), 2.23x (ShapeNet), 3.09x (Kitti), 8.07x (SUN RGB-D), and 5.67x (ScanNet) with GPU and 6.9x, 1.88x, 1.29x, 2.28x, 25.4x, and 19.3x with Ascend.
Chinese Translation
随着计算机视觉和深度学习的快速发展,三维视觉领域取得了显著进展,特别是在自动驾驶、机器人感知和增强现实方面。三维点云数据作为三维信息的重要表示形式,受到了广泛关注。然而,点云数据的庞大规模和复杂性给加载和处理带来了重大挑战,传统算法难以处理大规模数据集。点云数据集的存储格式多样性(例如,PLY、XYZ、BIN)增加了数据处理的复杂性,并导致数据准备效率低下。尽管像BIN和NPY这样的二进制格式已被用于加速数据访问,但它们仍未完全解决耗时的数据加载和处理阶段。为了解决这些挑战,我们提出了.PcRecord格式,这是一种统一的数据存储解决方案,旨在减少存储占用并加速点云数据的处理。我们还引入了一种高性能的数据处理管道,配备多个模块。通过利用多阶段并行管道架构,我们的系统优化了计算资源的使用,显著提高了处理速度和效率。本文详细介绍了该系统的实现,并展示了其在处理大规模点云数据集方面的有效性。我们的系统在平均性能上实现了6.61倍(ModelNet40)、2.69倍(S3DIS)、2.23倍(ShapeNet)、3.09倍(Kitti)、8.07倍(SUN RGB-D)和5.67倍(ScanNet)的提升,使用GPU时分别为6.9倍、1.88倍、1.29倍、2.28倍、25.4倍和19.3倍,使用Ascend时。
cs.CV / 14 / 2603.16947
EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments
EmergeNav:用于连续环境中零-shot视觉与语言导航的结构化具身推理
Abstract
Zero-shot vision-and-language navigation in continuous environments (VLN-CE) remains challenging for modern vision-language models (VLMs). Although these models encode useful semantic priors, their open-ended reasoning does not directly translate into stable long-horizon embodied execution. We argue that the key bottleneck is not missing knowledge alone, but missing an execution structure for organizing instruction following, perceptual grounding, temporal progress, and stage verification. We propose EmergeNav, a zero-shot framework that formulates continuous VLN as structured embodied inference. EmergeNav combines a Plan--Solve--Transition hierarchy for stage-structured execution, GIPE for goal-conditioned perceptual extraction, contrastive dual-memory reasoning for progress grounding, and role-separated Dual-FOV sensing for time-aligned local control and boundary verification. On VLN-CE, EmergeNav achieves strong zero-shot performance using only open-source VLM backbones and no task-specific training, explicit maps, graph search, or waypoint predictors, reaching 30.00 SR with Qwen3-VL-8B and 37.00 SR with Qwen3-VL-32B. These results suggest that explicit execution structure is a key ingredient for turning VLM priors into stable embodied navigation behavior.
Chinese Translation
在连续环境中进行零-shot视觉与语言导航(VLN-CE)对现代视觉语言模型(VLMs)仍然具有挑战性。尽管这些模型编码了有用的语义先验,但它们的开放式推理并未直接转化为稳定的长时间具身执行。我们认为,关键瓶颈不仅在于缺乏知识,还在于缺乏组织指令跟随、感知基础、时间进展和阶段验证的执行结构。我们提出了EmergeNav,一个将连续VLN形式化为结构化具身推理的零-shot框架。EmergeNav结合了用于阶段结构执行的计划-解决-过渡层次结构(Plan--Solve--Transition hierarchy)、用于目标条件感知提取的GIPE、用于进展基础的对比双重记忆推理,以及用于时间对齐的局部控制和边界验证的角色分离双视场(Dual-FOV)感知。在VLN-CE上,EmergeNav仅使用开源VLM骨干网络,且不进行任务特定训练、显式地图、图搜索或路径点预测,便实现了强大的零-shot性能,使用Qwen3-VL-8B达到30.00的成功率(SR),使用Qwen3-VL-32B达到37.00的成功率(SR)。这些结果表明,显式执行结构是将VLM先验转化为稳定具身导航行为的关键因素。
cs.CV / 15 / 2603.16958
PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models
PhysQuantAgent:一种用于视觉-语言模型的质量估计推理管道
Abstract
Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross-sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real-world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.
Chinese Translation
视觉-语言模型(VLMs)越来越多地应用于机器人感知和操作,但它们推断操作所需的物理属性的能力仍然有限。特别是,估计现实世界物体的质量对于确定适当的抓取力和确保安全交互至关重要。然而,目前的VLMs缺乏可靠的质量推理能力,大多数现有基准并未明确评估在现实感知条件下的物理量估计。在本研究中,我们提出了PhysQuantAgent,一个基于VLMs的现实世界物体质量估计框架,并推出了VisPhysQuant,一个用于评估的新基准数据集。VisPhysQuant包含从多个视角捕获的真实物体的RGB-D视频,并附有精确的质量测量注释。为了提高估计准确性,我们引入了三种视觉提示方法,通过物体检测、尺度估计和横截面图像生成增强输入图像,帮助模型理解目标物体的大小和内部结构。实验表明,视觉提示显著提高了在现实世界数据上的质量估计准确性,表明将空间推理与VLM知识结合用于物理推理的有效性。
cs.CV / 16 / 2603.16964
Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE
基于行为的高速公路交通数据场景提取及其领域知识引导的聚类方法研究
Abstract
Approval of ADS depends on evaluating its behavior within representative real-world traffic scenarios. A common way to obtain such scenarios is to extract them from real-world data recordings. These can then be grouped and serve as basis on which the ADS is subsequently tested. This poses two central challenges: how scenarios are extracted and how they are grouped. Existing extraction methods rely on heterogeneous definitions, hindering scenario comparability. For the grouping of scenarios, rule-based or ML-based methods can be utilized. However, while modern ML-based approaches can handle the complexity of traffic scenarios, unlike rule-based approaches, they lack interpretability and may not align with domain-knowledge. This work contributes to a standardized scenario extraction based on the Scenario-as-Specification concept, as well as a domain-knowledge-guided scenario clustering process. Experiments on the highD dataset demonstrate that scenarios can be extracted reliably and that domain-knowledge can be effectively integrated into the clustering process. As a result, the proposed methodology supports a more standardized process for deriving scenario categories from highway data recordings and thus enables a more efficient validation process of automated vehicles.
Chinese Translation
自动驾驶系统(ADS)的批准依赖于在具有代表性的真实交通场景中评估其行为。获取此类场景的常见方法是从真实数据记录中提取。这些场景可以被分组,并作为后续测试ADS的基础。这带来了两个核心挑战:场景的提取方式和分组方式。现有的提取方法依赖于异构定义,妨碍了场景的可比性。在场景的分组方面,可以利用基于规则或基于机器学习(ML)的方法。然而,尽管现代基于机器学习的方法能够处理交通场景的复杂性,但与基于规则的方法不同,它们缺乏可解释性,并且可能与领域知识不一致。本研究基于场景作为规范(Scenario-as-Specification)概念,贡献了一种标准化的场景提取方法,以及一种领域知识引导的场景聚类过程。在highD数据集上的实验表明,场景可以可靠地提取,并且领域知识可以有效地融入聚类过程。因此,所提出的方法论支持从高速公路数据记录中推导场景类别的更标准化过程,从而使自动驾驶车辆的验证过程更加高效。
cs.CV / 17 / 2603.16966
CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization
CineSRD:利用视觉、声学和语言线索进行开放世界视觉媒体说话人分离
Abstract
Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.
Chinese Translation
传统的说话人分离系统主要集中在会议和访谈等受限场景中,这些场景中的说话人数有限且声学条件相对干净。为了探索开放世界的说话人分离,我们将这一任务扩展到视觉媒体领域,涵盖电影和电视剧等复杂的视听节目。这一新环境带来了若干挑战,包括长视频理解、大量说话人、音频与视觉线索之间的跨模态异步性以及不可控的野外变异性。为了解决这些挑战,我们提出了电影说话人注册与分离(CineSRD),这是一个统一的多模态框架,利用来自视频、语音和字幕的视觉、声学和语言线索进行说话人标注。CineSRD首先执行视觉锚点聚类以注册初始说话人,然后集成音频语言模型以进行说话人轮次检测,细化标注并补充未注册的离屏说话人。此外,我们构建并发布了一个专门针对视觉媒体的说话人分离基准,包括中文和英文节目。实验结果表明,CineSRD在所提出的基准上实现了优越的性能,并在传统数据集上取得了竞争性结果,验证了其在开放世界视觉媒体环境中的鲁棒性和泛化能力。
cs.CV / 18 / 2603.16967
MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing
MSRAMIE:用于多指令图像编辑的多模态结构推理代理
Abstract
Existing instruction-based image editing models perform well with simple, single-step instructions but degrade in realistic scenarios that involve multiple, lengthy, and interdependent directives. A main cause is the scarcity of training data with complex multi-instruction annotations. However, it is costly to collect such data and retrain these models. To address this challenge, we propose MSRAMIE, a training-free agent framework built on Multimodal Large Language Model (MLLM). MSRAMIE takes existing editing models as plug-in components and handle multi-instruction tasks via structured multimodal reasoning. It orchestrates iterative interactions between an MLLM-based Instructor and an image editing Actor, introducing a novel reasoning topology that comprises the proposed Tree-of-States and Graph-of-References. During inference, complex instructions are decomposed into multiple editing steps which enable state transitions, cross-step information aggregation, and original input recall, which enables systematic exploration of the image editing space and flexible progressive output refinement. The visualizable inference topology further provides interpretable and controllable decision pathways. Experiments show that as the instruction complexity increases, MSRAMIE can improve instruction following over 15% and increases the probability of finishing all modifications in a single run over 100%, while preserving perceptual quality and maintaining visual consistency.
Chinese Translation
现有的基于指令的图像编辑模型在处理简单的单步指令时表现良好,但在涉及多个、冗长且相互依赖的指令的现实场景中表现不佳。主要原因是缺乏具有复杂多指令注释的训练数据。然而,收集此类数据并重新训练这些模型的成本较高。为了解决这一挑战,我们提出了MSRAMIE,一个基于多模态大型语言模型(MLLM)的无训练代理框架。MSRAMIE将现有的编辑模型作为插件组件,并通过结构化的多模态推理处理多指令任务。它协调基于MLLM的指导者与图像编辑者之间的迭代交互,引入了一种新颖的推理拓扑结构,包括所提出的状态树(Tree-of-States)和参考图(Graph-of-References)。在推理过程中,复杂指令被分解为多个编辑步骤,从而实现状态转换、跨步骤信息聚合和原始输入回忆,这使得对图像编辑空间的系统探索和灵活的渐进输出优化成为可能。可视化的推理拓扑进一步提供了可解释和可控的决策路径。实验表明,随着指令复杂性的增加,MSRAMIE能够提高指令遵循率超过15%,并在单次运行中完成所有修改的概率提高超过100%,同时保持感知质量和视觉一致性。
cs.CV / 19 / 2603.16970
Continual Multimodal Egocentric Activity Recognition via Modality-Aware Novel Detection
通过模态感知的新颖检测实现持续的多模态自我中心活动识别
Abstract
Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens over time under catastrophic forgetting. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) estimates sample-wise modality reliability from energy scores and adaptively integrates modality logits to better exploit complementary modality cues for novelty detection. During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND improves novel activity detection AUC by up to 10\% and known-class classification accuracy by up to 2.8\% over state-of-the-art baselines.
Chinese Translation
多模态自我中心活动识别整合了视觉和惯性线索,以实现对第一人称行为的稳健理解。然而,在开放世界环境中部署此类系统需要在不断学习非平稳流的同时检测新颖活动。现有方法依赖于主要的逻辑值进行新颖性评分,而未充分利用来自各个模态的互补证据。由于这些逻辑值通常受到RGB的主导,其他模态的线索,特别是IMU,仍然未得到充分利用,并且这种不平衡在灾难性遗忘的情况下随着时间的推移而加剧。为了解决这个问题,我们提出了MAND,这是一个针对多模态自我中心开放世界持续学习的模态感知框架。在推理过程中,模态感知自适应评分(MoAS)根据能量评分估计样本级模态可靠性,并自适应地整合模态逻辑值,以更好地利用互补模态线索进行新颖性检测。在训练过程中,模态特定表示稳定训练(MoRST)通过辅助头和模态逻辑蒸馏在任务之间保持模态特定的可区分性。在公共多模态自我中心基准上的实验表明,MAND在新颖活动检测的AUC上提高了多达10%,在已知类别分类准确率上提高了多达2.8%,优于最先进的基准。
cs.CV / 20 / 2603.16974
Are a Thousand Words Better Than a Single Picture? Beyond Images -- A Framework for Multi-Modal Knowledge Graph Dataset Enrichment
千言万语真的比一幅画更好吗?超越图像——多模态知识图谱数据集增强框架
Abstract
Multi-Modal Knowledge Graphs (MMKGs) benefit from visual information, yet large-scale image collection is hard to curate and often excludes ambiguous but relevant visuals (e.g., logos, symbols, abstract scenes). We present Beyond Images, an automatic data-centric enrichment pipeline with optional human auditing. This pipeline operates in three stages: (1) large-scale retrieval of additional entity-related images, (2) conversion of all visual inputs into textual descriptions to ensure that ambiguous images contribute usable semantics rather than noise, and (3) fusion of multi-source descriptions using a large language model (LLM) to generate concise, entity-aligned summaries. These summaries replace or augment the text modality in standard MMKG models without changing their architectures or loss functions. Across three public MMKG datasets and multiple baseline models, we observe consistent gains (up to 7% Hits@1 overall). Furthermore, on a challenging subset of entities with visually ambiguous logos and symbols, converting images into text yields large improvements (201.35% MRR and 333.33% Hits@1). Additionally, we release a lightweight Text-Image Consistency Check Interface for optional targeted audits, improving description quality and dataset reliability. Our results show that scaling image coverage and converting ambiguous visuals into text is a practical path to stronger MMKG completion. Code, datasets, and supplementary materials are available at https://github.com/pengyu-zhang/Beyond-Images.
Chinese Translation
多模态知识图谱(MMKGs)受益于视觉信息,但大规模图像收集难以管理,且常常排除模糊但相关的视觉内容(例如,标志、符号、抽象场景)。我们提出了超越图像(Beyond Images),这是一种自动化的数据中心增强流程,带有可选的人为审核。该流程分为三个阶段:(1)大规模检索额外的与实体相关的图像;(2)将所有视觉输入转换为文本描述,以确保模糊图像贡献可用的语义而非噪声;(3)使用大型语言模型(LLM)融合多源描述,以生成简洁的、与实体对齐的摘要。这些摘要替代或增强标准MMKG模型中的文本模态,而不改变其架构或损失函数。在三个公共MMKG数据集和多个基线模型上,我们观察到一致的提升(整体最高可达7%的Hits@1)。此外,在具有视觉模糊标志和符号的具有挑战性的实体子集上,将图像转换为文本带来了显著的改善(201.35%的MRR和333.33%的Hits@1)。此外,我们发布了一个轻量级的文本-图像一致性检查接口,以便进行可选的针对性审核,提高描述质量和数据集可靠性。我们的结果表明,扩大图像覆盖范围并将模糊视觉内容转换为文本是增强MMKG补全的可行路径。代码、数据集和补充材料可在 https://github.com/pengyu-zhang/Beyond-Images 获取。
cs.CV / 21 / 2603.16987
Empirical Recipes for Efficient and Compact Vision-Language Models
高效紧凑的视觉-语言模型的经验配方
Abstract
Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.
Chinese Translation
在资源受限的环境中部署视觉-语言模型(VLMs)需要低延迟和高吞吐量,然而现有的紧凑型VLM往往未能实现其较小参数量所暗示的推理速度提升。为了解释这一差异,我们进行了经验性的端到端效率分析,并系统性地分析推理过程,以识别主要瓶颈。基于这些发现,我们开发了针对紧凑型VLM的优化配方,显著降低延迟,同时保持准确性。这些技术在InternVL3-2B上将首次令牌时间(TTFT)减少了53%,在SmolVLM-256M上减少了93%。我们的配方广泛适用于各种VLM架构和常见服务框架,为构建高效的VLM系统提供了实用指导。除了效率,我们还研究了如何通过结构化感知输出扩展紧凑型VLM,并引入了由此产生的模型家族ArgusVLM。在多种基准测试中,ArgusVLM在保持紧凑和高效设计的同时,取得了强劲的性能。
cs.CV / 22 / 2603.17024
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
HopChain:用于可泛化视觉-语言推理的多跳数据合成
Abstract
VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.
Chinese Translation
视觉语言模型(VLMs)展现出强大的多模态能力,但在细粒度的视觉-语言推理方面仍然存在困难。我们发现,长链条推理(CoT)暴露出多种失败模式,包括感知、推理、知识和幻觉错误,这些错误可能在中间步骤中相互叠加。然而,现有用于强化学习视觉推理(RLVR)的视觉-语言数据大多不涉及依赖于视觉证据的复杂推理链,从而使这些弱点在很大程度上未被暴露。因此,我们提出了HopChain,这是一个可扩展的框架,用于合成多跳视觉-语言推理数据,专门用于VLMs的RLVR训练。每个合成的多跳查询形成一个逻辑依赖的实例基础跳跃链,其中早期的跳跃建立了后续跳跃所需的实例、集合或条件,而最终答案则是一个特定的、明确的数字,适合用于可验证的奖励。我们将HopChain合成的多跳数据添加到用于训练Qwen3.5-35B-A3B和Qwen3.5-397B-A17B的原始RLVR数据中,并在涵盖STEM和Puzzle、一般视觉问答(General VQA)、文本识别和文档理解以及视频理解的24个基准上,与仅使用原始RLVR数据的RLVR进行比较。尽管这些多跳数据并不是针对任何特定基准合成的,但添加它改善了两个模型在24个基准中的20个,表明了广泛且可泛化的提升。为了证明完整链条查询的重要性,我们用半多跳或单跳变体替换它们,分别将24个基准的平均准确率降低了5.3和7.0个百分点。多跳训练还增强了长链条推理的视觉-语言推理,在超长链条推理(ultra-long-CoT)阶段,准确率提升超过50个百分点。这些实验确立了HopChain作为一个有效的、可扩展的框架,用于合成多跳数据,从而改善可泛化的视觉-语言推理。
cs.CV / 23 / 2603.17043
OpenQlaw: An Agentic AI Assistant for Analysis of 2D Quantum Materials
OpenQlaw:用于分析二维量子材料的智能代理AI助手
Abstract
The transition from optical identification of 2D quantum materials to practical device fabrication requires dynamic reasoning beyond the detection accuracy. While recent domain-specific Multimodal Large Language Models (MLLMs) successfully ground visual features using physics-informed reasoning, their outputs are optimized for step-by-step cognitive transparency. This yields verbose candidate enumerations followed by dense reasoning that, while accurate, may induce cognitive overload and lack immediate utility for real-world interaction with researchers. To address this challenge, we introduce OpenQlaw, an agentic orchestration system for analyzing 2D materials. The architecture is built upon NanoBot, a lightweight agentic framework inspired by OpenClaw, and QuPAINT, one of the first Physics-Aware Instruction Multi-modal platforms for Quantum Material Discovery. This allows accessibility to the lab floor via a variety of messaging channels. OpenQlaw allows the core Large Language Model (LLM) agent to orchestrate a domain-expert MLLM,with QuPAINT, as a specialized node, successfully decoupling visual identification from reasoning and deterministic image rendering. By parsing spatial data from the expert, the agent can dynamically process user queries, such as performing scale-aware physical computation or generating isolated visual annotations, and answer in a naturalistic manner. Crucially, the system features a persistent memory that enables the agent to save physical scale ratios (e.g., 1 pixel = 0.25 {\mu}m) for area computations and store sample preparation methods for efficacy comparison. The application of an agentic architecture, together with the extension that uses the core agent as an orchestrator for domain-specific experts, transforms isolated inferences into a context-aware assistant capable of accelerating high-throughput device fabrication.
Chinese Translation
从二维量子材料的光学识别到实际设备制造的过渡需要超越检测准确性的动态推理。尽管最近的领域特定多模态大型语言模型(MLLMs)成功地利用物理知识推理来基础视觉特征,但其输出优化为逐步认知透明性。这导致了冗长的候选枚举,随后是密集的推理,尽管准确,但可能导致认知过载,并且缺乏与研究人员进行现实世界互动的即时实用性。为了解决这一挑战,我们引入了OpenQlaw,一个用于分析二维材料的智能协调系统。该架构基于NanoBot,一个轻量级的智能代理框架,灵感来自OpenClaw,以及QuPAINT,这是量子材料发现的首个物理感知指令多模态平台之一。这使得通过多种消息渠道访问实验室。OpenQlaw允许核心大型语言模型(LLM)代理协调领域专家MLLM,而QuPAINT作为一个专门节点,成功地将视觉识别与推理和确定性图像渲染解耦。通过解析来自专家的空间数据,代理可以动态处理用户查询,例如执行考虑比例的物理计算或生成孤立的视觉注释,并以自然的方式回答。关键是,该系统具有持久内存,使代理能够保存物理比例(例如,1像素 = 0.25 {}m)以进行面积计算,并存储样品制备方法以进行效能比较。智能架构的应用,以及使用核心代理作为领域特定专家的协调者的扩展,将孤立推理转变为一个上下文感知的助手,能够加速高通量设备制造。
cs.CV / 24 / 2603.17051
Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models
Astrolabe:为蒸馏自回归视频模型引导前向过程强化学习
Abstract
Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.
Chinese Translation
蒸馏自回归(AR)视频模型能够实现高效的流式生成,但往往与人类视觉偏好不一致。现有的强化学习(RL)框架并不自然适用于这些架构,通常需要昂贵的重新蒸馏或与求解器耦合的反向过程优化,这会引入相当大的内存和计算开销。我们提出了Astrolabe,一种针对蒸馏AR模型的高效在线RL框架。为了克服现有瓶颈,我们引入了一种基于负样本意识微调的前向过程RL公式。通过在推理端点直接对比正负样本,这种方法在不需要反向过程展开的情况下建立了隐式的策略改进方向。为了将这种对齐扩展到长视频,我们提出了一种流式训练方案,通过滚动的KV缓存逐步生成序列,仅对局部剪辑窗口应用RL更新,同时依赖先前的上下文以确保长距离一致性。最后,为了减轻奖励黑客行为,我们整合了一个由不确定性感知选择性正则化和动态参考更新稳定的多奖励目标。大量实验表明,我们的方法在多个蒸馏AR视频模型中始终提升生成质量,成为一种稳健且可扩展的对齐解决方案。
cs.CV / 25 / 2603.17055
PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning
PaAgent:通过主观-客观强化学习的肖像感知图像修复代理
Abstract
Image Restoration (IR) agents, leveraging multimodal large language models to perceive degradation and invoke restoration tools, have shown promise in automating IR tasks. However, existing IR agents typically lack an insight summarization mechanism for past interactions, which results in an exhaustive search for the optimal IR tool. To address this limitation, we propose a portrait-aware IR agent, dubbed PaAgent, which incorporates a self-evolving portrait bank for IR tools and Retrieval-Augmented Generation (RAG) to select a suitable IR tool for input. Specifically, to construct and evolve the portrait bank, the PaAgent continuously enriches it by summarizing the characteristics of various IR tools with restored images, selected IR tools, and degraded images. In addition, the RAG is employed to select the optimal IR tool for the input image by retrieving relevant insights from the portrait bank. Furthermore, to enhance PaAgent's ability to perceive degradation in complex scenes, we propose a subjective-objective reinforcement learning strategy that considers both image quality scores and semantic insights in reward generation, which accurately provides the degradation information even under partial and non-uniform degradation. Extensive experiments across 8 IR benchmarks, covering six single-degradation and eight mixed-degradation scenarios, validate PaAgent's superiority in addressing complex IR tasks. Our project page is \href{https://wyjgr.github.io/PaAgent.html}{PaAgent}.
Chinese Translation
图像修复(IR)代理利用多模态大型语言模型来感知退化并调用修复工具,在自动化IR任务中展现出良好的前景。然而,现有的IR代理通常缺乏对过去交互的洞察总结机制,这导致在寻找最佳IR工具时需要进行耗时的全面搜索。为了解决这一局限性,我们提出了一种肖像感知的IR代理,称为PaAgent,它结合了自我演变的肖像库和增强检索生成(RAG)来为输入选择合适的IR工具。具体而言,为了构建和演变肖像库,PaAgent通过总结各种IR工具的特征、修复后的图像、选择的IR工具和退化图像,持续丰富肖像库。此外,RAG被用于通过从肖像库中检索相关见解来选择输入图像的最佳IR工具。此外,为了增强PaAgent在复杂场景中感知退化的能力,我们提出了一种主观-客观强化学习策略,该策略在奖励生成中同时考虑图像质量评分和语义见解,从而在部分和非均匀退化情况下准确提供退化信息。在涵盖六种单一退化和八种混合退化场景的8个IR基准上进行的大量实验验证了PaAgent在处理复杂IR任务中的优越性。我们的项目页面是 [PaAgent](https://wyjgr.github.io/PaAgent.html)。
cs.CV / 26 / 2603.17056
DesertFormer: Transformer-Based Semantic Segmentation for Off-Road Desert Terrain Classification in Autonomous Navigation Systems
DesertFormer:基于变换器的语义分割用于自主导航系统中的越野沙漠地形分类
Abstract
Reliable terrain perception is a fundamental requirement for autonomous navigation in unstructured, off-road environments. Desert landscapes present unique challenges due to low chromatic contrast between terrain categories, extreme lighting variability, and sparse vegetation that defy the assumptions of standard road-scene segmentation models. We present DesertFormer, a semantic segmentation pipeline for off-road desert terrain analysis based on SegFormer B2 with a hierarchical Mix Transformer (MiT-B2) backbone. The system classifies terrain into ten ecologically meaningful categories -- Trees, Lush Bushes, Dry Grass, Dry Bushes, Ground Clutter, Flowers, Logs, Rocks, Landscape, and Sky -- enabling safety-aware path planning for ground robots and autonomous vehicles. Trained on a purpose-built dataset of 4,176 annotated off-road images at 512x512 resolution, DesertFormer achieves a mean Intersection-over-Union (mIoU) of 64.4% and pixel accuracy of 86.1%, representing a +24.2% absolute improvement over a DeepLabV3 MobileNetV2 baseline (41.0% mIoU). We further contribute a systematic failure analysis identifying the primary confusion patterns -- Ground Clutter to Landscape and Dry Grass to Landscape -- and propose class-weighted training and copy-paste augmentation for rare terrain categories. Code, checkpoints, and an interactive inference dashboard are released at https://github.com/Yasaswini-ch/Vision-based-Desert-Terrain-Segmentation-using-SegFormer.
Chinese Translation
可靠的地形感知是自主导航在非结构化越野环境中的基本要求。沙漠景观由于地形类别之间的低色差、极端的光照变化以及稀疏的植被,给标准道路场景分割模型带来了独特的挑战。我们提出了DesertFormer,这是一个基于SegFormer B2和分层混合变换器(MiT-B2)骨干网的越野沙漠地形分析语义分割管道。该系统将地形分类为十个生态意义明确的类别——树木、茂盛灌木、干草、干灌木、地面杂物、花朵、木头、岩石、景观和天空——从而为地面机器人和自主车辆提供安全意识的路径规划。DesertFormer在一个专门构建的包含4,176张标注越野图像(分辨率为512x512)的数据集上进行训练,达到了64.4%的平均交并比(mIoU)和86.1%的像素准确率,相较于DeepLabV3 MobileNetV2基线(41.0% mIoU)实现了+24.2%的绝对提升。我们还进行了系统的失败分析,识别出主要的混淆模式——地面杂物与景观、干草与景观,并提出了针对稀有地形类别的类加权训练和复制粘贴增强。代码、检查点和交互式推理仪表板已发布在https://github.com/Yasaswini-ch/Vision-based-Desert-Terrain-Segmentation-using-SegFormer。
cs.CV / 27 / 2603.17068
TrackDeform3D: Markerless and Autonomous 3D Keypoint Tracking and Dataset Collection for Deformable Objects
TrackDeform3D:无标记和自主的可变形物体三维关键点跟踪与数据集收集
Abstract
Structured 3D representations such as keypoints and meshes offer compact, expressive descriptions of deformable objects, jointly capturing geometric and topological information useful for downstream tasks such as dynamics modeling and motion planning. However, robustly extracting such representations remains challenging, as current perception methods struggle to handle complex deformations. Moreover, large-scale 3D data collection remains a bottleneck: existing approaches either require prohibitive data collection efforts, such as labor-intensive annotation or expensive motion capture setups, or rely on simplifying assumptions that break down in unstructured environments. As a result, large-scale 3D datasets and benchmarks for deformable objects remain scarce. To address these challenges, this paper presents an affordable and autonomous framework for collecting 3D datasets of deformable objects using only RGB-D cameras. The proposed method identifies 3D keypoints and robustly tracks their trajectories, incorporating motion consistency constraints to produce temporally smooth and geometrically coherent data. TrackDeform3D is evaluated against several state-of-the-art tracking methods across diverse object categories and demonstrates consistent improvements in both geometric and tracking accuracy. Using this framework, this paper presents a high-quality, large-scale dataset consisting of 6 deformable objects, totaling 110 minutes of trajectory data.
Chinese Translation
结构化的三维表示(如关键点和网格)提供了对可变形物体的紧凑且富有表现力的描述,能够共同捕捉几何和拓扑信息,这对于动态建模和运动规划等下游任务非常有用。然而,稳健地提取这些表示仍然具有挑战性,因为当前的感知方法难以处理复杂的变形。此外,大规模三维数据收集仍然是一个瓶颈:现有的方法要么需要巨大的数据收集工作量,例如劳动密集型的标注或昂贵的运动捕捉设备,要么依赖于在非结构化环境中失效的简化假设。因此,针对可变形物体的大规模三维数据集和基准仍然稀缺。为了解决这些挑战,本文提出了一种经济实惠且自主的框架,仅使用RGB-D相机收集可变形物体的三维数据集。所提方法识别三维关键点并稳健地跟踪其轨迹,结合运动一致性约束以生成时间上平滑且几何上连贯的数据。TrackDeform3D在多个物体类别上与几种最先进的跟踪方法进行了评估,并在几何和跟踪精度上均表现出一致的改进。利用该框架,本文呈现了一个高质量的大规模数据集,包含6个可变形物体,总计110分钟的轨迹数据。
cs.CV / 28 / 2603.17069
Edge-Efficient Two-Stream Multimodal Architecture for Non-Intrusive Bathroom Fall Detection
边缘高效的双流多模态架构用于非侵入式浴室跌倒检测
Abstract
Falls in wet bathroom environments are a major safety risk for seniors living alone. Recent work has shown that mmWave-only, vibration-only, and existing multimodal schemes, such as vibration-triggered radar activation, early feature concatenation, and decision-level score fusion, can support privacy-preserving, non-intrusive fall detection. However, these designs still treat motion and impact as loosely coupled streams, depending on coarse temporal alignment and amplitude thresholds, and do not explicitly encode the causal link between radar-observed collapse and floor impact or address timing drift, object drop confounders, and latency and energy constraints on low-power edge devices. To this end, we propose a two-stream architecture that encodes radar signals with a Motion--Mamba branch for long-range motion patterns and processes floor vibration with an Impact--Griffin branch that emphasizes impact transients and cross-axis coupling. Cross-conditioned fusion uses low-rank bilinear interaction and a Switch--MoE head to align motion and impact tokens and suppress object-drop confounders. The model keeps inference cost suitable for real-time execution on a Raspberry Pi 4B gateway. We construct a bathroom fall detection benchmark dataset with frame-level annotations, comprising more than 3~h of synchronized mmWave radar and triaxial vibration recordings across eight scenarios under running water, together with subject-independent training, validation, and test splits. On the test split, our model attains 96.1% accuracy, 94.8% precision, 88.0% recall, a 91.1% macro F1 score, and an AUC of 0.968. Compared with the strongest baseline, it improves accuracy by 2.0 percentage points and fall recall by 1.3 percentage points, while reducing latency from 35.9 ms to 15.8 ms and lowering energy per 2.56 s window from 14200 mJ to 10750 mJ on the Raspberry Pi 4B gateway.
Chinese Translation
在潮湿的浴室环境中跌倒是独居老年人的主要安全风险。最近的研究表明,仅使用毫米波、仅使用振动以及现有的多模态方案(如振动触发的雷达激活、早期特征连接和决策级得分融合)可以支持保护隐私的非侵入式跌倒检测。然而,这些设计仍然将运动和冲击视为松散耦合的流,依赖于粗略的时间对齐和幅度阈值,并未明确编码雷达观察到的倒塌与地面冲击之间的因果关系,也未解决时间漂移、物体掉落混淆因素以及低功耗边缘设备上的延迟和能量限制。为此,我们提出了一种双流架构,通过一个运动-曼巴(Motion--Mamba)分支编码雷达信号以捕捉远程运动模式,并通过一个冲击-狮鹫(Impact--Griffin)分支处理地面振动,强调冲击瞬态和交叉轴耦合。交叉条件融合使用低秩双线性交互和一个切换-混合专家(Switch--MoE)头对齐运动和冲击标记,并抑制物体掉落混淆因素。该模型保持推理成本适合在树莓派4B网关上实时执行。我们构建了一个浴室跌倒检测基准数据集,包含帧级注释,涵盖在流动水下的八种场景中超过3小时的同步毫米波雷达和三轴振动记录,以及独立于受试者的训练、验证和测试划分。在测试划分中,我们的模型达到了96.1%的准确率、94.8%的精确率、88.0%的召回率、91.1%的宏观F1分数和0.968的AUC。与最强基线相比,准确率提高了2.0个百分点,跌倒召回率提高了1.3个百分点,同时将延迟从35.9毫秒降低到15.8毫秒,并将每2.56秒窗口的能量从14200毫焦降低到10750毫焦。
cs.CV / 29 / 2603.17079
ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models
ACE-LoRA:用于医疗视觉语言模型参数高效适应的图注意力上下文增强
Abstract
The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at https://github.com/icon-lab/ACE-LoRA.
Chinese Translation
CLIP类视觉语言模型(VLMs)在自然图像上的成功激发了医疗领域的相应研究,然而现有的方法大多陷入两个极端:一是基于单一领域数据训练的专业模型,能够捕捉领域特定的细节但泛化能力较差;二是基于多领域数据训练的通用医疗VLMs,能够保留广泛的语义但稀释了细粒度的诊断线索。弥合这种专业化与泛化之间的权衡仍然具有挑战性。为了解决这个问题,我们提出了ACE-LoRA,一个参数高效的适应框架,旨在保持通用医疗VLMs的强大零-shot 泛化能力。ACE-LoRA将低秩适应(LoRA)模块集成到冻结的图像-文本编码器中,并引入了一种基于注意力的上下文增强超图神经网络(ACE-HGNN)模块,该模块捕捉超越成对相似性的高阶上下文交互,以丰富全球表示并引入局部化的诊断线索,从而解决了先前参数高效微调(PEFT)方法忽视细粒度细节的关键限制。为了进一步增强跨模态对齐,我们制定了一种标签引导的信息对比损失(InfoNCE loss),有效抑制语义相关的图像-文本对之间的假阴性。尽管仅增加了0.95M可训练参数,ACE-LoRA在跨多个领域的零-shot 分类、分割和检测基准测试中始终优于最先进的医疗VLMs和PEFT基线。我们的代码可在 https://github.com/icon-lab/ACE-LoRA 获取。
cs.CV / 30 / 2603.17098
Accurate Shift Invariant Convolutional Neural Networks Using Gaussian-Hermite Moments
基于高斯-厄米特矩的准确平移不变卷积神经网络
Abstract
The convolutional neural networks (CNNs) are not inherently shift invariant or equivariant. The downsampling operation, used in CNNs, is one of the key reasons which breaks the shift invariant property of a CNN. Conversely, downsampling operation is important to improve computational efficiency and increase the area of the receptive field for more contextual information. In this work, we propose Gaussian-Hermite Sampling (GHS), a novel downsampling strategy designed to achieve accurate shift invariance. GHS leverages Gaussian-Hermite polynomials to perform shift-consistent sampling, enabling CNN layers to maintain invariance to arbitrary spatial shifts prior to training. When integrated into standard CNN architectures, the proposed method embeds shift invariance directly at the layer level without requiring architectural modifications or additional training procedures. We evaluate the proposed approach on CIFAR-10, CIFAR-100, and MNIST-rot datasets. Experimental results demonstrate that GHS significantly improves shift consistency, achieving 100% classification consistency under spatial shifts, while also improving classification accuracy compared to baseline CNN models.
Chinese Translation
卷积神经网络(CNN)本质上并不具备平移不变性或平移等变性。CNN中使用的下采样操作是破坏CNN平移不变性的重要原因之一。相反,下采样操作对于提高计算效率和增加感受野以获取更多上下文信息是至关重要的。在本研究中,我们提出了一种新颖的下采样策略——高斯-厄米特采样(Gaussian-Hermite Sampling, GHS),旨在实现准确的平移不变性。GHS利用高斯-厄米特多项式进行平移一致的采样,使得CNN层在训练前能够保持对任意空间平移的不变性。当将该方法集成到标准CNN架构中时,所提方法在层级上直接嵌入平移不变性,而无需对架构进行修改或额外的训练过程。我们在CIFAR-10、CIFAR-100和MNIST-rot数据集上评估了所提方法。实验结果表明,GHS显著提高了平移一致性,在空间平移下实现了100%的分类一致性,同时相较于基线CNN模型提高了分类准确性。
cs.CV / 31 / 2603.17108
LLM-Powered Flood Depth Estimation from Social Media Imagery: A Vision-Language Model Framework with Mechanistic Interpretability for Transportation Resilience
基于大型语言模型的社交媒体图像洪水深度估计:具有机制可解释性的视觉-语言模型框架以增强交通韧性
Abstract
Urban flooding poses an escalating threat to transportation network continuity, yet no operational system currently provides real-time, street-level flood depth information at the centimeter resolution required for dynamic routing, electric vehicle (EV) safety, and autonomous vehicle (AV) operations. This study presents FloodLlama, a fine-tuned open-source vision-language model (VLM) for continuous flood depth estimation from single street-level images, supported by a multimodal sensing pipeline using TikTok data. A synthetic dataset of approximately 190000 images was generated, covering seven vehicle types, four weather conditions, and 41 depth levels (0-40 cm at 1 cm resolution). Progressive curriculum training enabled coarse-to-fine learning, while LLaMA 3.2-11B Vision was fine-tuned using QLoRA. Evaluation across 34797 trials reveals a depth-dependent prompt effect: simple prompts perform better for shallow flooding, whereas chain-of-thought (CoT) reasoning improves performance at greater depths. FloodLlama achieves a mean absolute error (MAE) below 0.97 cm and Acc@5cm above 93.7% for deep flooding, exceeding 96.8% for shallow depths. A five-phase mechanistic interpretability framework identifies layer L23 as the critical depth-encoding transition and enables selective fine-tuning that reduces trainable parameters by 76-80% while maintaining accuracy. The Tier 3 configuration achieves 98.62% accuracy on real-world data and shows strong robustness under visual occlusion. A TikTok-based data pipeline, validated on 676 annotated flood frames from Detroit, demonstrates the feasibility of real-time, crowd-sourced flood sensing. The proposed framework provides a scalable, infrastructure-free solution with direct implications for EV safety, AV deployment, and resilient transportation management.
Chinese Translation
城市洪水对交通网络的连续性构成日益严重的威胁,但目前没有任何操作系统能够提供实时、街道级的洪水深度信息,且其分辨率达到动态路由、电动汽车(EV)安全和自动驾驶汽车(AV)操作所需的厘米级。本研究提出了FloodLlama,这是一种经过微调的开源视觉-语言模型(VLM),用于从单张街道级图像中连续估计洪水深度,并通过使用TikTok数据的多模态传感管道进行支持。生成了一个包含约190000张图像的合成数据集,涵盖七种车辆类型、四种天气条件和41个深度级别(0-40厘米,分辨率为1厘米)。渐进式课程训练使得从粗到细的学习成为可能,而LLaMA 3.2-11B Vision则使用QLoRA进行了微调。在34797次试验中的评估揭示了深度依赖的提示效应:简单提示在浅洪水情况下表现更佳,而链式思维(CoT)推理在更大深度下提高了性能。FloodLlama在深洪水情况下的平均绝对误差(MAE)低于0.97厘米,Acc@5cm超过93.7%,在浅水深度时超过96.8%。一个五阶段的机制可解释性框架识别出第L23层作为关键的深度编码转换,并实现选择性微调,将可训练参数减少76-80%,同时保持准确性。Tier 3配置在真实世界数据上的准确率达到98.62%,并在视觉遮挡下表现出强大的鲁棒性。基于TikTok的数据管道在676个来自底特律的注释洪水帧上进行了验证,展示了实时众包洪水感知的可行性。所提出的框架提供了一种可扩展的、无需基础设施的解决方案,对电动汽车安全、自动驾驶汽车部署和韧性交通管理具有直接影响。
cs.CV / 32 / 2603.17110
Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation
用于医学图像分割的像素级反事实对比学习
Abstract
Image segmentation relies on large annotated datasets, which are expensive and slow to produce. Silver-standard (AI-generated) labels are easier to obtain, but they risk introducing bias. Self-supervised learning, needing only images, has become key for pre-training. Recent work combining contrastive learning with counterfactual generation improves representation learning for classification but does not readily extend to pixel-level tasks. We propose a pipeline combining counterfactual generation with dense contrastive learning via Dual-View (DVD-CL) and Multi-View (MVD-CL) methods, along with supervised variants that utilize available silver-standard annotations. A new visualisation algorithm, the Color-coded High Resolution Overlay map (CHRO-map) is also introduced. Experiments show annotation-free DVD-CL outperforms other dense contrastive learning methods, while supervised variants using silver-standard labels outperform training on the silver-standard labeled data directly, achieving $\sim$94% DSC on challenging data. These results highlight that pixel-level contrastive learning, enhanced by counterfactuals and silver-standard annotations, improves robustness to acquisition and pathological variations.
Chinese Translation
图像分割依赖于大量标注数据集,这些数据集的生产成本高且耗时。银标准(AI生成)标签更易获得,但可能引入偏差。自监督学习仅需图像,已成为预训练的关键。近期研究将对比学习与反事实生成相结合,提升了分类的表征学习,但不易扩展到像素级任务。我们提出了一种将反事实生成与密集对比学习相结合的流程,通过双视图(Dual-View, DVD-CL)和多视图(Multi-View, MVD-CL)方法,以及利用现有银标准注释的监督变体。还引入了一种新的可视化算法,即彩色编码高分辨率叠加图(Color-coded High Resolution Overlay map, CHRO-map)。实验表明,无需注释的DVD-CL优于其他密集对比学习方法,而使用银标准标签的监督变体在银标准标注数据上直接训练的表现更佳,在具有挑战性的数据集上达到了约94%的DSC。这些结果强调了通过反事实和银标准注释增强的像素级对比学习,提高了对获取和病理变化的鲁棒性。
cs.CV / 33 / 2603.17111
Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles
隐藏克隆:揭示和修复视觉-语言模型集成中的家族偏差
Abstract
Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA -- all significant -- and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.
Chinese Translation
从不同提供者的视觉-语言模型(VLMs)中进行集成可以最大化基准准确率,但来自同一架构家族的模型共享相关错误,而标准投票对此无能为力。我们在 VQAv2、TextVQA 和 GQA 上研究了来自 8 个家族的 17 个 VLM 的这种结构。家族相关错误将有效集成维度降低到 2.5-3.6 个独立投票者,并创建了一个误导层(占 1.5-6.5% 的问题),在该层中,相关的多数错误使准确率降至 0%,尽管最佳模型是正确的。我们提出了三种关注家族的方法。层次家族投票(HFV)在投票之前在家族内部进行聚合,使误导层的准确率恢复了 +18-26 个百分点。QualRCCV 是一种无训练方法,通过校准、家族质量和逆家族大小对模型进行加权,是首个在所有三个基准上超越校准投票的方法(p<0.05)。学习候选评分(LCS)训练一个交叉验证的分类器,利用支持广度、家族多样性和模型质量对候选答案进行重新排序,取得了最大的增益:VQAv2 +0.68%、TextVQA +0.61%、GQA +2.45% -- 所有增益均显著 -- 并且是唯一一个在任何基准上都没有降级的学习方法。在 VQAv2 测试标准(EvalAI)上,LCS 在 12 个模型下达到了 87.83%,验证了其泛化能力。
cs.CV / 34 / 2603.17117
MosaicMem: Hybrid Spatial Memory for Controllable Video World Models
MosaicMem:可控视频世界模型的混合空间记忆
Abstract
Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.
Chinese Translation
视频扩散模型正从短小而合理的片段向世界模拟器发展,这些模拟器必须在相机运动、重访和干预下保持一致性。然而,空间记忆仍然是一个关键瓶颈:显式的三维结构可以改善基于重投影的一致性,但在描绘移动物体时却面临困难,而隐式记忆即使在正确的姿态下也常常产生不准确的相机运动。我们提出了Mosaic Memory(MosaicMem),一种混合空间记忆,它将补丁提升到三维,以实现可靠的定位和有针对性的检索,同时利用模型的原生条件来保持遵循提示的生成。MosaicMem通过补丁与组合接口在查询视图中组合空间对齐的补丁,保留应持续存在的内容,同时允许模型进行应演变的内容的修补。通过PRoPE相机条件和两种新的记忆对齐方法,实验显示与隐式记忆相比,姿态遵循得到了改善,并且动态建模优于显式基线。MosaicMem进一步实现了分钟级导航、基于记忆的场景编辑和自回归展开。
cs.CV / 35 / 2603.17131
SMAL-pets: SMAL Based Avatars of Pets from Single Image
SMAL-pets:基于SMAL的单图像宠物头像生成
Abstract
Creating high-fidelity, animatable 3D dog avatars remains a formidable challenge in computer vision. Unlike human digital doubles, animal reconstruction faces a critical shortage of large-scale, annotated datasets for specialized applications. Furthermore, the immense morphological diversity across species, breeds, and crosses, which varies significantly in size, proportions, and features, complicates the generalization of existing models. Current reconstruction methods often struggle to capture realistic fur textures. Additionally, ensuring these avatars are fully editable and capable of performing complex, naturalistic movements typically necessitates labor-intensive manual mesh manipulation and expert rigging. This paper introduces SMAL-pets, a comprehensive framework that generates high-quality, editable animal avatars from a single input image. Our approach bridges the gap between reconstruction and generative modeling by leveraging a hybrid architecture. Our method integrates 3D Gaussian Splatting with the SMAL parametric model to provide a representation that is both visually high-fidelity and anatomically grounded. We introduce a multimodal editing suite that enables users to refine the avatar's appearance and execute complex animations through direct textual prompts. By allowing users to control both the aesthetic and behavioral aspects of the model via natural language, SMAL-pets provides a flexible, robust tool for animation and virtual reality.
Chinese Translation
创建高保真、可动画的3D狗头像在计算机视觉领域仍然是一项艰巨的挑战。与人类数字双胞胎不同,动物重建面临着专用应用所需的大规模标注数据集的严重短缺。此外,物种、品种和杂交之间巨大的形态多样性,在大小、比例和特征上有显著差异,进一步复杂化了现有模型的泛化能力。目前的重建方法往往难以捕捉逼真的毛发纹理。此外,确保这些头像完全可编辑并能够执行复杂的自然运动通常需要劳动密集型的手动网格操作和专业的绑定技术。本文介绍了SMAL-pets,一个综合框架,能够从单一输入图像生成高质量、可编辑的动物头像。我们的方法通过利用混合架构,弥合了重建与生成建模之间的差距。我们的方法将3D高斯点云与SMAL参数模型相结合,提供了一种既视觉高保真又解剖学上合理的表示。我们引入了一套多模态编辑工具,使用户能够通过直接的文本提示来细化头像的外观并执行复杂的动画。通过允许用户通过自然语言控制模型的美学和行为特征,SMAL-pets提供了一种灵活、强大的动画和虚拟现实工具。
cs.CV / 36 / 2603.17159
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images
BEV-SLD:基于自监督的场景地标检测方法用于激光雷达鸟瞰图的全球定位
Abstract
We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our self-supervised approach leverages bird's-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns learnable global landmark coordinates with per-frame heatmaps, yielding consistent landmark detections across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and achieves strong performance compared to state-of-the-art methods.
Chinese Translation
我们提出了BEV-SLD,这是一种基于场景地标检测(SLD)概念的激光雷达全球定位方法。与场景无关的处理流程不同,我们的自监督方法利用鸟瞰图(BEV)图像在规定的空间密度下发现场景特定的模式,并将其视为地标。一致性损失将可学习的全球地标坐标与每帧热图对齐,从而在整个场景中实现一致的地标检测。在校园、工业和森林环境中,BEV-SLD提供了稳健的定位,并与最先进的方法相比取得了优异的性能。
cs.CV / 37 / 2603.17161
GazeOnce360: Fisheye-Based 360{\deg} Multi-Person Gaze Estimation with Global-Local Feature Fusion
GazeOnce360:基于鱼眼镜头的360°多人物注视估计与全局-局部特征融合
Abstract
We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360{\deg} scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations. Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360{\deg} gaze estimation in practical multi-person scenarios. Project page: https://caizhuojiang.github.io/GazeOnce360/.
Chinese Translation
我们提出了GazeOnce360,这是一种新颖的端到端模型,用于从单个桌面安装的向上鱼眼相机进行多人物注视估计。与依赖于受限视角的前向相机的传统方法不同,我们解决了从向上鱼眼视角估计分布在360°场景中的多个人的3D注视方向这一未被充分探索的设置。为了支持该领域的研究,我们引入了MPSGaze360,这是一个使用虚幻引擎渲染的大规模合成数据集,包含多样化的多人物配置,并提供准确的3D注视和眼部特征标注。我们的模型通过结合旋转卷积和眼部特征监督,解决了鱼眼图像固有的严重畸变和视角变化问题。为了更好地捕捉对注视估计至关重要的细粒度眼部特征,我们提出了一种双分辨率架构,将全局低分辨率上下文与高分辨率局部眼部区域融合。实验结果证明了我们模型中每个组件的有效性。这项工作突显了基于鱼眼镜头的360°注视估计在实际多人物场景中的可行性和潜力。项目页面:https://caizhuojiang.github.io/GazeOnce360/
cs.CV / 38 / 2603.17173
Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience
通用多模态大语言模型通过人类显著性获得生物识别专业知识
Abstract
Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.
Chinese Translation
虹膜呈现攻击检测(PAD)对于安全的生物识别部署至关重要,但开发专门模型面临显著的实际障碍:收集代表未来未知攻击的数据是不可能的,而收集足够多样的数据又在其预测能力上受到限制,且成本高昂。此外,分享生物识别数据引发隐私问题。由于新攻击向量的快速出现需要适应性解决方案,因此我们在本文中探讨通用多模态大语言模型(MLLMs)在增强人类专家知识的情况下,是否能够在严格的隐私约束下进行虹膜PAD,这些约束禁止将生物识别数据发送到公共云MLLM服务。通过对应用于我们数据集的视觉编码器嵌入的分析,我们证明了在MLLMs中预训练的视觉变换器固有地聚类了许多虹膜攻击类型,尽管从未明确针对这一任务进行训练。然而,在聚类显示攻击类别之间存在重叠的情况下,我们发现结合人类显著性(来自受试者识别攻击指标的口头描述)的结构化提示使这些模型能够解决模糊性。在一个受IRB限制的数据集中测试224张涵盖七种攻击类型的虹膜图像,仅使用大学批准的服务(Gemini 2.5 Pro)或本地托管模型(例如,Llama 3.2-Vision),我们展示了使用专家信息提示的Gemini在性能上优于基于卷积神经网络(CNN)的专门基线和人类检查员,而本地可部署的Llama则达到了接近人类的表现。我们的结果表明,在机构隐私约束下可部署的MLLMs为虹膜PAD提供了一条可行的路径。
cs.CV / 39 / 2603.17178
Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video
Patient4D:从单目手术室视频中恢复时间一致的患者身体网格
Abstract
Recovering a dense 3D body mesh from monocular video remains challenging under occlusion from draping and continuously moving camera viewpoints. This configuration arises in surgical augmented reality (AR), where an anesthetized patient lies under surgical draping while a surgeon's head-mounted camera continuously changes viewpoint. Existing human mesh recovery (HMR) methods are typically trained on upright, moving subjects captured from relatively stable cameras, leading to performance degradation under such conditions. To address this, we present Patient4D, a stationarity-constrained reconstruction pipeline that explicitly exploits the stationarity prior. The pipeline combines image-level foundation models for perception with lightweight geometric mechanisms that enforce temporal consistency across frames. Two key components enable robust reconstruction: Pose Locking, which anchors pose parameters using stable keyframes, and Rigid Fallback, which recovers meshes under severe occlusion through silhouette-guided rigid alignment. Together, these mechanisms stabilize predictions while remaining compatible with off-the-shelf HMR models. We evaluate Patient4D on 4,680 synthetic surgical sequences and three public HMR video benchmarks. Under surgical drape occlusion, Patient4D achieves a 0.75 mean IoU, reducing failure frames from 30.5% to 1.3% compared to the best baseline. Our findings demonstrate that exploiting stationarity priors can substantially improve monocular reconstruction in clinical AR scenarios.
Chinese Translation
从单目视频中恢复稠密的三维身体网格在遮挡和持续移动的摄像机视角下仍然具有挑战性。这种配置出现在外科增强现实(AR)中,麻醉患者躺在手术覆盖物下,而外科医生的头戴式摄像机不断改变视角。现有的人体网格恢复(HMR)方法通常是在相对稳定的摄像机下对直立移动的对象进行训练,因此在这种条件下性能下降。为了解决这个问题,我们提出了Patient4D,一个受限于静止性的重建管道,明确利用静止性先验。该管道结合了用于感知的图像级基础模型和轻量几何机制,以确保帧间的时间一致性。两个关键组件使得稳健重建成为可能:姿态锁定(Pose Locking),通过稳定的关键帧锚定姿态参数,以及刚性回退(Rigid Fallback),通过轮廓引导的刚性对齐在严重遮挡下恢复网格。这些机制共同稳定了预测,同时与现成的HMR模型兼容。我们在4680个合成手术序列和三个公共HMR视频基准上评估了Patient4D。在手术覆盖物遮挡下,Patient4D实现了0.75的平均交并比(mean IoU),将失败帧从30.5%减少到1.3%,与最佳基线相比。我们的研究结果表明,利用静止性先验可以显著改善临床AR场景中的单目重建。
cs.CV / 40 / 2603.17186
Visual Product Search Benchmark
视觉产品搜索基准
Abstract
Reliable product identification from images is a critical requirement in industrial and commercial applications, particularly in maintenance, procurement, and operational workflows where incorrect matches can lead to costly downstream failures. At the core of such systems lies the visual search component, which must retrieve and rank the exact object instance from large and continuously evolving catalogs under diverse imaging conditions. This report presents a structured benchmark of modern visual embedding models for instance-level image retrieval, with a focus on industrial applications. A curated set of open-source foundation embedding models, proprietary multi-modal embedding systems, and domain-specific vision-only models are evaluated under a unified image-to-image retrieval protocol. The benchmark includes curated datasets, which includes industrial datasets derived from production deployments in Manufacturing, Automotive, DIY, and Retail, as well as established public benchmarks. Evaluation is conducted without post-processing, isolating the retrieval capability of each model. The results provide insight into how well contemporary foundation and unified embedding models transfer to fine-grained instance retrieval tasks, and how they compare to models explicitly trained for industrial applications. By emphasizing realistic constraints, heterogeneous image conditions, and exact instance matching requirements, this benchmark aims to inform both practitioners and researchers about the strengths and limitations of current visual embedding approaches in production-level product identification systems. An interactive companion website presenting the benchmark results, evaluation details, and additional visualizations is available at https://benchmark.nyris.io.
Chinese Translation
从图像中可靠地识别产品是工业和商业应用中的一项关键需求,尤其是在维护、采购和操作工作流程中,错误匹配可能导致昂贵的下游故障。这类系统的核心是视觉搜索组件,它必须在多样的成像条件下,从大型且不断演变的目录中检索和排序准确的对象实例。本报告提出了一个针对实例级图像检索的现代视觉嵌入模型的结构化基准,重点关注工业应用。我们评估了一组经过策划的开源基础嵌入模型、专有的多模态嵌入系统以及特定领域的视觉专用模型,采用统一的图像到图像检索协议。该基准包括经过策划的数据集,其中包含来自制造、汽车、DIY和零售等领域的生产部署的工业数据集,以及已建立的公共基准。评估是在没有后处理的情况下进行的,以隔离每个模型的检索能力。结果提供了对当代基础和统一嵌入模型在细粒度实例检索任务中的迁移能力的洞察,以及它们与专门为工业应用训练的模型的比较。通过强调现实约束、异构图像条件和精确实例匹配要求,该基准旨在向从业者和研究人员提供当前视觉嵌入方法在生产级产品识别系统中的优缺点的信息。一个互动的伴随网站展示了基准结果、评估细节和其他可视化内容,网址为 https://benchmark.nyris.io。
cs.CV / 41 / 2603.17219
SA-CycleGAN-2.5D: Self-Attention CycleGAN with Tri-Planar Context for Multi-Site MRI Harmonization
SA-CycleGAN-2.5D:具有三平面上下文的自注意力CycleGAN用于多站点MRI协调
Abstract
Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities $P(\mathbf{x})$ varies non-linearly across acquisition protocols while the conditional anatomy $P(\mathbf{y}|\mathbf{x})$ remains constant. This is particularly detrimental to radiomic reproducibility, where acquisition variance often exceeds biological pathology variance. Existing statistical harmonization methods (e.g., ComBat) operate in feature space, precluding spatial downstream tasks, while standard deep learning approaches are theoretically bounded by local effective receptive fields (ERF), failing to model the global intensity correlations characteristic of field-strength bias. We propose SA-CycleGAN-2.5D, a domain adaptation framework motivated by the $H\Delta H$-divergence bound of Ben-David et al., integrating three architectural innovations: (1) A 2.5D tri-planar manifold injection preserving through-plane gradients $\nabla_z$ at $O(HW)$ complexity; (2) A U-ResNet generator with dense voxel-to-voxel self-attention, surpassing the $O(\sqrt{L})$ receptive field limit of CNNs to model global scanner field biases; and (3) A spectrally-normalized discriminator constraining the Lipschitz constant ($K_D \le 1$) for stable adversarial optimization. Evaluated on 654 glioma patients across two institutional domains (BraTS and UPenn-GBM), our method reduces Maximum Mean Discrepancy (MMD) by 99.1% ($1.729 \to 0.015$) and degrades domain classifier accuracy to near-chance (59.7%). Ablation confirms that global attention is statistically essential (Cohen's $d = 1.32$, $p < 0.001$) for the harder heterogeneous-to-homogeneous translation direction. By bridging 2D efficiency and 3D consistency, our framework yields voxel-level harmonized images that preserve tumor pathophysiology, enabling reproducible multi-center radiomic analysis.
Chinese Translation
多站点神经影像分析受到扫描仪引起的协变量偏移的根本干扰,其中体素强度的边际分布 $P( extbf{x})$ 在采集协议之间非线性变化,而条件解剖结构 $P( extbf{y}| extbf{x})$ 保持不变。这对放射组学的可重复性尤其有害,因为采集变异通常超过生物病理变异。现有的统计协调方法(例如,ComBat)在特征空间中操作,限制了空间下游任务,而标准深度学习方法在理论上受限于局部有效感受野(ERF),未能建模特征强度偏差的全局相关性。我们提出了SA-CycleGAN-2.5D,这是一种受Ben-David等人提出的$H riangle H$-散度界限启发的领域适应框架,整合了三项架构创新:(1)一种2.5D三平面流形注入,保持通过平面梯度 $
abla_z$,复杂度为 $O(HW)$;(2)一种具有密集体素到体素自注意力的U-ResNet生成器,超越了CNN的 $O( ext{sqrt}{L})$ 感受野限制,以建模全局扫描仪场偏差;(3)一种谱归一化判别器,约束Lipschitz常数($K_D ext{≤} 1$),以实现稳定的对抗优化。在654名胶质瘤患者的两个机构域(BraTS和UPenn-GBM)上进行评估,我们的方法将最大均值差异(MMD)降低了99.1%($1.729 o 0.015$),并将领域分类器的准确性降至接近随机(59.7%)。消融实验确认,全局注意力在更困难的异质到同质的转换方向上在统计上是必不可少的(Cohen's $d = 1.32$, $p < 0.001$)。通过桥接2D效率和3D一致性,我们的框架生成了保持肿瘤病理生理的体素级协调图像,从而实现可重复的多中心放射组学分析。
cs.CV / 42 / 2603.17227
Adaptive Anchor Policies for Efficient 4D Gaussian Streaming
高效4D高斯流的自适应锚点策略
Abstract
Dynamic scene reconstruction with Gaussian Splatting has enabled efficient streaming for real-time rendering and free-viewpoint video. However, most pipelines rely on fixed anchor selection such as Farthest Point Sampling (FPS), typically using 8,192 anchors regardless of scene complexity, which over-allocates computation under strict budgets. We propose Efficient Gaussian Streaming (EGS), a plug-in, budget-aware anchor sampler that replaces FPS with a reinforcement-learned policy while keeping the Gaussian streaming reconstruction backbone unchanged. The policy jointly selects an anchor budget and a subset of informative anchors under discrete constraints, balancing reconstruction quality and runtime using spatial features of the Gaussian representation. We evaluate EGS in two settings: fast rendering, which prioritizes runtime efficiency, and high-quality refinement, which enables additional optimization. Experiments on dynamic multi-view datasets show consistent improvements in the quality--efficiency trade-off over FPS sampling. On unseen data, in fast rendering at 256 anchors ($32\times$ fewer than 8,192), EGS improves PSNR by $+0.52$--$0.61$\,dB while running $1.29$--$1.35\times$ faster than IGS@8192 (N3DV and MeetingRoom). In high-quality refinement, EGS remains competitive with the full-anchor baseline at substantially lower anchor budgets. \emph{Code and pretrained checkpoints will be released upon acceptance.} \keywords{4D Gaussian Splatting \and 4D Gaussian Streaming \and Reinforcement Learning}
Chinese Translation
基于高斯点云的动态场景重建使得实时渲染和自由视角视频的高效流媒体传输成为可能。然而,大多数流程依赖于固定的锚点选择方法,如最远点采样(Farthest Point Sampling, FPS),通常在场景复杂性不变的情况下使用8,192个锚点,这在严格预算下导致计算资源的过度分配。我们提出了一种高效高斯流(Efficient Gaussian Streaming, EGS),这是一种预算感知的锚点采样器,使用强化学习策略替代FPS,同时保持高斯流重建的主干不变。该策略在离散约束下共同选择锚点预算和一组信息丰富的锚点,利用高斯表示的空间特征平衡重建质量和运行时间。我们在两种设置中评估EGS:快速渲染,优先考虑运行效率,以及高质量细化,允许额外的优化。在动态多视角数据集上的实验表明,EGS在质量与效率的权衡上相较于FPS采样有一致的改善。在未见数据上,在256个锚点的快速渲染中(比8,192少$32 imes$),EGS的峰值信噪比(PSNR)提高了$+0.52$--$0.61$ dB,同时运行速度比IGS@8192快$1.29$--$1.35 imes$(N3DV和MeetingRoom)。在高质量细化中,EGS在显著较低的锚点预算下仍然与全锚点基线保持竞争力。 extit{代码和预训练检查点将在接受后发布。} extkeywords{4D高斯点云 extit{和} 4D高斯流 extit{和} 强化学习}
cs.CV / 43 / 2603.17228
From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs
从下降到恢复:对多模态大语言模型中分割机制的分析
Abstract
Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing evaluation across the entire MLLM pipeline: vision encoder, adapter, and LLM. We further conduct an intervention based attention knockout analysis to test whether cross-token attention progressively refines visual representations, and an evaluation of bidirectional attention among image tokens on spatial consistency. Our analysis reveals that the adapter introduces a segmentation representation drop-off, but LLM layers progressively recover through attention-mediated refinement, where correctly classified tokens steer misclassified neighbors toward the correct label. At early image token positions, this recovery is bounded by causal attention, which bidirectional attention among image tokens alleviates. These findings provide a mechanistic account of how MLLMs process visual information for segmentation, informing the design of future segmentation-capable models.
Chinese Translation
多模态大语言模型(MLLMs)越来越多地应用于像素级视觉任务,但它们在空间理解方面的内在能力仍然不甚了解。我们通过对整个MLLM管道(视觉编码器、适配器和LLM)进行逐层线性探测评估,研究了分割能力。我们进一步进行了一项干预式注意力消失分析,以测试跨标记注意力是否逐步优化视觉表征,并评估图像标记之间的双向注意力对空间一致性的影响。我们的分析揭示,适配器引入了分割表征的下降,但LLM层通过注意力介导的优化逐步恢复,其中正确分类的标记引导错误分类的邻近标记朝向正确标签。在早期图像标记位置,这种恢复受到因果注意力的限制,而图像标记之间的双向注意力则缓解了这一限制。这些发现为MLLMs如何处理视觉信息以进行分割提供了机制性解释,为未来具备分割能力的模型设计提供了参考。
cs.CV / 44 / 2603.17240
GigaWorld-Policy: An Efficient Action-Centered World--Action Model
GigaWorld-Policy:一种高效的以动作为中心的世界-动作模型
Ye, Angen, Wang, Boyuan, Ni, Chaojun, Huang, Guan, Zhao, Guosheng, Li, Hao, Li, Hengtao, Li, Jie, Lv, Jindi, Liu, Jingyu, Cao, Min, Li, Peng, Deng, Qiuping, Mei, Wenjun, Wang, Xiaofeng, Chen, Xinze, Zhou, Xinyu, Wang, Yang, Chang, Yifan, Li, Yifan, Zhou, Yukun, Ye, Yun, Liu, Zhichao, Zhu, Zheng
Abstract
World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.
Chinese Translation
基于预训练视频生成骨干网络初始化的世界-动作模型(World-Action Models, WAM)在机器人策略学习中展现了显著的潜力。然而,现有方法面临两个关键瓶颈,阻碍了性能和部署。首先,联合推理未来视觉动态和相应动作会带来显著的推理开销。其次,联合建模往往将视觉和运动表示纠缠在一起,使得运动预测的准确性严重依赖于未来视频预测的质量。为了解决这些问题,我们提出了GigaWorld-Policy,这是一种以动作为中心的WAM,能够学习二维像素-动作动态,同时实现高效的动作解码,并可选生成视频。具体而言,我们将策略训练形式化为两个耦合组件:模型根据当前观察预测未来动作序列,同时生成基于预测动作和相同观察的未来视频。该策略通过动作预测和视频生成进行监督,提供更丰富的学习信号,并通过视觉动态约束鼓励物理上合理的动作。通过因果设计,防止未来视频标记影响动作标记,在推理时显式的未来视频生成是可选的,从而在部署期间实现更快的动作预测。为了支持这一范式,我们整理了一个多样化的大规模机器人数据集,以预训练一个以动作为中心的视频生成模型,然后将其适配为机器人策略学习的骨干。实验证明,在真实世界的机器人平台上,GigaWorld-Policy的运行速度比领先的WAM基线Motus快9倍,同时任务成功率提高了7%。此外,与pi-0.5相比,GigaWorld-Policy在RoboTwin 2.0上的性能提高了95%。
cs.CV / 45 / 2603.17265
LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis
LED:评估文档分析中布局错误检测的基准
Abstract
Recent advances in Large Language Models (LLMs) and Large Multimodal Models (LMMs) have improved Document Layout Analysis (DLA), yet structural errors such as region merging, splitting, and omission remain persistent. Conventional overlap-based metrics (e.g., IoU, mAP) fail to capture such logical inconsistencies. To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy. LED defines eight standardized error types (Missing, Hallucination, Size Error, Split, Merge, Overlap, Duplicate, and Misclassification) and provides quantitative rules and injection algorithms for realistic error simulation. Using these definitions, we construct LED-Dataset and design three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification. Experiments with state-of-the-art multimodal models show that LED enables fine-grained and interpretable assessment of structural understanding, revealing clear weaknesses across modalities and architectures. Overall, LED establishes a unified and explainable benchmark for diagnosing the structural robustness and reasoning capability of document understanding models.
Chinese Translation
近期大型语言模型(LLMs)和大型多模态模型(LMMs)的进展改善了文档布局分析(DLA),但结构性错误如区域合并、分割和遗漏仍然存在。传统的基于重叠的度量(例如,IoU、mAP)无法捕捉这些逻辑不一致性。为克服这一局限,我们提出了布局错误检测(LED),这是一个评估DLA预测中结构推理的基准,超越了表面准确性。LED定义了八种标准化的错误类型(缺失、幻觉、尺寸错误、分割、合并、重叠、重复和误分类),并提供了现实错误模拟的定量规则和注入算法。基于这些定义,我们构建了LED数据集,并设计了三个评估任务:文档级错误检测、文档级错误类型分类和元素级错误类型分类。与最先进的多模态模型的实验表明,LED能够对结构理解进行细致且可解释的评估,揭示了不同模态和架构的明显弱点。总体而言,LED建立了一个统一且可解释的基准,用于诊断文档理解模型的结构稳健性和推理能力。
cs.CV / 46 / 2603.17267
ConfusionBench: An Expert-Validated Benchmark for Confusion Recognition and Localization in Educational Videos
ConfusionBench:一个专家验证的教育视频混淆识别与定位基准
Abstract
Recognizing and localizing student confusion from video is an important yet challenging problem in educational AI. Existing confusion datasets suffer from noisy labels, coarse temporal annotations, and limited expert validation, which hinder reliable fine-grained recognition and temporally grounded analysis. To address these limitations, we propose a practical multi-stage filtering pipeline that integrates two stages of model-assisted screening, researcher curation, and expert validation to build a higher-quality benchmark for confusion understanding. Based on this pipeline, we introduce ConfusionBench, a new benchmark for educational videos consisting of a balanced confusion recognition dataset and a video localization dataset. We further provide zero-shot baseline evaluations of a representative open-source model and a proprietary model on clip-level confusion recognition, long-video confusion localization tasks. Experimental results show that the proprietary model performs better overall but tends to over-predict transitional segments, while the open-source model is more conservative and more prone to missed detections. In addition, the proposed student confusion report visualization can support educational experts in making intervention decisions and adapting learning plans accordingly. All datasets and related materials will be made publicly available on our project page.
Chinese Translation
从视频中识别和定位学生的混淆是教育人工智能中一个重要但具有挑战性的问题。现有的混淆数据集存在标签噪声、粗略的时间注释和有限的专家验证,这些问题阻碍了可靠的细粒度识别和时间基础分析。为了解决这些局限性,我们提出了一种实用的多阶段过滤管道,该管道集成了两阶段的模型辅助筛选、研究者策划和专家验证,以构建更高质量的混淆理解基准。在此基础上,我们推出了ConfusionBench,一个新的教育视频基准,包含一个平衡的混淆识别数据集和一个视频定位数据集。我们进一步提供了一个代表性开源模型和一个专有模型在片段级混淆识别和长视频混淆定位任务上的零样本基线评估。实验结果表明,专有模型整体表现更好,但倾向于过度预测过渡段,而开源模型则更为保守,更容易漏检。此外,所提出的学生混淆报告可视化可以支持教育专家做出干预决策并相应调整学习计划。所有数据集及相关材料将公开发布在我们的项目页面上。
cs.CV / 47 / 2603.17275
DANCE: Dynamic 3D CNN Pruning: Joint Frame, Channel, and Feature Adaptation for Energy Efficiency on the Edge
DANCE:动态3D CNN剪枝:边缘计算中的帧、通道和特征联合适应以提高能效
Abstract
Modern convolutional neural networks (CNNs) are workhorses for video and image processing, but fail to adapt to the computational complexity of input samples in a dynamic manner to minimize energy consumption. In this research, we propose DANCE, a fine-grained, input-aware, dynamic pruning framework for 3D CNNs to maximize power efficiency with negligible to zero impact on performance. In the proposed two-step approach, the first step is called activation variability amplification (AVA), and the 3D CNN model is retrained to increase the variance of the magnitude of neuron activations across the network in this step, facilitating pruning decisions across diverse CNN input scenarios. In the second step, called adaptive activation pruning (AAP), a lightweight activation controller network is trained to dynamically prune frames, channels, and features of 3D convolutional layers of the network (different for each layer), based on statistics of the outputs of the first layer of the network. Our method achieves substantial savings in multiply-accumulate (MAC) operations and memory accesses by introducing sparsity within convolutional layers. Hardware validation on the NVIDIA Jetson Nano GPU and the Qualcomm Snapdragon 8 Gen 1 platform demonstrates respective speedups of 1.37X and 2.22X, achieving up to 1.47X higher energy efficiency compared to the state of the art.
Chinese Translation
现代卷积神经网络(CNN)是视频和图像处理的主力军,但未能以动态方式适应输入样本的计算复杂性,从而最大限度地减少能耗。在本研究中,我们提出了DANCE,一个细粒度、输入感知的动态剪枝框架,旨在最大化3D CNN的能效,同时对性能的影响微乎其微或为零。在所提出的两步方法中,第一步称为激活变异性放大(AVA),在此步骤中重新训练3D CNN模型,以增加网络中神经元激活幅度的方差,从而促进在不同CNN输入场景下的剪枝决策。第二步称为自适应激活剪枝(AAP),训练一个轻量级的激活控制网络,基于网络第一层输出的统计信息,动态剪枝网络中3D卷积层的帧、通道和特征(每层不同)。我们的方法通过在卷积层中引入稀疏性,实现了乘加(MAC)操作和内存访问的显著节省。在NVIDIA Jetson Nano GPU和Qualcomm Snapdragon 8 Gen 1平台上的硬件验证显示,分别实现了1.37倍和2.22倍的加速,相较于当前最先进技术,能效提高了最高1.47倍。
cs.CV / 48 / 2603.17295
Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation
引导叙事:一种用于控制故事生成中的连贯性和风格的微调方法
Abstract
Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive two-stage framework designed for robust and consistent story generation. First, we introduce Group-Shared Attention (GSA), a mechanism that fosters intrinsic consistency by enabling lossless cross-sample information flow within attention layers. This allows the model to structurally encode identity correspondence across frames without relying on external encoders. Second, we leverage Direct Preference Optimization (DPO) to align generated outputs with human aesthetic and narrative standards. Unlike conventional methods that rely on conflicting auxiliary losses, our approach simultaneously enhances visual fidelity and identity preservation by learning from holistic preference data. Extensive evaluations on the ViStoryBench benchmark demonstrate that our method establishes a new state-of-the-art, significantly outperforming strong baselines with gains of +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD), all while preserving high-fidelity generation.
Chinese Translation
故事可视化需要生成与不断发展的叙事在语义上对齐的顺序图像,同时保持角色身份和视觉风格的一致性。然而,现有的方法往往在主题一致性和身份漂移方面存在困难,特别是在描绘复杂互动或延续叙事弧时。为了解决这些挑战,我们提出了一种连贯的两阶段框架,旨在实现稳健且一致的故事生成。首先,我们引入了群体共享注意力(Group-Shared Attention, GSA)机制,通过在注意力层内实现无损的跨样本信息流动,促进内在一致性。这使得模型能够在各帧之间结构性地编码身份对应,而无需依赖外部编码器。其次,我们利用直接偏好优化(Direct Preference Optimization, DPO)将生成的输出与人类的审美和叙事标准对齐。与依赖于相互冲突的辅助损失的传统方法不同,我们的方法通过从整体偏好数据中学习,同时增强视觉保真度和身份保持。对ViStoryBench基准的广泛评估表明,我们的方法建立了新的最先进水平,在角色身份(Character Identity, CIDS)上提升了+10.0,在风格一致性(Style Consistency, CSD)上提升了+18.7,同时保持了高保真度的生成。
cs.CV / 49 / 2603.17304
3D MRI-Based Alzheimer's Disease Classification Using Multi-Modal 3D CNN with Leakage-Aware Subject-Level Evaluation
基于3D MRI的阿尔茨海默病分类:使用多模态3D卷积神经网络与泄漏感知的受试者水平评估
Abstract
Deep learning has become an important tool for Alzheimer's disease (AD) classification from structural MRI. Many existing studies analyze individual 2D slices extracted from MRI volumes, while clinical neuroimaging practice typically relies on the full three dimensional structure of the brain. From this perspective, volumetric analysis may better capture spatial relationships among brain regions that are relevant to disease progression. Motivated by this idea, this work proposes a multimodal 3D convolutional neural network for AD classification using raw OASIS 1 MRI volumes. The model combines structural T1 information with gray matter, white matter, and cerebrospinal fluid probability maps obtained through FSL FAST segmentation in order to capture complementary neuroanatomical information. The proposed approach is evaluated on the clinically labelled OASIS 1 cohort using 5 fold subject level cross validation, achieving a mean accuracy of 72.34% plus or minus 4.66% and a ROC AUC of 0.7781 plus or minus 0.0365. GradCAM visualizations further indicate that the model focuses on anatomically meaningful regions, including the medial temporal lobe and ventricular areas that are known to be associated with Alzheimer's related structural changes. To better understand how data representation and evaluation strategies may influence reported performance, additional diagnostic experiments were conducted on a slice based version of the dataset under both slice level and subject level protocols. These observations help provide context for the volumetric results. Overall, the proposed multimodal 3D framework establishes a reproducible subject level benchmark and highlights the potential benefits of volumetric MRI analysis for Alzheimer's disease classification.
Chinese Translation
深度学习已成为从结构性MRI中分类阿尔茨海默病(AD)的重要工具。许多现有研究分析从MRI体积中提取的单个2D切片,而临床神经影像学实践通常依赖于大脑的完整三维结构。从这个角度来看,体积分析可能更好地捕捉与疾病进展相关的大脑区域之间的空间关系。基于这一理念,本研究提出了一种用于AD分类的多模态3D卷积神经网络,使用原始OASIS 1 MRI体积。该模型结合了结构性T1信息与通过FSL FAST分割获得的灰质、白质和脑脊液概率图,以捕捉互补的神经解剖信息。所提出的方法在临床标记的OASIS 1队列上进行了评估,采用5折受试者水平交叉验证,获得了72.34% ± 4.66%的平均准确率和0.7781 ± 0.0365的ROC AUC。GradCAM可视化进一步表明,该模型关注于解剖上有意义的区域,包括已知与阿尔茨海默病相关的结构变化的内侧颞叶和脑室区域。为了更好地理解数据表示和评估策略如何影响报告的性能,还对数据集的切片版本在切片水平和受试者水平协议下进行了额外的诊断实验。这些观察有助于为体积结果提供背景。总体而言,所提出的多模态3D框架建立了一个可重复的受试者水平基准,并突显了体积MRI分析在阿尔茨海默病分类中的潜在益处。
cs.CV / 50 / 2603.17307
Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
交响乐:一种认知启发的多智能体系统用于长视频理解
Abstract
Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.
Chinese Translation
尽管多模态大语言模型(MLLM)代理的快速发展和广泛应用,但它们在长视频理解(LVU)任务中仍然面临挑战,这些任务的特点是信息密度高和时间跨度长。近期关于LVU代理的研究表明,简单的任务分解和协作机制不足以应对长链推理任务。此外,通过基于嵌入的检索直接减少时间上下文可能会丢失复杂问题的关键信息。本文提出了交响乐(Symphony),一个多智能体系统,以缓解这些局限性。交响乐通过模拟人类认知模式,将LVU分解为细粒度的子任务,并结合一种通过反思增强的深度推理协作机制,有效提高了推理能力。此外,交响乐提供了一种基于视觉语言模型(VLM)的基础方法来分析LVU任务并评估视频片段的相关性,这显著增强了定位具有隐含意图和大时间跨度的复杂问题的能力。实验结果表明,交响乐在LVBench、LongVideoBench、VideoMME和MLVU上达到了最先进的性能,在LVBench上比之前的最先进方法提高了5.0%。代码可在 https://github.com/Haiyang0226/Symphony 获取。
cs.CV / 51 / 2603.17312
Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
基于视觉-语言模型的递归推理用于估计长时间范围的具身任务进展
Abstract
Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ($\text{R}^2$VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train $\text{R}^2$VLM on large-scale, automatically generated datasets from ALFRED and Ego4D. Extensive experiments on progress estimation and downstream applications, including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance, demonstrate that $\text{R}^2$VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation. The models and benchmarks are publicly available at \href{https://huggingface.co/collections/zhangyuelin/r2vlm}{huggingface}.
Chinese Translation
准确估计任务进展对于具身智能体规划和执行长时间范围的多步骤任务至关重要。尽管已有令人鼓舞的进展,现有的基于视觉-语言模型(VLMs)的方法主要利用其视频理解能力,而忽视了其复杂推理潜力。此外,使用VLM处理长视频轨迹在实际部署中计算成本过高。为了解决这些挑战,我们提出了递归推理视觉-语言模型($ ext{R}^2$VLM)。我们的模型具有一个递归推理框架,通过迭代处理局部视频片段,保持通过不断演变的思维链(Chain of Thought, CoT)来维持全局上下文。这个CoT明确记录任务分解、关键步骤及其完成状态,使模型能够推理复杂的时间依赖关系。该设计避免了处理长视频的高成本,同时保留了必要的推理能力。我们在来自ALFRED和Ego4D的大规模自动生成数据集上训练了$ ext{R}^2$VLM。在进展估计和下游应用(包括进展增强的策略学习、强化学习的奖励建模和主动辅助)上的大量实验表明,$ ext{R}^2$VLM实现了强大的性能和泛化能力,在长时间范围的任务进展估计中达到了新的最先进水平。模型和基准数据集已在 ext{huggingface} 上公开可用。
cs.CV / 52 / 2603.17314
A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition
一种无提案的查询引导网络用于基础多模态命名实体识别
Abstract
Grounded Multimodal Named Entity Recognition (GMNER) identifies named entities, including their spans and types, in natural language text and grounds them to the corresponding regions in associated images. Most existing approaches split this task into two steps: they first detect objects using a pre-trained general-purpose detector and then match named entities to the detected objects. However, these methods face a major limitation. Because pre-trained general-purpose object detectors operate independently of textual entities, they tend to detect common objects and frequently overlook specific fine-grained regions required by named entities. This misalignment between object detectors and entities introduces imprecision and can impair overall system performance. In this paper, we propose a proposal-free Query-Guided Network (QGN) that unifies multimodal reasoning and decoding through text guidance and cross- modal interaction. QGN enables accurate grounding and robust performance in open-domain scenarios. Extensive experiments demonstrate that QGN achieves top performance among compared GMNER models on widely used benchmarks.
Chinese Translation
基础多模态命名实体识别(GMNER)在自然语言文本中识别命名实体,包括其范围和类型,并将其与相关图像中的相应区域进行关联。现有的大多数方法将此任务分为两个步骤:首先使用预训练的通用检测器检测对象,然后将命名实体与检测到的对象进行匹配。然而,这些方法面临一个主要限制。由于预训练的通用对象检测器独立于文本实体操作,它们往往只检测常见对象,并且经常忽视命名实体所需的特定细粒度区域。这种对象检测器与实体之间的不匹配引入了不精确性,并可能损害整体系统性能。在本文中,我们提出了一种无提案的查询引导网络(QGN),通过文本引导和跨模态交互统一多模态推理和解码。QGN能够在开放领域场景中实现准确的基础和强大的性能。大量实验表明,QGN在广泛使用的基准测试中在比较的GMNER模型中实现了最佳性能。
cs.CV / 53 / 2603.17325
MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation
MedSAD-CLIP:基于Token-Patch跨注意力的监督CLIP用于医学异常检测与分割
Abstract
Medical anomaly detection (MAD) and segmentation play a critical role in assisting clinical diagnosis by identifying abnormal regions in medical images and localizing pathological regions. Recent CLIP-based studies are promising for anomaly detection in zero-/few-shot settings, and typically rely on global representations and weak supervision, often producing coarse localization and limited segmentation quality. In this work, we study supervised adaptation of CLIP for MAD under a realistic clinical setting where a limited yet meaningful amount of labeled abnormal data is available. Our model MedSAD-CLIP leverages fine-grained text-visual cues via the Token-Patch Cross-Attention(TPCA) to improve lesion localization while preserving the generalization capability of CLIP representations. Lightweight image adapters and learnable prompt tokens efficiently adapt the pretrained CLIP encoder to the medical domain while preserving its rich semantic alignment. Furthermore, a Margin-based image-text Contrastive Loss is designed to enhance global feature discrimination between normal and abnormal representations. Extensive experiments on four diverse benchmarks-Brain, Retina, Lung, and Breast datasets-demonstrate the effectiveness of our approach, achieving superior performance in both pixel-level segmentation and image-level classification over state-of-the-art methods. Our results highlight the potential of supervised CLIP adaptation as a unified and scalable paradigm for medical anomaly understanding. Code will be made available at https://github.com/thuy4tbn99/MedSAD-CLIP
Chinese Translation
医学异常检测(MAD)和分割在通过识别医学图像中的异常区域和定位病理区域来辅助临床诊断中发挥着关键作用。最近基于CLIP的研究在零样本/少样本设置下的异常检测中展现出良好的前景,通常依赖于全局表示和弱监督,往往导致粗糙的定位和有限的分割质量。在本研究中,我们探讨了在现实临床环境下对CLIP进行监督适应的方法,其中有有限但有意义的标注异常数据可用。我们的模型MedSAD-CLIP通过Token-Patch跨注意力(TPCA)利用细粒度的文本-视觉线索来改善病灶定位,同时保持CLIP表示的泛化能力。轻量级图像适配器和可学习的提示令牌有效地将预训练的CLIP编码器适应于医学领域,同时保持其丰富的语义对齐。此外,设计了一种基于边际的图像-文本对比损失,以增强正常与异常表示之间的全局特征区分。在四个多样化基准(脑部、视网膜、肺部和乳腺数据集)上的广泛实验表明了我们方法的有效性,在像素级分割和图像级分类方面均优于最先进的方法。我们的结果突显了监督CLIP适应作为医学异常理解的统一且可扩展范式的潜力。代码将发布在 https://github.com/thuy4tbn99/MedSAD-CLIP
cs.CV / 54 / 2603.17326
FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions
FineViT:通过密集重述逐步解锁细粒度感知
Abstract
While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over $450$ million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.
Chinese Translation
尽管多模态大型语言模型(MLLMs)取得了快速进展,但其视觉编码器常常成为性能瓶颈。传统的基于CLIP的编码器在处理密集空间任务时面临困难,因为低分辨率的预训练导致视觉细节的丢失,并且依赖于噪声较大、粗糙的网络爬取图像-文本对。为了解决这些限制,我们提出了FineViT,这是一种专门设计的视觉编码器,旨在解锁细粒度感知。通过用密集重述替代粗糙的网络数据,我们通过渐进式训练范式系统性地减轻信息损失:首先,编码器在数十亿个全球重述的图像-文本对上从头开始以高原生分辨率进行训练,建立了一个稳健、细节丰富的语义基础。随后,我们通过LLM对齐进一步增强其局部感知,利用我们精心策划的FineCap-450M数据集,该数据集包含超过4.5亿个高质量的本地标题。大量实验验证了渐进策略的有效性。FineViT在零-shot识别和检索性能上达到了最先进的水平,尤其是在长上下文检索中,并且在与MLLM集成时始终优于多模态视觉编码器,如SigLIP2和Qwen-ViT。我们希望FineViT能够作为细粒度视觉感知的强大新基准。
cs.CV / 55 / 2603.17343
EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection
EvoGuard:一个可扩展的基于代理的强化学习框架,用于实用和不断演变的人工智能生成图像检测
Abstract
The rapid proliferation of AI-Generated Images (AIGIs) has introduced severe risks of misinformation, making AIGI detection a critical yet challenging task. While traditional detection paradigms mainly rely on low-level features, recent research increasingly focuses on leveraging the general understanding ability of Multimodal Large Language Models (MLLMs) to achieve better generalization, but still suffer from limited extensibility and expensive training data annotations. To better address complex and dynamic real-world environments, we propose EvoGuard, a novel agentic framework for AIGI detection. It encapsulates various state-of-the-art (SOTA) off-the-shelf MLLM and non-MLLM detectors as callable tools, and coordinates them through a capability-aware dynamic orchestration mechanism. Empowered by the agent's capacities for autonomous planning and reflection, it intelligently selects suitable tools for given samples, reflects intermediate results, and decides the next action, reaching a final conclusion through multi-turn invocation and reasoning. This design effectively exploits the complementary strengths among heterogeneous detectors, transcending the limits of any single model. Furthermore, optimized by a GRPO-based Agentic Reinforcement Learning algorithm using only low-cost binary labels, it eliminates the reliance on fine-grained annotations. Extensive experiments demonstrate that EvoGuard achieves SOTA accuracy while mitigating the bias between positive and negative samples. More importantly, it allows the plug-and-play integration of new detectors to boost overall performance in a train-free manner, offering a highly practical, long-term solution to ever-evolving AIGI threats. Source code will be publicly available upon acceptance.
Chinese Translation
人工智能生成图像(AIGI)的快速传播带来了严重的虚假信息风险,使得AIGI检测成为一项关键但具有挑战性的任务。尽管传统的检测范式主要依赖于低级特征,但最近的研究越来越关注利用多模态大型语言模型(MLLM)的通用理解能力以实现更好的泛化,然而仍然面临有限的可扩展性和昂贵的训练数据标注问题。为了更好地应对复杂和动态的现实环境,我们提出了EvoGuard,一个用于AIGI检测的新型代理框架。它将各种最先进的(SOTA)现成的MLLM和非MLLM检测器封装为可调用工具,并通过能力感知的动态编排机制进行协调。借助代理的自主规划和反思能力,它智能地为给定样本选择合适的工具,反映中间结果,并决定下一步行动,通过多轮调用和推理达到最终结论。该设计有效利用了异构检测器之间的互补优势,超越了任何单一模型的局限。此外,通过使用仅需低成本二元标签的基于GRPO的代理强化学习算法进行优化,消除了对细粒度标注的依赖。大量实验表明,EvoGuard在减轻正负样本之间的偏差的同时,达到了SOTA准确率。更重要的是,它允许新检测器的即插即用集成,以无训练的方式提升整体性能,为不断演变的AIGI威胁提供了一个高度实用的长期解决方案。源代码将在接受后公开。
cs.CV / 56 / 2603.17355
OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery
在线HMR:基于视频的在线世界基础人类网格恢复
Abstract
Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key-value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing. Page and code are available at https://tsukasane.github.io/Video-OnlineHMR/.
Chinese Translation
人类网格恢复(HMR)从单目视频中重建三维人类身体,最近的研究将其扩展到世界坐标的人类轨迹和运动重建。然而,大多数现有方法仍然是离线的,依赖于未来帧或全局优化,这限制了它们在增强现实/虚拟现实(AR/VR)和远程存在等交互反馈和感知-行动循环场景中的适用性。为了解决这个问题,我们提出了在线HMR,一个完全在线的框架,联合满足在线处理的四个基本标准,包括系统级因果性、真实性、时间一致性和效率。在线HMR基于双分支架构,通过因果键值缓存设计和精心策划的滑动窗口学习策略实现流式推断。同时,以人为中心的增量SLAM提供了在物理上合理的轨迹修正下的在线世界基础对齐。实验结果表明,我们的方法在标准EMDB基准和高度动态的自定义视频上实现了与现有基于块的方法相当的性能,同时独特地支持在线处理。页面和代码可在 https://tsukasane.github.io/Video-OnlineHMR/ 获取。
cs.CV / 57 / 2603.17358
A 3D Reconstruction Benchmark for Asset Inspection
用于资产检查的三维重建基准
Abstract
Asset management requires accurate 3D models to inform the maintenance, repair, and assessment of buildings, maritime vessels, and other key structures as they age. These downstream applications rely on high-fidelity models produced from aerial surveys in close proximity to the asset, enabling operators to locate and characterise deterioration or damage and plan repairs. Captured images typically have high overlap between adjacent camera poses, sufficient detail at millimetre scale, and challenging visual appearances such as reflections and transparency. However, existing 3D reconstruction datasets lack examples of these conditions, making it difficult to benchmark methods for this task. We present a new dataset with ground truth depth maps, camera poses, and mesh models of three synthetic scenes with simulated inspection trajectories and varying levels of surface condition on non-Lambertian scene content. We evaluate state-of-the-art reconstruction methods on this dataset. Our results demonstrate that current approaches struggle significantly with the dense capture trajectories and complex surface conditions inherent to this domain, exposing a critical scalability gap and pointing toward new research directions for deployable 3D reconstruction in asset inspection. Project page: https://roboticimaging.org/Projects/asset-inspection-dataset/
Chinese Translation
资产管理需要准确的三维模型,以指导建筑物、海洋船舶及其他关键结构在老化过程中的维护、修理和评估。这些下游应用依赖于从靠近资产的航空测量中生成的高保真模型,使操作人员能够定位和表征劣化或损坏情况并规划修复。捕获的图像通常在相邻相机位置之间具有较高的重叠度,能够提供毫米级的细节,并且具有反射和透明度等挑战性的视觉特征。然而,现有的三维重建数据集缺乏这些条件的示例,使得对该任务的方法进行基准测试变得困难。我们提出了一个新的数据集,包含真实深度图、相机位姿和三个合成场景的网格模型,这些场景具有模拟的检查轨迹和不同程度的非朗伯场景内容表面条件。我们在该数据集上评估了最先进的重建方法。我们的结果表明,当前的方法在处理该领域固有的密集捕获轨迹和复杂表面条件时面临显著挑战,暴露了一个关键的可扩展性缺口,并指向可部署的资产检查三维重建的新研究方向。项目页面:https://roboticimaging.org/Projects/asset-inspection-dataset/
cs.CV / 58 / 2603.17360
MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval
MCoT-MVS:基于多模态链式思维推理的多层次视觉选择用于复合图像检索
Abstract
Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.
Chinese Translation
复合图像检索(CIR)旨在基于参考图像和修改文本检索目标图像。然而,现有方法往往难以从参考图像中提取出最佳反映用户意图的正确语义线索,这在文本修改提示下导致了来自无关视觉噪声的干扰。本文提出了一种新颖的多层次视觉选择方法(MCoT-MVS),通过多模态链式思维推理用于CIR,整合了由多模态大型语言模型(MLLM)引导的注意力感知多层次视觉特征。具体而言,我们利用MLLM对多模态复合输入进行链式思维推理,生成保留、移除和目标推断文本。这些文本线索随后引导两个参考视觉注意选择模块,从参考图像中选择性地提取判别性补丁级和实例级语义。最后,为了有效地将这些多粒度视觉线索与修改文本和想象的目标描述融合,我们设计了一个加权层次组合模块,以在统一的嵌入空间中对齐复合查询与目标图像。在两个CIR基准(CIRR和FashionIQ)上的广泛实验表明,我们的方法始终优于现有方法,并取得了新的最先进性能。代码和训练模型已公开发布。
cs.CV / 59 / 2603.17370
Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes
材料魔法棒:无纹理网格中3D部件的材料感知分组
Abstract
We introduce the problem of material-aware part grouping in untextured meshes. Many real-world shapes, such as scales of pinecones or windows of buildings, contain repeated structures that share the same material but exhibit geometric variations. When assigning materials to such meshes, these repeated parts often require piece-by-piece manual identification and selection, which is tedious and time-consuming. To address this, we propose Material Magic Wand, a tool that allows artists to select part groups based on their estimated material properties -- when one part is selected, our algorithm automatically retrieves all other parts likely to share the same material. The key component of our approach is a part encoder that generates a material-aware embedding for each 3D part, accounting for both local geometry and global context. We train our model with a supervised contrastive loss that brings embeddings of material-consistent parts closer while separating those of different materials; therefore, part grouping can be achieved by retrieving embeddings that are close to the embedding of the selected part. To benchmark this task, we introduce a curated dataset of 100 shapes with 241 part-level queries. We verify the effectiveness of our method through extensive experiments and demonstrate its practical value in an interactive material assignment application.
Chinese Translation
我们提出了在无纹理网格中进行材料感知部件分组的问题。许多现实世界的形状,例如松果的鳞片或建筑物的窗户,包含共享相同材料但具有几何变异的重复结构。在为这些网格分配材料时,这些重复的部件通常需要逐个手动识别和选择,这既繁琐又耗时。为了解决这个问题,我们提出了材料魔法棒(Material Magic Wand),这是一个允许艺术家根据估计的材料属性选择部件组的工具——当选择一个部件时,我们的算法会自动检索所有可能共享相同材料的其他部件。我们方法的关键组成部分是一个部件编码器,它为每个3D部件生成一个材料感知的嵌入,考虑了局部几何和全局上下文。我们使用监督对比损失训练我们的模型,使得材料一致的部件的嵌入更接近,同时将不同材料的嵌入分开;因此,部件分组可以通过检索与所选部件的嵌入接近的嵌入来实现。为了对这一任务进行基准测试,我们引入了一个包含100个形状和241个部件级查询的精心策划的数据集。我们通过广泛的实验验证了我们方法的有效性,并展示了其在交互式材料分配应用中的实际价值。
cs.CV / 60 / 2603.17372
Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
通过与越狱相关的表征转变理解和防御视觉语言模型的越狱行为
Abstract
Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.
Chinese Translation
大型视觉语言模型(VLMs)在整合视觉模态后,往往表现出安全对齐的减弱。即使文本提示包含明确的有害意图,添加图像也可以显著提高越狱成功率。在本文中,我们观察到VLMs能够在其表征空间中清晰地区分良性输入和有害输入。此外,即使在有害输入中,越狱样本形成了一个与拒绝样本可分离的独特内部状态。这些观察结果表明,越狱并不是由于未能识别有害意图而产生的。相反,视觉模态将表征转向特定的越狱状态,从而导致未能触发拒绝。为了量化这一转变,我们识别出一个越狱方向,并将与越狱相关的转变定义为沿该方向的图像诱导表征转变的组成部分。我们的分析表明,与越狱相关的转变可靠地表征了越狱行为,为多样的越狱场景提供了统一的解释。最后,我们提出了一种防御方法,通过在推理时去除与越狱相关的转变(JRS-Rem)来增强VLM的安全性。实验表明,JRS-Rem在多个场景中提供了强有力的防御,同时保持了良性任务的性能。
cs.CV / 61 / 2603.17374
Shot-Aware Frame Sampling for Video Understanding
基于镜头意识的视频帧采样用于视频理解
Abstract
Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it improves the chance of preserving both overall video context and short decision-critical moments without requiring any retraining. To better evaluate such short-lived events, we further introduce SynFlash, a synthetic benchmark with controllable sub-second anomaly patterns and frame-level ground truth, and we also evaluate InfoShot on existing anomaly datasets and general video understanding tasks. Experiments show that InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.
Chinese Translation
视频帧采样对于使用视觉语言模型(Vision-Language Models, VLMs)进行高效的长视频理解至关重要,因为密集输入的成本高且往往超出上下文限制。然而,当只能保留少量帧时,现有的采样方法通常无法在广泛的视频覆盖和简短但关键的事件之间取得平衡,这可能导致下游预测的不可靠性。为了解决这个问题,我们提出了InfoShot,一种任务无关的、基于镜头意识的长视频理解帧采样器。InfoShot首先将视频划分为语义一致的镜头,然后从每个镜头中选择两个互补的关键帧:一个用于表示主要内容,另一个用于捕捉镜头内的不寻常变化。该设计受到信息论目标的指导,鼓励采样集保留关于镜头结构和稀疏镜头内偏差的高信息量。通过这种方式,它提高了保留整体视频上下文和短暂决策关键时刻的机会,而无需任何重新训练。为了更好地评估这些短暂事件,我们进一步引入了SynFlash,一个具有可控亚秒异常模式和帧级真实标签的合成基准,并在现有的异常数据集和一般视频理解任务上评估InfoShot。实验表明,在帧数限制下,InfoShot提高了异常命中率和下游视频问答(Video-QA)准确性,同时在标准视频理解基准上与强基线相匹配或超越。
cs.CV / 62 / 2603.17375
Stereo World Model: Camera-Guided Stereo Video Generation
立体世界模型:相机引导的立体视频生成
Abstract
We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.
Chinese Translation
我们提出了StereoWorld,一种相机条件下的立体世界模型,它联合学习外观和双目几何,用于端到端的立体视频生成。与单目RGB或RGBD方法不同,StereoWorld仅在RGB模态下操作,同时直接从视差中获取几何信息。为了高效实现一致的立体生成,我们的方法引入了两个关键设计:(1)统一的相机帧旋转位置编码(RoPE),通过相机感知的旋转位置编码增强潜在标记,使得在保持预训练视频先验的同时,实现相对、视角和时间一致的条件;(2)立体感知注意力分解,将完整的4D注意力分解为3D视内注意力和水平行注意力,利用极线先验以显著更低的计算成本捕捉与视差对齐的对应关系。在基准测试中,StereoWorld在立体一致性、视差准确性和相机运动保真度方面优于强大的单目转化管道,实现了超过3倍的生成速度提升,并在视角一致性上获得额外5%的提升。除了基准测试,StereoWorld还支持端到端的双目虚拟现实渲染,无需深度估计或修补,通过度量尺度深度基础增强体感策略学习,并与长视频蒸馏兼容,以实现扩展的交互式立体合成。
cs.CV / 63 / 2603.17382
VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm
VisionNVS:在虚拟位移范式下的自监督图像修复用于新视角合成
Abstract
A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift'' strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.
Chinese Translation
在自动驾驶的新视角合成(NVS)中,一个根本性的瓶颈是对新轨迹的固有监督缺口:模型在推理过程中需要合成未见过的视图,但在训练时缺乏这些位移姿态的真实图像。本文提出了VisionNVS,一个仅依赖相机的框架,根本上将视图合成从一个病态外推问题重新构造为一个自监督图像修复任务。通过引入“虚拟位移”(Virtual-Shift)策略,我们利用单目深度代理模拟遮挡模式,并将其映射到原始视图上。这种范式转变使得原始录制图像可以作为像素级的完美监督,有效消除了以往方法中固有的领域差距。此外,我们通过伪3D接缝合成(Pseudo-3D Seam Synthesis)策略解决空间一致性问题,该策略在训练过程中整合来自相邻相机的视觉数据,以明确建模现实世界中的光度差异和校准误差。实验表明,VisionNVS在几何保真度和视觉质量上优于依赖LiDAR的基线,提供了一种可扩展的驾驶模拟的强健解决方案。
cs.CV / 64 / 2603.17388
Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis
朝着音位引导的手语动作生成:扩散基线与条件分析
Abstract
Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.
Chinese Translation
基于文本输入生成自然、正确且视觉流畅的3D虚拟形象手语动作仍然面临很大挑战。在本研究中,我们训练了一个3D身体动作的生成模型,并探讨了音位属性条件在手语动作生成中的作用,使用了ASL-LEX 2.0注释,如手形、手的位置和运动。我们首先建立了一个强大的扩散基线,采用了Human Motion MDM风格的扩散模型,并使用SMPL-X表示,结果在手语词汇辨别性指标上优于最先进的CVAE方法SignAvatar。随后,我们系统地研究了使用不同文本编码器(CLIP与T5)、条件模式(仅词汇与词汇+音位属性)和属性标注格式(符号与自然语言)进行文本条件的作用。我们的分析表明,将符号ASL-LEX标注翻译为自然语言是有效的基于CLIP的属性条件的必要条件,而T5对这种翻译的影响较小。此外,我们表现最佳的变体(使用映射属性的CLIP)在所有指标上均优于SignAvatar。这些发现突显了输入表示作为基于文本编码器的属性条件的关键因素,并激励了结构化条件方法,其中词汇和音位属性通过独立路径进行编码。
cs.CV / 65 / 2603.17390
Harnessing the Power of Foundation Models for Accurate Material Classification
利用基础模型的力量实现准确的材料分类
Abstract
Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.
Chinese Translation
材料分类已成为计算机视觉和图形学中的一项关键任务,支持将准确的材料属性分配给广泛的数字和现实世界应用。尽管传统上将其视为图像分类任务,但该领域面临着由于标注数据稀缺而导致的重大挑战,这限制了训练模型的准确性和泛化能力。最近在视觉-语言基础模型(VLMs)方面的进展为解决这些问题提供了有希望的途径,但现有利用这些模型的解决方案在材料识别任务中仍然表现不尽如人意。在本研究中,我们提出了一种新颖的框架,有效利用基础模型克服数据限制并提高分类准确性。我们的方法集成了两个关键创新:(a)一个强大的图像生成和自动标注管道,创建一个多样化且高质量的训练数据集,其中包含以材料为中心的图像,并通过融合对象语义和材料属性的文本提示自动分配标签;(b)一种先验信息整合策略,从VLMs中提取信息,结合联合微调方法,优化一个预训练的视觉基础模型以及VLM衍生的先验,保持广泛的泛化能力,同时适应特定材料的特征。大量实验表明,在多个数据集上显著提高了性能。我们展示了我们的合成数据集有效捕捉了现实世界材料的特征,并且从视觉-语言模型中整合的先验显著增强了最终性能。源代码和数据集将被发布。
cs.CV / 66 / 2603.17396
Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation
基于手势感知的预训练与标记融合用于三维手势估计
Abstract
Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.
Chinese Translation
从单目RGB图像中估计三维手势对于增强现实/虚拟现实(AR/VR)、人机交互和手语理解等应用至关重要。在本研究中,我们关注于一个可用离散手势标签的场景,并展示手势语义可以作为三维姿态估计的强大归纳偏置。我们提出了一个两阶段框架:手势感知预训练,通过使用来自InterHand2.6M的粗略和精细手势标签学习一个信息丰富的嵌入空间,随后是一个由手势嵌入指导的每个关节标记Transformer,作为最终回归MANO手参数的中间表示。训练通过对参数、关节和结构约束的分层目标驱动。在InterHand2.6M上的实验表明,手势感知预训练在单手准确性上始终优于最先进的EANet基线,并且这一优势在不同架构之间无须任何修改即可转移。
cs.CV / 67 / 2603.17398
Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion
基于运动自适应时间注意力的轻量级视频生成与稳定扩散
Abstract
We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy -- global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9\% of the base UNet) while achieving competitive results on WebVid validation when trained on 100K videos. We demonstrate that the standard denoising objective alone provides sufficient implicit temporal regularization, outperforming approaches that add explicit temporal consistency losses. Our ablation studies reveal a clear trade-off between noise correlation and motion amplitude, providing a practical inference-time control for diverse generation behaviors.
Chinese Translation
我们提出了一种基于冻结的稳定扩散模型的运动自适应时间注意力机制,用于参数高效的视频生成。我们的方法并非对所有视频内容进行统一处理,而是根据估计的运动内容动态调整时间注意力接收域:高运动序列在帧间局部关注,以保留快速变化的细节,而低运动序列则全局关注,以增强场景一致性。我们通过级联策略将轻量级时间注意力模块注入所有的 UNet 变换器块——在下采样和中间块中使用全局注意力以实现语义稳定,在上采样块中使用运动自适应注意力以进行细粒度的精细调整。结合时间相关的噪声初始化和运动感知门控,该系统仅增加了 25.8M 可训练参数(占基础 UNet 的 2.9%),在训练 100K 视频后,在 WebVid 验证集上取得了竞争力的结果。我们证明,仅使用标准去噪目标就提供了足够的隐式时间正则化,优于添加显式时间一致性损失的方法。我们的消融研究揭示了噪声相关性与运动幅度之间的明显权衡,为多样化生成行为提供了实用的推理时控制。
cs.CV / 68 / 2603.17408
Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression
联合降解感知任意尺度超分辨率用于可变速率极限图像压缩
Abstract
Recent diffusion-based extreme image compression methods have demonstrated remarkable performance at ultra-low bitrates. However, most approaches require training separate diffusion models for each target bitrate, resulting in substantial computational overhead and hindering practical deployment. Meanwhile, recent studies have shown that joint super-resolution can serve as an effective approach for enhancing low-bitrate reconstruction. However, when moving toward ultra-low bitrate regimes, these methods struggle due to severe information loss, and their reliance on fixed super-resolution scales prevents flexible adaptation across diverse bitrates. To address these limitations, we propose ASSR-EIC, a novel image compression framework that leverages arbitrary-scale super-resolution (ASSR) to support variable-rate extreme image compression (EIC). An arbitrary-scale downsampling module is introduced at the encoder side to provide controllable rate reduction, while a diffusion-based, joint degradation-aware ASSR decoder enables rate-adaptive reconstruction within a single model. We exploit the compression- and rescaling-aware diffusion prior to guide the reconstruction, yielding high fidelity and high realism restoration across diverse compression and rescaling settings. Specifically, we design a global compression-rescaling adaptor that offers holistic guidance for rate adaptation, and a local compression-rescaling modulator that dynamically balances generative and fidelity-oriented behaviors to achieve fine-grained, bitrate-adaptive detail restoration. To further enhance reconstruction quality, we introduce a dual semantic-enhanced design. Extensive experiments demonstrate that ASSR-EIC delivers state-of-the-art performance in extreme image compression while simultaneously supporting flexible bitrate control and adaptive rate-dependent reconstruction.
Chinese Translation
近期基于扩散的极限图像压缩方法在超低比特率下表现出显著的性能。然而,大多数方法需要为每个目标比特率训练单独的扩散模型,这导致了巨大的计算开销,并阻碍了实际应用。同时,近期研究表明,联合超分辨率可以作为增强低比特率重建的有效方法。然而,当向超低比特率范围移动时,这些方法由于严重的信息损失而面临困难,并且它们对固定超分辨率尺度的依赖阻碍了在不同比特率之间的灵活适应。为了解决这些限制,我们提出了ASSR-EIC,这是一种新颖的图像压缩框架,利用任意尺度超分辨率(ASSR)来支持可变速率极限图像压缩(EIC)。在编码器端引入了一个任意尺度下采样模块,以提供可控的速率降低,而基于扩散的联合降解感知ASSR解码器则使得在单一模型内实现速率自适应重建成为可能。我们利用压缩和重缩放感知的扩散先验来指导重建,从而在不同的压缩和重缩放设置下实现高保真度和高真实感的恢复。具体而言,我们设计了一个全局压缩-重缩放适配器,为速率适应提供整体指导,并设计了一个局部压缩-重缩放调制器,动态平衡生成和保真导向行为,以实现细粒度的比特率自适应细节恢复。为了进一步提升重建质量,我们引入了双重语义增强设计。大量实验表明,ASSR-EIC在极限图像压缩中提供了最先进的性能,同时支持灵活的比特率控制和自适应的速率依赖重建。
cs.CV / 69 / 2603.17412
Mutually Causal Semantic Distillation Network for Zero-Shot Learning
互因语义蒸馏网络用于零样本学习
Abstract
Zero-shot learning (ZSL) aims to recognize the unseen classes in the open-world guided by the side-information (e.g., attributes). Its key task is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus conducting a desirable semantic knowledge transfer from seen classes to unseen ones. Prior works simply utilize unidirectional attention within a weakly-supervised manner to learn the spurious and limited latent semantic representations, which fail to effectively discover the intrinsic semantic knowledge (e.g., attribute semantic) between visual and attribute features. To solve the above challenges, we propose a mutually causal semantic distillation network (termed MSDN++) to distill the intrinsic and sufficient semantic representations for ZSL. MSDN++ consists of an attribute$\rightarrow$visual causal attention sub-net that learns attribute-based visual features, and a visual$\rightarrow$attribute causal attention sub-net that learns visual-based attribute features. The causal attentions encourages the two sub-nets to learn causal vision-attribute associations for representing reliable features with causal visual/attribute learning. With the guidance of semantic distillation loss, the two mutual attention sub-nets learn collaboratively and teach each other throughout the training process. Extensive experiments on three widely-used benchmark datasets (e.g., CUB, SUN, AWA2, and FLO) show that our MSDN++ yields significant improvements over the strong baselines, leading to new state-of-the-art performances.
Chinese Translation
零样本学习(ZSL)旨在在开放世界中识别未见类别,依赖于侧信息(例如,属性)。其关键任务是如何推断已见类别之间视觉特征与属性特征的潜在语义知识,从而实现从已见类别到未见类别的理想语义知识转移。之前的研究仅利用单向注意力在弱监督的方式下学习虚假和有限的潜在语义表示,这未能有效发现视觉特征与属性特征之间的内在语义知识(例如,属性语义)。为了解决上述挑战,我们提出了一种互因语义蒸馏网络(称为 MSDN++),用于提取零样本学习的内在和充分的语义表示。MSDN++ 由一个属性$
ightarrow$视觉因果注意力子网络和一个视觉$
ightarrow$属性因果注意力子网络组成,前者学习基于属性的视觉特征,后者学习基于视觉的属性特征。因果注意力促使这两个子网络学习因果的视觉-属性关联,以表示具有因果视觉/属性学习的可靠特征。在语义蒸馏损失的指导下,这两个互注意力子网络在整个训练过程中协同学习并相互教授。对三个广泛使用的基准数据集(例如,CUB、SUN、AWA2 和 FLO)进行的广泛实验表明,我们的 MSDN++ 在强基线之上取得了显著的改进,达到了新的最先进性能。
cs.CV / 70 / 2603.17413
Towards Motion-aware Referring Image Segmentation
面向运动感知的图像分割
Abstract
Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions. Codes are available at https://github.com/snuviplab/MRaCL
Chinese Translation
引用图像分割(Referring Image Segmentation, RIS)需要根据文本描述从图像中识别对象。我们观察到,现有方法在与运动相关的查询上表现显著低于基于外观的查询。为了解决这个问题,我们首先提出了一种高效的数据增强方案,该方案从原始标题中提取运动中心短语,使模型在没有额外注释的情况下接触到更多的运动表达。其次,由于同一对象在不同上下文中可能有不同的描述,我们提出了多模态径向对比学习(Multimodal Radial Contrastive Learning, MRaCL),该方法在融合的图像-文本嵌入上进行,而不是在单一模态表示上进行。为了进行全面评估,我们引入了一个新的测试集,专注于运动中心的查询,并推出了一个新的基准测试,称为 M-Bench,其中对象主要通过动作进行区分。大量实验表明,我们的方法在多个 RIS 模型中显著提高了运动中心查询的性能,同时在基于外观的描述上保持了竞争力的结果。代码可在 https://github.com/snuviplab/MRaCL 获取。
cs.CV / 71 / 2603.17426
SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning
SHIFT:在视频扩散模型中进行运动对齐的对抗混合微调
Abstract
Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.
Chinese Translation
基于图像条件的视频扩散模型在视觉真实感方面表现出色,但通常会遭遇运动保真度降低的问题,例如运动动态减弱或长期时间一致性下降,尤其是在微调后。我们研究了视频扩散模型训练后运动对齐的问题。为了解决这一问题,我们引入了基于像素流动动态的像素运动奖励,以捕捉瞬时和长期的运动一致性。我们进一步提出了平滑混合微调(Smooth Hybrid Fine-tuning,SHIFT),这是一种可扩展的基于奖励的微调框架,用于视频扩散模型。SHIFT将常规的监督微调和优势加权微调融合为一个统一的框架。得益于新颖的对抗优势,SHIFT提高了收敛速度并减轻了奖励破解的问题。实验表明,我们的方法有效解决了现代视频扩散模型监督微调中的动态度崩溃问题。
cs.CV / 72 / 2603.17427
ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation
ECHO:朝着情感适宜和上下文感知的互动头部生成
Abstract
In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO's superior IHG performance.
Chinese Translation
在自然的面对面互动中,参与者无缝地交替进行说话和倾听,产生的面部行为(FBs)受到长程上下文的细致影响,自然展现出上下文适宜性和情感理性。互动头部生成(IHG)旨在合成逼真的头像视频,以模拟这种能力。现有的IHG方法通常在短时间窗口内依赖于双轨信号(即人类用户的行为和为头像预定义的音频),共同驱动头像的音频对齐的唇部发音和非语言FBs的生成。然而,这些方法仍面临两个主要挑战:(i)依赖于短片段行为线索而缺乏长程上下文建模,导致生成的面部行为缺乏上下文适宜性;(ii)双轨信号的纠缠和角色无关的融合在经验上引入了信号间干扰,可能影响说话时唇部区域的同步。为此,我们提出了ECHO,一个新颖的IHG框架,包含两个关键组件:一个长程上下文理解(LCU)组件,促进对行为驱动动态和语言驱动情感语义的上下文理解,以提升合成头像FBs的上下文适宜性和情感理性;以及一个块级空间感知解耦交叉注意调制(SDCM)模块,保持自音频驱动的唇部发音,同时自适应地整合用户上下文行为线索用于非唇部面部区域,辅以我们设计的两阶段训练范式,以共同增强唇部同步和视觉真实感。大量实验表明了所提组件的有效性以及ECHO在IHG性能上的优越性。
cs.CV / 73 / 2603.17441
AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
AdaZoom-GUI:基于自适应缩放的图形用户界面定位与指令优化
Abstract
GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.
Chinese Translation
图形用户界面(GUI)定位是视觉语言模型(VLMs)的一项关键能力,它通过从自然语言指令中定位目标元素,实现与图形用户界面的自动化交互。然而,由于高分辨率图像、小型用户界面元素和模糊的用户指令,基于GUI截图的定位仍然具有挑战性。在本研究中,我们提出了AdaZoom-GUI,一种基于自适应缩放的GUI定位框架,旨在提高定位准确性和指令理解能力。我们的方法引入了一个指令优化模块,将自然语言命令重写为明确且详细的描述,使得定位模型能够专注于精确的元素定位。此外,我们设计了一种条件缩放策略,选择性地对预测的小元素进行第二阶段推理,从而提高定位准确性,同时避免在简单案例中不必要的计算和上下文丢失。为了支持该框架,我们构建了一个高质量的GUI定位数据集,并使用组相对策略优化(Group Relative Policy Optimization, GRPO)训练定位模型,使模型能够预测点击坐标和元素边界框。在公共基准测试上的实验表明,我们的方法在与可比或更大参数规模的模型中实现了最先进的性能,突显了其在高分辨率GUI理解和实际GUI代理部署中的有效性。
cs.CV / 74 / 2603.17455
FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning
FACE-net:用于检索增强情感视频字幕的事实校准与情感增强
Abstract
Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.
Chinese Translation
情感视频字幕生成(EVC)是一项新兴任务,旨在描述视频中表达的内在情感与事实内容。现有的研究主要关注全球情感线索,并将其与视频内容结合以生成描述。然而,在生成过程中,事实和情感线索的挖掘与协调不足,使得这些方法难以处理事实-情感偏差,即在不同样本中事实和情感需求的不同。为此,我们提出了一种基于检索增强的框架,称为事实校准与情感增强(FACE-net),该框架通过统一架构协同挖掘事实-情感语义,并为生成提供自适应和准确的指导,从而突破了在所有样本学习中事实-情感描述的妥协倾向。在技术上,我们首先引入一个外部库,并检索与视频内容最相关的句子以增强语义信息。随后,我们通过不确定性估计模块进行的事实校准将检索到的信息拆分为主语-谓语-宾语三元组,并通过视频内容自我精炼和交叉精炼不同组件,以有效挖掘事实语义;同时,我们的渐进式视觉情感增强模块利用校准后的事实语义作为专家,与视频内容和情感词典交互生成视觉查询和候选情感,然后将其聚合以自适应地增强每个事实语义的情感。此外,为了减轻事实-情感偏差,我们设计了一个动态偏差调整路由模块,以预测和调整样本的偏差程度。
cs.CV / 75 / 2603.17461
AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization
AR-CoPO:通过对比策略优化对齐自回归视频生成
Abstract
Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.
Chinese Translation
流式自回归(AR)视频生成器结合少步蒸馏实现了低延迟、高质量的合成,但通过人类反馈的强化学习(RLHF)进行对齐仍然困难。现有的基于随机微分方程(SDE)的GRPO方法在这一设置中面临挑战:少步常微分方程(ODE)和一致性模型采样器偏离了标准的流匹配ODE,其短期、低随机性的轨迹对初始化噪声高度敏感,使得中间SDE探索效果不佳。我们提出了AR-CoPO(自回归对比策略优化),这是一个将邻域GRPO对比视角适应于流式AR生成的框架。AR-CoPO通过分叉机制引入了块级对齐,该机制在随机选择的块处构建邻域候选,分配序列级奖励,并执行局部GRPO更新。我们进一步提出了一种半在线策略训练策略,该策略通过对参考回放的利用补充在线探索,提高了各领域的生成质量。在自我强迫(Self-Forcing)实验中,AR-CoPO在领域外泛化和领域内人类偏好对齐方面均优于基线,提供了真实对齐而非奖励黑客的证据。
cs.CV / 76 / 2603.17470
VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection
VirPro:视觉引用的概率提示学习用于弱监督单目3D检测
Abstract
Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model's ability to learn scene-aware representations. To address this challenge, we propose Visual-referred Probabilistic Prompt Learning (VirPro), an adaptive multi-modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance-conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi-Gaussian Prompt Modeling (MGPM), which incorporates scene-based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision-language embeddings, we decode a prompt-targeted Gaussian, from which we derive a unified object-level prompt embedding for each instance. RoI-level contrastive matching is employed to enforce modality alignment, bringing embeddings of co-occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline.
Chinese Translation
单目3D物体检测通常依赖伪标注技术以减少对真实世界注释的依赖。最近的进展表明,确定性的语言线索可以作为有效的辅助弱监督信号,提供互补的语义上下文。然而,手工制作的文本描述难以捕捉场景中个体的固有视觉多样性,从而限制了模型学习场景感知表示的能力。为了解决这一挑战,我们提出了视觉引用的概率提示学习(VirPro),这是一种自适应的多模态预训练范式,可以无缝集成到多种弱监督单目3D检测框架中。具体而言,我们在不同场景中生成一组多样的可学习实例条件提示,并将其存储在自适应提示库(Adaptive Prompt Bank, APB)中。随后,我们引入多高斯提示建模(Multi-Gaussian Prompt Modeling, MGPM),该方法将基于场景的视觉特征融入相应的文本嵌入中,使文本提示能够表达视觉不确定性。然后,从融合的视觉-语言嵌入中,我们解码出一个针对提示的高斯分布,从中为每个实例导出统一的物体级提示嵌入。采用RoI级对比匹配来强制模态对齐,使同一场景中共现物体的嵌入在潜在空间中更接近,从而增强语义一致性。在KITTI基准上的大量实验表明,整合我们的预训练范式始终带来显著的性能提升,平均精度比基线提高了高达4.8%。
cs.CV / 77 / 2603.17474
Revisiting Cross-Attention Mechanisms: Leveraging Beneficial Noise for Domain-Adaptive Learning
重新审视交叉注意力机制:利用有益噪声进行领域自适应学习
Abstract
Unsupervised Domain Adaptation (UDA) seeks to transfer knowledge from a labeled source domain to an unlabeled target domain but often suffers from severe domain and scale gaps that degrade performance. Existing cross-attention-based transformers can align features across domains, yet they struggle to preserve content semantics under large appearance and scale variations. To explicitly address these challenges, we introduce the concept of beneficial noise, which regularizes cross-attention by injecting controlled perturbations, encouraging the model to ignore style distractions and focus on content. We propose the Domain-Adaptive Cross-Scale Matching (DACSM) framework, which consists of a Domain-Adaptive Transformer (DAT) for disentangling domain-shared content from domain-specific style, and a Cross-Scale Matching (CSM) module that adaptively aligns features across multiple resolutions. DAT incorporates beneficial noise into cross-attention, enabling progressive domain translation with enhanced robustness, yielding content-consistent and style-invariant representations. Meanwhile, CSM ensures semantic consistency under scale changes. Extensive experiments on VisDA-2017, Office-Home, and DomainNet demonstrate that DACSM achieves state-of-the-art performance, with up to +2.3% improvement over CDTrans on VisDA-2017. Notably, DACSM achieves a +5.9% gain on the challenging "truck" class of VisDA, evidencing the strength of beneficial noise in handling scale discrepancies. These results highlight the effectiveness of combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment for robust cross-domain representation learning.
Chinese Translation
无监督领域适应(UDA)旨在将知识从标记的源领域转移到未标记的目标领域,但常常面临严重的领域和尺度差异,从而降低性能。现有的基于交叉注意力的变换器可以对齐跨领域的特征,但在大幅外观和尺度变化下,它们难以保持内容语义。为明确解决这些挑战,我们引入了有益噪声的概念,通过注入受控扰动来正则化交叉注意力,鼓励模型忽略风格干扰,专注于内容。我们提出了领域自适应跨尺度匹配(DACSM)框架,该框架由一个领域自适应变换器(DAT)组成,用于将领域共享内容与领域特定风格解耦,以及一个跨尺度匹配(CSM)模块,能够自适应地对齐多个分辨率下的特征。DAT将有益噪声纳入交叉注意力中,使得增强鲁棒性的渐进式领域翻译成为可能,产生内容一致且风格不变的表示。同时,CSM确保在尺度变化下的语义一致性。在VisDA-2017、Office-Home和DomainNet上的大量实验表明,DACSM达到了最先进的性能,在VisDA-2017上相较于CDTrans提高了最多2.3%。值得注意的是,DACSM在VisDA的“卡车”类别上获得了5.9%的增益,证明了有益噪声在处理尺度差异方面的有效性。这些结果突显了结合领域翻译、有益噪声增强的注意力和尺度感知对齐在鲁棒的跨领域表示学习中的有效性。
cs.CV / 78 / 2603.17476
UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models
UniSAFE:统一多模态模型安全评估的综合基准
Abstract
Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE
Chinese Translation
统一多模态模型(UMMs)提供强大的跨模态能力,但引入了在单任务模型中未观察到的新安全风险。尽管它们正在逐渐出现,现有的安全基准在任务和模态之间仍然存在碎片化,限制了对复杂系统级脆弱性的全面评估。为了解决这一问题,我们提出了UniSAFE,这是第一个针对UMMs系统级安全评估的综合基准,涵盖7种输入/输出模态组合,涉及传统任务和新颖的多模态上下文图像生成设置。UniSAFE采用共享目标设计,投射出跨任务特定输入/输出配置的共同风险场景,能够进行受控的跨任务安全失败比较。该基准包含6,802个精心策划的实例,我们利用UniSAFE评估了15个最先进的UMMs,包括专有和开源模型。我们的结果揭示了当前UMMs的关键脆弱性,包括在多图像组合和多轮对话设置中安全违规的增加,其中图像输出任务的脆弱性始终高于文本输出任务。这些发现突显了对UMMs进行更强系统级安全对齐的必要性。我们的代码和数据可在 https://github.com/segyulee/UniSAFE 获取。
cs.CV / 79 / 2603.17492
UAV-CB: A Complex-Background RGB-T Dataset and Local Frequency Bridge Network for UAV Detection
UAV-CB:一个复杂背景的RGB-T数据集及无人机检测的局部频率桥接网络
Abstract
Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB-T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. Furthermore, we propose the Local Frequency Bridge Network (LFBNet), which models features in localized frequency space to bridge both the frequency-spatial fusion gap and the cross-modality discrepancy gap in RGB-T fusion. Extensive experiments on UAV-CB and public benchmarks demonstrate that LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions, offering a frequency-aware perspective on multimodal UAV perception in real-world applications.
Chinese Translation
在低空环境中检测无人机(UAV)对于感知和防御系统至关重要,但由于复杂背景、伪装和多模态干扰,这一任务仍然极具挑战性。在现实场景中,无人机常常与周围的建筑、植被和电力线等结构视觉上融合,导致低对比度、边界模糊以及与杂乱背景纹理的强混淆。现有的无人机检测数据集虽然多样,但并未专门设计用于捕捉这些伪装和复杂背景的挑战,这限制了在真实世界感知中的稳健进展。为填补这一空白,我们构建了UAV-CB,一个新的RGB-T无人机检测数据集,专门策划以强调复杂的低空背景和伪装特征。此外,我们提出了局部频率桥接网络(Local Frequency Bridge Network, LFBNet),该网络在局部频率空间中建模特征,以弥合RGB-T融合中的频率-空间融合差距和跨模态差异差距。在UAV-CB和公共基准上的大量实验表明,LFBNet在伪装和杂乱条件下实现了最先进的检测性能和强大的鲁棒性,为真实世界应用中的多模态无人机感知提供了频率感知的视角。
cs.CV / 80 / 2603.17508
Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation
Omni-I2C:高保真图像到代码生成的整体基准
Abstract
We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception -- to parse intricate spatial hierarchies and symbolic details -- and precise generative expression -- to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content -- from scientific visualizations to complex symbolic notations -- each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at https://github.com/MiliLab/Omni-I2C.
Chinese Translation
我们提出了Omni-I2C,这是一个综合性基准,旨在评估大型多模态模型(LMMs)将复杂的结构化数字图形转换为可执行代码的能力。我们认为这一任务对当前一代LMMs构成了非平凡的挑战:它要求在高保真的视觉感知与精确的生成表达之间实现前所未有的协同——以解析复杂的空间层次和符号细节,并合成语法正确且逻辑一致的代码。与传统的描述性任务不同,Omni-I2C需要一种整体理解,其中任何微小的感知幻觉或编码错误都会导致视觉重建的完全失败。Omni-I2C包含1080个精心策划的样本,涵盖了不同主题、图像模态和编程语言。通过纳入真实用户来源的案例,该基准涵盖了广泛的数字内容——从科学可视化到复杂的符号表示——每个案例都配有可执行的参考代码。为了补充这种多样性,我们的评估框架提供了必要的深度;通过将性能解耦为感知保真度和符号精确度,它超越了表面准确性,揭示了当前LMMs的细粒度结构失败和推理瓶颈。我们的评估揭示了领先LMMs之间存在显著的性能差距;即使是最先进的模型在复杂场景中也难以保持结构完整性,强调了多模态代码生成仍然是一个艰巨的挑战。数据和代码可在 https://github.com/MiliLab/Omni-I2C 获取。
cs.CV / 81 / 2603.17514
EI: Early Intervention for Multimodal Imaging based Disease Recognition
EI:基于多模态影像的疾病识别早期干预
Abstract
Current methods for multimodal medical imaging based disease recognition face two major challenges. First, the prevailing "fusion after unimodal image embedding" paradigm cannot fully leverage the complementary and correlated information in the multimodal data. Second, the scarcity of labeled multimodal medical images, coupled with their significant domain shift from natural images, hinders the use of cutting-edge Vision Foundation Models (VFMs) for medical image embedding. To jointly address the challenges, we propose a novel Early Intervention (EI) framework. Treating one modality as target and the rest as reference, EI harnesses high-level semantic tokens from the reference as intervention tokens to steer the target modality's embedding process at an early stage. Furthermore, we introduce Mixture of Low-varied-Ranks Adaptation (MoR), a parameter-efficient fine-tuning method that employs a set of low-rank adapters with varied ranks and a weight-relaxed router for VFM adaptation. Extensive experiments on three public datasets for retinal disease, skin lesion, and keen anomaly classification verify the effectiveness of the proposed method against a number of competitive baselines.
Chinese Translation
当前基于多模态医学影像的疾病识别方法面临两个主要挑战。首先,普遍采用的“单模态影像嵌入后融合”范式无法充分利用多模态数据中的互补和相关信息。其次,标注的多模态医学影像稀缺,加上它们与自然影像之间显著的领域转移,阻碍了前沿视觉基础模型(Vision Foundation Models, VFMs)在医学影像嵌入中的应用。为共同应对这些挑战,我们提出了一种新颖的早期干预(Early Intervention, EI)框架。EI将一种模态视为目标,其他模态视为参考,利用参考模态中的高层语义标记作为干预标记,在早期阶段引导目标模态的嵌入过程。此外,我们引入了低秩适应混合(Mixture of Low-varied-Ranks Adaptation, MoR),这是一种参数高效的微调方法,采用一组具有不同秩的低秩适配器和一个权重放松的路由器进行VFM适应。在三个公共数据集上进行的广泛实验,涉及视网膜疾病、皮肤病变和异常分类,验证了所提方法相较于多个竞争基线的有效性。
cs.CV / 82 / 2603.17519
UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images
UniSem:从稀疏无姿态图像中进行可泛化的语义3D重建
Abstract
Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model's own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.
Chinese Translation
从稀疏无姿态图像中进行语义感知的3D重建对于前馈3D高斯点云(3DGS)仍然具有挑战性。现有方法通常在稀疏视图监督下预测过度完整的高斯原语集,导致几何不稳定和深度质量较差。同时,它们仅依赖于2D分割器特征进行语义提升,这提供了较弱的3D级别和有限的可泛化监督,导致在新场景中3D语义不完整。为了解决这些问题,我们提出了UniSem,一个统一框架,通过两个关键组件共同提高深度准确性和语义泛化。首先,错误感知高斯丢弃(EGD)通过使用渲染错误线索抑制冗余倾向的高斯,执行错误引导的容量控制,从而生成有意义的、几何稳定的高斯表示,以改善深度估计。其次,我们引入了一种混合训练课程(MTC),逐步将2D分割器提升的语义与模型自身的突现3D语义先验混合,采用对象级原型对齐来增强语义的一致性和完整性。在ScanNet和Replica上的大量实验表明,UniSem在深度预测和开放词汇3D分割方面的表现优于不同输入视图数量的强基线。值得注意的是,在16视图输入下,UniSem将深度相对误差(Rel)降低了15.2%,并将开放词汇分割的平均准确率(mAcc)提高了3.7%。
cs.CV / 83 / 2603.17520
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation
PCA-Seg:重新审视开放词汇语义和部分分割的成本聚合
Abstract
Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.
Chinese Translation
最近在视觉-语言模型(VLMs)方面的进展引起了对开放词汇语义和部分分割(OSPS)的广泛关注。然而,现有方法通过空间和类别聚合的串行结构从成本体积中提取图像-文本对齐线索,导致类别级语义与空间上下文之间的知识干扰。因此,本文提出了一种简单而有效的并行成本聚合(PCA-Seg)范式,以缓解上述挑战,使模型能够从成本体积中捕获更丰富的视觉-语言对齐信息。具体而言,我们设计了一个专家驱动的感知学习(EPL)模块,能够高效整合语义和上下文流。该模块结合了一个多专家解析器,从多个角度提取互补特征。此外,设计了一个系数映射器,以自适应地学习每个特征的像素特定权重,从而将互补知识整合到统一且强大的特征嵌入中。此外,我们提出了一种特征正交化解耦(FOD)策略,以减轻语义流和上下文流之间的冗余,这使得EPL模块能够从正交化特征中学习多样化的知识。在八个基准测试上的大量实验表明,PCA-Seg中的每个并行模块仅增加0.35M参数,同时实现了最先进的OSPS性能。
cs.CV / 84 / 2603.17528
MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
MM-OVSeg:用于遥感开放词汇分割的多模态光学-合成孔径雷达融合
Abstract
Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.
Chinese Translation
开放词汇分割能够从开放的文本类别集中进行像素级识别,允许超越固定类别的泛化。尽管在遥感领域具有巨大潜力,但这一领域的进展在很大程度上仍然局限于晴空光学数据,并在多云或雾霾污染条件下面临困难。我们提出了MM-OVSeg,一个针对恶劣天气条件下稳健的开放词汇分割的多模态光学-合成孔径雷达(SAR)融合框架。MM-OVSeg利用了这两种模态的互补优势——光学影像提供丰富的光谱语义,而合成孔径雷达(SAR)则提供穿透云层的结构线索。为了解决跨模态领域差距和当前视觉-语言模型的有限密集预测能力,我们提出了两个关键设计:用于多传感器表示对齐的跨模态统一过程,以及一个双编码器融合模块,该模块整合来自多个视觉基础模型的分层特征,以实现文本对齐的多模态分割。大量实验表明,MM-OVSeg在不同云层条件下展现出优越的鲁棒性和泛化能力。源数据集和代码可在此获取。
cs.CV / 85 / 2603.17530
AdapTS: Lightweight Teacher-Student Approach for Multi-Class and Continual Visual Anomaly Detection
AdapTS:用于多类别和持续视觉异常检测的轻量级教师-学生方法
Abstract
Visual Anomaly Detection (VAD) is crucial for industrial inspection, yet most existing methods are limited to single-category scenarios, failing to address the multi-class and continual learning demands of real-world environments. While Teacher-Student (TS) architectures are efficient, they remain unexplored for the Continual Setting. To bridge this gap, we propose AdapTS, a unified TS framework designed for multi-class and continual settings, optimized for edge deployment. AdapTS eliminates the need for two different architectures by utilizing a single shared frozen backbone and injecting lightweight trainable adapters into the student pathway. Training is enhanced via a segmentation-guided objective and synthetic Perlin noise, while a prototype-based task identification mechanism dynamically selects adapters at inference with 99\% accuracy. Experiments on MVTec AD and VisA demonstrate that AdapTS matches the performance of existing TS methods across multi-class and continual learning scenarios, while drastically reducing memory overhead. Our lightest variant, AdapTS-S, requires only 8 MB of additional memory, 13x less than STFPM (95 MB), 48x less than RD4AD (360 MB), and 149x less than DeSTSeg (1120 MB), making it a highly scalable solution for edge deployment in complex industrial environments.
Chinese Translation
视觉异常检测(VAD)在工业检测中至关重要,但大多数现有方法仅限于单类别场景,未能满足现实环境中多类别和持续学习的需求。尽管教师-学生(TS)架构效率高,但在持续学习设置中尚未得到探索。为填补这一空白,我们提出了AdapTS,一个统一的TS框架,旨在支持多类别和持续学习设置,并针对边缘部署进行了优化。AdapTS通过利用单一共享的冻结主干网络,并在学生路径中注入轻量级可训练适配器,消除了对两种不同架构的需求。通过分割引导目标和合成Perlin噪声增强训练,同时基于原型的任务识别机制在推理时以99%的准确率动态选择适配器。在MVTec AD和VisA上的实验表明,AdapTS在多类别和持续学习场景中与现有的TS方法的性能相当,同时显著降低了内存开销。我们的最轻变体AdapTS-S仅需额外8 MB的内存,比STFPM(95 MB)少13倍,比RD4AD(360 MB)少48倍,比DeSTSeg(1120 MB)少149倍,成为在复杂工业环境中边缘部署的高度可扩展解决方案。
cs.CV / 86 / 2603.17531
Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing
Rel-Zero:利用补丁对不变性实现对抗AI编辑的稳健零水印
Abstract
Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.
Chinese Translation
近期基于扩散的图像编辑技术的进步对数字视觉内容的真实性构成了重大威胁。传统的嵌入式水印方法通常为了保持稳健性而引入可感知的扰动,必然会损害视觉保真度。同时,现有的零水印方法通常依赖于全局图像特征,难以抵御复杂的操控。在本研究中,我们发现了一个关键观察:尽管在基于AI的编辑过程中,单个图像补丁经历了显著的变化,但补丁对之间的关系距离却保持相对不变。利用这一特性,我们提出了关系零水印(Relational Zero-Watermarking,Rel-Zero),这是一个新颖的框架,它不需要对原始图像进行修改,而是从这些编辑不变的补丁关系中推导出独特的零水印。通过将水印基于内在的结构一致性而非绝对外观,Rel-Zero 提供了一种非侵入性且具有韧性的内容认证机制。大量实验表明,与之前的零水印方法相比,Rel-Zero 在多种编辑模型和操控下实现了显著增强的稳健性。
cs.CV / 87 / 2603.17538
Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis
学习基于坐标的卷积核以实现连续的 SE(3) 等变性和高效的点云分析
Abstract
A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade-off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part segmentation, and large-scale semantic segmentation, validate the rigid equivariance, memory scalability, and outstanding performance of ECKConv compared to state-of-the-art equivariant methods.
Chinese Translation
刚体运动上的对称性是高效学习三维点云问题的一个显著因素。群卷积一直是提取等变特征的代表性方法,但其实现一直难以同时保持严格的对称性和可扩展性。我们主张利用交织框架来解决这一权衡,但之前的研究未能实现完全的 SE(3) 对称性或对大规模问题的可扩展性,因此需要一种更先进的卷积核架构。我们提出了等变基于坐标的卷积核(Equivariant Coordinate-based Kernel Convolution,简称 ECKConv)。它通过在双余域空间中定义的卷积核领域获得 SE(3) 等变性,并且其基于坐标网络的显式卷积核设计增强了其学习能力和内存效率。在多种点云任务上的实验,例如分类、姿态配准、部件分割和大规模语义分割,验证了 ECKConv 相较于最先进的等变方法在刚体等变性、内存可扩展性和卓越性能方面的优势。
cs.CV / 88 / 2603.17541
Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models
时间收益,空间成本:重新审视多模态大语言模型中的视频微调
Abstract
Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.
Chinese Translation
多模态大语言模型(MLLMs)通常分阶段进行训练,其中基于视频的监督微调(Video-SFT)作为提升视觉理解的关键步骤。然而,它对视觉能力的细致演变,特别是空间理解与时间理解之间的平衡,其影响仍然不甚明了。本文系统研究了Video-SFT如何重塑MLLMs中的视觉能力。在不同的架构、参数规模和帧采样设置下,我们观察到一个一致的模式:Video-SFT可靠地提升了视频性能,但在静态图像基准上往往仅带来有限的收益,甚至可能导致性能下降。我们进一步表明,这种权衡与时间预算密切相关:增加采样帧数通常能提高视频性能,但并不一定能可靠地改善静态图像性能。基于这一发现,我们研究了一种关注指令的混合帧策略(Hybrid-Frame),该策略自适应地分配帧数,并部分缓解图像与视频之间的权衡。我们的结果表明,Video-SFT并不是MLLMs的“免费午餐”,而保持空间理解仍然是联合图像-视频训练中的一个核心挑战。
cs.CV / 89 / 2603.17546
ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling
ProGVC:基于渐进式生成的视频压缩通过自回归上下文建模
Abstract
Perceptual video compression leverages generative priors to reconstruct realistic textures and motions at low bitrates. However, existing perceptual codecs often lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction. Inspired by the next-scale prediction in the Visual Auto-Regressive (VAR) models, we propose ProGVC, a Progressive-based Generative Video Compression framework that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single codec. ProGVC encodes videos into hierarchical multi-scale residual token maps, enabling flexible rate adaptation by transmitting a coarse-to-fine subset of scales in a progressive manner. A Transformer-based multi-scale autoregressive context model estimates token probabilities, utilized both for efficient entropy coding of the transmitted tokens and for predicting truncated fine-scale tokens at the decoder to restore perceptual details. Extensive experiments demonstrate that as a new coding paradigm, ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability at the same time.
Chinese Translation
感知视频压缩利用生成先验在低比特率下重建真实的纹理和运动。然而,现有的感知编解码器往往缺乏对可变比特率和渐进式传输的原生支持,其生成模块与熵编码的耦合较弱,限制了比特率的降低。受到视觉自回归(VAR)模型中下一尺度预测的启发,我们提出了ProGVC,一个基于渐进式的生成视频压缩框架,将渐进式传输、高效熵编码和细节合成统一于单一编解码器中。ProGVC将视频编码为分层多尺度残差令牌图,使得通过以渐进方式传输粗到细的尺度子集实现灵活的速率适应。基于Transformer的多尺度自回归上下文模型估计令牌概率,既用于对传输令牌的高效熵编码,也用于在解码器处预测截断的细尺度令牌以恢复感知细节。大量实验表明,作为一种新的编码范式,ProGVC在低比特率下提供了令人满意的感知压缩性能,同时提供了实际的可扩展性。
cs.CV / 90 / 2603.17554
Prompt-Free Universal Region Proposal Network
无提示通用区域提议网络
Abstract
Identifying potential objects is critical for object recognition and analysis across various computer vision applications. Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions. However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios. In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (PF-RPN), which identifies potential objects without relying on external prompts. First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features. Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner. Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network. Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection. Experimental results across 19 datasets validate the effectiveness of our method. Code is available at https://github.com/tangqh03/PF-RPN.
Chinese Translation
识别潜在物体对于各种计算机视觉应用中的物体识别和分析至关重要。现有方法通常依赖示例图像、预定义类别或文本描述来定位潜在物体。然而,它们对图像和文本提示的依赖往往限制了灵活性,限制了在现实场景中的适应性。在本文中,我们提出了一种新颖的无提示通用区域提议网络(Prompt-Free Universal Region Proposal Network, PF-RPN),该网络在不依赖外部提示的情况下识别潜在物体。首先,稀疏图像感知适配器(Sparse Image-Aware Adapter, SIA)模块使用可学习的查询嵌入进行潜在物体的初步定位,该嵌入会根据视觉特征动态更新。接下来,级联自提示(Cascade Self-Prompt, CSP)模块利用自提示的可学习嵌入识别剩余的潜在物体,以级联方式自主聚合信息丰富的视觉特征。最后,中心性引导查询选择(Centerness-Guided Query Selection, CG-QS)模块通过中心性评分网络促进高质量查询嵌入的选择。我们的方法可以在有限数据(例如,5%的 MS COCO 数据)下进行优化,并可直接应用于各种物体检测应用领域,以识别潜在物体而无需微调,例如水下物体检测、工业缺陷检测和遥感图像物体检测。跨19个数据集的实验结果验证了我们方法的有效性。代码可在 https://github.com/tangqh03/PF-RPN 获取。
cs.CV / 91 / 2603.17555
FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion
FrescoDiffusion:基于先验正则化的4K图像到视频的平铺扩散
Abstract
Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.
Chinese Translation
基于扩散的图像到视频(I2V)模型越来越有效,但在处理超高分辨率输入(例如4K)时仍然面临挑战。在模型的原生分辨率下生成视频往往会丢失细致的结构,而高分辨率的平铺去噪则保留了局部细节,但破坏了全局布局的一致性。这种失效模式在壁画动画设置中尤为严重:这些宏伟的艺术作品包含许多不同的角色、物体和语义上不同的子场景,必须在时间上保持空间一致性。我们提出了FrescoDiffusion,这是一种无需训练的方法,用于从单一复杂图像生成一致的大格式I2V。其关键思想是通过预计算的潜在先验来增强平铺去噪:我们首先在基础模型分辨率下生成低分辨率视频,并上采样其潜在轨迹,以获得捕捉长程时间和空间结构的全局参考。对于4K生成,我们计算每个平铺的噪声预测,并在每个扩散时间步通过最小化模型输出空间中的单个加权最小二乘目标,将其与该参考融合。该目标结合了标准的平铺合并标准和我们的正则化项,产生了一个封闭形式的融合更新,增强了全局一致性,同时保留了细节。此外,我们还提供了一个空间正则化变量,使得可以在区域级别上控制运动的允许范围。我们在VBench-I2V数据集和我们提出的壁画I2V数据集上的实验表明,与平铺基线相比,全球一致性和保真度得到了改善,同时计算效率也很高。我们的正则化使得在创造力和一致性之间的权衡具有明确的可控性。
cs.CV / 92 / 2603.17567
Face anonymization preserving facial expressions and photometric realism
保留面部表情和光度真实性的面部匿名化
Abstract
The widespread sharing of face images on social media platforms and in large-scale datasets raises pressing privacy concerns, as biometric identifiers can be exploited without consent. Face anonymization seeks to generate realistic facial images that irreversibly conceal the subject's identity while preserving their usefulness for downstream tasks. However, most existing generative approaches focus on identity removal and image realism, often neglecting facial expressions as well as photometric consistency -- specifically attributes such as illumination and skin tone -- that are critical for applications like relighting, color constancy, and medical or affective analysis. In this work, we propose a feature-preserving anonymization framework that extends DeepPrivacy by incorporating dense facial landmarks to better retain expressions, and by introducing lightweight post-processing modules that ensure consistency in lighting direction and skin color. We further establish evaluation metrics specifically designed to quantify expression fidelity, lighting consistency, and color preservation, complementing standard measures of image realism, pose accuracy, and re-identification resistance. Experiments on the CelebA-HQ dataset demonstrate that our method produces anonymized faces with improved realism and significantly higher fidelity in expression, illumination, and skin tone compared to state-of-the-art baselines. These results underscore the importance of feature-aware anonymization as a step toward more useful, fair, and trustworthy privacy-preserving facial data.
Chinese Translation
在社交媒体平台和大规模数据集中,面部图像的广泛分享引发了紧迫的隐私问题,因为生物识别标识符可能在未获同意的情况下被利用。面部匿名化旨在生成真实的面部图像,能够不可逆转地隐藏主体的身份,同时保留其在下游任务中的实用性。然而,大多数现有的生成方法侧重于身份去除和图像真实感,往往忽视了面部表情以及光度一致性——特别是照明和肤色等属性——这些对于重新照明、颜色恒常性以及医学或情感分析等应用至关重要。在本研究中,我们提出了一种特征保留的匿名化框架,通过引入密集的面部特征点来更好地保留表情,并引入轻量级后处理模块以确保照明方向和肤色的一致性,从而扩展了DeepPrivacy。我们进一步建立了专门设计的评估指标,以量化表情保真度、照明一致性和颜色保留,补充了图像真实感、姿态准确性和重新识别抗性等标准测量。对CelebA-HQ数据集的实验表明,我们的方法生成的匿名面孔在真实感、表情、照明和肤色的保真度上显著优于最先进的基线。这些结果强调了特征感知匿名化的重要性,作为朝着更有用、公平和可信的隐私保护面部数据迈出的重要一步。
cs.CV / 93 / 2603.17571
PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery
PanoVGGT:基于全景图像的前馈3D重建
Abstract
Panoramic imagery offers a full 360{\deg} field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.
Chinese Translation
全景图像提供了360{ ext{°}}的完整视场,并在消费设备中越来越普遍。然而,它引入了非针孔畸变,这对联合姿态估计和3D重建提出了挑战。现有的为透视相机构建的前馈模型在这一设置下泛化能力较差。我们提出了PanoVGGT,一种排列等变的Transformer框架,能够在一次前向传播中联合预测相机姿态、深度图和3D点云,基于一个或多个全景图。该模型结合了球面感知的位置嵌入和特定于全景的三轴SO(3)旋转增强,能够在球面域内有效进行几何推理。为了解决固有的全局框架模糊性,我们在训练过程中进一步引入了一种随机锚定策略。此外,我们贡献了PanoCity,一个大规模的户外全景数据集,包含密集的深度和6自由度姿态注释。在PanoCity和标准基准上的大量实验表明,PanoVGGT在准确性、鲁棒性和跨域泛化能力方面表现出色。代码和数据集将会发布。
cs.CV / 94 / 2603.17576
LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation
LoGSAM:用于MRI分割的参数高效跨模态基础对接
Abstract
Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.
Chinese Translation
使用磁共振成像(MRI)对脑肿瘤进行精确定位和描绘对于治疗规划和手术决策至关重要。然而,大多数现有方法依赖于特定任务的监督模型,并受到标注数据有限的限制。为了解决这个问题,我们提出了LoGSAM,一个参数高效、以检测为驱动的框架,将放射科医师的口述转化为基础模型驱动的定位和分割文本提示。首先,使用预训练的Whisper ASR模型对放射科医师的语音进行转录和翻译,然后通过关注否定的临床自然语言处理(NLP)提取肿瘤特定的文本提示。这些提示通过经过LoRA(低秩适应)调整的视觉-语言检测模型Grounding DINO(GDINO)指导文本条件的肿瘤定位。LoRA适应使用5%的模型参数进行更新,从而实现计算高效的领域适应,同时保留预训练的跨模态知识。预测的边界框被用作MedSAM生成像素级肿瘤掩膜的提示,而无需任何额外的微调。在BRISC 2025上,将冻结的MedSAM以LoGSAM派生的先验条件化,获得了80.32%的最新dice分数。此外,我们在12个未见的MRI扫描上使用经过认证的放射科医师的德语口述评估了整个流程,达到了91.7%的案例级准确率。这些结果突显了通过智能利用预训练基础模型并进行最小参数更新来构建模块化的语音到分割管道的可行性。
cs.CV / 95 / 2603.17583
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
编辑即行动:开放词汇3D室内场景编辑的目标回归规划
Abstract
Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.
Chinese Translation
从自然语言编辑3D室内场景在概念上是简单的,但在技术上却具有挑战性。现有的开放词汇系统通常会重新生成场景的大部分内容,或依赖于破坏空间结构的图像空间编辑,导致意外的全局变化或物理上不一致的布局。这些局限性源于将编辑主要视为生成任务。我们采取了不同的观点。用户指令定义了期望的世界状态,而编辑应当是使这一状态成立的最小行动序列,同时保留其他一切。这一视角激励了Edit-As-Act框架,它将开放词汇场景编辑视为3D空间中的目标回归规划。给定源场景和自由形式的指令,Edit-As-Act预测符号目标谓词,并在EditLang中进行规划,EditLang是一种受PDDL启发的行动语言,我们设计了明确的前提条件和效果编码,支持接触、碰撞及其他几何关系。一个基于语言的规划器提出行动,而验证器则强制执行目标导向性、单调性和物理可行性,从而产生可解释且物理一致的变换。通过将推理与低级生成分离,Edit-As-Act实现了指令保真度、语义一致性和物理合理性——这三个标准是现有范式无法共同满足的。在E2A-Bench上,我们的基准测试涵盖了9个室内环境中的63个编辑任务,Edit-As-Act在所有编辑类型和场景类别上显著优于之前的方法。
cs.CV / 96 / 2603.17603
Trust the Unreliability: Inward Backward Dynamic Unreliability Driven Coreset Selection for Medical Image Classification
信任不可靠性:基于内向反向动态不可靠性驱动的核心集选择用于医学图像分类
Abstract
Efficiently managing and utilizing large-scale medical imaging datasets with limited resources presents significant challenges. While coreset selection helps reduce computational costs, its effectiveness in medical data remains limited due to inherent complexity, such as large intra-class variation and high inter-class similarity. To address this, we revisit the training process and observe that neural networks consistently produce stable confidence predictions and better remember samples near class centers in training. However, concentrating on these samples may complicate the modeling of decision boundaries. Hence, we argue that the more unreliable samples are, in fact, the more informative in helping build the decision boundary. Based on this, we propose the Dynamic Unreliability-Driven Coreset Selection(DUCS) strategy. Specifically, we introduce an inward-backward unreliability assessment perspective: 1) Inward Self-Awareness: The model introspects its behavior by analyzing the evolution of confidence during training, thereby quantifying uncertainty of each sample. 2) Backward Memory Tracking: The model reflects on its training tracking by tracking the frequency of forgetting samples, thus evaluating its retention ability for each sample. Next, we select unreliable samples that exhibit substantial confidence fluctuations and are repeatedly forgotten during training. This selection process ensures that the chosen samples are near the decision boundary, thereby aiding the model in refining the boundary. Extensive experiments on public medical datasets demonstrate our superior performance compared to state-of-the-art(SOTA) methods, particularly at high compression rates.
Chinese Translation
有效管理和利用资源有限的大规模医学影像数据集面临重大挑战。尽管核心集选择有助于降低计算成本,但由于固有的复杂性,如类内变异性大和类间相似性高,其在医学数据中的有效性仍然有限。为了解决这一问题,我们重新审视训练过程,观察到神经网络在训练过程中始终产生稳定的置信度预测,并更好地记住接近类中心的样本。然而,专注于这些样本可能会使决策边界的建模变得复杂。因此,我们认为,更不可靠的样本实际上在帮助构建决策边界方面更具信息量。基于此,我们提出了动态不可靠性驱动的核心集选择策略(Dynamic Unreliability-Driven Coreset Selection,DUCS)。具体而言,我们引入了一种内向-反向不可靠性评估视角:1)内向自我意识:模型通过分析训练过程中置信度的演变来反思其行为,从而量化每个样本的不确定性。2)反向记忆追踪:模型通过追踪遗忘样本的频率来反思其训练过程,从而评估其对每个样本的记忆能力。接下来,我们选择那些在训练过程中表现出显著置信度波动并被反复遗忘的不可靠样本。该选择过程确保所选样本接近决策边界,从而帮助模型优化边界。在公共医学数据集上的大量实验表明,我们的性能优于最先进的方法,尤其是在高压缩率下。
cs.CV / 97 / 2603.17605
ReLaGS: Relational Language Gaussian Splatting
ReLaGS:关系语言高斯点云
Abstract
Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval. Project page: https://dfki-av.github.io/ReLaGS/
Chinese Translation
在分割、检索和关系理解等任务中,实现统一的三维感知和推理仍然具有挑战性,因为现有方法要么以对象为中心,要么依赖于昂贵的训练来进行对象间推理。我们提出了一种新颖的框架,构建了一个层次化的语言提炼高斯场景及其三维语义场景图,而无需特定场景的训练。高斯修剪机制精炼了场景几何形状,而稳健的多视角语言对齐策略将噪声较大的二维特征聚合为准确的三维对象嵌入。在此层次结构之上,我们构建了一个开放词汇的三维场景图,结合了视觉语言衍生的注释和基于图神经网络的关系推理。我们的方法通过联合建模层次语义和对象间/内部关系,实现了高效且可扩展的开放词汇三维推理,并在开放词汇分割、场景图生成和关系引导检索等任务中得到了验证。项目页面:https://dfki-av.github.io/ReLaGS/
cs.CV / 98 / 2603.17625
S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models
S-VGGT:面向结构的子场景分解以实现可扩展的3D基础模型
Abstract
Feed-forward 3D foundation models face a key challenge: the quadratic computational cost introduced by global attention, which severely limits scalability as input length increases. Concurrent acceleration methods, such as token merging, operate at the token level. While they offer local savings, the required nearest-neighbor searches introduce undesirable overhead. Consequently, these techniques fail to tackle the fundamental issue of structural redundancy dominant in dense capture data. In this work, we introduce \textbf{S-VGGT}, a novel approach that addresses redundancy at the structural frame level, drastically shifting the optimization focus. We first leverage the initial features to build a dense scene graph, which characterizes structural scene redundancy and guides the subsequent scene partitioning. Using this graph, we softly assign frames to a small number of subscenes, guaranteeing balanced groups and smooth geometric transitions. The core innovation lies in designing the subscenes to share a common reference frame, establishing a parallel geometric bridge that enables independent and highly efficient processing without explicit geometric alignment. This structural reorganization provides strong intrinsic acceleration by cutting the global attention cost at its source. Crucially, S-VGGT is entirely orthogonal to token-level acceleration methods, allowing the two to be seamlessly combined for compounded speedups without compromising reconstruction fidelity. Code is available at https://github.com/Powertony102/S-VGGT.
Chinese Translation
前馈3D基础模型面临一个关键挑战:全球注意力引入的二次计算成本,这在输入长度增加时严重限制了可扩展性。并行加速方法,如令牌合并,主要在令牌级别操作。虽然它们提供了局部节省,但所需的最近邻搜索引入了不必要的开销。因此,这些技术未能解决在密集捕获数据中占主导地位的结构冗余的根本问题。在本研究中,我们提出了 extbf{S-VGGT},一种新颖的方法,旨在从结构帧级别解决冗余问题,彻底改变优化的重点。我们首先利用初始特征构建一个密集场景图,该图表征了结构场景冗余并指导后续的场景划分。利用该图,我们将帧柔性地分配给少量子场景,确保组的平衡和几何过渡的平滑性。核心创新在于设计子场景以共享一个共同的参考帧,建立一个平行的几何桥梁,使得在没有显式几何对齐的情况下能够独立且高效地处理。这种结构重组通过从源头削减全球注意力成本提供了强大的内在加速。重要的是,S-VGGT与令牌级别的加速方法完全正交,使得两者可以无缝结合,实现复合加速而不影响重建的保真度。代码可在 https://github.com/Powertony102/S-VGGT 获取。
cs.CV / 99 / 2603.17626
A Multi-Agent System for Building-Age Cohort Mapping to Support Urban Energy Planning
支持城市能源规划的建筑年龄群体映射多智能体系统
Abstract
Determining the age distribution of the urban building stock is crucial for sustainable municipal heat planning and upgrade prioritization. However, existing approaches often rely on datasets gathered via sensors or remote sensing techniques, leaving inconsistencies and gaps in data. We present a multi-agent LLM system comprising three key agents, the Zensus agent, the OSM agent, and the Monument agent, that fuse data from heterogeneous sources. A data orchestrator and harmonizer geocodes and deduplicates building imprints. Using this fused ground truth, we introduce BuildingAgeCNN, a satellite-only classifier based on a ConvNeXt backbone augmented with a Feature Pyramid Network (FPN), CoordConv spatial channels, and Squeeze-and-Excitation (SE) blocks. Under spatial cross validation, BuildingAgeCNN attains an overall accuracy of 90.69% but a modest macro-F1 of 67.25%, reflecting strong class imbalance and persistent confusions between adjacent historical cohorts. To mitigate risk for planning applications, the address-to prediction pipeline includes calibrated confidence estimates and flags low-confidence cases for manual review. This multi-agent LLM system not only assists in gathering structured data but also helps energy demand planners optimize district-heating networks and target low-carbon sustainable energy systems.
Chinese Translation
确定城市建筑存量的年龄分布对于可持续的市政热能规划和升级优先级的确定至关重要。然而,现有的方法通常依赖于通过传感器或遥感技术收集的数据集,这导致数据存在不一致和缺口。我们提出了一种多智能体大语言模型(LLM)系统,包含三个关键代理:Zensus 代理、OSM 代理和 Monument 代理,能够融合来自异构来源的数据。数据协调器和统一器对建筑印记进行地理编码和去重。基于融合的真实数据,我们引入了 BuildingAgeCNN,这是一种仅基于卫星图像的分类器,采用 ConvNeXt 主干,并增强了特征金字塔网络(FPN)、CoordConv 空间通道和挤压与激励(SE)模块。在空间交叉验证下,BuildingAgeCNN 达到了 90.69% 的总体准确率,但宏观 F1 值仅为 67.25%,反映出类别不平衡和相邻历史群体之间的持续混淆。为了降低规划应用的风险,地址到预测的管道包括经过校准的置信度估计,并标记低置信度案例以供人工审核。该多智能体 LLM 系统不仅有助于收集结构化数据,还帮助能源需求规划者优化区域供热网络并针对低碳可持续能源系统进行目标设定。
cs.CV / 100 / 2603.17647
Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment
基于原型语义和几何对齐的部件感知开放词汇3D功能基础定位
Abstract
Grounding natural language questions to functionally relevant regions in 3D objects -- termed language-driven 3D affordance grounding -- is essential for embodied intelligence and human-AI interaction. Existing methods, while progressing from label-based to language-driven approaches, still face challenges in open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency. To address these issues, we propose a novel two-stage cross-modal framework that enhances both semantic and geometric representations for open-vocabulary 3D affordance grounding. In the first stage, large language models generate part-aware instructions to recover missing semantics, enabling the model to link semantically similar affordances. In the second stage, we introduce two key components: Affordance Prototype Aggregation (APA), which captures cross-object geometric consistency for each affordance, and Intra-Object Relational Modeling (IORM), which refines geometric differentiation within objects to support precise semantic alignment. We validate the effectiveness of our method through extensive experiments on a newly introduced benchmark, as well as two existing benchmarks, demonstrating superior performance in comparison with existing methods.
Chinese Translation
将自然语言问题定位到3D物体中功能相关区域——称为语言驱动的3D功能基础定位——对于具身智能和人机交互至关重要。现有方法虽然在从基于标签到语言驱动的方法上取得了进展,但仍面临开放词汇泛化、细粒度几何对齐和部件级语义一致性等挑战。为了解决这些问题,我们提出了一种新颖的两阶段跨模态框架,增强开放词汇3D功能基础定位的语义和几何表示。在第一阶段,大型语言模型生成部件感知指令以恢复缺失的语义,使模型能够链接语义相似的功能。在第二阶段,我们引入两个关键组件:功能原型聚合(Affordance Prototype Aggregation, APA),用于捕捉每个功能的跨物体几何一致性,以及物体内部关系建模(Intra-Object Relational Modeling, IORM),用于精细化物体内部的几何差异,以支持精确的语义对齐。我们通过在新引入的基准和两个现有基准上的广泛实验验证了我们方法的有效性,显示出相较于现有方法的优越性能。
cs.CV / 101 / 2603.17651
Anchoring and Rescaling Attention for Semantically Coherent Inbetweening
锚定与重标定注意力以实现语义连贯的插值
Abstract
Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.
Chinese Translation
生成插值(Generative Inbetweening, GI)旨在合成现实的中间帧,超越单纯的插值,连接首尾关键帧。随着序列变得稀疏且运动幅度增大,之前的GI模型在处理不一致的帧时面临着节奏不稳定和语义错位的问题。由于GI涉及固定的端点和众多合理的路径,这一任务需要从关键帧和文本中获得额外的指导,以明确指定预期路径。因此,我们通过关键帧锚定注意力偏差(Keyframe-anchored Attention Bias)为每个中间帧提供来自关键帧和文本的语义和时间指导。我们还通过重标定时间RoPE(Rescaled Temporal RoPE)更好地加强帧的一致性,使自注意力能够更真实地关注关键帧。TGI-Bench是首个专门为文本条件下的GI评估设计的基准,能够进行针对挑战的评估以分析GI模型。在没有额外训练的情况下,我们的方法在短序列和长序列的多种挑战中实现了最先进的帧一致性、语义保真度和节奏稳定性。
cs.CV / 102 / 2603.17655
Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
可解释的跨域少样本学习与校正的目标域局部对齐
Abstract
Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.
Chinese Translation
跨域少样本学习(CDFSL)旨在将使用大规模通用数据(源域)训练的模型适应于仅有稀缺训练数据的下游目标域,而在视觉-语言模型(如CLIP)的研究仍处于早期阶段。典型的下游领域,如医学诊断,需要细粒度的视觉线索以实现可解释的识别,但我们发现当前微调的CLIP模型几乎无法关注这些线索,尽管它们可以大致关注源域中的重要区域。尽管目前的研究已展示CLIP在捕捉局部细微模式方面的不足,但在本文中,我们发现领域间的差距和稀缺的训练数据进一步加剧了这种不足,远超过整体模式的不足,我们称之为CLIP基础的CDFSL中的局部不对齐问题。为了解决这个问题,由于在对齐局部视觉特征和文本语义时缺乏监督,我们转向自监督信息。受到翻译任务的启发,我们提出了具有循环一致性的CC-CDFSL方法,该方法将局部视觉特征翻译为文本特征,然后再翻译回视觉特征(反之亦然),并约束原始特征与翻译回的特征接近。为了减少视觉模态中丰富信息带来的噪声,我们进一步提出了一种语义锚机制,该机制首先增强视觉特征,以提供更大的语料库用于文本到图像的映射,然后缩小图像特征以过滤掉无关的图像到文本的映射。在各种基准、骨干网络和微调方法上的大量实验表明,我们可以(1)有效改善局部视觉-语言对齐,(2)通过可视化补丁增强学习模式和模型决策的可解释性,以及(3)实现最先进的性能。
cs.CV / 103 / 2603.17662
FINER: MLLMs Hallucinate under Fine-grained Negative Queries
FINER:多模态大语言模型在细粒度负查询下的幻觉现象
Abstract
Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.
Chinese Translation
多模态大语言模型(MLLMs)在处理幻觉现象时面临挑战,尤其是在细粒度查询方面,而现有基准测试主要集中于粗略的图像相关问题,未能充分体现这一挑战。我们引入了细粒度负查询(FIne-grained NEgative queRies,FINER),并提出了两个基准:FINER-CompreCap 和 FINER-DOCCI。通过使用 FINER,我们分析了四种场景下的幻觉现象:多对象、多属性、多关系和“什么”问题。我们的基准测试揭示,当细粒度的不匹配与图像中真实存在的元素同时出现时,MLLMs 会产生幻觉。为了解决这一问题,我们提出了 FINER-Tuning,利用直接偏好优化(Direct Preference Optimization,DPO)对基于 FINER 的数据进行调整。使用 FINER-Tuning 对四个前沿 MLLMs 进行微调,能够在我们的基准测试中实现高达 24.2\% 的幻觉现象改善(InternVL3.5-14B),同时提升在八个现有幻觉评估套件上的表现,并增强在六个基准测试中的通用多模态能力。代码、基准和模型可在 https://explainableml.github.io/finer-project/ 获取。
cs.CV / 104 / 2603.17671
Few-Step Diffusion Sampling Through Instance-Aware Discretizations
通过实例感知离散化的少步扩散采样
Abstract
Diffusion and flow matching models generate high-fidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance. Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.
Chinese Translation
扩散和流匹配模型通过模拟由常微分方程(ODEs)或随机微分方程(SDEs)定义的路径,从可处理的先验分布出发,生成高保真数据。概率流ODE的公式化使得可以使用先进的数值求解器来加速采样。与求解器设计正交但至关重要的是离散化策略。早期方法采用手工设计的启发式策略,而最近的方法则采用基于优化的技术,但大多数现有策略在所有样本中强制执行全局共享的时间步调度。这种统一的处理未能考虑生成过程中实例特定的复杂性,可能限制性能。受对合成数据的受控实验的启发,这些实验揭示了在实例特定动态下全局调度的次优性,我们提出了一种实例感知离散化框架。我们的方法学习根据输入依赖的先验分配时间步,从而将基于梯度的离散化搜索扩展到条件生成设置。我们在多种设置下的实证结果,包括合成数据、像素空间扩散、潜在空间图像和视频流匹配模型,表明我们的方法在与训练相比的边际调优成本下,始终提高了生成质量,且推理开销微乎其微。
cs.CV / 105 / 2603.17675
DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation
DeepCORO-CLIP:用于全面冠状动脉造影视频-文本分析和外部验证的多视角基础模型
Abstract
Coronary angiography is the reference standard for evaluating coronary artery disease, yet visual interpretation remains variable between readers. Existing artificial intelligence methods typically analyze single frames or projections and focus mainly on stenosis, limiting comprehensive coronary assessment. We present DeepCORO-CLIP, a multi-view foundation model trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients across 32,473 studies at the Montreal Heart Institute and externally validated on 4,249 studies from the University of California, San Francisco. DeepCORO-CLIP integrates multiple projections with attention-based pooling for study-level assessment across diagnostic, prognostic, and disease progression tasks. For significant stenosis detection, the model achieved an AUROC of 0.888 internally and 0.89 on external validation. Mean absolute error against core laboratory quantitative coronary angiography was 13.6%, lower than clinical reports at 19.0%. The model also performed strongly for chronic total occlusion, intracoronary thrombus, and coronary calcification detection. Transfer learning enabled prediction of one-year major adverse cardiovascular events with AUROC 0.79 and estimation of left ventricular ejection fraction with mean absolute error 7.3%. Embeddings also captured disease progression across serial examinations. With a mean inference time of 4.2 seconds in hospital deployment, DeepCORO-CLIP provides a foundation for automated coronary angiography interpretation at the point of care. Code, sample data, model weights, and deployment infrastructure are publicly released.
Chinese Translation
冠状动脉造影是评估冠状动脉疾病的参考标准,但不同读者之间的视觉解读仍然存在差异。现有的人工智能方法通常分析单帧或投影,并主要关注狭窄,限制了对冠状动脉的全面评估。我们提出了DeepCORO-CLIP,这是一种多视角基础模型,采用视频-文本对比学习,基于来自蒙特利尔心脏研究所的28,117名患者的203,808个造影视频和32,473项研究进行训练,并在加利福尼亚大学旧金山分校的4,249项研究中进行了外部验证。DeepCORO-CLIP结合多种投影,通过基于注意力的池化方法进行研究级评估,涵盖诊断、预后和疾病进展任务。在显著狭窄检测方面,该模型在内部验证中达到了0.888的AUROC,在外部验证中为0.89。与核心实验室定量冠状动脉造影的平均绝对误差为13.6%,低于临床报告的19.0%。该模型在慢性完全闭塞、冠状动脉内血栓和冠状动脉钙化检测方面也表现出色。迁移学习使得对一年内主要不良心血管事件的预测达到了0.79的AUROC,并且左心室射血分数的估计平均绝对误差为7.3%。嵌入还捕捉了连续检查中的疾病进展。在医院部署中,平均推理时间为4.2秒,DeepCORO-CLIP为临床现场的自动冠状动脉造影解读提供了基础。代码、样本数据、模型权重和部署基础设施已公开发布。
cs.CV / 106 / 2603.17679
Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging
基于成对闪光-非闪光成像的照明感知无接触指纹欺骗检测
Abstract
Contactless fingerprint recognition enables hygienic and convenient biometric authentication but poses new challenges for spoof detection due to the absence of physical contact and traditional liveness cues. Most existing methods rely on single-image acquisition and appearance-based features, which often generalize poorly across devices, capture conditions, and spoof materials. In this work, we study paired flash-non-flash contactless fingerprint acquisition as a lightweight active sensing mechanism for spoof detection. Through a preliminary empirical analysis, we show that flash illumination accentuates material- and structure-dependent properties, including ridge visibility, subsurface scattering, micro-geometry, and surface oils, while non-flash images provide a baseline appearance context. We analyze lighting-induced differences using interpretable metrics such as inter-channel correlation, specular reflection characteristics, texture realism, and differential imaging. These complementary features help discriminate genuine fingerprints from printed, digital, and molded presentation attacks. We further examine the limitations of paired acquisition, including sensitivity to imaging settings, dataset scale, and emerging high-fidelity spoofs. Our findings demonstrate the potential of illumination-aware analysis to improve robustness and interpretability in contactless fingerprint presentation attack detection, motivating future work on paired acquisition and physics-informed feature design. Code is available in the repository.
Chinese Translation
无接触指纹识别实现了卫生和便利的生物识别认证,但由于缺乏物理接触和传统活性线索,给欺骗检测带来了新的挑战。现有大多数方法依赖于单幅图像采集和基于外观的特征,这往往在不同设备、采集条件和欺骗材料之间的泛化能力较差。在本研究中,我们探讨了成对闪光-非闪光无接触指纹采集作为一种轻量级主动感知机制用于欺骗检测。通过初步的实证分析,我们表明闪光照明突显了材料和结构依赖的特性,包括脊线可见性、亚表面散射、微几何结构和表面油脂,而非闪光图像提供了基线外观上下文。我们使用可解释的度量分析照明引起的差异,如通道间相关性、镜面反射特征、纹理真实感和差异成像。这些互补特征有助于区分真实指纹与打印、数字和模具呈现攻击。我们进一步考察了成对采集的局限性,包括对成像设置、数据集规模和新兴高保真欺骗的敏感性。我们的研究结果表明,照明感知分析有潜力提高无接触指纹呈现攻击检测的鲁棒性和可解释性,激励未来在成对采集和物理信息特征设计方面的研究。代码已在代码库中提供。
cs.CV / 107 / 2603.17680
WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models
WeatherReasonSeg:面向天气感知推理分割的视觉语言模型基准
Abstract
Existing vision-language models (VLMs) have demonstrated impressive performance in reasoning-based segmentation. However, current benchmarks are primarily constructed from high-quality images captured under idealized conditions. This raises a critical question: when visual cues are severely degraded by adverse weather conditions such as rain, snow, or fog, can VLMs sustain reliable reasoning segmentation capabilities? In response to this challenge, we introduce WeatherReasonSeg, a benchmark designed to evaluate VLM performance in reasoning-based segmentation under adverse weather conditions. It consists of two complementary components. First, we construct a controllable reasoning dataset by applying synthetic weather with varying severity levels to existing segmentation datasets, enabling fine-grained robustness analysis. Second, to capture real-world complexity, we curate a real-world adverse-weather reasoning segmentation dataset with semantically consistent queries generated via mask-guided LLM prompting. We further broaden the evaluation scope across five reasoning dimensions, including functionality, application scenarios, structural attributes, interactions, and requirement matching. Extensive experiments across diverse VLMs reveal two key findings: (1) VLM performance degrades monotonically with increasing weather severity, and (2) different weather types induce distinct vulnerability patterns. We hope WeatherReasonSeg will serve as a foundation for advancing robust, weather-aware reasoning.
Chinese Translation
现有的视觉语言模型(VLMs)在基于推理的分割任务中表现出色。然而,目前的基准主要是基于在理想条件下捕获的高质量图像构建的。这引发了一个关键问题:当视觉线索因恶劣天气条件(如雨、雪或雾)而严重退化时,VLMs能否维持可靠的推理分割能力?为应对这一挑战,我们提出了WeatherReasonSeg,这是一个旨在评估VLM在恶劣天气条件下基于推理的分割性能的基准。该基准由两个互补的组成部分构成。首先,我们通过对现有分割数据集施加不同严重程度的合成天气,构建了一个可控的推理数据集,从而实现细粒度的鲁棒性分析。其次,为了捕捉现实世界的复杂性,我们策划了一个真实世界的恶劣天气推理分割数据集,该数据集通过掩码引导的LLM提示生成语义一致的查询。我们进一步扩展了评估范围,涵盖功能性、应用场景、结构属性、交互和需求匹配等五个推理维度。对多种VLM的广泛实验揭示了两个关键发现:(1)随着天气严重程度的增加,VLM的性能单调下降;(2)不同类型的天气会引发不同的脆弱性模式。我们希望WeatherReasonSeg能够为推动鲁棒的天气感知推理奠定基础。
cs.CV / 108 / 2603.17684
Does YOLO Really Need to See Every Training Image in Every Epoch?
YOLO真的需要在每个周期中查看每张训练图像吗?
Abstract
YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy. This naturally raises an important question: \textit{Does YOLO really need to see every training image in every epoch?} To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently. Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Moderate training images are partially selected, prioritizing recently unused ones and randomly choosing the rest from unselected images to ensure coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones. On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than $1.43\times$ training speedup for YOLO-series detectors while also improving accuracy.
Chinese Translation
YOLO检测器以其快速的推理速度而闻名,但由于其在每个周期中处理每张训练图像的耗时流程,训练它们仍然出乎意料地耗时,即使许多图像已经被充分学习。这与“You Only Look Once”(YOLO)理念所暗示的效率形成了明显对比。这自然引发了一个重要问题: extit{YOLO真的需要在每个周期中查看每张训练图像吗?} 为了探讨这一问题,我们提出了一种反遗忘采样策略(Anti-Forgetting Sampling Strategy, AFSS),该策略动态确定在每个周期中应使用哪些图像以及可以跳过哪些图像,从而使检测器能够更有效、更高效地学习。具体而言,AFSS通过检测召回率和精度的最小值来衡量每张训练图像的学习充分性,并相应地将训练图像动态分类为简单、中等或困难级别。简单训练图像在训练过程中以连续审查的方式稀疏重采样,优先考虑那些长时间未使用的图像,以减少冗余并防止遗忘。中等训练图像部分选择,优先考虑最近未使用的图像,并从未选择的图像中随机选择其余图像,以确保覆盖并防止遗忘。困难训练图像在每个周期中完全采样,以确保充分学习。每张训练图像的学习充分性定期更新,使检测器能够随着时间的推移自适应地将重点转向信息丰富的训练图像,同时逐步丢弃冗余图像。在广泛使用的自然图像检测基准(MS COCO 2017和PASCAL VOC 2007)和遥感检测数据集(DOTA-v1.0和DIOR-R)上,AFSS为YOLO系列检测器实现了超过$1.43 imes$的训练加速,同时提高了准确性。
cs.CV / 109 / 2603.17693
Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
通过合成视频学习可转移的时间原语以进行视频推理
Abstract
The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.
Chinese Translation
从图像到视频理解的转变要求视觉-语言模型(VLMs)从识别静态模式转向对时间动态进行推理,例如运动轨迹、速度变化和状态转变。然而,当前的后训练方法由于两个关键限制而显得不足:(1)现有数据集往往缺乏以时间为中心的特性,答案可以从孤立的关键帧中推断,而不需要整体的时间整合;(2)由专有模型生成的训练数据在基本的时间感知方面存在系统性错误,例如混淆运动方向或错误判断速度。我们提出了SynRL,一个后训练框架,旨在教会模型时间原语,即时间理解的基本构建块,包括方向、速度和状态跟踪。我们的关键见解是,这些从程序生成的合成视频中学习到的抽象原语能够有效地迁移到现实世界场景中。我们将时间理解分解为短期感知原语(速度、方向)和长期认知原语,通过基于代码的视频生成构建了7.7K的CoT和7K的RL样本,并提供了真实的帧级注释。尽管在简单的几何形状上进行训练,SynRL在涵盖时间基础、复杂推理和一般视频理解的15个基准测试中取得了显著的改进。值得注意的是,我们的7.7K合成CoT样本在性能上超越了包含165K真实世界样本的Video-R1。我们将这一成就归因于基本的时间技能,例如逐帧跟踪变化和比较速度,这些技能能够有效地从抽象的合成模式迁移到复杂的现实世界场景。这为视频后训练建立了一个新的范式:通过精心设计的合成数据进行视频时间学习提供了一条更具成本效益的扩展路径。
cs.CV / 110 / 2603.17705
Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation
参数高效的模态平衡对称融合用于多模态遥感语义分割
Abstract
Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.
Chinese Translation
多模态遥感语义分割通过利用异构数据中的互补物理线索来增强场景解释。尽管预训练的视觉基础模型(Vision Foundation Models, VFMs)提供了强大的通用表示,但将其适应于多模态任务往往会带来显著的计算开销,并且容易出现模态不平衡,即在优化过程中辅助模态的贡献受到抑制。为了解决这些挑战,我们提出了MoBaNet,一个参数高效且模态平衡的对称融合框架。MoBaNet建立在一个大部分被冻结的VFM骨干网络之上,采用对称双流架构来保留可泛化的表示,同时最小化可训练参数的数量。具体而言,我们设计了一个跨模态提示注入适配器(Cross-modal Prompt-Injected Adapter, CPIA),通过生成共享提示并将其注入到冻结骨干网络下的瓶颈适配器中,以实现深层语义交互。为了获得紧凑且具有区分性的多模态表示用于解码,我们进一步引入了一个差异引导门控融合模块(Difference-Guided Gated Fusion Module, DGFM),该模块通过明确利用跨模态差异来指导特征选择,从而自适应地融合配对阶段特征。此外,我们提出了一种模态条件随机掩蔽(Modality-Conditional Random Masking, MCRM)策略,通过在训练期间仅掩蔽一种模态并对模态特定分支施加硬像素辅助监督,以减轻模态不平衡。在ISPRS Vaihingen和波茨坦基准测试上的广泛实验表明,MoBaNet以显著更少的可训练参数实现了最先进的性能,验证了其在稳健和平衡的多模态融合中的有效性。本研究的源代码可在 https://github.com/sauryeo/MoBaNet 获取。
cs.CV / 111 / 2603.17715
Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)
使用视觉和概念提示的眼部图像分割与Segment Anything Model 3 (SAM3)
Abstract
Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3's codebase that allows processing videos of arbitrary duration.
Chinese Translation
先前的研究报告显示,视觉基础模型在眼部图像分割中表现出良好的零样本性能。在此,我们考察了最新版本的Segment Anything Model(SAM3)是否在眼部图像分割性能上优于SAM2,并探索其新的概念(文本)提示模式的表现。眼部图像分割性能的评估使用了多样化的数据集,包括来自实验室环境的高分辨率高质量视频以及包含在野外获取的具有挑战性的眼部视频的TEyeD数据集。结果显示,在大多数情况下,使用视觉或概念提示的SAM3在实验室和野外数据集上的表现均未优于SAM2。由于SAM2不仅表现更好,而且速度更快,我们得出结论,SAM2仍然是眼部图像分割的最佳选择。我们提供了SAM3代码库的适配版本,允许处理任意时长的视频。
cs.CV / 112 / 2603.17718
DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation
DiffVP:基于差异视觉语义提示的LLM驱动CT报告生成
Abstract
While large language models (LLMs) have advanced CT report generation, existing methods typically encode 3D volumes holistically, failing to distinguish informative cues from redundant anatomical background. Inspired by radiological cognitive subtraction, we propose Differential Visual Prompting (DiffVP), which conditions report generation on explicit, high-level semantic scan-to-reference differences rather than solely on absolute visual features. DiffVP employs a hierarchical difference extractor to capture complementary global and local semantic discrepancies into a shared latent space, along with a difference-to-prompt generator that transforms these signals into learnable visual prefix tokens for LLM conditioning. These difference prompts serve as structured conditioning signals that implicitly suppress invariant anatomy while amplifying diagnostically relevant visual evidence, thereby facilitating accurate report generation without explicit lesion localization. On two large-scale benchmarks, DiffVP consistently outperforms prior methods, improving the average BLEU-1-4 by +10.98 and +4.36, respectively, and further boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421). All codes will be released at https://github.com/ArielTYH/DiffVP/.
Chinese Translation
尽管大型语言模型(LLMs)在CT报告生成方面取得了进展,但现有方法通常整体编码3D体积,未能区分信息性线索与冗余的解剖背景。受放射学认知减法的启发,我们提出了差异视觉提示(DiffVP),该方法在报告生成中基于显式的高层次语义扫描与参考之间的差异,而不仅仅依赖于绝对视觉特征。DiffVP采用层次差异提取器,将互补的全局和局部语义差异捕捉到共享的潜在空间中,并配备差异到提示生成器,将这些信号转化为可学习的视觉前缀标记,以便于LLM的条件生成。这些差异提示作为结构化的条件信号,隐式抑制不变的解剖结构,同时增强与诊断相关的视觉证据,从而在没有明确病灶定位的情况下促进准确的报告生成。在两个大规模基准测试中,DiffVP始终优于先前的方法,平均BLEU-1-4分别提高了+10.98和+4.36,并进一步提升了RadGenome-ChestCT的临床效能(F1分数为0.421)。所有代码将发布在https://github.com/ArielTYH/DiffVP/。
cs.CV / 113 / 2603.17729
SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition
SARE:一种用于无训练细粒度视觉识别的样本自适应推理
Abstract
Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.
Chinese Translation
最近,大型视觉-语言模型(LVLMs)的进展使得无训练的细粒度视觉识别(FGVR)成为可能。然而,由于下级类别固有的视觉模糊性,有效利用LVLMs进行FGVR仍然具有挑战性。现有方法主要采用检索导向或推理导向的范式来应对这一挑战,但这两者都受到两个基本限制的制约:(1)它们对所有样本应用相同的推理流程,而没有考虑到识别难度的不均匀性,从而导致准确性和效率的次优;(2)缺乏巩固和重用特定错误经验的机制,导致在类似的挑战性案例上重复失败。为了解决这些限制,我们提出了SARE,一种用于无训练FGVR的样本自适应推理框架。具体而言,SARE采用级联设计,结合快速候选检索与细粒度推理,仅在必要时调用后者。在推理过程中,SARE结合了一种自我反思经验机制,利用过去的失败在推理过程中提供可转移的区分指导,而无需任何参数更新。通过在14个数据集上的广泛实验,证明SARE在显著降低计算开销的同时实现了最先进的性能。
cs.CV / 114 / 2603.17735
TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos
TAPESTRY:通过一致的转盘视频从几何到外观
Abstract
Automatically generating photorealistic and self-consistent appearances for untextured 3D models is a critical challenge in digital content creation. The advancement of large-scale video generation models offers a natural approach: directly synthesizing 360-degree turntable videos (TTVs), which can serve not only as high-quality dynamic previews but also as an intermediate representation to drive texture synthesis and neural rendering. However, existing general-purpose video diffusion models struggle to maintain strict geometric consistency and appearance stability across the full range of views, making their outputs ill-suited for high-quality 3D reconstruction. To this end, we introduce TAPESTRY, a framework for generating high-fidelity TTVs conditioned on explicit 3D geometry. We reframe the 3D appearance generation task as a geometry-conditioned video diffusion problem: given a 3D mesh, we first render and encode multi-modal geometric features to constrain the video generation process with pixel-level precision, thereby enabling the creation of high-quality and consistent TTVs. Building upon this, we also design a method for downstream reconstruction tasks from the TTV input, featuring a multi-stage pipeline with 3D-Aware Inpainting. By rotating the model and performing a context-aware secondary generation, this pipeline effectively completes self-occluded regions to achieve full surface coverage. The videos generated by TAPESTRY are not only high-quality dynamic previews but also serve as a reliable, 3D-aware intermediate representation that can be seamlessly back-projected into UV textures or used to supervise neural rendering methods like 3DGS. This enables the automated creation of production-ready, complete 3D assets from untextured meshes. Experimental results demonstrate that our method outperforms existing approaches in both video consistency and final reconstruction quality.
Chinese Translation
自动生成无纹理3D模型的照片级真实感和自一致性外观是数字内容创作中的一项关键挑战。大规模视频生成模型的进展提供了一种自然的方法:直接合成360度转盘视频(TTVs),这些视频不仅可以作为高质量的动态预览,还可以作为驱动纹理合成和神经渲染的中间表示。然而,现有的通用视频扩散模型在保持严格的几何一致性和外观稳定性方面面临困难,使得它们的输出不适合高质量的3D重建。为此,我们提出了TAPESTRY,一个基于显式3D几何条件生成高保真TTV的框架。我们将3D外观生成任务重新框定为一个几何条件的视频扩散问题:给定一个3D网格,我们首先渲染并编码多模态几何特征,以像素级精度约束视频生成过程,从而实现高质量和一致的TTV生成。在此基础上,我们还设计了一种从TTV输入进行下游重建任务的方法,采用具有3D感知修复的多阶段管道。通过旋转模型并执行上下文感知的二次生成,该管道有效地完成自遮挡区域,实现全面的表面覆盖。TAPESTRY生成的视频不仅是高质量的动态预览,还作为可靠的3D感知中间表示,可以无缝地反投影到UV纹理中,或用于监督神经渲染方法如3DGS。这使得能够从无纹理网格自动创建生产就绪的完整3D资产。实验结果表明,我们的方法在视频一致性和最终重建质量上均优于现有方法。
cs.CV / 115 / 2603.17746
Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation
概念到像素:无提示通用医学图像分割
Abstract
Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model's predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments and analysis on a unified benchmark comprising eight diverse datasets across seven modalities demonstrate the significant superiority of our jointly trained approach, compared to universe- or single-model approaches. Remarkably, our unified model demonstrates strong generalization, achieving impressive results not only on zero-shot tasks involving unseen cases but also in cross-modal transfers across similar tasks. Code is available at: https://github.com/Yundi218/Concept-to-Pixel
Chinese Translation
通用医学图像分割旨在使用单一基础模型处理多种成像模态下的多样任务。然而,现有方法往往过于依赖手动视觉提示或检索的参考图像,这限制了它们的自动化和鲁棒性。此外,跨模态的简单联合训练通常无法有效应对较大的领域转移。为了解决这些局限性,我们提出了概念到像素(Concept-to-Pixel, C2P),一种新颖的无提示通用分割框架。C2P明确将解剖知识分为两个组成部分:几何表示和语义表示。它利用多模态大语言模型(Multimodal Large Language Models, MLLMs)将抽象的高层医学概念提炼为可学习的语义标记,并引入显式监督的几何标记以强制执行通用的物理和结构约束。这些解耦的标记与图像特征深度交互,生成特定输入的动态核以实现精确的掩膜预测。此外,我们引入了一种几何感知推理共识机制,利用模型预测的几何约束来评估预测的可靠性并抑制异常值。在包含八个多样数据集的统一基准上进行的大量实验和分析表明,与通用或单一模型方法相比,我们的联合训练方法具有显著的优势。值得注意的是,我们的统一模型展现出强大的泛化能力,不仅在涉及未见案例的零-shot 任务中取得了令人印象深刻的结果,而且在类似任务的跨模态转移中也表现优异。代码可在以下链接获取:https://github.com/Yundi218/Concept-to-Pixel
cs.CV / 116 / 2603.17753
PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation
PC-CrossDiff:统一3D指称与分割的点-簇双层跨模态差异注意力
Abstract
3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the
[email protected] score by +10.16% for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.
Chinese Translation
3D视觉定位(3DVG)旨在通过两个核心任务:指称表达理解(3DREC)和分割(3DRES),定位自然语言指称表达的指称物。尽管现有方法在简单的单对象场景中取得了高准确率,但在复杂的多对象场景中,尤其是在现实世界环境中,性能严重下降,阻碍了实际应用。现有方法在复杂的多对象场景中面临两个关键挑战:对隐含定位线索的解析不足,这些线索对于消歧视觉上相似的对象至关重要,以及对共存对象动态空间干扰的抑制效果不佳,导致定位准确性下降。为了解决这些挑战,我们提出了PC-CrossDiff,一个统一的双任务框架,采用双层跨模态差异注意力架构用于3DREC和3DRES。具体而言,该框架引入了:(i) 点级差异注意力(PLDA)模块,通过文本和点云之间的双向差异注意力,利用可学习的权重自适应提取隐含定位线索,以改善区分性表示;(ii)簇级差异注意力(CLDA)模块,建立一个层次注意力机制,自适应增强与定位相关的空间关系,同时通过定位感知的差异注意力块抑制模糊或无关的空间关系。我们的方法在ScanRefer、NR3D和SR3D基准测试中实现了最先进的性能。值得注意的是,在ScanRefer的隐含子集上,它将3DREC任务的
[email protected]分数提高了10.16%,突显了其解析隐含空间线索的强大能力。
cs.CV / 117 / 2603.17761
Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs
基于证据打包的跨领域图像深度伪造检测方法与大型视觉语言模型(LVLMs)
Abstract
Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.
Chinese Translation
图像深度伪造检测(IDD)通过识别合成或篡改的伪影,将操控的图像与真实图像分离。尽管大型视觉语言模型(LVLMs)提供了强大的图像理解能力,但将其应用于IDD通常需要昂贵的微调,并且对多样化和不断演变的操控的泛化能力较差。我们提出了一种语义一致性证据包(Semantic Consistent Evidence Pack, SCEP),这是一种无训练的LVLM框架,通过证据驱动的推理替代了整图推理。SCEP挖掘出一组紧凑的可疑补丁标记,这些标记最能揭示操控线索。它使用视觉编码器的CLS标记作为全局参考,将补丁特征聚类为一致的组,并通过结合CLS引导的语义不匹配与基于频率和噪声的异常的融合度量对补丁进行评分。为了覆盖分散的痕迹并避免冗余,SCEP每个聚类中抽样少量高置信度补丁,并应用基于网格的非极大值抑制(NMS),生成一个证据包,以便为冻结的LVLM进行预测。实验结果表明,SCEP在多样化基准测试中表现优于强基线,无需对LVLM进行微调。
cs.CV / 118 / 2603.17779
CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image
CrowdGaussian:从单幅图像重建高保真度人群3D高斯模型
Abstract
Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs. Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.
Chinese Translation
近年来,单视角3D人类重建引起了广泛关注。尽管取得了诸多进展,之前的研究主要集中在从清晰、特写的个体图像中重建3D模型,往往在更常见的多人物场景中效果不佳。重建3D人群模型是一项极其复杂的任务,面临着诸多挑战,例如:1)广泛的遮挡,2)低清晰度,以及3)众多且多样的外观。为了解决这一任务,我们提出了CrowdGaussian,这是一个统一框架,能够直接从单幅图像输入中重建多人物3D高斯点云(3D Gaussian Splatting, 3DGS)表示。为了处理遮挡问题,我们设计了一种自监督适应管道,使得预训练的大型人类模型能够从严重遮挡的输入中重建出具有合理几何形状和外观的完整3D人类。此外,我们引入了自校准学习(Self-Calibrated Learning, SCL)。这一训练策略使得单步扩散模型能够通过将保持身份的样本与干净/损坏的图像对混合,适应性地优化粗略渲染的质量。这些输出可以被提炼回去,以提升多人物3DGS表示的质量。大量实验表明,CrowdGaussian能够生成逼真且几何一致的多人物场景重建。
cs.CV / 119 / 2603.17782
Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime
探索亿参数视觉模型的参数高效微调(PEFT):基于QLoRA和DoRA的有限数据图像分类泛化洞察,在98:1的测试与训练比例下
Abstract
Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.
Chinese Translation
自动化行为分类对于精准畜牧业至关重要,但面临高计算成本和有限标注数据的挑战。本研究系统比较了三种方法:从头开始训练(ResNet-18,ViT-Small)、冻结特征提取和DINOv3基础模型(67亿参数)的参数高效微调(PEFT)。我们在多个配置下评估了QLoRA和DoRA,变化的秩(8、16、64)和目标模块(q_proj与所有线性层)。在2160张经过验证的训练图像上,我们评估了模型在211,800个测试样本上的泛化能力,基本上是98:1的测试与训练比例。结果表明,PEFT显著优于其他方法,其中最佳QLoRA配置(所有线性层和秩=64)在仅用5.8小时内实现了83.16%的测试准确率,参数仅为2.72%(1.83亿),而ResNet-18的准确率为72.87%(16.8小时),ViT-Small为61.91%(18.7小时),冻结的DINOv3为76.56%(17.5小时)。DoRA实现了相当的准确率(83.14%),但训练时间较长(11.0小时)。值得注意的是,增加适配器容量持续改善了泛化能力,同时没有导致过拟合:将秩从16降低到8使测试准确率从78.38%降至77.17%,而从仅使用q_proj扩展到所有线性层且秩=64时,准确率从78.38%提高到83.16%。这表明,在将基础模型适应农业图像时,主要挑战是欠拟合而非过拟合。我们的研究结果为在农业畜牧应用中部署亿参数视觉模型的PEFT提供了指导。
cs.CV / 120 / 2603.17784
ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis
基于类重加权和解剖引导的时间解码的ResNet-50在胃肠道视频分析中的应用
Abstract
We developed a multi-label gastrointestinal video analysis pipeline based on a ResNet-50 frame classifier followed by anatomy-guided temporal event decoding. The system predicts 17 labels, including 5 anatomy classes and 12 pathology classes, from frames resized to 336x336. A major challenge was severe class imbalance, particularly for rare pathology labels. To address this, we used clipped class-wise positive weighting in the training loss, which improved rare-class learning while maintaining stable optimization. At the temporal stage, we found that direct frame-to-event conversion produced fragmented mismatches with the official ground truth. The final submission therefore combined GT-style framewise event composition, anatomy vote smoothing, and anatomy-based pathology gating with a conservative hysteresis decoder. This design improved the final temporal mAP from 0.3801 to 0.4303 on the challenge test set.
Chinese Translation
我们开发了一个基于ResNet-50帧分类器的多标签胃肠道视频分析管道,随后进行了解剖引导的时间事件解码。该系统从调整为336x336的帧中预测17个标签,包括5个解剖类和12个病理类。一个主要挑战是严重的类别不平衡,尤其是对于稀有病理标签。为了解决这个问题,我们在训练损失中使用了剪切的类别正权重,这改善了稀有类别的学习,同时保持了稳定的优化。在时间阶段,我们发现直接的帧到事件转换产生了与官方真实标签的碎片化不匹配。因此,最终的提交结合了GT风格的逐帧事件组合、解剖投票平滑和基于解剖的病理门控,使用了保守的滞后解码器。这一设计使最终的时间mAP在挑战测试集上从0.3801提高到0.4303。
cs.CV / 121 / 2603.17809
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients
针对大规模视觉语言模型的细粒度后训练量化与量化感知集成梯度
Abstract
Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at https://github.com/ucas-xiang/QIG.
Chinese Translation
大规模视觉语言模型(LVLMs)在需要多模态交互的多种下游任务中取得了显著成功,但其能力伴随着巨大的计算和内存开销,阻碍了实际部署。在众多加速技术中,后训练量化是一种流行且有效的策略,用于降低内存成本和加速推理。然而,现有的LVLM量化方法通常在模态层面测量令牌敏感性,这未能捕捉复杂的跨令牌交互,并且在定量测量令牌级别的量化误差方面存在不足。随着令牌在模型中的交互,模态之间的区别逐渐减小,这表明需要进行细粒度的校准。受到机械解释中的公理归因启发,我们提出了一种基于量化感知集成梯度(Quantization-aware Integrated Gradients, QIG)的细粒度量化策略,该策略利用集成梯度定量评估令牌敏感性,并将粒度从模态层面提升到令牌层面,反映了模态间和模态内的动态。在W4A8和W3A16设置下对多个LVLM进行的广泛实验表明,我们的方法在模型和基准测试中提高了准确性,且延迟开销微乎其微。例如,在3位仅权重量化下,我们的方法使LLaVA-onevision-7B的平均准确率提高了1.60%,将其与全精度对应模型的差距缩小至仅1.33%。代码可在 https://github.com/ucas-xiang/QIG 获取。
cs.CV / 122 / 2603.17812
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation
ChopGrad:通过截断反向传播实现潜在视频扩散的逐像素损失
Abstract
Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.
Chinese Translation
最近的视频扩散模型通过递归帧处理实现高质量生成,其中每帧的生成依赖于之前的帧。然而,这种递归机制意味着在像素域中训练此类模型会产生高昂的内存成本,因为激活值在整个视频序列中累积。这个基本限制也使得使用逐像素损失对这些模型进行微调在长视频或高分辨率视频中计算上不可行。本文提出了ChopGrad,一种用于视频解码的截断反向传播方案,限制梯度计算在局部帧窗口内,同时保持全局一致性。我们对这种近似进行了理论分析,并展示了它如何使得使用逐帧损失进行高效微调成为可能。ChopGrad将训练内存从与视频帧数线性相关(全反向传播)的情况降低到常量内存,并在一系列条件视频生成任务中与现有最先进的视频扩散模型相比表现良好,这些任务包括视频超分辨率、视频修复、神经渲染场景的视频增强以及受控驾驶视频生成。
cs.CV / 123 / 2603.17813
M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking
M2P:通过掩码到点的弱监督学习改善视觉基础模型以进行密集点跟踪
Abstract
Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.
Chinese Translation
任意点跟踪(TAP)已成为视频理解的一个基本工具。目前的方法通过离线微调或测试时优化来适应视觉基础模型(VFM),如DINOv2。然而,这些VFM依赖于静态图像的预训练,这在捕捉视频中的密集时间对应关系时本质上是次优的。为了解决这个问题,我们提出了掩码到点(M2P)学习,它利用丰富的视频对象分割(VOS)掩码注释来改善VFM以进行密集点跟踪。我们的M2P引入了三个新的基于掩码的约束用于弱监督表示学习。首先,我们提出了一种局部结构一致性损失,它利用Procrustes分析来建模位于局部结构内的点的凝聚运动,从而实现更可靠的点对点匹配学习。其次,我们提出了一种掩码标签一致性(MLC)损失,强制采样的前景点在帧间严格匹配前景区域。所提出的MLC损失可以视为一种正则化,稳定训练并防止收敛到平凡解。最后,掩码边界约束被应用于显式监督边界点。我们展示了我们的弱监督M2P模型在仅使用3.6K VOS训练视频的高效训练下显著优于基线VFM。值得注意的是,M2P在TAP-Vid-DAVIS基准上分别在DINOv2-B/14和DINOv3-B/16上实现了12.8%和14.6%的性能提升。此外,所提出的M2P模型被用作测试时优化和离线微调TAP任务的预训练骨干,展示了其作为点跟踪通用预训练模型的潜力。代码将在接受后公开发布。
cs.CV / 124 / 2603.17825
Steering Video Diffusion Transformers with Massive Activations
利用大规模激活引导视频扩散变换器
Abstract
Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.
Chinese Translation
尽管视频扩散变换器取得了快速进展,但如何以最小的开销利用其内部模型信号来提升视频生成质量仍然未被充分探索。在本研究中,我们研究了大规模激活(Massive Activations, MAs)的作用,MAs 是视频扩散变换器中稀有的高幅度隐藏状态尖峰。我们观察到,MAs 在所有视觉标记中一致出现,并且具有明显的幅度层次:第一帧标记表现出最大的 MA 幅度,潜在帧边界标记(潜在空间中每个时间块的头部和尾部部分)显示出高于第一帧但略低的 MA 幅度,而每个潜在帧内的内部标记则保持在较高水平,但幅度相对适中。这种结构化模式表明,模型隐含地优先考虑与潜在空间中的时间块对齐的标记位置。基于这一观察,我们提出了结构化激活引导(Structured Activation Steering, STAS),这是一种无训练的自我引导类方法,旨在将第一帧和边界标记的 MA 值引导至缩放的全局最大参考幅度。STAS 在不同的文本到视频模型中实现了视频质量和时间一致性的持续改善,同时引入的计算开销微乎其微。
cs.CV / 125 / 2603.17828
TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models
TINA:无文本反演攻击针对未学习的文本到图像扩散模型
Abstract
Although text-to-image diffusion models exhibit remarkable generative power, concept erasure techniques are essential for their safe deployment to prevent the creation of harmful content. This has fostered a dynamic interplay between the development of erasure defenses and the adversarial probes designed to bypass them, and this co-evolution has progressively enhanced the efficacy of erasure methods. However, this adversarial co-evolution has converged on a narrow, text-centric paradigm that equates erasure with severing the text-to-image mapping, ignoring that the underlying visual knowledge related to undesired concepts still persist. To substantiate this claim, we investigate from a visual perspective, leveraging DDIM inversion to probe whether a generative pathway for the erased concept can still be found. However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model. To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses. Moreover, TINA integrates an optimization procedure to overcome the accumulating approximation errors that arise when standard inversion operates without its usual textual guidance. Our experiments demonstrate that TINA regenerates erased concepts from models treated with state-of-the-art unlearning. The success of TINA proves that current methods merely obscure concepts, highlighting an urgent need for paradigms that operate directly on internal visual knowledge.
Chinese Translation
尽管文本到图像的扩散模型展现出显著的生成能力,但概念抹除技术对于其安全部署至关重要,以防止有害内容的生成。这促成了抹除防御的发展与旨在绕过这些防御的对抗性探测之间的动态互动,而这种共同演化逐步增强了抹除方法的有效性。然而,这种对抗性共同演化已收敛于一个狭窄的以文本为中心的范式,将抹除等同于切断文本到图像的映射,忽视了与不希望出现的概念相关的基础视觉知识仍然存在。为了证实这一观点,我们从视觉的角度进行研究,利用DDIM反演探测被抹除概念的生成路径是否仍然存在。然而,识别这样的视觉生成路径具有挑战性,因为标准的文本引导DDIM反演在被抹除模型中受到以文本为中心的防御的积极抵制。为了解决这个问题,我们引入了TINA,一种新颖的无文本反演攻击,它通过在无文本条件下操作来强制进行这种仅基于视觉的探测,从而避免现有的以文本为中心的防御。此外,TINA整合了一种优化过程,以克服在标准反演没有其通常的文本指导下产生的累积近似误差。我们的实验表明,TINA能够从经过最先进的未学习处理的模型中重新生成被抹除的概念。TINA的成功证明了当前的方法仅仅是模糊了概念,突显了迫切需要直接作用于内部视觉知识的范式。
cs.CV / 126 / 2603.17840
Video Understanding: From Geometry and Semantics to Unified Models
视频理解:从几何与语义到统一模型
Abstract
Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map of the evolving video understanding landscape, summarizes key modeling trends and design principles, and outlines open challenges toward building robust, scalable, and unified video foundation models.
Chinese Translation
视频理解旨在使模型能够感知、推理和与动态视觉世界进行交互。与图像理解不同,视频理解本质上需要建模时间动态和不断演变的视觉上下文,这对时空推理提出了更高的要求,使其成为计算机视觉中的一个基础性问题。在本次综述中,我们通过将文献组织为三个互补的视角,呈现了视频理解的结构化概述:低级视频几何理解、高级语义理解和统一视频理解模型。我们进一步强调了从孤立的、特定任务的流程向统一建模范式的更广泛转变,这些范式可以适应多样的下游目标,从而使我们能够更系统地看待最近的进展。通过整合这些视角,本综述提供了一个关于不断演变的视频理解领域的连贯地图,总结了关键的建模趋势和设计原则,并概述了构建稳健、可扩展和统一的视频基础模型的开放挑战。
cs.CV / 127 / 2603.17841
Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass
Omni-3DEdit:一次性通用多功能3D编辑
Abstract
Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model's representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.
Chinese Translation
大多数基于指令的3D编辑方法依赖于2D模型来指导3D表示的显式和迭代优化。然而,这一范式存在两个主要缺陷。首先,由于3D几何的显式操作需要依赖于特定任务的规则,因此缺乏不同3D编辑任务的通用设计。例如,3D外观编辑需要固有的源3D几何,而3D移除则会改变源几何。其次,迭代优化过程耗时极长,通常需要数千次的2D/3D更新调用。我们提出了Omni-3DEdit,这是一种统一的基于学习的模型,能够隐式地概括各种3D编辑任务。实现我们目标的一个关键挑战是缺乏配对的源-编辑多视角资产用于训练。为了解决这个问题,我们构建了一个数据管道,合成了相对丰富的高质量配对多视角编辑样本。随后,我们通过在序列空间中将源视图潜变量与条件标记连接,适配了预训练生成模型SEVA作为我们的基础模型。我们提出了一种双流LoRA模块,以解耦不同视图线索,极大增强了我们模型的表征学习能力。作为一个基于学习的模型,我们的模型不需要耗时的在线优化,并且可以在一次前向传递中完成各种3D编辑任务,将推理时间从数十分钟减少到大约两分钟。大量实验证明了Omni-3DEdit的有效性和效率。
cs.CV / 128 / 2603.17845
Revisiting foundation models for cell instance segmentation
重新审视细胞实例分割的基础模型
Abstract
Cell segmentation is a fundamental task in microscopy image analysis. Several foundation models for cell segmentation have been introduced, virtually all of them are extensions of Segment Anything Model (SAM), improving it for microscopy data. Recently, SAM2 and SAM3 have been published, further improving and extending the capabilities of general-purpose segmentation foundation models. Here, we comprehensively evaluate foundation models for cell segmentation (CellPoseSAM, CellSAM, $\mu$SAM) and for general-purpose segmentation (SAM, SAM2, SAM3) on a diverse set of (light) microscopy datasets, for tasks including cell, nucleus and organoid segmentation. Furthermore, we introduce a new instance segmentation strategy called automatic prompt generation (APG) that can be used to further improve SAM-based microscopy foundation models. APG consistently improves segmentation results for $\mu$SAM, which is used as the base model, and is competitive with the state-of-the-art model CellPoseSAM. Moreover, our work provides important lessons for adaptation strategies of SAM-style models to microscopy and provides a strategy for creating even more powerful microscopy foundation models. Our code is publicly available at https://github.com/computational-cell-analytics/micro-sam.
Chinese Translation
细胞分割是显微镜图像分析中的一项基础任务。已经引入了几种用于细胞分割的基础模型,几乎所有这些模型都是对Segment Anything Model (SAM)的扩展,旨在改善其在显微镜数据上的表现。最近,SAM2和SAM3相继发布,进一步提升和扩展了通用分割基础模型的能力。在此,我们对细胞分割的基础模型(CellPoseSAM、CellSAM、$BC$SAM)以及通用分割模型(SAM、SAM2、SAM3)进行了全面评估,涵盖了多种(光)显微镜数据集,任务包括细胞、细胞核和类器官分割。此外,我们引入了一种新的实例分割策略,称为自动提示生成(automatic prompt generation, APG),可用于进一步改善基于SAM的显微镜基础模型。APG在以$BC$SAM作为基础模型时,始终能提高分割结果,并且与最先进的模型CellPoseSAM具有竞争力。此外,我们的工作为SAM风格模型在显微镜中的适应策略提供了重要的经验教训,并为创建更强大的显微镜基础模型提供了一种策略。我们的代码已公开发布在https://github.com/computational-cell-analytics/micro-sam。
cs.CV / 129 / 2603.17859
VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection
VISER:一种基于视觉信息的开放集虹膜呈现攻击检测增强鲁棒性的系统
Abstract
Human perceptual priors have shown promise in saliency-guided deep learning training, particularly in the domain of iris presentation attack detection (PAD). Common saliency approaches include hand annotations obtained via mouse clicks and eye gaze heatmaps derived from eye tracking data. However, the most effective form of human saliency for open-set iris PAD remains underexplored. In this paper, we conduct a series of experiments comparing hand annotations, eye tracking heatmaps, segmentation masks, and DINOv2 embeddings to a state-of-the-art deep learning-based baseline on the task of open-set iris PAD. Results for open-set PAD in a leave-one-attack-type out paradigm indicate that denoised eye tracking heatmaps show the best generalization improvement over cross entropy in terms of Area Under the ROC curve (AUROC) and Attack Presentation Classification Error Rate (APCER) at Bona Fide Presentation Classification Error Rate (BPCER) of 1%. Along with this paper, we offer trained models, code, and saliency maps for reproducibility and to facilitate follow-up research efforts.
Chinese Translation
人类感知先验在基于显著性引导的深度学习训练中显示出了潜力,特别是在虹膜呈现攻击检测(PAD)领域。常见的显著性方法包括通过鼠标点击获得的手工标注和从眼动追踪数据中派生的眼动热图。然而,针对开放集虹膜PAD的人类显著性最有效的形式仍然未被充分探索。本文中,我们进行了一系列实验,将手工标注、眼动追踪热图、分割掩码和DINOv2嵌入与基于深度学习的最先进基线进行比较,以评估开放集虹膜PAD的任务。在一种留一攻击类型的实验范式下,开放集PAD的结果表明,去噪的眼动追踪热图在ROC曲线下面积(AUROC)和攻击呈现分类错误率(APCER)方面相较于交叉熵表现出最佳的泛化改进,且在真实呈现分类错误率(BPCER)为1%的情况下尤为明显。随本文附带,我们提供了训练模型、代码和显著性图,以便于重现研究并促进后续研究工作。
cs.CV / 130 / 2603.17876
Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?
编辑溢出作为探针:图像编辑模型是否隐含理解世界关系?
Abstract
Instruction-following image editing models are expected to modify only the specified region while keeping the rest of the image unchanged. However, in practice, we observe a pervasive phenomenon -- edit spillover: models alter semantically related but unspecified content outside the edit region. This raises a fundamental question -- does spillover reflect genuine implicit world understanding, or is it merely attention leakage? We propose EditSpilloverProbe, a systematic framework that repurposes edit spillover as a natural probe for world knowledge in image editing models. We introduce a spillover taxonomy (spatial, semantic, mixed, random), an automated detection-and-classification pipeline, and a benchmark dataset constructed from real-world Chinese text editing tasks, EditSpilloverBench. Systematic evaluation of 5 representative editing models reveals three core findings: (1) spillover rates vary dramatically across architectures, from 3.49% to 11.46%, with a 3.3x ratio; (2) absolute semantic spillover quantity reveals models' world understanding capability -- nano_banana produces the most semantic spillover (27.8 per image), while qwen_2511 has the most precise editing control but lower semantic spillover (16.3 per image), revealing a trade-off between editing control and world understanding; (3) spatial decay analysis shows spillover area density decays exponentially with distance, but the proportion of semantically relevant spillover remains constant (40%-58%), providing direct evidence that semantic spillover reflects genuine world understanding rather than spatial diffusion.
Chinese Translation
遵循指令的图像编辑模型应仅修改指定区域,同时保持图像其余部分不变。然而,在实践中,我们观察到一种普遍现象——编辑溢出:模型改变了与编辑区域语义相关但未指定的内容。这引发了一个根本性的问题——溢出是否反映了真正的隐性世界理解,还是仅仅是注意力泄漏?我们提出了EditSpilloverProbe,这是一个系统框架,将编辑溢出重新用作图像编辑模型中世界知识的自然探针。我们引入了一种溢出分类法(空间、语义、混合、随机)、一个自动检测和分类管道,以及一个基于真实世界中文文本编辑任务构建的基准数据集EditSpilloverBench。对5个代表性编辑模型的系统评估揭示了三个核心发现:(1)不同架构的溢出率差异显著,从3.49%到11.46%,比例为3.3倍;(2)绝对语义溢出量揭示了模型的世界理解能力——nano_banana产生了最多的语义溢出(每张图像27.8),而qwen_2511具有最精确的编辑控制但语义溢出较低(每张图像16.3),显示了编辑控制与世界理解之间的权衡;(3)空间衰减分析表明,溢出区域密度随着距离呈指数衰减,但语义相关溢出的比例保持不变(40%-58%),提供了直接证据表明语义溢出反映了真正的世界理解,而非空间扩散。
cs.CV / 131 / 2603.17879
Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification
不平衡多标签视频胶囊内窥镜分类的差异注意力增强BiomedCLIP与非对称焦点优化
Abstract
This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal
[email protected] of 0.2456 and
[email protected] of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.
Chinese Translation
本研究提出了一种视频胶囊内窥镜(VCE)的多标签分类框架,通过结合架构和优化层面的策略,解决Galar数据集中固有的极端类别不平衡问题。我们的方法修改了BiomedCLIP这一生物医学视觉-语言基础模型,采用差异注意力机制替代其标准的多头自注意力,以计算两个softmax注意力图之间的差异,从而抑制注意力噪声。为了应对标签分布的偏斜,其中病理发现占所有标注帧的比例不足0.1%,我们采用了平方频率加权采样器、非对称焦点损失、混合正则化和每类阈值优化。在事件级JSON生成之前,通过中值滤波平滑和间隙合并来强制执行时间一致性。在包含三次NaviCam检查(161,025帧)的保留RARE-VISION测试集上,该管道实现了整体时间
[email protected]为0.2456,
[email protected]为0.2353,且在单个GPU上总推理时间约为8.6分钟。
cs.CV / 132 / 2603.17889
Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation
身份作为存在:面向个性化外观和声音的联合音视频生成
Abstract
Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.
Chinese Translation
近期的研究进展展示了将真实个体合成到生成视频中的强大能力,反映了对身份感知内容创作日益增长的需求。然而,尚未出现一个开放可用的框架,能够在多个身份之间对面部外观和声音音色进行细粒度控制。在本研究中,我们提出了一个统一且可扩展的身份感知联合音视频生成框架,实现高保真和一致的个性化。具体而言,我们引入了一个数据策划管道,自动提取带有配对注释的身份信息,涵盖音频和视觉模态,适用于从单一主体到多主体交互的多种场景。我们进一步提出了一种灵活且可扩展的身份注入机制,适用于单一和多主体场景,其中面部外观和声音音色作为身份控制信号。此外,考虑到模态差异,我们设计了一种多阶段训练策略,以加速收敛并强化跨模态一致性。实验结果证明了所提框架的优越性。有关更多细节和定性结果,请参阅我们的网页: extit{Identity-as-Presence}。
cs.CV / 133 / 2603.17895
A Creative Agent is Worth a 64-Token Template
创造性代理的价值相当于一个64-token模板
Abstract
Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as ``a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes ``creativity'' costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents' intrinsic understanding of ``creativity'' through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent's latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a $3.7\times$ speedup and a $4.8\times$ reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.
Chinese Translation
文本到图像(T2I)模型在图像保真度和提示遵循方面有了显著提升,但其创造力仍然受到对离散自然语言提示依赖的限制。当面对模糊提示,例如“灵感来自创意黑胶唱片的摩天大楼”时,这些模型往往无法推断出潜在的创造意图,从而使得创造性构思和提示设计主要依赖于人类用户。最近的基于推理或代理驱动的方法通过迭代增强提示,但由于其实例特定的生成方式,使得“创造力”变得昂贵且不可重用,需要对后续生成进行重复查询或推理。为了解决这一问题,我们提出了 extbf{CAT},一个 extbf{C}reative extbf{A}gent extbf{T}okenization框架,通过 extit{Creative Tokenizer}封装代理对“创造力”的内在理解。给定模糊提示的嵌入,tokenizer生成一个可重用的token模板,可以直接与这些提示连接,以在T2I模型中注入创造性语义,而无需重复推理或提示增强。为实现这一目标,tokenizer通过创造性语义解缠结进行训练,利用部分重叠概念对之间的关系来捕捉代理的潜在创造性表示。在 extbf{ extit{建筑设计}}、 extbf{ extit{家具设计}}和 extbf{ extit{自然混合}}任务上的大量实验表明,CAT提供了一种可扩展且有效的范式,以增强T2I生成中的创造力,实现了$3.7 imes$的加速和$4.8 imes$的计算成本降低,同时生成的图像在与最先进的T2I模型和创造性生成方法相比,具有更优的人类偏好和文本-图像对齐效果。
cs.CV / 134 / 2603.17910
SpiderCam: Low-Power Snapshot Depth from Differential Defocus
SpiderCam:基于差分失焦的低功耗快照深度相机
Abstract
We introduce SpiderCam, an FPGA-based snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 624 mW of power in total. SpiderCam comprises a custom camera that simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.
Chinese Translation
我们介绍了SpiderCam,这是一种基于FPGA的快照深度相机,能够实时生成480x400的稀疏深度图,帧率为32.5 FPS,工作范围为52厘米,同时总功耗仅为624毫瓦。SpiderCam由一个定制相机组成,能够同时捕捉同一场景的两幅不同聚焦的图像,并在低功耗FPGA上使用SystemVerilog实现的差分失焦深度(DfDD)算法进行处理。为了实现先进的功耗表现,我们提出了对DfDD的算法改进,以克服低功耗传感器带来的挑战,并设计了一种内存局部实现,以便在一个连存储一对图像都太小的设备上进行流式深度计算。我们报告了文献中首次针对被动FPGA基础的3D相机的亚瓦特总功耗测量结果。
cs.CV / 135 / 2603.17914
Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference
协作深度神经网络推理中的噪声感知误分类攻击检测
Abstract
Collaborative inference of object classification Deep neural Networks (DNNs) where resource-constrained end-devices offload partially processed data to remote edge servers to complete end-to-end processing, is becoming a key enabler of edge-AI. However, such edge-offloading is vulnerable to malicious data injections leading to stealthy misclassifications that are tricky to detect, especially in the presence of environmental noise. In this paper, we propose a semi-gray-box and noise- aware anomaly detection framework fueled by a variational autoencoder (VAE) to capture deviations caused by adversarial manipulation. The proposed framework incorporates a robust noise-aware feature that captures the characteristic behavior of environmental noise to improve detection accuracy while reducing false alarm rates. Our evaluation with popular object classification DNNs demonstrate the robustness of the proposed detection (up to 90% AUROC across DNN configurations) under realistic noisy conditions while revealing limitations caused by feature similarity and elevated noise levels.
Chinese Translation
对象分类深度神经网络(DNN)的协作推理中,资源受限的终端设备将部分处理的数据卸载到远程边缘服务器以完成端到端处理,已成为边缘人工智能的关键推动力。然而,这种边缘卸载容易受到恶意数据注入的攻击,导致隐蔽的误分类,尤其是在环境噪声存在的情况下,这种误分类更难以检测。本文提出了一种半灰盒和噪声感知的异常检测框架,该框架由变分自编码器(VAE)驱动,以捕捉由对抗性操控引起的偏差。所提出的框架结合了一种强大的噪声感知特征,能够捕捉环境噪声的特征行为,从而提高检测准确性并降低误报率。我们对流行的对象分类DNN的评估表明,在现实噪声条件下,所提出的检测方法具有良好的鲁棒性(在不同DNN配置下的AUROC高达90%),同时揭示了特征相似性和噪声水平升高所导致的局限性。
cs.CV / 136 / 2603.17920
SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale
SegFly:一种用于大规模空中RGB-热成像语义分割的2D-3D-2D范式
Abstract
Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at https://github.com/markus-42/SegFly.
Chinese Translation
无人机(UAV)的语义分割对于空中场景理解至关重要,但现有的RGB和RGB-T数据集在规模、多样性和标注效率方面仍然有限,这主要是由于人工标注的高成本以及在现成无人机上实现准确RGB-T对齐的困难。为了解决这些挑战,我们提出了一种可扩展的几何驱动2D-3D-2D范式,该范式利用高重叠空中图像中的多视角冗余,自动将标签从少量手动标注的RGB图像传播到统一框架内的RGB和热成像模态。通过将不到3%的RGB图像提升为语义3D点云并重新投影到所有视图,我们的方法能够在大规模图像集合中生成密集的伪地面真值,自动生成97%的RGB标签和100%的热成像标签,同时在没有任何2D手动精修的情况下实现91%和88%的标注准确率。我们进一步将这一2D-3D-2D范式扩展到跨模态图像配准,利用3D几何作为中间对齐空间,以获得完全自动化的强像素级RGB-T对齐,注册准确率达到87%,且无需硬件级同步。将我们的框架应用于现有的地理参考空中图像,我们构建了SegFly,这是一个大规模基准数据集,包含超过20,000张高分辨率RGB图像和超过15,000对几何对齐的RGB-T图像,涵盖多种城市、工业和农村环境,跨越多个高度和季节。在SegFly上,我们建立了RGB和热成像语义分割的Firefly基线,并展示了传统架构和视觉基础模型在SegFly监督下的显著收益,突显了几何驱动的2D-3D-2D管道在可扩展多模态场景理解中的潜力。数据和代码可在 https://github.com/markus-42/SegFly 获取。
cs.CV / 137 / 2603.17926
A practical artificial intelligence framework for legal age estimation using clavicle computed tomography scans
基于锁骨计算机断层扫描的法律年龄估计实用人工智能框架
Abstract
Legal age estimation plays a critical role in forensic and medico-legal contexts, where decisions must be supported by accurate, robust, and reproducible methods with explicit uncertainty quantification. While prior artificial intelligence (AI)-based approaches have primarily focused on hand radiographs or dental imaging, clavicle computed tomography (CT) scans remain underexplored despite their documented effectiveness for legal age estimation. In this work, we present an interpretable, multi-stage pipeline for legal age estimation from clavicle CT scans. The proposed framework combines (i) a feature-based connected-component method for automatic clavicle detection that requires minimal manual annotation, (ii) an Integrated Gradients-guided slice selection strategy used to construct the input data for a multi-slice convolutional neural network that estimates legal age, and (iii) conformal prediction intervals to support uncertainty-aware decisions in accordance with established international protocols. The pipeline is evaluated on 1,158 full-body post-mortem CT scans from a public forensic dataset (the New Mexico Decedent Image Database). The final model achieves state-of-the-art performance with a mean absolute error (MAE) of 1.55 $\pm$ 0.16 years on a held-out test set, outperforming both human experts (MAE of approximately 1.90 years) and previous methods (MAEs above 1.75 years in our same dataset). Furthermore, conformal prediction enables configurable coverage levels aligned with forensic requirements. Attribution maps indicate that the model focuses on anatomically relevant regions of the medial clavicular epiphysis. The proposed method, which is currently being added as part of the Skeleton-ID software (https://skeleton-id.com/skeleton-id/), is intended as a decision-support component within multi-factorial forensic workflows.
Chinese Translation
法律年龄估计在法医学和医学法律背景中发挥着关键作用,其中决策必须基于准确、稳健且可重复的方法,并明确量化不确定性。尽管之前基于人工智能(AI)的方法主要集中在手部X光片或牙科影像上,但锁骨计算机断层扫描(CT)在法律年龄估计中的有效性尚未得到充分探索。在本研究中,我们提出了一种可解释的多阶段管道,用于从锁骨CT扫描中进行法律年龄估计。所提出的框架结合了(i)一种基于特征的连通组件方法,用于自动锁骨检测,所需的手动标注最少;(ii)一种集成梯度引导的切片选择策略,用于构建多切片卷积神经网络的输入数据,以估计法律年龄;以及(iii)符合国际标准的保守预测区间,以支持不确定性意识的决策。该管道在一个公共法医学数据集中(新墨西哥州死者影像数据库)对1158个全身尸检CT扫描进行了评估。最终模型在保留的测试集上实现了最先进的性能,平均绝对误差(MAE)为1.55 ± 0.16年,优于人类专家(MAE约为1.90年)和之前的方法(在我们相同数据集中MAE超过1.75年)。此外,保守预测使得可配置的覆盖水平与法医学要求相一致。归因图表明,该模型关注于内侧锁骨骨骺的解剖相关区域。所提出的方法目前正在作为Skeleton-ID软件的一部分添加(https://skeleton-id.com/skeleton-id/),旨在作为多因素法医学工作流程中的决策支持组件。
cs.CV / 138 / 2603.17930
Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning
基于法律多智能体推理的行车记录仪视频可解释交通责任分析
Abstract
The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming "what happened in the video" into "who is responsible under which legal provisions" still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.
Chinese Translation
行车记录仪的广泛应用使得交通事故中的视频证据日益丰富,但将“视频中发生了什么”转化为“在何种法律条款下谁应负责”仍然高度依赖于人类专家。现有的自视角交通事故研究主要集中在感知和语义理解上,而基于大型语言模型(LLM)的法律方法主要建立在文本案例描述上,鲜少结合视频证据,导致两者之间存在明显的鸿沟。我们首先提出了C-TRAIL,这是一个多模态法律数据集,在中国交通法规体系下,明确将行车记录仪视频与文本描述对齐,并与一组封闭的责任模式及其对应的中国交通法规相对应。在此基础上,我们引入了一个两阶段框架:(1) 一个交通事故理解模块,生成文本视频描述;(2) 一个法律多智能体框架,输出责任模式、法规集合和完整的判决报告。在C-TRAIL和MM-AU上的实验结果表明,我们的方法在性能上优于一般和法律LLM,以及现有的基于智能体的方法,同时提供了透明且可解释的法律推理过程。
cs.CV / 139 / 2603.17944
TransText: Transparency Aware Image-to-Video Typography Animation
TransText:透明度感知的图像到视频排版动画
Abstract
We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.
Chinese Translation
我们介绍了首个方法(据我们所知),用于将图像到视频模型适应于层感知文本(字形)动画,这一能力对于实际动态视觉设计至关重要。现有方法主要将透明度编码(alpha 通道)视为附加在 RGB 空间上的额外潜在维度,这需要重构基于 RGB 的变分自编码器(VAE)。然而,由于高质量透明字形数据的稀缺,重新训练 VAE 计算成本高昂,并可能削弱从大量 RGB 语料中学习到的稳健语义先验,进而导致潜在模式混合。为了解决这些局限性,我们提出了 TransText,一个基于新颖的 Alpha-as-RGB 范式的框架,能够在不修改预训练生成流形的情况下联合建模外观和透明度。TransText 通过潜在空间拼接将 alpha 通道嵌入为与 RGB 兼容的视觉信号,明确确保严格的跨模态(RGB 和 Alpha)一致性,同时防止特征纠缠。我们的实验表明,TransText 显著优于基线,生成连贯、高保真的透明动画,具有多样化和细致的效果。
cs.CV / 140 / 2603.17948
VideoAtlas: Navigating Long-Form Video in Logarithmic Compute
视频地图:在对数计算中导航长篇视频
Abstract
Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.
Chinese Translation
将语言模型扩展到视频引入了两个挑战:表示,其中现有方法依赖于有损近似;长上下文,其中基于字幕或代理的管道将视频压缩为文本并失去视觉保真度。为了解决这个问题,我们提出了 extbf{VideoAtlas},一个与任务无关的环境,将视频表示为一个层次化网格,该网格同时是无损的、可导航的、可扩展的,并且不依赖于字幕或预处理。视频的概览可以一目了然,任何区域都可以递归放大,视频、临时调查和代理的记忆均使用相同的视觉表示,从而消除了端到端的有损文本转换。这种层次结构确保访问深度仅随视频长度以对数方式增长。对于长上下文,递归语言模型(Recursive Language Models, RLMs)最近为长文本提供了强大的解决方案,但将其扩展到视觉领域需要一个结构化的环境以进行递归,而 extbf{VideoAtlas} 正是提供了这样的环境。作为马尔可夫决策过程(Markov Decision Process), extbf{VideoAtlas} 解锁了视频递归语言模型(Video-RLM):一种并行的主-工作者架构,其中主节点协调全局探索,而工作者则同时深入分配的区域以积累无损视觉证据。我们展示了三个关键发现:(1)视频时长的对数计算增长,进一步通过网格的结构重用带来的30-60 ext{%}的多模态缓存命中率得到增强;(2)环境预算,通过限制最大探索深度提供了一个原则性的计算-准确性超参数;(3)随着问题粒度的变化而自适应的计算分配。当从1小时扩展到10小时基准时,Video-RLM 仍然是最具时长鲁棒性的方法,准确性降幅最小,证明了结构化环境导航是视频理解的一个可行且可扩展的范式。
cs.CV / 141 / 2603.17965
LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition
LaDe:统一的多层图形媒体生成与分解
Abstract
Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).
Chinese Translation
媒体设计层生成使得能够仅通过自然语言提示创建完全可编辑的分层设计文档,如海报、传单和徽标。现有方法要么限制输出为固定数量的层,要么要求每层仅包含空间上连续的区域,导致层数随着设计复杂性的增加而线性增长。我们提出了LaDe(分层媒体设计),一种生成语义上有意义的灵活层数的潜在扩散框架。LaDe结合了三个组件:基于大型语言模型(LLM)的提示扩展器,将简短的用户意图转化为结构化的每层描述,以指导生成;具有4D RoPE位置编码机制的潜在扩散变换器,联合生成完整的媒体设计及其组成的RGBA层;以及支持完整alpha通道的RGBA变分自编码器(VAE),对每层进行解码。通过在训练过程中对层样本进行条件化,我们的统一框架支持三项任务:文本到图像生成、文本到层的媒体设计生成以及媒体设计分解。我们在Crello测试集上将LaDe与Qwen-Image-Layered在文本到层和图像到层任务上进行了比较。LaDe在文本到层生成中优于Qwen-Image-Layered,通过提高文本到层的对齐度得到了验证,评估者为两位VLM-as-a-judge(GPT-4o mini和Qwen3-VL)。
cs.CV / 142 / 2603.17968
Robust-ComBat: Mitigating Outlier Effects in Diffusion MRI Data Harmonization
鲁棒-ComBat:减轻扩散MRI数据协调中的异常值影响
Abstract
Harmonization methods such as ComBat and its variants are widely used to mitigate diffusion MRI (dMRI) site-specific biases. However, ComBat assumes that subject distributions exhibit a Gaussian profile. In practice, patients with neurological disorders often present diffusion metrics that deviate markedly from those of healthy controls, introducing pathological outliers that distort site-effect estimation. This problem is particularly challenging in clinical practice as most patients undergoing brain imaging have an underlying and yet undiagnosed condition, making it difficult to exclude them from harmonization cohorts, as their scans were precisely prescribed to establish a diagnosis. In this paper, we show that harmonizing data to a normative reference population with ComBat while including pathological cases induces significant distortions. Across 7 neurological conditions, we evaluated 10 outlier rejection methods with 4 ComBat variants over a wide range of scenarios, revealing that many filtering strategies fail in the presence of pathology. In contrast, a simple MLP provides robust outlier compensation enabling reliable harmonization while preserving disease-related signal. Experiments on both control and real multi-site cohorts, comprising up to 80% of subjects with neurological disorders, demonstrate that Robust-ComBat consistently outperforms conventional statistical baselines with lower harmonization error across all ComBat variants.
Chinese Translation
协调方法如ComBat及其变体广泛用于减轻扩散MRI(dMRI)特定站点的偏差。然而,ComBat假设受试者分布呈高斯分布。在实际中,神经系统疾病患者的扩散指标往往明显偏离健康对照组,导致病理异常值扭曲站点效应的估计。这个问题在临床实践中尤为棘手,因为大多数接受脑成像的患者存在潜在且尚未诊断的病症,使得将他们排除在协调队列之外变得困难,因为他们的扫描正是为了确立诊断而精确开具的。在本文中,我们展示了在包含病理病例的情况下,将数据协调到规范参考人群与ComBat结合使用会引入显著的扭曲。在7种神经疾病中,我们评估了10种异常值拒绝方法与4种ComBat变体在广泛场景下的表现,结果显示许多过滤策略在存在病理情况下失效。相比之下,一个简单的多层感知器(MLP)提供了鲁棒的异常值补偿,能够在保留与疾病相关信号的同时实现可靠的协调。在对控制组和真实多站点队列的实验中,参与者中高达80%为神经系统疾病患者,结果表明鲁棒-ComBat在所有ComBat变体中始终优于传统统计基线,具有更低的协调误差。
cs.CV / 143 / 2603.17975
AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors
AHOY!在YouTube视频中通过高斯溅射和视频扩散先验重建可动画的人体模型
Abstract
We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/
Chinese Translation
我们提出了AHOY,一种从自然环境中的单目视频中重建完整的、可动画的3D高斯头像的方法,尽管存在严重的遮挡。现有的方法假设输入是未被遮挡的,即完全可见的主体,通常处于标准姿势,这排除了大多数现实世界的视频,因为人们常常被家具、物体或其他人遮挡。从这样的画面中重建面临着基本挑战:大面积的身体区域可能永远无法被观察到,并且每个姿势的多视角监督是不可用的。我们通过四个贡献来解决这些挑战:(i)一种幻觉作为监督的管道,利用经过身份微调的扩散模型为先前未观察到的身体区域生成密集监督;(ii)一种两阶段的标准姿势到依赖姿势的架构,从稀疏观察启动到完整的姿势依赖高斯图;(iii)一种图-姿势/LBS-姿势解耦,吸收生成数据中的多视角不一致性;(iv)一种头部/身体分离监督策略,保持面部身份。我们在YouTube视频和具有显著遮挡的多视角捕捉数据上进行了评估,展示了最先进的重建质量。我们还展示了生成的头像足够稳健,可以用新姿势进行动画,并合成到使用手机视频捕获的3DGS场景中。我们的项目页面可在 https://miraymen.github.io/ahoy/ 找到。
cs.CV / 144 / 2603.17979
AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
AdaRadar:基于雷达感知的速率自适应光谱压缩
Abstract
Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations--pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop (~1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.
Chinese Translation
雷达是自动驾驶系统中一种关键的感知方式,因其具备全天候特性以及测量距离和多普勒速度的能力。然而,高维原始雷达数据的巨大体量使得与计算引擎(例如,NPU)之间的通信链路饱和,而该链路通常是低带宽接口,其数据速率仅为少量低分辨率的距离-多普勒帧所配置。缺乏一种通用的编解码器来有效利用高维雷达数据,而现有的图像域方法不适用,因为它们通常在固定压缩比下运行,无法适应变化或对抗条件。基于此,我们提出了一种具有自适应反馈的雷达数据压缩方法。该方法通过对检测置信度相对于压缩率的代理梯度进行梯度下降,动态调整压缩比。我们采用零阶梯度近似,因为它能够在非可微核心操作(如剪枝和量化)下进行梯度计算。这也避免了在带宽有限的链路上传输梯度张量,因为如果估计,梯度张量的大小将与原始雷达数据相当。此外,我们发现雷达特征图在少数频率成分上高度集中。因此,我们对雷达数据立方体应用离散余弦变换,并有效地选择性剪枝系数。我们通过缩放量化保留每个雷达补丁的动态范围。结合这些技术,我们提出的在线自适应压缩方案在性能损失极小(约1%p)的情况下实现了超过100倍的特征大小减少。我们在RADIal、CARRADA和Radatron数据集上验证了我们的结果。
cs.CV / 145 / 2603.17980
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
感知空间:基于自我运动感知的视频表示以实现高效准确的三维场景理解
Abstract
Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM exhibits similar or even higher accuracy with significantly less overhead (i.e., 1.40$\times$ and 1.63$\times$ higher cost-effectiveness, respectively).
Chinese Translation
近期的多模态大型语言模型(MLLMs)在三维场景中的空间推理方面展现出很高的潜力。然而,它们通常依赖于计算成本高昂的三维表示,如点云或重建的鸟瞰图(Bird's-Eye View, BEV)地图,或者缺乏物理基础以解决尺度和大小的模糊性。本文通过与视频同时捕获的惯性测量单元(Inertial Measurement Units, IMUs)数据,显著增强了MLLMs的自我运动模态数据。特别地,我们提出了一个新颖的框架,称为Motion-MLLM,介绍了两个关键组件:(1)一个级联运动-视觉关键帧过滤模块,利用IMU数据和视觉特征高效选择稀疏但具有代表性的关键帧集;(2)一个不对称的跨模态融合模块,其中运动标记作为中介,将自我运动线索和跨帧视觉上下文引入视觉表示。通过将视觉内容基于物理自我运动轨迹进行基础化,Motion-MLLM能够推理场景中的绝对尺度和空间关系。我们的广泛评估表明,Motion-MLLM在与三维场景理解和空间推理相关的各种任务中显著提升了性能。与基于视频帧和显式三维数据的最先进(SOTA)方法相比,Motion-MLLM在准确性上表现出相似甚至更高的水平,同时显著降低了开销(即,成本效益分别提高了1.40倍和1.63倍)。
cs.CV / 146 / 2603.17989
Versatile Editing of Video Content, Actions, and Dynamics without Training
无需训练的视频内容、动作和动态的多功能编辑
Abstract
Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.
Chinese Translation
受控视频生成近年来取得了显著进展。然而,编辑动作和动态事件,或插入应影响现实世界视频中其他对象行为的内容,仍然是一个重大挑战。现有的训练模型在复杂编辑方面表现不佳,这可能是由于收集相关训练数据的困难。同样,现有的无训练方法本质上受到结构和运动保持编辑的限制,不支持运动或交互的修改。在此,我们介绍了DynaEdit,这是一种无需训练的编辑方法,利用预训练的文本到视频流模型解锁多功能视频编辑能力。我们的方法依赖于最近提出的无反演方法,该方法不干预模型内部,因此是模型无关的。我们展示了简单地尝试将这种方法适应于一般无约束编辑会导致严重的低频失配和高频抖动。我们解释了这些现象的来源,并引入了克服这些问题的新机制。通过广泛的实验,我们表明DynaEdit在复杂的基于文本的视频编辑任务中实现了最先进的结果,包括修改动作、插入与场景交互的对象以及引入全局效果。
cs.CV / 147 / 2603.17993
GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes
GMT:用于3D场景中6自由度物体轨迹合成的目标条件多模态变换器
Abstract
Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.
Chinese Translation
在3D环境中合成可控的6自由度物体操作轨迹对于使机器人能够与复杂场景进行交互至关重要,但由于需要准确的空间推理、物理可行性和多模态场景理解,这一任务仍然具有挑战性。现有方法通常依赖于2D或部分3D表示,限制了它们捕捉完整场景几何形状的能力,并约束了轨迹的精度。我们提出了GMT,一种多模态变换器框架,通过联合利用3D边界框几何、点云上下文、语义物体类别和目标末端姿态,生成逼真且以目标为导向的物体轨迹。该模型将轨迹表示为连续的6自由度姿态序列,并采用定制的条件策略,融合几何、语义、上下文和目标导向信息。在合成和真实世界基准上的大量实验表明,GMT在空间精度和方向控制方面显著优于最先进的人类运动和人机交互基准,如CHOIS和GIMO。我们的方法为基于学习的操作规划建立了新的基准,并在多样化物体和杂乱的3D环境中表现出强大的泛化能力。项目页面:https://huajian-zeng.github.io/projects/gmt/
cs.CV / 148 / 2603.17995
LoST: Level of Semantics Tokenization for 3D Shapes
LoST:3D形状的语义层级标记化
Abstract
Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.
Chinese Translation
标记化是各种模态生成建模中的一项基础技术。特别是在自回归(AR)模型中,它发挥着关键作用,近年来这些模型已成为3D生成的一个引人注目的选择。然而,3D形状的最佳标记化仍然是一个未解的问题。现有的最先进(SOTA)方法主要依赖于几何细节层级(LoD)结构,这些结构最初是为渲染和压缩而设计的。这些空间层级通常在标记效率上表现不佳,并且在AR建模中缺乏语义一致性。我们提出了语义层级标记化(LoST),该方法按语义显著性对标记进行排序,使得早期前缀解码为完整且合理的形状,具备主要语义,而后续标记则细化特定实例的几何和语义细节。为了训练LoST,我们引入了关系间距对齐(RIDA),这是一种新颖的3D语义对齐损失,旨在将3D形状潜在空间的关系结构与语义DINO特征空间的关系结构对齐。实验表明,LoST在重建方面达到了SOTA,显著超越了之前基于LoD的3D形状标记化方法,在几何和语义重建指标上均表现出色。此外,LoST实现了高效、高质量的AR 3D生成,并支持语义检索等下游任务,同时仅使用了先前AR模型所需标记的0.1%-10%。
cs.CV / 149 / 2603.17998
The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering
文本嵌入插值在连续图像引导中的不合理有效性
Abstract
We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.
Chinese Translation
我们提出了一种无训练框架,用于在测试时进行连续和可控的图像编辑,适用于文本条件生成模型。与依赖额外训练或手动用户干预的先前方法不同,我们发现简单地在文本嵌入空间中进行引导就足以产生平滑的编辑控制。给定一个目标概念(例如,增强照片真实感或改变面部表情),我们使用大型语言模型自动构建一小组去偏差的对比提示对,从中计算生成器文本编码器空间中的引导向量。然后,我们将该向量直接添加到输入提示表示中,以沿着所需的语义轴控制生成。为了获得连续控制,我们提出了一种弹性范围搜索程序,自动识别有效的引导幅度区间,避免了引导不足(无编辑)和引导过度(改变其他属性)。在该区间内添加同一向量的缩放版本可以产生平滑和连续的编辑。由于我们的方法仅修改文本表示,因此它自然适用于跨文本条件的多种模式,包括图像和视频生成。为了量化引导的连续性,我们引入了一种新的评估指标,测量编辑强度下语义变化的均匀性。我们比较了不同方法的连续编辑行为,发现尽管我们的设计简单且轻量,但与基于训练的方法相当,且优于其他无训练方法。
cs.CV / 150 / 2603.18001
EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding
EchoGen:用于统一布局-图像生成与理解的循环一致性学习
Abstract
In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model's unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.
Chinese Translation
在本研究中,我们提出了EchoGen,一个用于布局到图像生成和图像定位的统一框架,能够生成具有准确布局和高度符合文本描述(例如,空间关系)的图像,同时稳健地进行图像定位。我们认为,图像定位具备强大的文本和布局理解能力,可以弥补布局到图像生成中的相应局限。同时,从布局生成的图像在内容上表现出高度的多样性,从而增强了图像定位的鲁棒性。在统一模型中联合训练这两个任务可以促进各自性能的提升。然而,我们发现这种联合训练范式面临多个优化挑战,导致性能受限。为了解决这些问题,我们提出了渐进式训练策略。首先,平行多任务预训练(PMTP)阶段为模型提供了两个任务的基本能力,利用共享标记加速训练。接下来,双重联合优化(DJO)阶段利用任务的双重性顺序整合这两个任务,实现统一优化。最后,循环强化学习(Cycle RL)阶段通过使用一致性约束作为奖励,消除了对视觉监督的依赖,显著增强了模型的统一能力,通过GRPO策略实现。大量实验表明,在布局到图像生成和图像定位基准测试中均取得了最先进的结果,并揭示了共同优化这两个任务所带来的明显协同增益。
cs.CV / 151 / 2603.18002
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
Loc3R-VLM:基于语言的定位与视觉-语言模型的三维推理
Abstract
Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm
Chinese Translation
多模态大型语言模型(MLLMs)在连接视觉与语言方面取得了显著进展,但在空间理解和视角感知推理方面仍然存在困难。最近的研究努力旨在通过几何线索增强输入表示,而不是明确教导模型在三维空间中进行推理。我们提出了Loc3R-VLM,一个框架,利用单目视频输入为二维视觉-语言模型赋予先进的三维理解能力。受人类空间认知的启发,Loc3R-VLM依赖于两个联合目标:全球布局重建,以构建场景结构的整体表示,以及显式情境建模,以锚定自我中心视角。这些目标提供了直接的空间监督,将感知和语言都扎根于三维上下文中。为了确保几何一致性和度量尺度对齐,我们利用从预训练的三维基础模型中提取的轻量级相机位姿先验。Loc3R-VLM在基于语言的定位方面实现了最先进的性能,并在情境和一般三维问答基准上超越了现有的二维和视频基础方法,展示了我们的空间监督框架能够实现强大的三维理解。项目页面:https://kevinqu7.github.io/loc3r-vlm
cs.CV / 152 / 2603.18003
Universal Skeleton Understanding via Differentiable Rendering and MLLMs
通过可微渲染和多模态大语言模型实现通用骨骼理解
Abstract
Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM's native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer -- suggesting a viable path for applying MLLMs to non-native modalities. Code will be released upon acceptance.
Chinese Translation
多模态大语言模型(MLLMs)展现出强大的视觉-语言推理能力,但仍然局限于其原生模态,无法直接处理结构化的非视觉数据,如人类骨骼。现有方法要么将骨骼动态压缩为损失特征向量以进行文本对齐,要么将运动量化为离散标记,这在异构骨骼格式之间的泛化能力较差。我们提出了SkeletonLLM,通过将任意骨骼序列转换为MLLM的原生视觉模态,实现通用骨骼理解。其核心是DrAction,一个可微、格式无关的渲染器,将骨骼运动学转换为紧凑的图像序列。由于该流程是端到端可微的,MLLM的梯度可以直接指导渲染,以生成任务相关的视觉标记。为了进一步增强推理能力,我们引入了一种协同训练策略:因果推理蒸馏(Causal Reasoning Distillation)将结构化的逐步推理从教师模型转移,而判别微调(Discriminative Finetuning)则增强了可混淆动作之间的决策边界。SkeletonLLM在多样化任务上展现出强大的泛化能力,包括识别、字幕生成、推理和跨格式迁移——这为将MLLM应用于非原生模态提供了一条可行路径。代码将在接受后发布。
cs.CV / 153 / 2603.18004
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
统一时空令牌评分用于高效的视频视觉语言模型
Abstract
Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.
Chinese Translation
令牌剪枝对于提升视觉语言模型(VLMs)的计算效率至关重要,尤其是在视频任务中,时间冗余现象普遍存在。以往的方法通常只在视觉变换器(ViT)中进行令牌剪枝,专注于单模态感知任务,如动作识别和物体分割,而未能适应下游的视觉语言任务;或者仅在大语言模型(LLM)中进行剪枝,而保持ViT输出不变,这往往需要复杂的文本条件令牌选择机制。在本文中,我们提出了时空令牌评分(STTS),这是一种简单且轻量级的模块,能够在ViT和LLM之间剪枝视觉令牌,而无需文本条件或令牌合并,并且完全兼容端到端训练。通过学习如何通过辅助损失进行时间评分,以及通过LLM下游梯度进行空间评分,借助我们的高效打包算法,STTS在整个架构中剪枝了50%的视觉令牌,导致训练和推理过程中的效率提高62%,同时在13个短视频和长视频问答任务中平均性能仅下降0.7%。随着每个视频采样帧数的增加,效率提升也随之增加。对于长视频问答应用测试时缩放,进一步相较于基线获得了0.5-1%的性能提升。总体而言,STTS代表了一种新颖、简单而有效的技术,用于统一架构范围内的视觉令牌剪枝。
cs.CL / 1 / 2603.16872
Trust, Safety, and Accuracy: Assessing LLMs for Routine Maternity Advice
信任、安全与准确性:评估大型语言模型在常规孕产建议中的应用
Abstract
Access to reliable maternal healthcare information is a major challenge in rural India due to limited medical resources and infrastructure. With over 830 million internet users and nearly half of rural women online, digital tools offer new opportunities for health education. This study evaluates large language models (LLMs) like ChatGPT-4o, Perplexity AI, and GeminiAI to provide reliable and understandable pregnancy-related information. Seventeen pregnancy-focused questions were posed to each model and compared with responses from maternal health professionals. Evaluations used semantic similarity, noun overlap, and readability metrics to measure content quality. Results show Perplexity closely matched expert semantics, while ChatGPT-4o produced clearer, more understandable text with better medical terminology. As internet access grows in rural areas, LLMs could serve as scalable aids for maternal health education. The study highlights the need for AI tools that balance accuracy and clarity to improve healthcare communication in underserved regions.
Chinese Translation
由于医疗资源和基础设施有限,可靠的母婴健康信息获取在印度农村地区面临重大挑战。随着超过8.3亿互联网用户和近一半农村女性在线,数字工具为健康教育提供了新的机会。本研究评估了大型语言模型(LLMs),如ChatGPT-4o、Perplexity AI和GeminiAI,以提供可靠且易于理解的孕期相关信息。我们向每个模型提出了十七个与孕期相关的问题,并将其与母婴健康专业人士的回答进行了比较。评估使用了语义相似性、名词重叠和可读性指标来衡量内容质量。结果显示,Perplexity与专家语义高度匹配,而ChatGPT-4o生成了更清晰、更易理解的文本,并且在医学术语使用上表现更佳。随着农村地区互联网接入的增加,LLMs可能成为母婴健康教育的可扩展辅助工具。本研究强调了需要平衡准确性和清晰性的人工智能工具,以改善服务不足地区的医疗沟通。
cs.CL / 2 / 2603.16877
Enhancing Financial Report Question-Answering: A Retrieval-Augmented Generation System with Reranking Analysis
增强财务报告问答系统:基于检索增强生成的重排序分析
Abstract
Financial analysts face significant challenges extracting information from lengthy 10-K reports, which often exceed 100 pages. This paper presents a Retrieval-Augmented Generation (RAG) system designed to answer questions about S&P 500 financial reports and evaluates the impact of neural reranking on system performance. Our pipeline employs hybrid search combining full-text and semantic retrieval, followed by an optional reranking stage using a cross-encoder model. We conduct systematic evaluation using the FinDER benchmark dataset, comprising 1,500 queries across five experimental groups. Results demonstrate that reranking significantly improves answer quality, achieving 49.0 percent correctness for scores of 8 or above compared to 33.5 percent without reranking, representing a 15.5 percentage point improvement. Additionally, the error rate for completely incorrect answers decreases from 35.3 percent to 22.5 percent. Our findings emphasize the critical role of reranking in financial RAG systems and demonstrate performance improvements over baseline methods through modern language models and refined retrieval strategies.
Chinese Translation
财务分析师在从冗长的10-K报告中提取信息时面临重大挑战,这些报告通常超过100页。本文提出了一种检索增强生成(Retrieval-Augmented Generation, RAG)系统,旨在回答有关标准普尔500(S&P 500)财务报告的问题,并评估神经重排序对系统性能的影响。我们的流程采用混合搜索,结合全文检索和语义检索,随后使用交叉编码器模型进行可选的重排序阶段。我们使用FinDER基准数据集进行系统评估,该数据集包含1500个查询,分为五个实验组。结果表明,重排序显著提高了答案质量,得分为8分或以上的正确率达49.0%,而不进行重排序时为33.5%,提高了15.5个百分点。此外,完全错误答案的错误率从35.3%降至22.5%。我们的研究结果强调了重排序在财务RAG系统中的关键作用,并通过现代语言模型和精细化检索策略展示了相较于基线方法的性能提升。
cs.CL / 3 / 2603.16889
Rubric-Guided Fine-tuning of SpeechLLMs for Multi-Aspect, Multi-Rater L2 Reading-Speech Assessment
基于评分标准的语音大语言模型微调用于多维度、多评审者的二语阅读语音评估
Abstract
Reliable and interpretable automated assessment of second-language (L2) speech remains a central challenge, as large speech-language models (SpeechLLMs) often struggle to align with the nuanced variability of human raters. To address this, we introduce a rubric-guided reasoning framework that explicitly encodes multi-aspect human assessment criteria: accuracy, fluency, and prosody, while calibrating model uncertainty to capture natural rating variability. We fine-tune the Qwen2-Audio-7B-Instruct model using multi-rater human judgments and develop an uncertainty-calibrated regression approach supported by conformal calibration for interpretable confidence intervals. Our Gaussian uncertainty modeling and conformal calibration approach achieves the strongest alignment with human ratings, outperforming regression and classification baselines. The model reliably assesses fluency and prosody while highlighting the inherent difficulty of assessing accuracy. Together, these results demonstrate that rubric-guided, uncertainty-calibrated reasoning offers a principled path toward trustworthy and explainable SpeechLLM-based speech assessment.
Chinese Translation
可靠且可解释的二语(L2)语音自动评估仍然是一个核心挑战,因为大型语音语言模型(SpeechLLMs)往往难以与人类评审者的细微变异对齐。为了解决这一问题,我们引入了一种基于评分标准的推理框架,该框架明确编码了多维度的人类评估标准:准确性、流利性和韵律,同时校准模型的不确定性以捕捉自然评分的变异性。我们使用多评审者的人类判断对Qwen2-Audio-7B-Instruct模型进行了微调,并开发了一种基于不确定性校准的回归方法,该方法通过符合校准支持可解释的置信区间。我们的高斯不确定性建模和符合校准方法在与人类评分的对齐上取得了最佳效果,优于回归和分类基线。该模型可靠地评估流利性和韵律,同时突出了评估准确性的固有困难。这些结果共同表明,基于评分标准的、不确定性校准的推理为可信且可解释的基于SpeechLLM的语音评估提供了一条原则性路径。
cs.CL / 4 / 2603.17017
LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings
LLM NL2SQL 鲁棒性:传统和代理环境中的表面噪声与语言变异
Abstract
Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.
Chinese Translation
自然语言到 SQL (NL2SQL) 系统的鲁棒性评估至关重要,因为现实世界的数据库环境是动态的、嘈杂的并且不断演变,而传统的基准评估通常假设静态模式和格式良好的用户输入。在本研究中,我们引入了一个鲁棒性评估基准,包含大约十种扰动类型,并在传统和代理环境下进行评估。我们评估了多个最先进的大型语言模型(LLMs),包括 Grok-4.1、Gemini-3-Pro、Claude-Opus-4.6 和 GPT-5.2。我们的结果表明,这些模型在多种扰动下通常保持强劲的性能;然而,对于表面级噪声(例如字符级损坏)和在保持语义的同时改变词汇或句法形式的语言变异,观察到显著的性能下降。此外,我们观察到表面级噪声在传统管道中导致更大的性能下降,而语言变异在代理环境中带来了更大的挑战。这些发现突显了在实现鲁棒的 NL2SQL 系统方面仍然存在的挑战,特别是在处理语言变异性时。
cs.CL / 5 / 2603.17067
Evaluating Ill-Defined Tasks in Large Language Models
评估大型语言模型中的模糊任务
Abstract
Many evaluations of Large Language Models (LLMs) target tasks that are inherently ill-defined, with unclear input and output spaces and ambiguous success criteria. We analyze why existing evaluation benchmarks and metrics fail to provide reliable or diagnostic signals of model capability for such tasks. We examine two case studies: Complex Instruction Following (CIF), where we identify recurring issues including limited coverage of real-world instruction complexity, sensitivity to instruction phrasing, inconsistent and non-comparable metrics, and instability introduced by LLM-based judges; and Natural Language to Mermaid Sequence Diagrams (NL2Mermaid), where we show how multi-faceted evaluation criteria can yield actionable insights beyond aggregate scores. Together, these case studies show that current evaluations frequently conflate distinct failure modes, yielding scores that are unstable, non-diagnostic, and difficult to act upon. Our findings expose fundamental limitations in existing evaluation practices for ill-defined tasks and motivate more robust, interpretable evaluation designs.
Chinese Translation
许多对大型语言模型(LLMs)的评估针对的是本质上模糊的任务,这些任务具有不明确的输入和输出空间以及模糊的成功标准。我们分析了现有评估基准和指标为何未能为此类任务提供可靠或诊断性的模型能力信号。我们考察了两个案例研究:复杂指令跟随(Complex Instruction Following, CIF),在该研究中我们识别出反复出现的问题,包括对现实世界指令复杂性的覆盖有限、对指令措辞的敏感性、不一致且不可比较的指标,以及由基于LLM的评审引入的不稳定性;以及自然语言到美人鱼序列图(Natural Language to Mermaid Sequence Diagrams, NL2Mermaid),在该研究中我们展示了多维评估标准如何提供超越汇总分数的可操作见解。这两个案例研究表明,当前的评估常常混淆不同的失败模式,导致得分不稳定、缺乏诊断性且难以采取行动。我们的研究揭示了现有模糊任务评估实践的根本局限性,并推动了更稳健、可解释的评估设计的必要性。
cs.CL / 6 / 2603.17070
Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts
大型推理模型在跨脚本参数知识转移中面临挑战
Abstract
In this work, we analyze shortcomings in cross-lingual knowledge transfer in large, modern reasoning LLMs. We demonstrate that the perceived gap in knowledge transfer is primarily a script barrier. First, we conduct an observational data analysis on the performance of thinking models on two datasets with local knowledge from around the world, ECLeKTic and MultiLoKo. Our regression analysis shows that script match - not language or family - is the primary predictor of knowledge transfer failure once model capability and question difficulty are accounted for. We further this finding by providing the LLMs with the key entities of the questions in their source language and find that this disproportionately improves cross-script questions. We then posit that these LLMs could be reasoning better at test-time. To evaluate this, we develop a synthetic generation pipeline to design SFT samples to encourage the model to better reason about transliteration ambiguities when trying to fetch parametric knowledge at inference-time. We show that teaching two models to reason better reduces the cross-script transfer gap. As a result, we conclude that there is potential to improve cross-lingual parametric knowledge transfer during post-training.
Chinese Translation
在本研究中,我们分析了大型现代推理语言模型(LLMs)在跨语言知识转移中的不足之处。我们证明了知识转移中感知的差距主要是由于脚本障碍。首先,我们对来自全球的两个数据集ECLeKTic和MultiLoKo中思维模型的表现进行了观察性数据分析。我们的回归分析显示,一旦考虑到模型能力和问题难度,脚本匹配——而非语言或语言家族——是知识转移失败的主要预测因素。我们进一步通过向LLMs提供问题的关键实体(以其源语言)来验证这一发现,并发现这对跨脚本问题的改善效果尤为显著。接着,我们假设这些LLMs在测试时可能能够更好地推理。为了评估这一点,我们开发了一个合成生成管道,以设计SFT样本,鼓励模型在推理时更好地处理音译歧义。我们展示了教导两个模型更好地推理可以减少跨脚本转移差距。因此,我们得出结论,后训练阶段有潜力改善跨语言参数知识转移。
cs.CL / 7 / 2603.17087
Ensemble Self-Training for Unsupervised Machine Translation
用于无监督机器翻译的集成自训练
Abstract
We present an ensemble-driven self-training framework for unsupervised neural machine translation (UNMT). Starting from a primary language pair, we train multiple UNMT models that share the same translation task but differ in an auxiliary language, inducing structured diversity across models. We then generate pseudo-translations for the primary pair using token-level ensemble decoding, averaging model predictions in both directions. These ensemble outputs are used as synthetic parallel data to further train each model, allowing the models to improve via shared supervision. At deployment time, we select a single model by validation performance, preserving single-model inference cost. Experiments show statistically significant improvements over single-model UNMT baselines, with mean gains of 1.7 chrF when translating from English and 0.67 chrF when translating into English.
Chinese Translation
我们提出了一种基于集成驱动的自训练框架,用于无监督神经机器翻译(UNMT)。从一个主要语言对开始,我们训练多个UNMT模型,这些模型共享相同的翻译任务,但在辅助语言上有所不同,从而在模型之间引入结构化的多样性。然后,我们使用基于标记级别的集成解码为主要语言对生成伪翻译,在两个方向上平均模型预测。这些集成输出被用作合成平行数据,以进一步训练每个模型,使模型能够通过共享监督进行改进。在部署时,我们通过验证性能选择一个单一模型,从而保持单模型推理的成本。实验结果显示,与单模型UNMT基线相比,具有统计显著性的改进,翻译自英语时平均提高1.7 chrF,翻译为英语时平均提高0.67 chrF。
cs.CL / 8 / 2603.17094
Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction
评估大型语言模型模拟对话在建模人类社会互动中的不一致和不合作行为
Abstract
Simulating human conversations using large language models (LLMs) has emerged as a scalable methodology for modeling human social interaction. However, simulating human conversations is challenging because they inherently involve inconsistent and uncollaborative behaviors, such as misunderstandings and interruptions. Analysis comparing inconsistent and uncollaborative behaviors in human- and LLM-generated conversations remains limited, although reproducing these behaviors is integral to simulating human-like and complex social interaction. In this work, we introduce CoCoEval, an evaluation framework that analyzes LLM-simulated conversations by detecting 10 types of inconsistent and uncollaborative behaviors at the turn level using an LLM-as-a-Judge. Using CoCoEval, we evaluate GPT-4.1, GPT-5.1, and Claude Opus 4 by comparing the frequencies of detected behaviors in conversations simulated by each model and in human conversations across academic, business, and governmental meetings, as well as debates. Our analysis shows that (1) under vanilla prompting, LLM-simulated conversations exhibit far fewer inconsistent and uncollaborative behaviors than human conversations; (2) prompt engineering does not provide reliable control over these behaviors, as our results show that different prompts lead to their under- or overproduction; and (3) supervised fine-tuning on human conversations can lead LLMs to overproduce a narrow set of behaviors, such as repetition. Our findings highlight the difficulty of simulating human conversations, raising concerns about the use of LLMs as a proxy for human social interaction.
Chinese Translation
使用大型语言模型(LLMs)模拟人类对话已成为建模人类社会互动的一种可扩展方法。然而,模拟人类对话具有挑战性,因为它们本质上涉及不一致和不合作的行为,例如误解和打断。尽管再现这些行为对于模拟类人和复杂的社会互动至关重要,但对人类和LLM生成的对话中不一致和不合作行为的比较分析仍然有限。在本研究中,我们引入了CoCoEval,一个评估框架,通过使用LLM作为评判者,在回合层面检测10种不一致和不合作行为来分析LLM模拟的对话。通过CoCoEval,我们评估了GPT-4.1、GPT-5.1和Claude Opus 4,比较了每个模型模拟的对话与人类对话(包括学术、商业和政府会议以及辩论)中检测到的行为频率。我们的分析表明:(1)在普通提示下,LLM模拟的对话表现出远低于人类对话的不一致和不合作行为;(2)提示工程并未提供对这些行为的可靠控制,因为我们的结果显示不同的提示导致其产生不足或过量;(3)对人类对话进行监督微调可能导致LLM过度产生一组狭窄的行为,例如重复。我们的发现突显了模拟人类对话的难度,并引发了对将LLM作为人类社会互动代理使用的担忧。
cs.CL / 9 / 2603.17102
Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency
利用跨语言不一致性在混合专家大语言模型中进行知识本地化
Abstract
Modern LLMs continue to exhibit significant variance in behavior across languages, such as being able to recall factual information in some languages but not others. While typically studied as a problem to be mitigated, in this work, we propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs. Our knowledge localization framework contrasts routing for sets of languages where the model correctly recalls information from languages where it fails. This allows us to isolate model components that play a functional role in answering about a piece of knowledge. Our method proceeds in two stages: (1) querying the model with difficult factual questions across a diverse set of languages to generate "success" and "failure" activation buckets and then (2) applying a statistical contrastive analysis to the MoE router logits to identify experts important for knowledge. To validate the necessity of this small number of experts for answering a knowledge question, we deactivate them and re-ask the question. We find that despite only deactivating about 20 out of 6000 experts, the model no longer answers correctly in over 40% of cases. Generally, this method provides a realistic and scalable knowledge localization approach to address increasingly complex LLMs.
Chinese Translation
现代大语言模型(LLMs)在不同语言之间的行为表现仍然存在显著差异,例如在某些语言中能够回忆事实信息,而在其他语言中则无法做到。虽然通常将其视为需要减轻的问题,但在本研究中,我们提出利用这种跨语言不一致性作为混合专家(MoE)大语言模型可解释性的工具。我们的知识本地化框架对比了模型在能够正确回忆信息的语言与无法回忆的语言集之间的路由。这使我们能够隔离在回答某个知识点时发挥功能作用的模型组件。我们的方法分为两个阶段:(1)在多样化的语言集上向模型提出困难的事实问题,以生成“成功”和“失败”激活桶;然后(2)对MoE路由器的logits应用统计对比分析,以识别对知识重要的专家。为了验证这一小部分专家在回答知识问题中的必要性,我们将其停用并重新提出问题。我们发现,尽管只停用了约6000个专家中的20个,但模型在超过40%的情况下不再正确回答。总体而言,该方法提供了一种现实且可扩展的知识本地化方法,以应对日益复杂的大语言模型。
cs.CL / 10 / 2603.17171
Exploiting the English Grammar Profile for L2 grammatical analysis with LLMs
利用英语语法概况进行第二语言语法分析的框架
Abstract
Evaluating the grammatical competence of second language (L2) learners is essential both for providing targeted feedback and for assessing proficiency. To achieve this, we propose a novel framework leveraging the English Grammar Profile (EGP), a taxonomy of grammatical constructs mapped to the proficiency levels of the Common European Framework of Reference (CEFR), to detect learners' attempts at grammatical constructs and classify them as successful or unsuccessful. This detection can then be used to provide fine-grained feedback. Moreover, the grammatical constructs are used as predictors of proficiency assessment by using automatically detected attempts as predictors of holistic CEFR proficiency. For the selection of grammatical constructs derived from the EGP, rule-based and LLM-based classifiers are compared. We show that LLMs outperform rule-based methods on semantically and pragmatically nuanced constructs, while rule-based approaches remain competitive for constructs that rely purely on morphological or syntactic features and do not require semantic interpretation. For proficiency assessment, we evaluate both rule-based and hybrid pipelines and show that a hybrid approach combining a rule-based pre-filter with an LLM consistently yields the strongest performance. Since our framework operates on pairs of original learner sentences and their corrected counterparts, we also evaluate a fully automated pipeline using automatic grammatical error correction. This pipeline closely approaches the performance of semi-automated systems based on manual corrections, particularly for the detection of successful attempts at grammatical constructs. Overall, our framework emphasises learners' successful attempts in addition to unsuccessful ones, enabling positive, formative feedback and providing actionable insights into grammatical development.
Chinese Translation
评估第二语言(L2)学习者的语法能力对于提供针对性的反馈和评估语言水平至关重要。为此,我们提出了一种新颖的框架,利用英语语法概况(English Grammar Profile, EGP),该框架是一个将语法构造与欧洲共同语言参考框架(Common European Framework of Reference, CEFR)水平相对应的分类体系,用于检测学习者对语法构造的尝试并将其分类为成功或不成功。这种检测可以用于提供细致的反馈。此外,语法构造被用作评估语言水平的预测因子,通过自动检测的尝试作为整体CEFR水平的预测因子。我们比较了基于规则和基于大语言模型(LLM)的分类器在选择源自EGP的语法构造方面的表现。结果表明,LLM在语义和语用上具有细微差别的构造上优于基于规则的方法,而基于规则的方法在纯粹依赖形态或句法特征且不需要语义解释的构造上仍然具有竞争力。对于语言水平评估,我们评估了基于规则和混合管道,并显示结合基于规则的预筛选与LLM的混合方法始终能产生最强的表现。由于我们的框架在原始学习者句子及其修正版本的对比中运作,我们还评估了使用自动语法错误修正的完全自动化管道。该管道的表现接近于基于手动修正的半自动化系统,特别是在成功检测语法构造尝试方面。总体而言,我们的框架强调学习者的成功尝试以及不成功的尝试,从而实现积极的形成性反馈,并为语法发展提供可操作的见解。
cs.CL / 11 / 2603.17191
Tabular LLMs for Interpretable Few-Shot Alzheimer's Disease Prediction with Multimodal Biomedical Data
基于表格的LLMs在多模态生物医学数据下可解释的少量样本阿尔茨海默病预测
Abstract
Accurate diagnosis of Alzheimer's disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few-shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP-GPT Tabular Alzheimer's Prediction GPT, a domain-adapted tabular LLM framework built on TableGPT2 and fine-tuned for few-shot AD classification using tabular prompts rather than plain texts. We evaluate TAP-GPT across four ADNI-derived datasets, including QT-PAD biomarkers and region-level structural MRI, amyloid PET, and tau PET for binary AD classification. Across multimodal and unimodal settings, TAP-GPT improves upon its backbone models and outperforms traditional machine learning baselines in the few-shot setting while remaining competitive with state-of-the-art general-purpose LLMs. We show that feature selection mitigates degradation in high-dimensional inputs and that TAP-GPT maintains stable performance under simulated and real-world missingness without imputation. Additionally, TAP-GPT produces structured, modality-aware reasoning aligned with established AD biology and shows greater stability under self-reflection, supporting its use in iterative multi-agent systems. To our knowledge, this is the first systematic application of a tabular-specialized LLM to multimodal biomarker-based AD prediction, demonstrating that such pretrained models can effectively address structured clinical prediction tasks and laying the foundation for tabular LLM-driven multi-agent clinical decision-support systems. The source code is publicly available on GitHub: https://github.com/sophie-kearney/TAP-GPT.
Chinese Translation
阿尔茨海默病(AD)的准确诊断需要处理表格生物标志物数据,但此类数据通常较小且不完整,深度学习模型在此情况下往往无法超越经典方法。预训练的大型语言模型(LLMs)提供了少量样本的泛化、结构化推理和可解释的输出,为临床预测提供了强大的范式转变。我们提出了TAP-GPT(Tabular Alzheimer's Prediction GPT),这是一个基于TableGPT2构建的领域适应型表格LLM框架,针对少量样本的AD分类进行了微调,使用表格提示而非纯文本。我们在四个基于ADNI的数据集上评估了TAP-GPT,包括QT-PAD生物标志物和区域级结构MRI、淀粉样PET和tau PET用于二分类AD。无论是在多模态还是单模态设置下,TAP-GPT均优于其基础模型,并在少量样本设置中超越传统机器学习基线,同时在性能上与最先进的通用LLMs保持竞争力。我们展示了特征选择可以减轻高维输入的降级,并且TAP-GPT在模拟和真实世界缺失情况下保持稳定性能,无需插补。此外,TAP-GPT产生的结构化、模态感知的推理与已建立的AD生物学相一致,并在自我反思下表现出更大的稳定性,支持其在迭代多智能体系统中的应用。我们认为,这是首次系统地将专门针对表格的LLM应用于基于多模态生物标志物的AD预测,证明了此类预训练模型能够有效解决结构化临床预测任务,为基于表格的LLM驱动的多智能体临床决策支持系统奠定基础。源代码已在GitHub上公开: https://github.com/sophie-kearney/TAP-GPT。
cs.CL / 12 / 2603.17204
CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization
CODMAS:一种用于结构化RTL优化的辩证多智能体协作框架
Abstract
Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.
Chinese Translation
优化寄存器传输级(RTL)代码是电子设计自动化(EDA)中改善功耗、性能和面积(PPA)的关键步骤。我们提出了CODMAS(通过辩证多智能体系统进行协作优化),这是一个将结构化辩证推理与领域感知代码生成和确定性评估相结合的框架,以实现RTL优化的自动化。CODMAS的核心是两个辩证智能体:受橡皮鸭调试启发的表达者(Articulator),它阐明逐步转化计划并揭示潜在假设;以及假设合作伙伴(Hypothesis Partner),它预测结果并调和预期与实际行为之间的偏差,以指导有针对性的改进。这些智能体指导一个领域特定编码智能体(Domain-Specific Coding Agent, DCA)生成架构感知的Verilog编辑,以及一个代码评估智能体(Code Evaluation Agent, CEA)验证语法、功能和PPA指标。我们引入了RTLOPT,这是一个包含120个Verilog三元组(未优化、优化、测试平台)的基准,用于流水线和时钟门控转换。在专有和开放的LLM中,CODMAS在流水线方面实现了约25%的关键路径延迟减少,在时钟门控方面实现了约22%的功耗减少,同时与强提示和智能基线相比,减少了功能和编译失败。这些结果表明,结构化的多智能体推理可以显著增强自动化RTL优化,并扩展到更复杂的设计和更广泛的优化任务。
cs.CL / 13 / 2603.17208
SYMDIREC: A Neuro-Symbolic Divide-Retrieve-Conquer Framework for Enhanced RTL Synthesis and Summarization
SYMDIREC:一种增强 RTL 综合与摘要的神经符号分割-检索-征服框架
Abstract
Register-Transfer Level (RTL) synthesis and summarization are central to hardware design automation but remain challenging for Large Language Models (LLMs) due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and retrieval-augmented generation (RAG) methods have not incorporated symbolic planning, limiting their structural precision. We introduce SYMDIREC, a neuro-symbolic framework that decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. Supporting both Verilog and VHDL without LLM fine-tuning, SYMDIREC achieves ~20% higher Pass@1 rates for synthesis and 15-20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating the benefits of symbolic guidance in RTL tasks.
Chinese Translation
寄存器传输级(RTL)综合与摘要是硬件设计自动化的核心,但由于严格的硬件描述语言(HDL)语法、有限的监督和与自然语言的弱对齐,仍然对大型语言模型(LLMs)构成挑战。现有的提示和检索增强生成(RAG)方法未能纳入符号规划,限制了其结构精度。我们提出了 SYMDIREC,一种神经符号框架,它将 RTL 任务分解为符号子目标,通过微调的检索器检索相关代码,并通过 LLM 推理组装验证的输出。SYMDIREC 支持 Verilog 和 VHDL,无需对 LLM 进行微调,在综合任务中实现了约 20% 的 Pass@1 提升,并在摘要任务中实现了 15-20% 的 ROUGE-L 改进,相较于提示和 RAG 基线,展示了符号指导在 RTL 任务中的优势。
cs.CL / 14 / 2603.17217
Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text
构建即匿名:一种基于大语言模型的隐私保护文本框架
Abstract
Responsible use of AI demands that we protect sensitive information without undermining the usefulness of data, an imperative that has become acute in the age of large language models. We address this challenge with an on-premise, LLM-driven substitution pipeline that anonymizes text by replacing personally identifiable information (PII) with realistic, type-consistent surrogates. Executed entirely within organizational boundaries using local LLMs, the approach prevents data egress while preserving fluency and task-relevant semantics. We conduct a systematic, multi-metric, cross-technique evaluation on the Action-Based Conversation Dataset, benchmarking against industry standards (Microsoft Presidio and Google DLP) and a state-of-the-art approach (ZSTS, in redaction-only and redaction-plus-substitution variants). Our protocol jointly measures privacy, semantic utility, and trainability under privacy via a lifecycle-ready criterion obtained by fine-tuning a compact encoder (BERT+LoRA) on sanitized text. In addition, we assess agentic Q&A performance by inserting an on-premise anonymization layer before the answering LLM and evaluating the quality of its responses. This intermediate, type-preserving substitution stage ensures that no sensitive content is exposed to third-party APIs, enabling responsible deployment of Q\&A agents without compromising confidentiality. Our method attains state-of-the-art privacy, minimal topical drift, strong factual utility, and low trainability loss, outperforming rule-based approaches and named-entity recognition (NER) baselines and ZSTS variants on the combined privacy--utility--trainability frontier. These results show that local LLM substitution yields anonymized corpora that are both responsible to use and operationally valuable: safe for agentic pipelines and suitable for downstream fine-tuning with limited degradation.
Chinese Translation
负责任地使用人工智能要求我们在不削弱数据实用性的前提下保护敏感信息,这一要求在大语言模型时代变得尤为迫切。我们通过一个本地的、基于大语言模型的替代管道来应对这一挑战,该管道通过用现实的、类型一致的替代品替换个人可识别信息(PII)来实现文本匿名化。该方法完全在组织边界内使用本地大语言模型执行,防止数据外泄,同时保持流畅性和任务相关的语义。我们在基于行动的对话数据集上进行系统的、多指标的、跨技术评估,并与行业标准(Microsoft Presidio 和 Google DLP)以及一种先进的方法(ZSTS,分为仅去标识和去标识加替代两种变体)进行基准测试。我们的协议通过对经过清洗的文本进行微调,联合测量隐私、语义效用和在隐私下的可训练性,采用生命周期准备标准,使用紧凑编码器(BERT+LoRA)。此外,我们通过在回答大语言模型之前插入一个本地匿名化层来评估代理问答性能,并评估其响应质量。这个中间的、保持类型一致的替代阶段确保没有敏感内容暴露给第三方API,从而使问答代理的负责任部署成为可能,而不损害机密性。我们的方法在隐私、最小主题漂移、强事实效用和低可训练性损失方面达到了最先进的水平,超越了基于规则的方法和命名实体识别(NER)基准以及ZSTS变体,在隐私-效用-可训练性综合前沿上表现优异。这些结果表明,本地大语言模型替代生成的匿名语料库既负责任地使用,又具有操作价值:对代理管道安全,适合于下游微调且降解有限。
cs.CL / 15 / 2603.17218
Alignment Makes Language Models Normative, Not Descriptive
对齐使语言模型具有规范性,而非描述性
Abstract
Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.
Chinese Translation
后训练对齐优化语言模型以匹配人类偏好信号,但这一目标并不等同于建模观察到的人类行为。我们比较了120对基础对齐模型在超过10,000个真实人类决策中的表现,这些决策来自多轮战略游戏——讨价还价、劝说、谈判和重复矩阵游戏。在这些环境中,基础模型在预测人类选择方面的表现优于其对齐模型,比例接近10:1,这一结果在不同模型系列、提示形式和游戏配置中都表现出稳健性。然而,在人类行为更可能遵循规范性预测的环境中,这一模式发生了逆转:在所有12种测试的单轮教科书游戏和非战略性彩票选择中,对齐模型占据主导地位——甚至在多轮游戏的第一轮中,交互历史尚未发展时也是如此。这一边界条件模式表明,对齐引发了一种规范性偏见:当人类行为相对较好地被规范性解决方案捕捉时,它提高了预测能力,但在多轮战略环境中却降低了预测能力,因为此类行为受到互惠、报复和历史依赖适应等描述性动态的影响。这些结果揭示了优化模型以供人类使用与将其作为人类行为代理之间的根本权衡。
cs.CL / 16 / 2603.17220
TharuChat: Bootstrapping Large Language Models for a Low-Resource Language via Synthetic Data and Human Validation
TharuChat:通过合成数据和人工验证为低资源语言引导大型语言模型的启动
Abstract
The rapid proliferation of Large Language Models (LLMs) has created a profound digital divide, effectively excluding indigenous languages of the Global South from the AI revolution. The Tharu language, an Indo-Aryan vernacular spoken by approximately 1.7 million people across the Terai belt of Nepal and India, exemplifies this crisis. Despite a rich oral tradition, Tharu suffers from severe data scarcity and linguistic fragmentation, causing state-of-the-art multilingual models to routinely "hallucinate" or default to dominant high-resource neighbors like Hindi and Nepali due to contamination in pre-training corpora. This paper presents Tharu-LLaMA (3B), a specialized instruction-following model designed to address this exclusion. We introduce TharuChat, a novel dataset constructed via a LLM-to-Human bootstrapping pipeline. We utilized prompt-engineered Gemini models, fed with Rana Tharu grammar and folklore, to synthesize training data. Unlike curated gold-standard corpora, TharuChat reflects the noisy, heterogeneous linguistic reality of the region: it is predominantly anchored in Rana Tharu (~70%) while integrating elements of Dangaura and Kochila dialects. We provide a transparent analysis of the dataset's limitations, including dialectal code-mixing and residual Awadhi/Hindi influence. Through a rigorous empirical ablation study, we demonstrate that despite these imperfections, small-scale synthetic data is highly effective, increasing the dataset volume from 25% to 100% results in a linear reduction in perplexity from 6.42 to 2.88. The resulting model serves as a proof-of-concept for the preservation of under-resourced Himalayan languages via generative AI, achievable on consumer-grade hardware.
Chinese Translation
大型语言模型(LLMs)的快速普及造成了深刻的数字鸿沟,实际上将全球南方的土著语言排除在人工智能革命之外。Tharu语言是一种印欧亚语,约有170万人在尼泊尔和印度的特赖地区使用,正是这一危机的典型代表。尽管拥有丰富的口头传统,Tharu语言却面临严重的数据匮乏和语言碎片化问题,导致最先进的多语言模型常常“幻觉”或因预训练语料的污染而默认使用主导的高资源邻国语言,如印地语和尼泊尔语。本文提出了Tharu-LLaMA(3B),一种专门设计的遵循指令的模型,旨在解决这一排斥问题。我们介绍了TharuChat,这是一种通过LLM到人类的引导管道构建的新型数据集。我们利用经过提示工程的Gemini模型,结合Rana Tharu的语法和民间故事,合成训练数据。与经过精心策划的黄金标准语料库不同,TharuChat反映了该地区嘈杂而异质的语言现实:它主要以Rana Tharu为基础(约70%),同时融入了Dangaura和Kochila方言的元素。我们对数据集的局限性进行了透明的分析,包括方言代码混合和残余的Awadhi/印地语影响。通过严格的实证消融研究,我们证明了尽管存在这些缺陷,小规模合成数据是非常有效的,数据集的体量从25%增加到100%时,困惑度线性降低,从6.42降至2.88。最终模型作为通过生成性人工智能保护资源匮乏的喜马拉雅语言的概念验证,能够在消费级硬件上实现。
cs.CL / 17 / 2603.17231
Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models
语音生成大型音频语言模型中的神经元级情感控制
Abstract
Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.
Chinese Translation
大型音频语言模型(LALMs)能够生成富有表现力的语音,但可靠的情感控制仍然难以实现:转换往往未能达到目标情感,并可能通过拒绝、幻觉或释义降低语言的准确性。我们首次进行神经元级情感控制的研究,展示了紧凑的情感敏感神经元(ESNs)是可因果操作的,从而在推理时实现无训练的情感引导。通过成功过滤的激活聚合识别ESNs,确保情感实现和内容保留。在三个LALMs(Qwen2.5-Omni-7B、MiniCPM-o 4.5、Kimi-Audio)中,ESN干预产生了特定情感的增益,这些增益能够推广到未见过的说话者,并得到了自动和人工评估的支持。可控性依赖于选择器设计、掩码稀疏性、过滤和干预强度。我们的结果建立了一个无训练情感控制的机制框架,用于语音生成。
cs.CL / 18 / 2603.17303
From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation
从词语到世界:跨文化理解在机器翻译中的基准评估
Abstract
Culture-expressions, such as idioms, slang, and culture-specific items (CSIs), are pervasive in natural language and encode meanings that go beyond literal linguistic form. Accurately translating such expressions remains challenging for machine translation systems. Despite this, existing benchmarks remain fragmented and do not provide a systematic framework for evaluating translation performance on culture-loaded expressions. To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions. CulT-Eval comprises over 7,959 carefully curated instances spanning multiple types of culturally grounded expressions, with a comprehensive error taxonomy covering culturally grounded expressions. Through extensive evaluation of large language models and detailed analysis, we identify recurring and systematic failure modes that are not adequately captured by existing automatic metrics. Accordingly, we propose a complementary evaluation metric that targets culturally induced meaning deviations overlooked by standard MT metrics. The results indicate that current models struggle to preserve culturally grounded meaning and to capture the cultural and contextual nuances essential for accurate translation. Our benchmark and code are available at https://anonymous.4open.science/r/CulT-Eval-E75D/.
Chinese Translation
文化表达,如习语、俚语和特定文化项目(CSIs),在自然语言中无处不在,编码的意义超越了字面语言形式。准确翻译这些表达对机器翻译系统仍然具有挑战性。尽管如此,现有的基准测试仍然零散,未能提供一个系统框架来评估文化负载表达的翻译性能。为了解决这一问题,我们推出了CulT-Eval,一个旨在评估模型如何处理不同类型文化根植表达的基准。CulT-Eval包含超过7,959个精心策划的实例,涵盖多种类型的文化根植表达,并提供了一个全面的错误分类法,涵盖文化根植表达。通过对大型语言模型的广泛评估和详细分析,我们识别出一些反复出现且系统性的失败模式,这些模式未能被现有的自动评估指标充分捕捉。因此,我们提出了一种补充评估指标,针对标准机器翻译指标所忽视的文化引起的意义偏差。结果表明,当前模型在保留文化根植意义和捕捉准确翻译所必需的文化和上下文细微差别方面存在困难。我们的基准和代码可在 https://anonymous.4open.science/r/CulT-Eval-E75D/ 获取。
cs.CL / 19 / 2603.17306
Beyond bouba/kiki: Multidimensional semantic signals are deeply woven into the fabric of natural language
超越 bouba/kiki:多维语义信号深深融入自然语言的结构中
Abstract
A foundational assumption in linguistics holds that the relationship between a word's sound and its meaning is arbitrary. Accumulating evidence from sound symbolism challenges this view, yet no study has systematically mapped the multidimensional semantic profile of every phonological unit within a language. Here we show that individual letter-phonemes in English carry structured, multidimensional semantic signals. Using a minimal-pair paradigm spanning all 220 pairwise letter contrasts, three large language models independently recover consistent phoneme-meaning associations across nine perceptual dimensions. These associations are systematically predicted by articulatory-phonetic features, with manner and place of articulation mapping onto distinct semantic dimensions. Behavioral data from English speakers confirm these patterns at rates well above chance (80.8%), and preliminary cross-linguistic evidence from five typologically diverse languages suggests that core mappings generalize beyond English. Our findings indicate that sound-meaning iconicity is not an occasional curiosity but a pervasive, structured property of the phonological signal, one so systematic that large language models recover it when given only text input, without exposure to speech or articulation during the task.
Chinese Translation
语言学的一个基本假设认为,词语的声音与其意义之间的关系是任意的。越来越多的声音象征主义证据挑战了这一观点,但尚无研究系统地映射出语言中每个音位单元的多维语义特征。在此,我们展示了英语中个别字母音素携带结构化的多维语义信号。通过涵盖所有220对字母对比的最小对比范式,三个大型语言模型独立地恢复了九个感知维度上的一致音素-意义关联。这些关联由发音-语音特征系统性预测,发音方式和发音部位映射到不同的语义维度。来自英语使用者的行为数据确认了这些模式,其准确率远高于偶然水平(80.8%),并且来自五种类型学上多样语言的初步跨语言证据表明,核心映射超越了英语。我们的研究结果表明,声音-意义的象征性并非偶然的好奇现象,而是音位信号的普遍、结构化特性,这种系统性使得大型语言模型在仅提供文本输入的情况下也能恢复它,而无需在任务过程中接触语音或发音。
cs.CL / 20 / 2603.17311
Ruyi2.5 Technical Report
Ruyi2.5 技术报告
Abstract
We present Ruyi2.5, a multimodal familial model built on the AI Flow framework. Extending Ruyi2's "Train Once, Deploy Many" paradigm to the multimodal domain, Ruyi2.5 constructs a shared-backbone architecture that co-trains models of varying scales within a single unified pipeline, ensuring semantic consistency across all deployment tiers. Built upon Ruyi2.5, Ruyi2.5-Camera model is developed as a privacy-preserving camera service system, which instantiates Ruyi2.5-Camera into a two-stage recognition pipeline: an edge model applies information-bottleneck-guided irreversible feature mapping to de-identify raw frames at the source, while a cloud model performs deep behavior reasoning. To accelerate reinforcement learning fine-tuning, we further propose Binary Prefix Policy Optimization (BPPO), which reduces sample redundancy via binary response selection and focuses gradient updates on response prefixes, achieving a 2 to 3 times training speedup over GRPO. Experiments show Ruyi2.5 matches Qwen3-VL on the general multimodal benchmarks, while Ruyi2.5-Camera substantially outperforms Qwen3-VL on privacy-constrained surveillance tasks.
Chinese Translation
我们提出了 Ruyi2.5,这是一个基于 AI Flow 框架构建的多模态家庭模型。Ruyi2.5 扩展了 Ruyi2 的“训练一次,多次部署”范式至多模态领域,构建了一个共享主干架构,在单一统一的流程中共同训练不同规模的模型,确保所有部署层之间的语义一致性。在 Ruyi2.5 的基础上,开发了 Ruyi2.5-Camera 模型,作为一个保护隐私的摄像头服务系统,将 Ruyi2.5-Camera 实现为一个两阶段的识别流程:边缘模型应用信息瓶颈引导的不可逆特征映射在源头对原始帧进行去标识化,而云模型则执行深度行为推理。为了加速强化学习的微调,我们进一步提出了二进制前缀策略优化(Binary Prefix Policy Optimization, BPPO),通过二进制响应选择减少样本冗余,并将梯度更新集中在响应前缀上,相较于 GRPO 实现了 2 到 3 倍的训练加速。实验表明,Ruyi2.5 在通用多模态基准测试中与 Qwen3-VL 相匹配,而 Ruyi2.5-Camera 在隐私受限的监控任务中显著优于 Qwen3-VL。
cs.CL / 21 / 2603.17333
Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures
网格空间理解:一个用于网格、具身环境和坐标结构的文本空间推理数据集
Abstract
We introduce GSU, a text-only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inputs, isolating spatial reasoning from perception, we show that while most models grasp basic grid concepts, they struggle with frames of reference relative to an embodied agent and identifying 3D shapes from coordinate lists. We also find that exposure to a visual modality does not provide a generalizable understanding of 3D space that VLMs are able to utilize for these tasks. Finally, we show that while the very latest frontier models can solve the provided tasks (though harder variants may still stump them), fully fine-tuning a small LM or LORA fine-tuning a small LLM show potential to match frontier model performance, suggesting an avenue for specialized embodied agents.
Chinese Translation
我们介绍了 GSU,这是一个仅包含文本的网格数据集,用于评估大型语言模型(LLMs)在三个核心任务上的空间推理能力:导航、物体定位和结构组合。通过放弃视觉输入,将空间推理与感知隔离,我们发现尽管大多数模型掌握了基本的网格概念,但在相对于具身代理的参考框架和从坐标列表中识别三维形状方面存在困难。我们还发现,接触视觉模态并未提供 VLMs(视觉语言模型)能够利用于这些任务的可推广的三维空间理解。最后,我们展示了尽管最新的前沿模型能够解决所提供的任务(尽管更困难的变体可能仍会让它们感到困惑),但对小型语言模型(LM)进行全面微调或对小型 LLM 进行 LORA 微调显示出与前沿模型性能相匹配的潜力,这为专门的具身代理提供了一个研究方向。
cs.CL / 22 / 2603.17356
PACE-RAG: Patient-Aware Contextual and Evidence-based Policy RAG for Clinical Drug Recommendation
PACE-RAG:患者意识的上下文和基于证据的临床药物推荐政策 RAG
Abstract
Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson's disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-based Policy RAG), a novel framework designed to synthesize individual patient context with the prescribing tendencies of similar cases. By analyzing treatment patterns tailored to specific clinical signals, PACE-RAG identifies optimal prescriptions and generates an explainable clinical summary. Evaluated on a Parkinson's cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results validate PACE-RAG as a robust, clinically grounded solution for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE-RAG.
Chinese Translation
药物推荐需要深入理解个体患者的背景,尤其是在帕金森病等复杂疾病的情况下。尽管大型语言模型(LLMs)具备广泛的医学知识,但它们未能捕捉到实际处方模式的细微差别。现有的 RAG 方法也在这些复杂性面前显得力不从心,因为基于指南的检索过于通用,而类似患者的检索往往重复了大多数模式,而未考虑个体患者独特的临床细微差别。为了解决这一问题,我们提出了 PACE-RAG(患者意识的上下文和基于证据的政策 RAG),这是一个新颖的框架,旨在将个体患者的背景与类似案例的处方倾向相结合。通过分析针对特定临床信号的治疗模式,PACE-RAG 能够识别最佳处方并生成可解释的临床总结。在帕金森病队列和 MIMIC-IV 基准测试中,使用 Llama-3.1-8B 和 Qwen3-8B 进行评估,PACE-RAG 达到了最先进的性能,分别获得了 80.84% 和 47.22% 的 F1 分数。这些结果验证了 PACE-RAG 作为一个稳健的、以临床为基础的个性化决策支持解决方案的有效性。我们的代码可在以下链接获取:https://github.com/ChaeYoungHuh/PACE-RAG。
cs.CL / 23 / 2603.17373
SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems
SafeTutors:人工智能辅导系统中教学安全的基准测试
Abstract
Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn't reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn "safe/helpful" results can mask systematic tutor failure over extended interaction.
Chinese Translation
大型语言模型正迅速被部署为人工智能辅导员,但当前的评估范式仅孤立地评估问题解决的准确性和一般安全性,未能捕捉模型在学生与辅导员互动中是否同时具有教学有效性和安全性。我们认为,辅导安全与传统的LLM安全根本不同:主要风险不是有毒内容,而是通过过度披露答案、强化误解和放弃支架支持而悄然侵蚀学习。为了系统地研究这种失败模式,我们引入了SafeTutors,一个在数学、物理和化学领域共同评估安全性和教学法的基准。SafeTutors围绕一个理论基础的风险分类法组织,包含来自学习科学文献的11个伤害维度和48个子风险。我们发现所有模型都表现出广泛的伤害;规模并不能可靠地提供帮助;而多轮对话则加剧了行为问题,教学失败率从17.7%上升至77.8%。伤害也因学科而异,因此缓解措施必须考虑学科特点,而单轮的“安全/有帮助”结果可能掩盖了在长期互动中系统性辅导失败的问题。
cs.CL / 24 / 2603.17432
Argument Reconstruction as Supervision for Critical Thinking in LLMs
作为批判性思维监督的论证重构
Abstract
To think critically about arguments, human learners are trained to identify, reconstruct, and evaluate arguments. Argument reconstruction is especially important because it makes an argument's underlying inferences explicit. However, it remains unclear whether LLMs can similarly enhance their critical thinking ability by learning to reconstruct arguments. To address this question, we introduce a holistic framework with three contributions. We (1) propose an engine that automatically reconstructs arbitrary arguments (GAAR), (2) synthesize a new high-quality argument reconstruction dataset (Arguinas) using the GAAR engine, and (3) investigate whether learning argument reconstruction benefits downstream critical thinking tasks. Our experimental results show that, across seven critical thinking tasks, models trained to learn argument reconstruction outperform models that do not, with the largest performance gains observed when training on the proposed Arguinas dataset. The source code and dataset will be publicly available.
Chinese Translation
为了对论证进行批判性思考,人类学习者被训练去识别、重构和评估论证。论证重构尤为重要,因为它使论证的潜在推理变得明确。然而,目前尚不清楚大型语言模型(LLMs)是否也能通过学习重构论证来增强其批判性思维能力。为了解决这个问题,我们提出了一个整体框架,包含三个贡献。我们(1)提出了一个自动重构任意论证的引擎(GAAR),(2)使用GAAR引擎合成了一个新的高质量论证重构数据集(Arguinas),以及(3)研究学习论证重构是否有助于下游的批判性思维任务。我们的实验结果表明,在七个批判性思维任务中,训练以学习论证重构的模型的表现优于不进行此训练的模型,尤其是在使用提议的Arguinas数据集进行训练时,性能提升最为显著。源代码和数据集将公开发布。
cs.CL / 25 / 2603.17449
TRiMS: Real-Time Tracking of Minimal Sufficient Length for Efficient Reasoning via RL
TRiMS:通过强化学习实现高效推理的最小充分长度实时跟踪
Abstract
Large language models achieve breakthroughs in complex reasoning via long chain-of-thought sequences. However, this often leads to severe reasoning inflation, causing substantial computational redundancy. To maximize Intelligence per Token, we introduce a theoretical metric, MSL-Minimal Sufficient Length. MSL rigorously characterizes the shortest reasoning length that preserves answer correctness. We provide a recursive definition based on independently sampled sequences and prove the existence of its limit, establishing the first measurable lower bound for reasoning-chain compression. Building on an analysis of mainstream CoT compression strategies, we identify key structural factors enabling a model to approach MSL. Based on these insights, we propose TRiMS which employs the GRPO algorithm in conjunction with MSL-based estimation during training, while mitigating instabilities during the training process through dynamic batch aggregation and advantage computation using batch-level standard deviation. TRiMS achieves over 80% CoT token reduction with a minor accuracy boost across all benchmarks.
Chinese Translation
大型语言模型通过长链推理序列在复杂推理方面取得了突破。然而,这往往导致严重的推理膨胀,造成显著的计算冗余。为了最大化每个标记的智能,我们引入了一个理论指标,MSL-最小充分长度(Minimal Sufficient Length)。MSL严格表征了保持答案正确性的最短推理长度。我们提供了基于独立采样序列的递归定义,并证明了其极限的存在,建立了推理链压缩的第一个可测量下界。在对主流链推理压缩策略的分析基础上,我们识别出使模型接近MSL的关键结构因素。基于这些见解,我们提出了TRiMS,该方法在训练过程中结合使用GRPO算法和基于MSL的估计,同时通过动态批量聚合和使用批量标准差进行优势计算来减轻训练过程中的不稳定性。TRiMS在所有基准测试中实现了超过80%的链推理标记减少,并且准确性略有提升。
cs.CL / 26 / 2603.17475
Humans and transformer LMs: Abstraction drives language learning
人类与变换器语言模型:抽象驱动语言学习
Abstract
Categorization is a core component of human linguistic competence. We investigate how a transformer-based language model (LM) learns linguistic categories by comparing its behaviour over the course of training to behaviours which characterize abstract feature-based and concrete exemplar-based accounts of human language acquisition. We investigate how lexical semantic and syntactic categories emerge using novel divergence-based metrics that track learning trajectories using next-token distributions. In experiments with GPT-2 small, we find that (i) when a construction is learned, abstract class-level behaviour is evident at earlier steps than lexical item-specific behaviour, and (ii) that different linguistic behaviours emerge abruptly in sequence at different points in training, revealing that abstraction plays a key role in how LMs learn. This result informs the models of human language acquisition that LMs may serve as an existence proof for.
Chinese Translation
分类是人类语言能力的核心组成部分。我们通过比较变换器基础的语言模型(LM)在训练过程中的行为与人类语言习得中抽象特征基础和具体示例基础的行为,来研究该模型如何学习语言类别。我们使用新颖的基于发散的度量方法,跟踪学习轨迹,研究词汇语义和句法类别的出现,这些方法利用下一个标记分布。在对GPT-2 small的实验中,我们发现:(i) 当一个结构被学习时,抽象类级别的行为在早期步骤中比词汇项特定的行为更为明显,(ii) 不同的语言行为在训练的不同阶段突然出现,揭示了抽象在语言模型学习中的关键作用。该结果为语言模型可能作为人类语言习得的存在证明提供了启示。
cs.CL / 27 / 2603.17484
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
学习何时关注:长上下文大语言模型的条件记忆访问
Abstract
Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for $\sim$80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to $\sim$2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.
Chinese Translation
语言模型在超出预训练上下文长度时难以进行泛化,这限制了长时间跨度的推理和检索。继续在长上下文数据上进行预训练可以有所帮助,但由于注意力机制的平方扩展性,这种方法成本高昂。我们观察到,大多数标记并不需要对整个序列进行(全局)注意力,而可以依赖局部上下文。基于此,我们提出了L2A(Learning To Attend),这是一个通过决定何时调用全局注意力来实现条件(逐标记)长程记忆访问的层。我们在Qwen 2.5和Qwen 3模型上评估L2A,将它们的有效上下文长度从32K扩展到128K标记。L2A的性能与标准长上下文训练相差不超过3%,同时对约80%的标记跳过了全局注意力,超越了之前的基线。此外,我们还设计了自定义的Triton内核,以高效地在GPU上实现这种逐标记条件注意力,在训练吞吐量和首次标记时间上实现了高达约2倍的改进,优于FlashAttention。此外,L2A还支持对高度稀疏的全局注意力层进行后训练剪枝,将KV缓存内存减少多达50%,且几乎没有性能损失。
cs.CL / 28 / 2603.17504
Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
在大型语言模型中引导认识谦逊:一种针对性监督微调方法以减少幻觉现象
Abstract
Large language models (LLMs) often hallucinate, producing fluent but false information, partly because supervised fine-tuning (SFT) implicitly rewards always responding. We introduce $\textit{HypoTermInstruct}$, an SFT dataset (31,487 responses for 11,151 questions) designed to teach models epistemological humility-the ability to recognize the limits of their own knowledge and admit uncertainty. This is achieved through questions about non-existent "hypothetical" terms. We also release $\textit{HypoTermQA-Enhanced}$, a benchmark for hallucination tendency strengthened through multiple validations. We conducted 800 controlled LoRA SFT runs across $\textit{Llama3.1-8B}$ and $\textit{Gemma3-4B}$ (base and instruct), testing 100 fine-tuning configurations with paired controls. Our results demonstrate that replacing generic instruction data with $\textit{HypoTermInstruct}$ significantly improves the HypoTerm Score (median increases of 0.19% to 25.91%) and FactScore (+0.39% to +0.86%), while maintaining stable performance on MMLU (minimal decreases of 0.26% to 0.35%). Our work demonstrates that targeted, high-quality SFT data teaching meta-cognitive skills can effectively reduce hallucination without preference/RL pipelines, providing mechanistic insights and a practical path toward more reliable AI systems.
Chinese Translation
大型语言模型(LLMs)常常产生幻觉,生成流畅但错误的信息,部分原因是监督微调(SFT)隐性地奖励总是给出回应。我们引入了$ extit{HypoTermInstruct}$,这是一个旨在教会模型认识谦逊——识别自身知识局限性并承认不确定性的能力——的SFT数据集(包含31,487个回应和11,151个问题)。这一目标通过关于不存在的“假设”术语的问题来实现。我们还发布了$ extit{HypoTermQA-Enhanced}$,这是一个通过多次验证增强的幻觉倾向基准。我们在$ extit{Llama3.1-8B}$和$ extit{Gemma3-4B}$(基础和指令模型)上进行了800次受控的LoRA SFT实验,测试了100种微调配置及其配对对照。我们的结果表明,用$ extit{HypoTermInstruct}$替换通用指令数据显著提高了HypoTerm评分(中位数增加0.19%至25.91%)和FactScore(+0.39%至+0.86%),同时在MMLU上的表现保持稳定(最小下降为0.26%至0.35%)。我们的研究表明,针对性、高质量的SFT数据教授元认知技能能够有效减少幻觉现象,而无需偏好/强化学习管道,提供了机制性见解和通向更可靠的人工智能系统的实用路径。
cs.CL / 29 / 2603.17512
Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality
按需语言,核心知识:将编码器-解码器翻译模型与大语言模型组合以实现可扩展的多语言能力
Abstract
Large language models (LLMs) exhibit strong general intelligence, yet their multilingual performance remains highly imbalanced. Although LLMs encode substantial cross-lingual knowledge in a unified semantic space, they often struggle to reliably interface this knowledge with low-resource or unseen languages. Fortunately, pretrained encoder-decoder translation models already possess balanced multilingual capability, suggesting a natural complement to LLMs. In this work, we propose XBridge, a compositional encoder-LLM-decoder architecture that offloads multilingual understanding and generation to external pretrained translation models, while preserving the LLM as an English-centric core for general knowledge processing. To address the resulting representation misalignment across models, we introduce lightweight cross-model mapping layers and an optimal transport-based alignment objective, enabling fine-grained semantic consistency for multilingual generation. Experiments on four LLMs across multilingual understanding, reasoning, summarization, and generation indicate that XBridge outperforms strong baselines, especially on low-resource and previously unseen languages, without retraining the LLM.
Chinese Translation
大型语言模型(LLMs)展现出强大的通用智能,但它们在多语言表现上仍然高度不平衡。尽管LLMs在统一的语义空间中编码了大量跨语言知识,但它们在与低资源或未见过的语言可靠对接时常常面临困难。幸运的是,预训练的编码器-解码器翻译模型已经具备了平衡的多语言能力,这为LLMs提供了自然的补充。在本研究中,我们提出了XBridge,一种组合的编码器-LLM-解码器架构,将多语言理解和生成的任务转移给外部预训练的翻译模型,同时将LLM保留为一个以英语为中心的核心,用于一般知识处理。为了解决模型间的表示不对齐问题,我们引入了轻量级的跨模型映射层和基于最优传输的对齐目标,从而实现多语言生成的细粒度语义一致性。在多语言理解、推理、摘要和生成的实验中,XBridge在四个LLMs上表现优于强基线,特别是在低资源和之前未见过的语言上,且无需对LLM进行再训练。
cs.CL / 30 / 2603.17522
Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions
检测机器:跨架构、领域和对抗条件下的AI生成文本检测器的综合基准评估
Abstract
The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.
Chinese Translation
大型语言模型(LLMs)的快速普及使得对机器生成文本的强健且可泛化的检测器的需求变得迫在眉睫。现有的基准通常在理想条件下对单一检测器进行单一数据集的评估,这使得跨领域迁移、跨LLM泛化和对抗鲁棒性等问题仍然悬而未决。我们提出了一项综合基准,评估了多种检测方法在两个语料库上的表现:HC3(23,363对人类-ChatGPT)和ELI5(15,000对人类-Mistral-7B)。方法包括经典分类器、微调的变换器编码器(BERT、RoBERTa、ELECTRA、DistilBERT、DeBERTa-v3)、卷积神经网络(CNN)、XGBoost风格模型、基于困惑度的检测器以及将LLM作为检测器的提示。结果表明,变换器模型在同分布下的表现接近完美,但在领域转移时表现下降。XGBoost风格模型在保持可解释性的同时匹配了性能。基于LLM的检测器表现不佳,并受到生成器-检测器身份偏差的影响。基于困惑度的方法表现出极性反转,现代LLM输出的困惑度低于人类文本,但在纠正后仍然有效。没有任何方法能够在不同领域和LLM来源之间实现强健的泛化。
cs.CL / 31 / 2603.17543
AURORA Model of Formant-to-Tongue Inversion for Didactic and Clinical Applications
AURORA模型在教学和临床应用中的共鸣音素与舌头反演
Abstract
This paper outlines the conceptual and computational foundations of the AURORA (Acoustic Understanding and Real-time Observation of Resonant Articulations) model. AURORA predicts tongue displacement and shape in vowel sounds based on the first two formant values. It is intended as a didactic aid helping to explain the relationship between formants and the underlying articulation, as well as a foundation for biofeedback applications. The model is informed by ultrasound tongue imaging and acoustic data from 40 native speakers of English. In this paper we discuss the motivation for the model, the modelling objectives as well as the model architecture. We provide a qualitative evaluation of the model, focusing on selected tongue features. We then present two tools developed to make the model more accessible to a wider audience, a Shiny app and a prototype software for real-time tongue biofeedback. Potential users include students of phonetics, linguists in fields adjacent to phonetics, as well as speech and language therapy practitioners and clients.
Chinese Translation
本文概述了AURORA(声学理解与共鸣发音的实时观察)模型的概念和计算基础。AURORA根据前两个共鸣音素值预测元音中的舌位移和形状。该模型旨在作为一种教学辅助工具,帮助解释共鸣音素与基础发音之间的关系,同时为生物反馈应用奠定基础。该模型基于40名英语母语者的超声舌像和声学数据。本文讨论了模型的动机、建模目标以及模型架构。我们提供了对模型的定性评估,重点关注选定的舌头特征。随后,我们介绍了两个旨在使模型更易于更广泛受众访问的工具,一个是Shiny应用程序,另一个是实时舌头生物反馈的原型软件。潜在用户包括语音学学生、与语音学相关领域的语言学家,以及语言治疗师和客户。
cs.CL / 32 / 2603.17558
Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition
Zipper-LoRA:基于语音大语言模型的多语言语音识别中的动态参数解耦
Abstract
Speech Large Language Models (Speech-LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. By using a lightweight language-conditioned router, Zipper-LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two-stage training strategy with an Initial-B warm start that significantly accelerates convergence. Experiments on a 12-language mixed-resource setting show that Zipper-LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low-resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non-chunked encoder configurations, confirming the framework's reliability for practical, large-scale multilingual ASR. Our code and data will be available at https://github.com/YuCeong-May/Zipper-LoRA for reproducibility.
Chinese Translation
语音大语言模型(Speech-LLMs)作为一种强大的自动语音识别(ASR)方法,通过将语音编码器与大语言模型对齐而崭露头角。然而,将这些系统适应于具有不平衡数据分布的多语言环境仍然面临挑战。在这种情况下,常常会出现稳定性-可塑性困境:完全共享的参数高效微调(PEFT)可能会对代表性不足的语言造成负面跨语言干扰,而完全特定于语言的微调则限制了低资源任务所需的跨语言有益知识转移。为了解决这个问题,我们提出了Zipper-LoRA,这是一种新颖的等级解耦框架,具有三种变体(静态、硬性和软性),能够动态合成来自共享和特定语言子空间的LoRA更新。通过使用轻量级的语言条件路由器,Zipper-LoRA在LoRA等级层面动态控制每个子空间的贡献,实现了在语言兼容时的细粒度共享,以及在发生冲突时的严格解耦。为了进一步稳定在不平衡数据下的优化,我们提出了一种两阶段训练策略,采用初始-B热启动,显著加速收敛。在12种语言混合资源设置下的实验表明,Zipper-LoRA在极低资源场景中始终优于完全共享和独立基线。此外,我们还证明了这些增益在分块和非分块编码器配置中都是稳健的,确认了该框架在实际大规模多语言ASR中的可靠性。我们的代码和数据将可在https://github.com/YuCeong-May/Zipper-LoRA上获取,以便于复现。
cs.CL / 33 / 2603.17566
KA2L: A Knowledge-Aware Active Learning Framework for LLMs
KA2L:一种面向知识的主动学习框架用于大型语言模型
Abstract
Fine-tuning large language models (LLMs) with high-quality knowledge has been shown to enhance their performance effectively. However, there is a paucity of research on the depth of domain-specific knowledge comprehension by LLMs and the application of targeted active learning to improve their expertise. To address this gap, we introduce the Knowledge-Aware Active Learning (KA2L) framework. This framework assesses LLMs' mastery of specific knowledge points to aid in constructing unanswerable or unknowable questions through latent space analysis. This active learning strategy enhances training efficiency by focusing on knowledge the model has yet to master, thereby minimizing redundancy in learning already acquired information. This study innovatively employs a knowledge distribution probing technique to examine the hidden states of specific Transformer layers and identify the distribution of known and unknown knowledge within the LLM. Additionally, a hidden-state decoding method is proposed to generate numerous unknown questions in natural language from the latent knowledge space. In our experiments, we selected nine open-source LLMs to validate the effectiveness of the proposed framework. Results indicate that KA2L not only significantly reduces 50% annotation and computation costs across two open-domain and one vertical-domain dataset but also achieves better performance, offering valuable insights into active learning strategies for LLMs. The code is available at https://anonymous.4open.science/r/KA2L-F15C.
Chinese Translation
对大型语言模型(LLMs)进行高质量知识的微调已被证明能有效提升其性能。然而,关于LLMs对特定领域知识理解深度的研究以及针对性主动学习在提升其专业能力方面的应用仍然较为匮乏。为了解决这一问题,我们提出了知识感知主动学习(KA2L)框架。该框架评估LLMs对特定知识点的掌握情况,以帮助构建不可回答或未知的问题,采用潜在空间分析的方法。该主动学习策略通过聚焦于模型尚未掌握的知识,从而提高训练效率,减少对已获得信息的学习冗余。本研究创新性地采用知识分布探测技术,检查特定Transformer层的隐藏状态,并识别LLM中已知与未知知识的分布。此外,提出了一种隐藏状态解码方法,从潜在知识空间生成大量自然语言的未知问题。在我们的实验中,选择了九个开源LLMs来验证所提框架的有效性。结果表明,KA2L不仅在两个开放域和一个垂直域数据集上显著减少了50%的标注和计算成本,而且还实现了更好的性能,为LLMs的主动学习策略提供了有价值的见解。代码可在 https://anonymous.4open.science/r/KA2L-F15C 获取。
cs.CL / 34 / 2603.17613
VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation
VeriAgent:一种集成工具的多智能体系统,具有演变记忆以支持PPA感知的RTL代码生成
Abstract
LLMs have recently demonstrated strong capabilities in automatic RTL code generation, achieving high syntactic and functional correctness. However, most methods focus on functional correctness while overlooking critical physical design objectives, including Power, Performance, and Area. In this work, we propose a PPA-aware, tool-integrated multi-agent framework for high-quality verilog code generation. Our framework explicitly incorporates EDA tools into a closed-loop workflow composed of a \textit{Programmer Agent}, a \textit{Correctness Agent}, and a \textit{PPA Agent}, enabling joint optimization of functional correctness and physical metrics. To support continuous improvement without model retraining, we introduce an \textit{Evolved Memory Mechanism} that externalizes optimization experience into structured memory nodes. A dedicated memory manager dynamically maintains the memory pool and allows the system to refine strategies based on historical execution trajectories. Extensive experiments demonstrate that our approach achieves strong functional correctness while delivering significant improvements in PPA metrics. By integrating tool-driven feedback with structured and evolvable memory, our framework transforms RTL generation from one-shot reasoning into a continual, feedback-driven optimization process, providing a scalable pathway for deploying LLMs in real-world hardware design flows.
Chinese Translation
大型语言模型(LLMs)最近在自动RTL代码生成方面展现了强大的能力,达到了高语法和功能正确性。然而,大多数方法关注功能正确性,却忽视了关键的物理设计目标,包括功耗、性能和面积。在本研究中,我们提出了一种PPA感知的、集成工具的多智能体框架,用于高质量Verilog代码生成。我们的框架明确将电子设计自动化(EDA)工具纳入由 extit{程序员智能体}、 extit{正确性智能体}和 extit{PPA智能体}组成的闭环工作流程中,从而实现功能正确性与物理指标的联合优化。为了支持持续改进而无需重新训练模型,我们引入了一种 extit{演变记忆机制},将优化经验外部化为结构化的记忆节点。专门的记忆管理器动态维护记忆池,并允许系统根据历史执行轨迹优化策略。大量实验表明,我们的方法在实现强功能正确性的同时,在PPA指标上也取得了显著改善。通过将工具驱动的反馈与结构化和可演变的记忆相结合,我们的框架将RTL生成从一次性推理转变为持续的、基于反馈的优化过程,为在实际硬件设计流程中部署LLMs提供了可扩展的路径。
cs.CL / 35 / 2603.17624
Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis
语言模型是否编码语义关系?探测与稀疏特征分析
Abstract
Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.
Chinese Translation
理解大型语言模型(LLMs)是否捕捉结构化意义需要考察它们如何表示概念关系。在本研究中,我们研究了三个规模逐渐增大的模型:Pythia-70M、GPT-2 和 Llama 3.1 8B,重点关注四种语义关系:同义关系、反义关系、上位关系和下位关系。我们结合线性探测与机械解释技术,包括稀疏自编码器(SAE)和激活补丁,来识别这些关系是如何编码的,以及特定特征如何贡献于它们的表示。我们的结果揭示了层级关系中的方向性不对称:上位关系被冗余编码并抵抗抑制,而下位关系依赖于更易受消融影响的紧凑特征。更广泛地说,关系信号是分散的,但表现出稳定的特征:它们在中间层达到峰值,并且在后残差/多层感知器(MLP)路径中的强度高于在注意力机制中的强度。不同模型之间的难度是一致的(反义关系最容易,同义关系最难)。探测级别的因果关系依赖于模型的容量:在 Llama 3.1 上,SAE 引导的补丁可靠地改变了这些信号,而在较小的模型上,信号的变化则较弱或不稳定。我们的结果阐明了语义关系在 LLMs 内部的表示位置和可靠性,并提供了一个可重复的框架,将稀疏特征与探测级别的因果证据联系起来。
cs.CL / 36 / 2603.17677
Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models
适应性指导用于检索增强的掩蔽扩散模型
Abstract
Retrieval-Augmented Generation (RAG) improves factual grounding by incorporating external knowledge into language model generation. However, when retrieved context is noisy, unreliable, or inconsistent with the model's parametric knowledge, it introduces retrieval-prior conflicts that can degrade generation quality. While this problem has been studied in autoregressive language models, it remains largely unexplored in diffusion-based language models, where the iterative denoising process introduces unique challenges for integrating retrieved context. In this work, we propose Adaptive Retrieval-Augmented Masked Diffusion (ARAM), a training-free adaptive guidance framework for Masked Diffusion Models (MDMs) in RAG settings. ARAM dynamically calibrates the guidance scale during denoising according to the Signal-to-Noise Ratio (SNR) of the distributional shift induced by retrieved context. Intuitively, the model strengthens guidance when the retrieved context provides reliable corrective evidence and suppresses it when the contextual signal is noisy or non-supportive. Extensive experiments on multiple knowledge-intensive QA benchmarks show that ARAM improves overall QA performance over competitive RAG baselines.
Chinese Translation
检索增强生成(RAG)通过将外部知识融入语言模型生成来改善事实基础。然而,当检索到的上下文噪声大、不可靠或与模型的参数知识不一致时,会引入检索优先冲突,从而降低生成质量。虽然这一问题在自回归语言模型中得到了研究,但在基于扩散的语言模型中仍然基本未被探索,因为迭代去噪过程为整合检索上下文带来了独特挑战。在本研究中,我们提出了适应性检索增强掩蔽扩散(ARAM),这是一个无训练的适应性指导框架,旨在为RAG环境中的掩蔽扩散模型(MDMs)提供支持。ARAM根据检索上下文引起的分布转移的信噪比(SNR)动态调整去噪过程中的指导尺度。直观地说,当检索上下文提供可靠的修正证据时,模型会增强指导,而在上下文信号噪声大或不支持时则会抑制指导。在多个知识密集型问答基准上的大量实验表明,ARAM在整体问答性能上优于竞争性的RAG基线。
cs.CL / 37 / 2603.17759
Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor
伤害还是幽默:一个多模态、多语言的显性与隐性有害幽默基准
Abstract
Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing \emph{Safe} jokes from \emph{Harmful} ones, with the latter further classified into \emph{Explicit} (overt) and \emph{Implicit} (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. \textcolor{red}{Warning: this paper contains example data that may be offensive, harmful, or biased.}
Chinese Translation
黑色幽默往往依赖于微妙的文化细微差别和隐含线索,这些都需要上下文推理来进行解读,因此带来了当前静态基准无法捕捉的安全挑战。为了解决这一问题,我们引入了一个新颖的多模态、多语言基准,用于检测和理解有害和冒犯性的幽默。我们手动策划的数据集包含3,000条文本和6,000张图像,涵盖英语和阿拉伯语,以及1,200个视频,涉及英语、阿拉伯语和语言独立(通用)上下文。与标准的毒性数据集不同,我们执行严格的注释指南:区分 extit{安全}笑话与 extit{有害}笑话,后者进一步分类为 extit{显性}(overt)和 extit{隐性}(Covert)类别,以探讨深层推理。我们系统地评估了所有模态下的最先进(SOTA)开源和闭源模型。我们的研究结果表明,闭源模型显著优于开源模型,并且在英语和阿拉伯语之间的性能差异显著,强调了基于文化的、关注推理的安全对齐的关键需求。 extcolor{red}{警告:本文包含可能冒犯、有害或偏见的示例数据。}
cs.CL / 38 / 2603.17775
CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution
CoVerRL:通过生成器-验证器共同进化打破无标签推理中的共识陷阱
Abstract
Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9\% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55\% to over 85\%, confirming that both capabilities genuinely co-evolve.
Chinese Translation
无标签强化学习使大型语言模型在没有真实标签监督的情况下提高推理能力,通常通过将多数投票的答案视为伪标签。然而,我们识别出一种关键的失效模式:随着训练最大化自我一致性,输出多样性崩溃,导致模型自信地强化那些逃避检测的系统性错误。我们将其称为共识陷阱。为了逃脱这一陷阱,我们提出了CoVerRL,一个框架,其中单个模型在生成器和验证器角色之间交替,每种能力相互促进。多数投票为训练验证器提供了嘈杂但有信息量的监督,而不断改进的验证器逐步过滤伪标签中的自我一致性错误。这种共同进化创造了一个良性循环,在整个训练过程中保持高奖励准确性。在Qwen和Llama模型系列上的实验表明,CoVerRL在数学推理基准测试中比无标签基线提高了4.7-5.9%。此外,自我验证准确率从约55%提高到超过85%,确认这两种能力确实是共同进化的。
cs.CL / 39 / 2603.17815
Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain
通过蒙特卡罗网络信息增益进行思维链推理的过程监督
Abstract
Multi-step reasoning improves the capabilities of large language models (LLMs) but increases the risk of errors propagating through intermediate steps. Process reward models (PRMs) mitigate this by scoring each step individually, enabling fine-grained supervision and improved reliability. Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling. We propose a novel approach to automatically generate step-level labels using Information Theory. Our method estimates how each reasoning step affects the likelihood of the correct answer, providing a signal of step quality. Importantly, it reduces computational complexity to $\mathcal{O}(N)$, improving over the previous $\mathcal{O}(N \log N)$ methods. We demonstrate that these labels enable effective chain-of-thought selection in best-of-$K$ evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering. This work enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical.
Chinese Translation
多步骤推理提高了大型语言模型(LLMs)的能力,但增加了错误在中间步骤中传播的风险。过程奖励模型(PRMs)通过对每个步骤进行单独评分来缓解这一问题,从而实现细粒度监督和提高可靠性。现有的训练PRMs的方法依赖于昂贵的人类标注或计算密集型的自动标记。我们提出了一种新颖的方法,通过信息论自动生成步骤级标签。我们的方法估计每个推理步骤对正确答案的可能性影响,从而提供步骤质量的信号。重要的是,它将计算复杂度降低到 $ ext{O}(N)$,优于之前的 $ ext{O}(N ext{log} N)$ 方法。我们证明这些标签能够在多样化的推理基准测试中,包括数学、Python编程、SQL和科学问答,支持在最佳的 $K$ 评估设置中有效的思维链选择。这项工作使得对LLM推理的可扩展和高效监督成为可能,特别是在错误传播至关重要的任务中。
cs.CL / 40 / 2603.17832
Text-to-Stage: Spatial Layouts from Long-form Narratives
文本到舞台:来自长篇叙述的空间布局
Abstract
In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications. Concretely, we study the narrative-to-play task: inferring stage-play layouts (scenes, speaker positions, movements, and room types) from text that lacks explicit spatial, positional, or relational cues. We then introduce a dramaturgy-inspired deterministic evaluation suite and, finally, a training and inference recipe that combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO. Experiments on a text-only corpus of classical English literature demonstrate improvements over vanilla models across multiple metrics (character attribution, spatial plausibility, and movement economy), as well as alignment with an LLM-as-a-judge and subjective human preferences.
Chinese Translation
在本研究中,我们探讨了语言模型从非结构化文本中展示空间推理能力的能力,模拟人类的能力,并自动化一个对许多下游媒体应用有益的过程。具体而言,我们研究了叙述到剧本的任务:从缺乏明确空间、位置或关系线索的文本中推断舞台剧布局(场景、发言者位置、移动和房间类型)。然后,我们引入了一个受戏剧学启发的确定性评估套件,最后提出了一种结合拒绝式SFT(Supervised Fine-Tuning)使用最佳采样(Best-of-N)与通过GRPO(Generalized Reward Policy Optimization)获得可验证奖励的强化学习(RL)的训练和推理方案。在一个仅包含文本的经典英语文学语料库上的实验显示,在多个指标(角色归属、空间合理性和移动经济性)上,相较于传统模型有了改善,并且与作为评判者的LLM(大型语言模型)和主观人类偏好保持一致。
cs.CL / 41 / 2603.17838
Event-Centric Human Value Understanding in News-Domain Texts: An Actor-Conditioned, Multi-Granularity Benchmark
基于事件的人类价值理解在新闻领域文本中的应用:一种以行为者为条件的多粒度基准
Abstract
Existing human value datasets do not directly support value understanding in factual news: many are actor-agnostic, rely on isolated utterances or synthetic scenarios, and lack explicit event structure or value direction. We present \textbf{NEVU} (\textbf{N}ews \textbf{E}vent-centric \textbf{V}alue \textbf{U}nderstanding), a benchmark for \emph{actor-conditioned}, \emph{event-centric}, and \emph{direction-aware} human value recognition in factual news. NEVU evaluates whether models can identify value cues, attribute them to the correct actor, and determine value direction from grounded evidence. Built from 2{,}865 English news articles, NEVU organizes annotations at four semantic unit levels (\textbf{Subevent}, \textbf{behavior-based composite event}, \textbf{story-based composite event}, and \textbf{Article}) and labels \mbox{(unit, actor)} pairs for fine-grained evaluation across local and composite contexts. The annotations are produced through an LLM-assisted pipeline with staged verification and targeted human auditing. Using a hierarchical value space with \textbf{54} fine-grained values and \textbf{20} coarse-grained categories, NEVU covers 45{,}793 unit--actor pairs and 168{,}061 directed value instances. We provide unified baselines for proprietary and open-source LLMs, and find that lightweight adaptation (LoRA) consistently improves open-source models, showing that although NEVU is designed primarily as a benchmark, it also supports supervised adaptation beyond prompting-only evaluation. Data availability is described in Appendix~\ref{app:data_code_availability}.
Chinese Translation
现有的人类价值数据集并未直接支持对事实新闻中的价值理解:许多数据集与行为者无关,依赖孤立的言辞或合成场景,且缺乏明确的事件结构或价值方向。我们提出了 extbf{NEVU}( extbf{N}ews extbf{E}vent-centric extbf{V}alue extbf{U}nderstanding),这是一个用于 extit{以行为者为条件}、 extit{以事件为中心}和 extit{关注方向}的人类价值识别的基准,专注于事实新闻。NEVU评估模型是否能够识别价值线索,将其归因于正确的行为者,并从基础证据中确定价值方向。NEVU基于2865篇英文新闻文章构建,按四个语义单元层次( extbf{子事件}、 extbf{基于行为的复合事件}、 extbf{基于故事的复合事件}和 extbf{文章})组织注释,并为细粒度评估标记(单元,行为者)对,以涵盖局部和复合上下文。注释通过一个LLM辅助的流程生成,经过分阶段验证和针对性的人工审计。使用包含 extbf{54}个细粒度价值和 extbf{20}个粗粒度类别的分层价值空间,NEVU覆盖了45793个单元-行为者对和168061个有向价值实例。我们提供了专有和开源LLM的统一基线,并发现轻量级适应(LoRA)始终提高了开源模型的表现,表明尽管NEVU主要作为基准设计,但它也支持超越仅提示评估的监督适应。数据可用性详见附录~
ef{app:data_code_availability}。
cs.CL / 42 / 2603.17839
How do LLMs Compute Verbal Confidence
大型语言模型如何计算语言信心
Abstract
Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.
Chinese Translation
语言信心——促使大型语言模型(LLMs)以数字或类别的形式陈述其信心——被广泛用于从黑箱模型中提取不确定性估计。然而,LLMs 如何在内部生成这些分数仍然未知。我们解决了两个问题:首先,信心何时被计算——是在请求时即时计算,还是在答案生成过程中自动计算并缓存以供后续检索;其次,语言信心代表什么——是标记的对数概率,还是对答案质量的更丰富评估?我们聚焦于 Gemma 3 27B 和 Qwen 2.5 7B,提供了缓存检索的趋同证据。激活引导、补丁、噪声和交换实验揭示,信心表示在答案相邻位置出现,然后才出现在语言化位置。注意力阻断明确了信息流:信心从答案标记中收集,在第一个答案后位置缓存,然后用于输出。关键的是,线性探测和方差分解揭示,这些缓存表示在语言信心中解释了超出标记对数概率的显著方差,暗示了一种更丰富的答案质量评估,而不仅仅是简单的流畅性输出。这些发现表明,语言信心反映了自动的、复杂的自我评估——而非事后重建——对理解 LLMs 中的元认知和改善校准具有重要意义。
cs.CL / 43 / 2603.17872
Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval
通过领域基础的分层检索减轻大型语言模型的幻觉现象
Abstract
Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to "hallucinations" - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Corrective Document Grading (CRAG) to filter irrelevant context, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of "False-Premise Overclaiming" was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval "answerability" nodes to further bridge the reliability gap in conversational AI.
Chinese Translation
大型语言模型(LLMs)已实现前所未有的流畅性,但仍然容易出现“幻觉”——生成事实不准确或缺乏依据的内容。这一局限性在高风险领域尤为关键,因为可靠性至关重要。我们提出了一种领域基础的分层检索和验证架构,旨在通过将 LLM 从随机模式匹配器转变为经过验证的真相追求者,系统性地拦截事实不准确性。所提出的框架利用一个四阶段的自我调节管道,通过 LangGraph 实现:(I) 采用早期退出逻辑的内在验证以优化计算,(II) 利用领域检测器的自适应搜索路由以针对特定主题的档案,(III) 纠正文档评分(CRAG)以过滤无关上下文,以及 (IV) 外部再生,随后进行原子级声明验证。该系统在来自五个不同基准的 650 个查询中进行了评估:TimeQA v2、FreshQA v2、HaluEval General、MMLU Global Facts 和 TruthfulQA。实证结果表明,该管道在所有环境中始终优于零样本基线。在 TimeQA v2 中的胜率达到 83.7%,在 MMLU Global Facts 中达到 78.0%,确认了在需要细粒度时间和数值精度的领域中的高效性。基于事实的答案行的基础性得分在 78.8% 到 86.4% 之间保持稳定。虽然该架构为错误信息提供了强有力的安全保障,但识别出一种持续的失败模式“虚假前提过度声称”。这些发现为多阶段 RAG 行为提供了详细的实证特征,并建议未来的工作应优先考虑检索前的“可回答性”节点,以进一步缩小对话 AI 中的可靠性差距。
cs.CL / 44 / 2603.17884
DebugLM: Learning Traceable Training Data Provenance for LLMs
DebugLM:为大型语言模型学习可追溯的训练数据来源
Abstract
Large language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability reduces debugging to reactive patching and makes failures prone to recur under distribution shift or subsequent model updates. To address this limitation, we propose DebugLM, a framework that equips LLMs with built-in data provenance, enabling them to explicitly trace the origins of their behaviors to specific training data sources. Specifically, the model learns to associate its responses with unique provenance tags that indicate the responsible dataset, empowering developers to precisely identify where undesirable behaviors are learned. Building on this capability, DebugLM further supports targeted test-time remediation, enabling developers to selectively trigger targeted refusal for specified data sources without retraining or modifying model parameters. Experiments demonstrate that DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the general utility of the model.
Chinese Translation
大型语言模型(LLMs)通过多阶段管道在异构数据源上进行训练,但开发者缺乏一种原则性的方法来准确定位导致观察到的行为的具体数据。这种缺乏可观察性使得调试变成了被动修补,并使得在分布变化或后续模型更新下故障容易重现。为了解决这一局限性,我们提出了DebugLM,一个为LLMs提供内置数据来源的框架,使其能够明确追踪其行为的来源到特定的训练数据源。具体而言,该模型学习将其响应与指示负责数据集的唯一来源标签关联,从而使开发者能够精确识别不良行为的学习来源。在此能力的基础上,DebugLM进一步支持针对性测试时修复,使开发者能够在不重新训练或修改模型参数的情况下,有选择性地对指定数据源触发拒绝。实验表明,DebugLM在多阶段训练管道中提供了准确的行为追踪和有效的测试时修复,同时保持了模型的整体实用性。
cs.CL / 45 / 2603.17912
Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages
预训练的多语言变换器揭示人类语言之间的定量距离
Abstract
Understanding the distance between human languages is central to linguistics, anthropology, and tracing human evolutionary history. Yet, while linguistics has long provided rich qualitative accounts of cross-linguistic variation, a unified and scalable quantitative approach to measuring language distance remains lacking. In this paper, we introduce a method that leverages pretrained multilingual language models as systematic instruments for linguistic measurement. Specifically, we show that the spontaneously emerged attention mechanisms of these models provide a robust, tokenization-agnostic measure of cross-linguistic distance, termed Attention Transport Distance (ATD). By treating attention matrices as probability distributions and measuring their geometric divergence via optimal transport, we quantify the representational distance between languages during translation. Applying ATD to a large and diverse set of languages, we demonstrate that the resulting distances recover established linguistic groupings with high fidelity and reveal patterns aligned with geographic and contact-induced relationships. Furthermore, incorporating ATD as a regularizer improves transfer performance in low-resource machine translation. Our results establish a principled foundation for testing linguistic hypotheses using artificial neural networks. This framework transforms multilingual models into powerful tools for quantitative linguistic discovery, facilitating more equitable multilingual AI.
Chinese Translation
理解人类语言之间的距离是语言学、人类学以及追溯人类进化历史的核心。然而,尽管语言学长期以来提供了丰富的跨语言变异的定性描述,但缺乏一种统一且可扩展的定量方法来测量语言距离。在本文中,我们介绍了一种利用预训练的多语言语言模型作为语言测量系统工具的方法。具体而言,我们展示了这些模型自发产生的注意力机制提供了一种稳健的、与分词无关的跨语言距离测量,称为注意力传输距离(Attention Transport Distance,ATD)。通过将注意力矩阵视为概率分布,并通过最优传输测量其几何发散,我们量化了翻译过程中语言之间的表征距离。将ATD应用于一个大规模且多样化的语言集合,我们证明了所得到的距离能够高保真地恢复已建立的语言分组,并揭示与地理和接触引起的关系相一致的模式。此外,将ATD作为正则化项纳入可以改善低资源机器翻译的迁移性能。我们的结果为使用人工神经网络测试语言学假设奠定了原则基础。该框架将多语言模型转变为强大的定量语言发现工具,促进更公平的多语言人工智能。
cs.CL / 46 / 2603.17915
IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia
IndicSafe:评估南亚多语言大型语言模型安全性的基准
Abstract
As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.
Chinese Translation
随着大型语言模型(LLMs)在多语言环境中的应用,它们在文化多样性和资源匮乏语言中的安全行为仍然缺乏深入理解。我们首次系统性地评估了12种印度语言的LLM安全性,这些语言由超过12亿人使用,但在LLM训练数据中代表性不足。我们使用一个包含6000个文化背景提示的数据集,涵盖了种姓、宗教、性别、健康和政治等主题,评估了10个领先的LLM在提示翻译变体上的表现。我们的分析揭示了显著的安全漂移:跨语言一致性仅为12.8%,而 exttt{SAFE}率的方差在各语言之间超过17%。一些模型在资源匮乏的脚本中对良性提示过度拒绝,过度标记政治敏感话题,而其他模型则未能标记不安全的生成内容。我们通过提示级熵、类别偏差分数和多语言一致性指数量化了这些失败。我们的研究结果突显了多语言LLM中关键的安全泛化差距,并表明安全对齐在不同语言之间并不均匀转移。我们发布了 extsc{IndicSafe},这是第一个基准,旨在支持针对印度语言部署的文化知情安全评估,并倡导基于区域危害的语言意识对齐策略。
cs.CL / 47 / 2603.17942
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
高效的无训练多标记预测通过嵌入空间探测
Abstract
Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12\% on LLaMA3 and 8--12\% on Qwen3, and achieving throughput gains of up to 15--19\%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.
Chinese Translation
大型语言模型(LLMs)尽管仅被训练用于下一个标记生成,但仍展现出潜在的多标记预测(MTP)能力。我们提出了一种简单的无训练MTP方法,该方法通过从嵌入空间中提取的即时掩码标记对LLM进行探测,使得在不修改模型权重或依赖辅助草稿模型的情况下实现未来标记的并行预测。我们的方法通过从掩码标记的logits中采样前K个候选项构建一个推测性标记树,并应用轻量级的剪枝策略以保留高概率的延续。在解码过程中,候选预测被并行验证,从而实现无损生成,同时显著减少模型调用次数并提高标记吞吐量。在各项基准测试中,我们的基于探测的MTP始终优于现有的无训练基线,在LLaMA3上接受长度提高约12%,在Qwen3上提高8%至12%,并实现高达15%至19%的吞吐量提升。最后,我们提供理论见解和实证证据,表明解码器层自然将掩码标记表示与下一个标记状态对齐,从而在不重新训练或使用辅助模型的情况下实现准确的多步预测。
cs.CL / 48 / 2603.17945
ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws
ShapleyLaw:一种博弈论方法的多语言缩放法则
Abstract
In multilingual pretraining, the test loss of a pretrained model is heavily influenced by the proportion of each language in the pretraining data, namely the \textit{language mixture ratios}. Multilingual scaling laws can predict the test loss under different language mixture ratios and can therefore be used to estimate the optimal ratios. However, the current approaches to multilingual scaling laws do not measure the \textit{cross-lingual transfer} effect, resulting in suboptimal mixture ratios. In this paper, we consider multilingual pretraining as a cooperative game in which each language acts as a player that jointly contributes to pretraining, gaining the resulting reduction in test loss as the payoff. Consequently, from the perspective of cooperative game theory, we quantify the cross-lingual transfer from each language by its contribution in the game, and propose a game-theoretic multilingual scaling law called \textit{ShapleyLaw}. Our experiments show that ShapleyLaw outperforms baseline methods in model performance prediction and language mixture optimization.
Chinese Translation
在多语言预训练中,预训练模型的测试损失受到预训练数据中各语言比例的显著影响,即 extit{语言混合比例}。多语言缩放法则可以预测在不同语言混合比例下的测试损失,因此可以用来估计最佳比例。然而,目前的多语言缩放法则方法并未考虑 extit{跨语言迁移}效应,导致混合比例不理想。在本文中,我们将多语言预训练视为一种合作博弈,其中每种语言作为一个参与者共同为预训练做出贡献,并获得由此带来的测试损失降低作为回报。因此,从合作博弈论的角度出发,我们通过每种语言在博弈中的贡献量化其跨语言迁移,并提出了一种称为 extit{ShapleyLaw}的博弈论多语言缩放法则。我们的实验表明,ShapleyLaw在模型性能预测和语言混合优化方面优于基线方法。
cs.CL / 49 / 2603.17952
Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures
机器翻译中的性别消歧义:仅解码器架构的诊断评估
Abstract
While Large Language Models achieve state-of-the-art results across a wide range of NLP tasks, they remain prone to systematic biases. Among these, gender bias is particularly salient in MT, due to systematic differences across languages in whether and how gender is marked. As a result, translation often requires disambiguating implicit source signals into explicit gender-marked forms. In this context, standard benchmarks may capture broad disparities but fail to reflect the full complexity of gender bias in modern MT. In this paper, we extend recent frameworks on bias evaluation by: (i) introducing a novel measure coined "Prior Bias", capturing a model's default gender assumptions, and (ii) applying the framework to decoder-only MT models. Our results show that, despite their scale and state-of-the-art status, decoder-only models do not generally outperform encoder-decoder architectures on gender-specific metrics; however, post-training (e.g., instruction tuning) not only improves contextual awareness but also reduces the masculine Prior Bias.
Chinese Translation
尽管大型语言模型在广泛的自然语言处理任务中取得了最先进的成果,但它们仍然容易受到系统性偏见的影响。其中,性别偏见在机器翻译中尤为显著,因为不同语言在性别标记的存在与方式上存在系统性差异。因此,翻译通常需要将隐含的源信号消歧为明确的性别标记形式。在这种背景下,标准基准可能捕捉到广泛的差异,但未能反映现代机器翻译中性别偏见的全部复杂性。本文通过以下方式扩展了最近的偏见评估框架:(i)引入了一种新颖的度量,称为“先验偏见”(Prior Bias),用于捕捉模型的默认性别假设;(ii)将该框架应用于仅解码器的机器翻译模型。我们的结果表明,尽管仅解码器模型在规模和最先进的状态下表现出色,但在性别特定指标上通常不优于编码-解码架构;然而,后训练(例如,指令调优)不仅提高了上下文意识,还减少了男性化的先验偏见。
cs.CL / 50 / 2603.17962
ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation
ConGA:上下文性性别标注指南。机器翻译中性别标注的框架
Abstract
Handling gender across languages remains a persistent challenge for Machine Translation (MT) and Large Language Models (LLMs), especially when translating from gender-neutral languages into morphologically gendered ones, such as English to Italian. English largely omits grammatical gender, while Italian requires explicit agreement across multiple grammatical categories. This asymmetry often leads MT systems to default to masculine forms, reinforcing bias and reducing translation accuracy. To address this issue, we present the Contextual Gender Annotation (ConGA) framework, a linguistically grounded set of guidelines for word-level gender annotation. The scheme distinguishes between semantic gender in English through three tags, Masculine (M), Feminine (F), and Ambiguous (A), and grammatical gender realisation in Italian (Masculine (M), Feminine (F)), combined with entity-level identifiers for cross-sentence tracking. We apply ConGA to the gENder-IT dataset, creating a gold-standard resource for evaluating gender bias in translation. Our results reveal systematic masculine overuse and inconsistent feminine realisation, highlighting persistent limitations of current MT systems. By combining fine-grained linguistic annotation with quantitative evaluation, this work offers both a methodology and a benchmark for building more gender-aware and multilingual NLP systems.
Chinese Translation
在语言之间处理性别仍然是机器翻译(MT)和大型语言模型(LLMs)面临的持续挑战,尤其是在从性别中立语言翻译到形态性别语言时,例如从英语翻译到意大利语。英语在很大程度上省略了语法性别,而意大利语则要求在多个语法类别中进行明确的一致性。这种不对称性常常导致机器翻译系统默认使用男性形式,从而加剧偏见并降低翻译准确性。为了解决这一问题,我们提出了上下文性性别标注(ConGA)框架,这是一个基于语言学的词汇级性别标注指南。该方案通过三个标签区分英语中的语义性别:男性(Masculine, M)、女性(Feminine, F)和模糊(Ambiguous, A),并结合意大利语中的语法性别实现(男性(Masculine, M)、女性(Feminine, F)),以及用于跨句追踪的实体级标识符。我们将ConGA应用于gENder-IT数据集,创建了一个评估翻译中性别偏见的黄金标准资源。我们的结果揭示了系统性的男性过度使用和不一致的女性实现,突显了当前机器翻译系统的持续局限性。通过将细粒度的语言标注与定量评估相结合,本研究提供了一种方法论和基准,用于构建更具性别意识和多语言的自然语言处理系统。