arXiv Daily Digest

212

Papers

A Learning-Based Approach for Contact Detection, Localization, and Force Estimation of Continuum Manipulators With Integrated OFDR Optical Fiber

基于学习的方法用于连续操纵器的接触检测、定位和力估计，集成光频域反射测量光纤

Tavangarifard, Mobina, Kacines, Jonathan S., Li, Qiyu, Alambeigi, Farshid

Abstract

Continuum manipulators (CMs) are widely used in minimally invasive procedures due to their compliant structure and ability to navigate deep and confined anatomical environments. However, their distributed deformation makes force sensing, contact detection, localization, and force estimation challenging, particularly when interactions occur at unknown arc-length locations along the robot. To address this problem, we propose a cascade learning-based framework (CLF) for CMs instrumented with a single distributed Optical Frequency Domain Reflectometry (OFDR) fiber embedded along one side of the robot. The OFDR sensor provides dense strain measurements along the manipulator backbone, capturing strain perturbations caused by external interactions. The proposed CLF first detects contact using a Gradient Boosting classifier and then estimates contact location and interaction force magnitude using a CNN--FiLM model that predicts a spatial force distribution along the manipulator. Experimental validation on a sensorized tendon-driven CM in an obstructed environment demonstrates that a single distributed OFDR fiber provides sufficient information to jointly infer contact occurrence, location, and force in continuum manipulators.

Chinese Translation

连续操纵器（CMs）因其柔性结构和在深部及狭窄解剖环境中导航的能力而广泛应用于微创手术。然而，其分布式变形使得力传感、接触检测、定位和力估计面临挑战，特别是在机器人沿未知弧长位置发生交互时。为了解决这一问题，我们提出了一种级联学习框架（CLF），用于在机器人一侧嵌入单根分布式光频域反射测量（OFDR）光纤的连续操纵器。OFDR传感器沿操纵器主干提供密集的应变测量，捕捉由外部交互引起的应变扰动。所提出的CLF首先使用梯度提升分类器检测接触，然后使用CNN--FiLM模型估计接触位置和交互力的大小，该模型预测沿操纵器的空间力分布。在一个受阻环境中对传感器驱动的腱驱动CM进行的实验验证表明，单根分布式OFDR光纤提供了足够的信息，以共同推断连续操纵器中的接触发生、位置和力。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2603.12361

GNN-DIP: Neural Corridor Selection for Decomposition-Based Motion Planning

GNN-DIP：基于神经网络的走廊选择用于分解式运动规划

Xie, Peng, Huang, Yanlinag, Wu, Wenyuan, Alanwar, Amr

Abstract

Motion planning through narrow passages remains a core challenge: sampling-based planners rarely place samples inside these narrow but critical regions, and even when samples land inside a passage, the straight-line connections between them run close to obstacle boundaries and are frequently rejected by collision checking. Decomposition-based planners resolve both issues by partitioning free space into convex cells -- every passage is captured exactly as a cell boundary, and any path within a cell is collision-free by construction. However, the number of candidate corridors through the cell graph grows combinatorially with environment complexity, creating a bottleneck in corridor selection. We present GNN-DIP, a framework that addresses this by integrating a Graph Neural Network (GNN) with a two-phase Decomposition-Informed Planner (DIP). The GNN predicts portal scores on the cell adjacency graph to bias corridor search toward near-optimal regions while preserving completeness. In 2D, Constrained Delaunay Triangulation (CDT) with the Funnel algorithm yields exact shortest paths within corridors; in 3D, Slab convex decomposition with portal-face sampling provides near-optimal path evaluation. Benchmarks on 2D narrow-passage scenarios, 3D bottleneck environments with up to 246 obstacles, and dynamic 2D settings show that GNN-DIP achieves 99--100% success rates with 2--280 times speedup over sampling-based baselines.

Chinese Translation

通过狭窄通道的运动规划仍然是一个核心挑战：基于采样的规划器很少在这些狭窄但关键的区域内放置样本，即使样本落在通道内，它们之间的直线连接也会靠近障碍物边界，常常被碰撞检测拒绝。基于分解的规划器通过将自由空间划分为凸单元来解决这两个问题——每个通道都被准确地捕捉为单元边界，单元内的任何路径在构造上都是无碰撞的。然而，穿过单元图的候选走廊数量随着环境复杂性的增加而呈组合增长，造成走廊选择的瓶颈。我们提出了GNN-DIP，一个通过将图神经网络（Graph Neural Network, GNN）与两阶段分解信息规划器（Decomposition-Informed Planner, DIP）相结合来解决这一问题的框架。GNN在单元邻接图上预测门户评分，以将走廊搜索偏向近似最优区域，同时保持完整性。在二维中，使用漏斗算法的约束德劳内三角剖分（Constrained Delaunay Triangulation, CDT）在走廊内产生精确的最短路径；在三维中，使用门户面采样的板状凸分解提供近似最优路径评估。在二维狭窄通道场景、具有多达246个障碍物的三维瓶颈环境以及动态二维设置上的基准测试表明，GNN-DIP在速度上比基于采样的基线提高了2到280倍，成功率达到99%到100%。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2603.12399

Push, Press, Slide: Mode-Aware Planar Contact Manipulation via Reduced-Order Models

推、压、滑：基于降阶模型的模式感知平面接触操作

Özcan, Melih, Oguz, Ozgur S., Orguner, Umut

Abstract

Non-prehensile planar manipulation, including pushing and press-and-slide, is critical for diverse robotic tasks, but notoriously challenging due to hybrid contact mechanics, under-actuation, and asymmetric friction limits that traditionally necessitate computationally expensive iterative control. In this paper, we propose a mode-aware framework for planar manipulation with one or two robotic arms based on contact topology selection and reduced-order kinematic modeling. Our core insight is that complex wrench-twist limit surface mechanics can be abstracted into a discrete library of physically intuitive models. We systematically map various single-arm and bimanual contact topologies to simple non-holonomic formulations, e.g. unicycle for simplified press-and-slide motion. By anchoring trajectory generation to these reduced-order models, our framework computes the required object wrench and distributes feasible, friction-bounded contact forces via a direct algebraic allocator. We incorporate manipulator kinematics to ensure long-horizon feasibility and demonstrate our fast, optimization-free approach in simulation across diverse single-arm and bimanual manipulation tasks.

Chinese Translation

非抓取式平面操作，包括推和压滑，对于多样化的机器人任务至关重要，但由于混合接触力学、欠驱动和不对称摩擦限制，传统上需要计算成本高昂的迭代控制，因此极具挑战性。本文提出了一种基于接触拓扑选择和降阶运动学建模的平面操作模式感知框架，适用于单臂或双臂机器人。我们的核心见解是，复杂的扭矩-扭转极限表面力学可以抽象为一组物理直观的离散模型。我们系统地将各种单臂和双臂接触拓扑映射到简单的非完整性公式中，例如，将简化的压滑运动映射为独轮车模型。通过将轨迹生成锚定到这些降阶模型，我们的框架计算所需的物体扭矩，并通过直接代数分配器分配可行的、受摩擦限制的接触力。我们结合了操控器运动学，以确保长时间范围的可行性，并在模拟中展示了我们在多样化单臂和双臂操作任务中的快速、无优化方法。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2603.12408

Beyond Motion Imitation: Is Human Motion Data Alone Sufficient to Explain Gait Control and Biomechanics?

超越运动模仿：单靠人类运动数据是否足以解释步态控制和生物力学？

Liu, Xinyi, Ahn, Jangwhan, Lobaton, Edgar, Si, Jennie, Huang, He

Abstract

With the growing interest in motion imitation learning (IL) for human biomechanics and wearable robotics, this study investigates how additional foot-ground interaction measures, used as reward terms, affect human gait kinematics and kinetics estimation within a reinforcement learning-based IL framework. Results indicate that accurate reproduction of forward kinematics alone does not ensure biomechanically plausible joint kinetics. Adding foot-ground contacts and contact forces to the IL reward terms enables the prediction of joint moments in forward walking simulation, which are significantly closer to those computed by inverse dynamics. This finding highlights a fundamental limitation of motion-only IL approaches, which may prioritize kinematics matching over physical consistency. Incorporating kinetic constraints, particularly ground reaction force and center of pressure information, significantly enhances the realism of internal and external kinetics. These findings suggest that, when imitation learning is applied to human-related research domains such as biomechanics and wearable robot co-design, kinetics-based reward shaping is necessary to achieve physically consistent gait representations.

Chinese Translation

随着对人类生物力学和可穿戴机器人运动模仿学习（IL）日益增长的兴趣，本研究探讨了作为奖励项的额外足-地面交互测量如何影响基于强化学习的IL框架下的人类步态运动学和动力学估计。结果表明，仅准确再现前向运动学并不能确保生物力学上合理的关节动力学。将足-地面接触和接触力添加到IL奖励项中，可以在前向步态模拟中预测关节力矩，这些力矩与通过逆动力学计算得出的结果显著接近。该发现突显了仅依赖运动的IL方法的一个基本局限性，即可能优先考虑运动学匹配而忽视物理一致性。引入动力学约束，特别是地面反作用力和压力中心信息，显著增强了内部和外部动力学的真实感。这些发现表明，当将模仿学习应用于与人类相关的研究领域，如生物力学和可穿戴机器人共同设计时，基于动力学的奖励塑造是实现物理一致步态表示所必需的。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2603.12460

Predictive and adaptive maps for long-term visual navigation in changing environments

用于变化环境中长期视觉导航的预测与自适应地图

Halodova, Lucie, Dvorakova, Eliska, Majer, Filip, Vintr, Tomas, Mozos, Oscar Martinez, Dayoub, Feras, Krajnik, Tomas

Abstract

In this paper, we compare different map management techniques for long-term visual navigation in changing environments. In this scenario, the navigation system needs to continuously update and refine its feature map in order to adapt to the environment appearance change. To achieve reliable long-term navigation, the map management techniques have to (i) select features useful for the current navigation task, (ii) remove features that are obsolete, (iii) and add new features from the current camera view to the map. We propose several map management strategies and evaluate their performance with regard to the robot localisation accuracy in long-term teach-and-repeat navigation. Our experiments, performed over three months, indicate that strategies which model cyclic changes of the environment appearance and predict which features are going to be visible at a particular time and location, outperform strategies which do not explicitly model the temporal evolution of the changes.

Chinese Translation

在本文中，我们比较了不同的地图管理技术，以实现变化环境中的长期视觉导航。在这种情况下，导航系统需要不断更新和完善其特征地图，以适应环境外观的变化。为了实现可靠的长期导航，地图管理技术必须 (i) 选择对当前导航任务有用的特征，(ii) 移除过时的特征，(iii) 并将当前相机视图中的新特征添加到地图中。我们提出了几种地图管理策略，并评估它们在长期教学与重复导航中的机器人定位精度方面的表现。我们在三个月内进行的实验表明，能够建模环境外观周期性变化并预测特定时间和位置将可见的特征的策略，优于那些未明确建模变化的时间演变的策略。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2603.12480

One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies

一步流政策：用于快速视觉运动政策的自蒸馏

Li, Shaolong, Sun, Lichao, Chen, Yongchao

Abstract

Generative flow and diffusion models provide the continuous, multimodal action distributions needed for high-precision robotic policies. However, their reliance on iterative sampling introduces severe inference latency, degrading control frequency and harming performance in time-sensitive manipulation. To address this problem, we propose the One-Step Flow Policy (OFP), a from-scratch self-distillation framework for high-fidelity, single-step action generation without a pre-trained teacher. OFP unifies a self-consistency loss to enforce coherent transport across time intervals, and a self-guided regularization to sharpen predictions toward high-density expert modes. In addition, a warm-start mechanism leverages temporal action correlations to minimize the generative transport distance. Evaluations across 56 diverse simulated manipulation tasks demonstrate that a one-step OFP achieves state-of-the-art results, outperforming 100-step diffusion and flow policies while accelerating action generation by over $100\times$. We further integrate OFP into the $\pi_{0.5}$ model on RoboTwin 2.0, where one-step OFP surpasses the original 10-step policy. These results establish OFP as a practical, scalable solution for highly accurate and low-latency robot control.

Chinese Translation

生成流和扩散模型提供了高精度机器人政策所需的连续多模态动作分布。然而，它们对迭代采样的依赖引入了严重的推理延迟，降低了控制频率并损害了时间敏感操作的性能。为了解决这个问题，我们提出了一种一步流政策（One-Step Flow Policy, OFP），这是一种从零开始的自蒸馏框架，能够在没有预训练教师的情况下生成高保真度的单步动作。OFP统一了自一致性损失，以强制在时间间隔内进行一致的传输，并采用自引导正则化来使预测向高密度专家模式聚焦。此外，热启动机制利用时间动作相关性来最小化生成传输距离。在56个不同的模拟操作任务中的评估表明，一步OFP实现了最先进的结果，超越了100步扩散和流政策，同时将动作生成加速超过100倍。我们进一步将OFP集成到RoboTwin 2.0的$ ext{π}_{0.5}$模型中，其中一步OFP超越了原始的10步政策。这些结果确立了OFP作为一种实用的、可扩展的解决方案，以实现高精度和低延迟的机器人控制。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2603.12488

COAD: Constant-Time Planning for Continuous Goal Manipulation with Compressed Library and Online Adaptation

COAD：具有压缩库和在线适应的连续目标操控的常数时间规划

Shiyas, Adil, Zhong, Zhuoyun, Chamzas, Constantinos

Abstract

In many robotic manipulation tasks, the robot repeatedly solves motion-planning problems that differ mainly in the location of the goal object and its associated obstacle, while the surrounding workspace remains fixed. Prior works have shown that leveraging experience and offline computation can accelerate repeated planning queries, but they lack guarantees of covering the continuous task space and require storing large libraries of solutions. In this work, we present COAD, a framework that provides constant-time planning over a continuous goal-parameterized task space. COAD discretizes the continuous task space into finitely many Task Coverage Regions. Instead of planning and storing solutions for every region offline, it constructs a compressed library by only solving representative root problems. Other problems are handled through fast adaptation from these root solutions. At query time, the system retrieves a root motion in constant time and adapts it to the desired goal using lightweight adaptation modules such as linear interpolation, Dynamic Movement Primitives, or simple trajectory optimization. We evaluate the framework on various manipulators and environments in simulation and the real world, showing that COAD achieves substantial compression of the motion library while maintaining high success rates and sub-millisecond-level queries, outperforming baseline methods in both efficiency and path quality. The source code is available at https://github.com/elpis-lab/CoAd.

Chinese Translation

在许多机器人操控任务中，机器人反复解决的运动规划问题主要在于目标物体的位置及其相关障碍物的变化，而周围的工作空间保持不变。先前的研究表明，利用经验和离线计算可以加速重复的规划查询，但它们缺乏覆盖连续任务空间的保证，并且需要存储大量的解决方案库。在本研究中，我们提出了COAD，一个在连续目标参数化任务空间上提供常数时间规划的框架。COAD将连续任务空间离散化为有限多个任务覆盖区域。它并不为每个区域离线规划和存储解决方案，而是通过仅解决代表性的根问题来构建一个压缩库。其他问题通过从这些根解决方案快速适应来处理。在查询时，系统以常数时间检索根运动，并使用轻量级适应模块（如线性插值、动态运动原语（Dynamic Movement Primitives）或简单的轨迹优化）将其适应到所需目标。我们在各种操控器和环境中进行了模拟和实地评估，结果表明COAD在保持高成功率和亚毫秒级查询的同时，实现了运动库的显著压缩，且在效率和路径质量上优于基线方法。源代码可在 https://github.com/elpis-lab/CoAd 获取。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2603.12505

Robots that redesign themselves through kinematic self-destruction

通过运动学自我毁灭重新设计自身的机器人

Yu, Chen, Kriegman, Sam

Abstract

Every robot built to date was predesigned by an external process, prior to deployment. Here we show a robot that actively participates in its own design during its lifetime. Starting from a randomly assembled body, and using only proprioceptive feedback, the robot dynamically ``sculpts'' itself into a new design through kinematic self-destruction: identifying redundant links within its body that inhibit its locomotion, and then thrashing those links against the surface until they break at the joint and fall off the body. It does so using a single autoregressive sequence model, a universal controller that learns in simulation when and how to simplify a robot's body through self-destruction and then adaptively controls the reduced morphology. The optimized policy successfully transfers to reality and generalizes to previously unseen kinematic trees, generating forward locomotion that is more effective than otherwise equivalent policies that randomly remove links or cannot remove any. This suggests that self-designing robots may be more successful than predesigned robots in some cases, and that kinematic self-destruction, though reductive and irreversible, could provide a general adaptive strategy for a wide range of robots.

Chinese Translation

迄今为止，所有构建的机器人都是在部署之前由外部过程预先设计的。在这里，我们展示了一种机器人，它在其生命周期内积极参与自身设计。该机器人从一个随机组装的身体开始，仅使用本体感觉反馈，通过运动学自我毁灭动态地“雕刻”出新的设计：识别其身体内妨碍运动的冗余连接，然后将这些连接与表面撞击，直到它们在关节处断裂并从身体上脱落。它使用一个单一的自回归序列模型，这是一个通用控制器，能够在仿真中学习何时以及如何通过自我毁灭简化机器人的身体，然后自适应地控制简化后的形态。优化后的策略成功转移到现实中，并且能够推广到以前未见过的运动学树，产生的前向运动比随机移除连接或无法移除任何连接的等效策略更有效。这表明，自我设计的机器人在某些情况下可能比预先设计的机器人更成功，并且尽管运动学自我毁灭是简化且不可逆的，但它可能为广泛的机器人提供了一种通用的自适应策略。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2603.12510

Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

通过质量多样性提示生成对视觉-语言-动作模型进行红队测试以增强机器人策略的鲁棒性

Srikanth, Siddharth, Liang, Freddie, Hsu, Sophie, Bhatt, Varun, Zhao, Shihan, Chen, Henry, Tjanaka, Bryon, Hwang, Minjune, Saran, Akanksha, Seita, Daniel, Tabrez, Aaquib, Nikolaidis, Stefanos

Abstract

Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks. However, the performance of VLA-based robots is highly sensitive to the precise wording of language instructions, and it remains difficult to predict when such robots will fail. To improve the robustness of VLAs to different wordings, we present Q-DIG (Quality Diversity for Diverse Instruction Generation), which performs red-teaming by scalably identifying diverse natural language task descriptions that induce failures while remaining task-relevant. Q-DIG integrates Quality Diversity (QD) techniques with Vision-Language Models (VLMs) to generate a broad spectrum of adversarial instructions that expose meaningful vulnerabilities in VLA behavior. Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates. Furthermore, results from a user study highlight that Q-DIG generates prompts judged to be more natural and human-like than those from baselines. Finally, real-world evaluations of Q-DIG prompts show results consistent with simulation, and fine-tuning VLAs on the generated prompts further success rates on unseen instructions. Together, these findings suggest that Q-DIG is a promising approach for identifying vulnerabilities and improving the robustness of VLA-based robots. Our anonymous project website is at qdigvla.github.io.

Chinese Translation

视觉-语言-动作（VLA）模型在实现多种视觉-语言任务的通用机器人系统方面具有显著潜力。然而，基于VLA的机器人的性能对语言指令的精确措辞高度敏感，且预测这些机器人何时会失败仍然困难。为了提高VLA对不同措辞的鲁棒性，我们提出了Q-DIG（质量多样性用于多样化指令生成），该方法通过可扩展地识别引发失败的多样化自然语言任务描述来进行红队测试，同时保持任务相关性。Q-DIG将质量多样性（QD）技术与视觉-语言模型（VLMs）结合，生成广泛的对抗性指令，揭示VLA行为中的重要脆弱性。我们在多个仿真基准测试中的结果表明，与基线方法相比，Q-DIG发现了更多样化和有意义的失败模式，并且在生成的指令上微调VLA提高了任务成功率。此外，用户研究的结果强调Q-DIG生成的提示被评判为比基线更自然和更像人类的提示。最后，Q-DIG提示的现实世界评估结果与仿真一致，并且在生成的提示上微调VLA进一步提高了在未见指令上的成功率。综合这些发现表明，Q-DIG是一种有前景的方法，用于识别脆弱性并提高基于VLA的机器人的鲁棒性。我们的匿名项目网站为qdigvla.github.io。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2603.12553

Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

超越密集未来：作为结构化规划者的世界模型在机器人操控中的应用

Jin, Minghao, Liao, Mozheng, Han, Mingfei, Li, Zhihui, Chang, Xiaojun

Abstract

Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture spatiotemporal milestones closely aligned with task progress. We implement this approach through a two-stage training paradigm with a unified discrete token vocabulary: the world model is first trained to predict structured frames and subsequently optimized to map the structured foresight into low-level actions. This approach provides clear physical guidance and bridges visual planning and motion control. In our experiments, StructVLA achieves strong average success rates of 75.0% on SimplerEnv-WidowX and 94.8% on LIBERO. Real-world deployments further demonstrate reliable task completion and robust generalization across both basic pick-and-place and complex long-horizon tasks.

Chinese Translation

近期基于世界模型的视觉-语言-行动（Vision-Language-Action, VLA）架构通过预测视觉前瞻性改善了机器人操控。然而，密集的未来预测引入了视觉冗余并累积了错误，导致长时间规划漂移。同时，最近的稀疏方法通常使用高层语义子任务或隐式潜在状态来表示视觉前瞻性。这些表示往往缺乏明确的运动学基础，削弱了规划与低级执行之间的对齐。为了解决这一问题，我们提出了StructVLA，它将生成性世界模型重新构造成一个明确的结构化规划者，以实现可靠的控制。与其进行密集的展开或语义目标，StructVLA预测稀疏的、具有物理意义的结构化框架。这些框架源自内在的运动学线索（例如，抓取器过渡和运动学转折点），捕捉与任务进展紧密相关的时空里程碑。我们通过一个两阶段的训练范式实现这一方法，采用统一的离散标记词汇：首先训练世界模型预测结构化框架，然后优化其将结构化前瞻性映射到低级动作。这一方法提供了明确的物理指导，并架起了视觉规划与运动控制之间的桥梁。在我们的实验中，StructVLA在SimplerEnv-WidowX上的平均成功率达到了75.0%，在LIBERO上的成功率达到了94.8%。现实世界的部署进一步证明了其在基本的抓取与放置以及复杂的长时间任务中的可靠任务完成和强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2603.12574

From Woofs to Words: Towards Intelligent Robotic Guide Dogs with Verbal Communication

从吠叫到语言：朝着具有语言沟通能力的智能机器人导盲犬迈进

Hayamizu, Yohei, DeFazio, David, Mehta, Hrudayangam, Altaweel, Zainab, Choe, Jacqueline, Lin, Chao, Juettner, Jake, Xiao, Furui, Blackburn, Jeremy, Zhang, Shiqi

Abstract

Assistive robotics is an important subarea of robotics that focuses on the well-being of people with disabilities. A robotic guide dog is an assistive quadruped robot that helps visually impaired people in obstacle avoidance and navigation. Enabling language capabilities for robotic guide dogs goes beyond naively adding an existing dialog system onto a mobile robot. The novel challenges include grounding language in the dynamically changing environment and improving spatial awareness for the human handler. To address those challenges, we develop a novel dialog system for robotic guide dogs that uses LLMs to verbalize both navigational plans and scenes. The goal is to enable verbal communication for collaborative decision-making within the handler-robot team. In experiments, we conducted a human study to evaluate different verbalization strategies and a simulation study to assess the efficiency and accuracy in navigation tasks.

Chinese Translation

辅助机器人是机器人学的一个重要子领域，专注于残疾人士的福祉。机器人导盲犬是一种辅助四足机器人，帮助视障人士进行障碍物规避和导航。为机器人导盲犬赋予语言能力不仅仅是将现有的对话系统简单地添加到移动机器人上。新的挑战包括在动态变化的环境中将语言与实际情况相结合，以及提高人类操作者的空间意识。为了解决这些挑战，我们开发了一种新颖的机器人导盲犬对话系统，该系统利用大型语言模型（LLMs）将导航计划和场景进行语言化。我们的目标是实现操作者与机器人团队之间的口头沟通，以便进行协作决策。在实验中，我们进行了一个人类研究，以评估不同的语言化策略，并进行了一个模拟研究，以评估导航任务中的效率和准确性。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2603.12583

Skill-informed Data-driven Haptic Nudges for High-dimensional Human Motor Learning

基于技能信息的数据驱动触觉提示用于高维人类运动学习

Kamboj, Ankur, Ranganathan, Rajiv, Tan, Xiaobo, Srivastava, Vaibhav

Abstract

In this work, we propose a data-driven skill-informed framework to design optimal haptic nudge feedback for high-dimensional novel motor learning tasks. We first model the stochastic dynamics of human motor learning using an Input-Output Hidden Markov Model (IOHMM), which explicitly decouples latent skill evolution from observable kinematic emissions. Leveraging this predictive model, we formulate the haptic nudge feedback design problem as a Partially Observable Markov Decision Process (POMDP). This allows us to derive an optimal nudging policy that minimizes long-term performance cost, implicitly guiding the learner toward robust regions of the skill space. We validated our approach through a human-subject study ($N=30$) using a high-dimensional hand-exoskeleton task. Results demonstrate that participants trained with the POMDP-derived policy exhibited significantly accelerated task performance compared to groups receiving heuristic-based feedback or no feedback. Furthermore, synergy analysis revealed that the POMDP group discovered efficient low-dimensional motor representations more rapidly.

Chinese Translation

在本研究中，我们提出了一种数据驱动的技能信息框架，用于设计高维新颖运动学习任务的最佳触觉提示反馈。我们首先使用输入-输出隐马尔可夫模型（Input-Output Hidden Markov Model, IOHMM）对人类运动学习的随机动态进行建模，该模型明确地将潜在技能演变与可观察的运动学输出解耦。利用这一预测模型，我们将触觉提示反馈设计问题构建为部分可观察马尔可夫决策过程（Partially Observable Markov Decision Process, POMDP）。这使我们能够推导出一种最优的提示策略，以最小化长期性能成本，隐性地引导学习者朝向技能空间的稳健区域。我们通过一项人类受试者研究（$N=30$）验证了我们的方法，该研究使用了高维手部外骨骼任务。结果表明，与接受基于启发式反馈或没有反馈的组相比，使用POMDP推导策略训练的参与者在任务表现上显著加速。此外，协同分析显示，POMDP组更快速地发现了高效的低维运动表示。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2603.12607

CarPLAN: Context-Adaptive and Robust Planning with Dynamic Scene Awareness for Autonomous Driving

CarPLAN：具有动态场景感知的上下文自适应和鲁棒规划用于自动驾驶

Yun, Junyong, Kim, Jungho, Lee, ByungHyun, Lee, Dongyoung, Choi, Sehwan, Nam, Seunghyeop, Jo, Kichun, Choi, Jun Won

Abstract

Imitation learning (IL) is widely used for motion planning in autonomous driving due to its data efficiency and access to real-world driving data. For safe and robust real-world driving, IL-based planning requires capturing the complex driving contexts inherent in real-world data and enabling context-adaptive decision-making, rather than relying solely on expert trajectory imitation. In this paper, we propose CarPLAN, a novel IL-based motion planning framework that explicitly enhances driving context understanding and enables adaptive planning across diverse traffic scenarios. Our contributions are twofold: We introduce Displacement-Aware Predictive Encoding (DPE) to improve the model's spatial awareness by predicting future displacement vectors between the Autonomous Vehicle (AV) and surrounding scene elements. This allows the planner to account for relational spacing when generating trajectories. In addition to the standard imitation loss, we incorporate an augmented loss term that captures displacement prediction errors, ensuring planning decisions consider relative distances from other agents. To improve the model's ability to handle diverse driving contexts, we propose Context-Adaptive Multi-Expert Decoder (CMD), which leverages the Mixture of Experts (MoE) framework. CMD dynamically selects the most suitable expert decoders based on scene structure at each Transformer layer, enabling adaptive and context-aware planning in dynamic environments. We evaluate CarPLAN on the nuPlan benchmark and demonstrate state-of-the-art performance across all closed-loop simulation metrics. In particular, CarPLAN exhibits robust performance on challenging scenarios such as Test14-Hard, validating its effectiveness in complex driving conditions. Additional experiments on the Waymax benchmark further demonstrate its generalization capability across different benchmark settings.

Chinese Translation

模仿学习（IL）因其数据效率和对真实世界驾驶数据的获取而广泛应用于自动驾驶的运动规划。为了实现安全和鲁棒的真实世界驾驶，基于IL的规划需要捕捉真实数据中固有的复杂驾驶上下文，并实现上下文自适应决策，而不仅仅依赖于专家轨迹的模仿。本文提出了CarPLAN，一种新颖的基于IL的运动规划框架，明确增强驾驶上下文理解，并在多样化交通场景中实现自适应规划。我们的贡献有两个方面：我们引入了位移感知预测编码（DPE），通过预测自主车辆（AV）与周围场景元素之间的未来位移向量来提高模型的空间感知能力。这使得规划者在生成轨迹时能够考虑相对间距。除了标准的模仿损失外，我们还结合了一个增强损失项，以捕捉位移预测误差，确保规划决策考虑到其他代理的相对距离。为了提高模型处理多样化驾驶上下文的能力，我们提出了上下文自适应多专家解码器（CMD），它利用专家混合（MoE）框架。CMD根据每个Transformer层的场景结构动态选择最合适的专家解码器，从而在动态环境中实现自适应和上下文感知的规划。我们在nuPlan基准上评估了CarPLAN，并在所有闭环仿真指标上展示了最先进的性能。特别是，CarPLAN在如Test14-Hard等具有挑战性的场景中表现出鲁棒性，验证了其在复杂驾驶条件下的有效性。在Waymax基准上的额外实验进一步展示了其在不同基准设置下的泛化能力。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2603.12649

Autonomous Integration and Improvement of Robotic Assembly using Skill Graph Representations

基于技能图表示的机器人装配的自主集成与改进

Yu, Peiqi, Huang, Philip, Chawla, Chaitanya, Shi, Guanya, Li, Jiaoyang, Liu, Changliu

Abstract

Robotic assembly systems traditionally require substantial manual engineering effort to integrate new tasks, adapt to new environments, and improve performance over time. This paper presents a framework for autonomous integration and continuous improvement of robotic assembly systems based on Skill Graph representations. A Skill Graph organizes robot capabilities as verb-based skills, explicitly linking semantic descriptions (verbs and nouns) with executable policies, pre-conditions, post-conditions, and evaluators. We show how Skill Graphs enable rapid system integration by supporting semantic-level planning over skills, while simultaneously grounding execution through well-defined interfaces to robot controllers and perception modules. After initial deployment, the same Skill Graph structure supports systematic data collection and closed-loop performance improvement, enabling iterative refinement of skills and their composition. We demonstrate how this approach unifies system configuration, execution, evaluation, and learning within a single representation, providing a scalable pathway toward adaptive and reusable robotic assembly systems. The code is at https://github.com/intelligent-control-lab/AIDF.

Chinese Translation

传统的机器人装配系统在集成新任务、适应新环境和随着时间推移提高性能方面通常需要大量的人工工程努力。本文提出了一种基于技能图表示的机器人装配系统的自主集成和持续改进框架。技能图将机器人能力组织为基于动词的技能，明确地将语义描述（动词和名词）与可执行策略、前置条件、后置条件和评估器联系起来。我们展示了技能图如何通过支持技能的语义层级规划来实现快速的系统集成，同时通过与机器人控制器和感知模块的良好定义接口来实现执行的基础。初始部署后，相同的技能图结构支持系统的数据收集和闭环性能改进，使技能及其组合的迭代优化成为可能。我们演示了这种方法如何在单一表示中统一系统配置、执行、评估和学习，为适应性和可重用的机器人装配系统提供了可扩展的路径。代码可在 https://github.com/intelligent-control-lab/AIDF 获取。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2603.12665

TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

TacVLA：面向接触的触觉融合以实现稳健的视觉-语言-动作操作

Zhang, Kaidi, Zhang, Heng, Xu, Zhengtong, Zhang, Zhiyuan, Prince, Md Rakibul Islam, Li, Xiang, Han, Xiaojing, Zhou, Yuhao, Ajoudani, Arash, She, Yu

Abstract

Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.

Chinese Translation

视觉-语言-动作（VLA）模型在机器人操作中展现了显著的优势。然而，它们对视觉和语言的依赖常常导致在涉及视觉遮挡、细粒度操作和物理接触的任务中表现不佳。为了解决这些挑战，我们提出了TacVLA，这是一种通过将触觉模态纳入基于变换器的策略来增强细粒度操作能力的微调VLA模型。具体而言，我们引入了一种接触感知的门控机制，仅在检测到接触时选择性激活触觉标记，从而实现自适应的多模态融合，同时避免不相关的触觉干扰。融合的视觉、语言和触觉标记在变换器架构中共同处理，以增强接触丰富交互中的跨模态基础。针对约束锁定拆卸、箱内拾取和鲁棒性评估的广泛实验表明，我们的模型在拆卸任务中平均提高了20%的成功率，在箱内拾取中提高了60%，在视觉遮挡场景中实现了2.1倍的改进。视频可在 https://sites.google.com/view/tacvla 获取，代码将会发布。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2603.12686

Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data

从不完美的人类运动数据中学习运动型类人网球技能

Zhang, Zhikai, Lu, Haofei, Lian, Yunrui, Chen, Ziqing, Liu, Yun, Lin, Chenghuai, Xue, Han, Zeng, Zicheng, Qi, Zekun, Zheng, Shaolin, Luan, Qing, Wang, Jingbo, Xing, Junliang, Wang, He, Yi, Li

Abstract

Human athletes demonstrate versatile and highly-dynamic tennis skills to successfully conduct competitive rallies with a high-speed tennis ball. However, reproducing such behaviors on humanoid robots is difficult, partially due to the lack of perfect humanoid action data or human kinematic motion data in tennis scenarios as reference. In this work, we propose LATENT, a system that Learns Athletic humanoid TEnnis skills from imperfect human motioN daTa. The imperfect human motion data consist only of motion fragments that capture the primitive skills used when playing tennis rather than precise and complete human-tennis motion sequences from real-world tennis matches, thereby significantly reducing the difficulty of data collection. Our key insight is that, despite being imperfect, such quasi-realistic data still provide priors about human primitive skills in tennis scenarios. With further correction and composition, we learn a humanoid policy that can consistently strike incoming balls under a wide range of conditions and return them to target locations, while preserving natural motion styles. We also propose a series of designs for robust sim-to-real transfer and deploy our policy on the Unitree G1 humanoid robot. Our method achieves surprising results in the real world and can stably sustain multi-shot rallies with human players. Project page: https://zzk273.github.io/LATENT/

Chinese Translation

人类运动员展现出多样化且高度动态的网球技能，以成功进行高速网球的竞争性对抗。然而，在类人机器人上重现这些行为是困难的，部分原因在于缺乏完美的类人动作数据或网球场景中的人类运动学数据作为参考。在本研究中，我们提出了LATENT，一个从不完美的人类运动数据中学习运动型类人网球技能的系统。这些不完美的人类运动数据仅由捕捉到的运动片段组成，这些片段反映了在打网球时使用的原始技能，而不是来自真实网球比赛的精确和完整的人类-网球运动序列，从而显著降低了数据收集的难度。我们的关键见解是，尽管这些数据不完美，但这种准现实的数据仍然提供了关于网球场景中人类原始技能的先验信息。通过进一步的修正和组合，我们学习到了一种类人策略，能够在各种条件下稳定地击打来球并将其返回到目标位置，同时保持自然的运动风格。我们还提出了一系列设计，以实现稳健的模拟到现实的转移，并将我们的策略部署在Unitree G1类人机器人上。我们的方法在现实世界中取得了令人惊讶的结果，并能够与人类玩家稳定地进行多次回合对抗。项目页面：https://zzk273.github.io/LATENT/

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2603.12696

HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation

HaltNav：基于轻量级拓扑先验的反应式视觉停顿方法用于稳健的视觉-语言导航

Li, Pingcong, Yu, Zihui, Zhang, Bichi, Schwertfeger, Sören

Abstract

Vision-and-Language Navigation (VLN) is shifting from rigid, step-by-step instruction following toward open-vocabulary, goal-oriented autonomy. Achieving this transition without exhaustive routing prompts requires agents to leverage structural priors. While prior work often assumes computationally heavy 2D/3D metric maps, we instead exploit a lightweight, text-based osmAG (OpenStreetMap Area Graph), a floorplan-level topological representation that is easy to obtain and maintain. However, global planning over a prior map alone is brittle in real-world deployments, where local connectivity can change (e.g., closed doors or crowded passages), leading to execution-time failures. To address this gap, we propose a hierarchical navigation framework HaltNav that couples the robust global planning of osmAG with the local exploration and instruction-grounding capability of VLN. Our approach features an MLLM-based brain module, which is capable of high-level task grounding and obstruction awareness. Conditioned on osmAG, the brain converts the global route into a sequence of localized execution snippets, providing the VLN executor with prior-grounded, goal-centric sub-instructions. Meanwhile, it detects local anomalies via a mechanism we term Reactive Visual Halting (RVH), which interrupts the local control loop, updates osmAG by invalidating the corresponding topology, and triggers replanning to orchestrate a viable detour. To train this halting capability efficiently, we introduce a data synthesis pipeline that leverages generative models to inject realistic obstacles into otherwise navigable scenes, substantially enriching hard negative samples. Extensive experiments demonstrate that our hierarchical framework outperforms several baseline methods without tedious language instructions, and significantly improves robustness for long-horizon vision-language navigation under environmental changes.

Chinese Translation

视觉与语言导航（VLN）正从严格的逐步指令跟随转向开放词汇的目标导向自主性。实现这一转变而不依赖于繁琐的路径规划提示，需要代理利用结构性先验。尽管先前的研究通常假设计算量大的2D/3D度量地图，我们则利用轻量级的基于文本的osmAG（OpenStreetMap Area Graph），这是一种易于获取和维护的平面图级拓扑表示。然而，仅依赖先前地图进行全局规划在现实世界的应用中是脆弱的，因为局部连通性可能会发生变化（例如，关闭的门或拥挤的通道），导致执行时失败。为了解决这一问题，我们提出了一种层次导航框架HaltNav，将osmAG的稳健全局规划与VLN的局部探索和指令基础能力相结合。我们的方法具有基于MLLM的脑模块，能够进行高层次的任务基础和障碍感知。基于osmAG，该脑模块将全局路线转换为一系列局部执行片段，为VLN执行器提供基于先验的、以目标为中心的子指令。同时，它通过我们称之为反应式视觉停顿（Reactive Visual Halting，RVH）的机制检测局部异常，该机制中断局部控制循环，更新osmAG，通过使相应拓扑失效来触发重新规划，以协调可行的绕行。为了高效地训练这种停顿能力，我们引入了一种数据合成管道，利用生成模型将现实障碍注入到本可导航的场景中，显著丰富了困难的负样本。大量实验表明，我们的层次框架在没有繁琐语言指令的情况下优于多种基线方法，并显著提高了在环境变化下长时间视觉-语言导航的稳健性。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2603.12717

Altered Thoughts, Altered Actions: Probing Chain-of-Thought Vulnerabilities in VLA Robotic Manipulation

思维变化，行动变化：探讨VLA机器人操作中的链式思维脆弱性

Trinh, Tuan Duong, Akhtar, Naveed, Azam, Basim

Abstract

Recent Vision-Language-Action (VLA) models increasingly adopt chain-of-thought (CoT) reasoning, generating a natural-language plan before decoding motor commands. This internal text channel between the reasoning module and the action decoder has received no adversarial scrutiny. We ask: which properties of this intermediate plan does the action decoder actually rely on, and can targeted corruption of the reasoning trace alone -- with all inputs left intact -- degrade a robot's physical task performance? We design a taxonomy of seven text corruptions organized into three attacker tiers (blind noise, mechanical-semantic, and LLM-adaptive) and apply them to a state-of-the-art reasoning VLA across 40 LIBERO tabletop manipulation tasks. Our results reveal a striking asymmetry: substituting object names in the reasoning trace reduces overall success rate by 8.3~percentage points (pp) -- reaching $-$19.3~pp on goal-conditioned tasks and $-$45~pp on individual tasks -- whereas sentence reordering, spatial-direction reversal, token noise, and even a 70B-parameter LLM crafting plausible-but-wrong plans all have negligible impact (within $\pm$4~pp). This asymmetry indicates that the action decoder depends on entity-reference integrity rather than reasoning quality or sequential structure. Notably, a sophisticated LLM-based attacker underperforms simple mechanical object-name substitution, because preserving plausibility inadvertently retains the entity-grounding structure the decoder needs. A cross-architecture control using a non-reasoning VLA confirms the vulnerability is exclusive to reasoning-augmented models, while instruction-level attacks degrade both architectures -- establishing that the internal reasoning trace is a distinct and stealthy threat vector invisible to input-validation defenses.

Chinese Translation

近年来，视觉-语言-行动（VLA）模型越来越多地采用链式思维（CoT）推理，在解码运动指令之前生成自然语言计划。推理模块与动作解码器之间的这一内部文本通道尚未受到对抗性审查。我们提出以下问题：动作解码器实际上依赖于该中间计划的哪些属性？仅通过有针对性地破坏推理轨迹——而不改变所有输入——是否会降低机器人的物理任务表现？我们设计了一个包含七种文本破坏的分类法，分为三个攻击者层级（盲噪声、机械-语义和LLM自适应），并将其应用于一个最先进的推理VLA，涵盖40个LIBERO桌面操作任务。我们的结果揭示了一个显著的不对称性：在推理轨迹中替换对象名称使整体成功率降低了8.3个百分点（pp）——在目标条件任务中降至-$19.3~pp$，在单个任务中降至-$45~pp$——而句子重排序、空间方向反转、标记噪声，甚至一个具有70B参数的LLM生成合理但错误的计划都几乎没有影响（在$ ext{±}4~pp$之内）。这种不对称性表明，动作解码器依赖于实体引用的完整性，而非推理质量或顺序结构。值得注意的是，一个复杂的基于LLM的攻击者在简单的机械对象名称替换中表现不佳，因为保持合理性无意中保留了解码器所需的实体定位结构。使用非推理VLA的跨架构控制实验确认该脆弱性仅限于增强推理的模型，而指令级攻击则会降低两种架构的性能——这表明内部推理轨迹是一个独特且隐蔽的威胁向量，对输入验证防御措施不可见。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2603.12730

AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

AnchorVLA4D：一种基于锚点的时空视觉-语言-动作模型用于机器人操作

Zhu, Juan, Shao, Zhanying, Li, Xiaoqi, Morgan, Ethan, Xu, Jiadong, Fan, Hongwei, Dong, Hao

Abstract

Since current Vision-Language-Action (VLA) systems suffer from limited spatial perception and the absence of memory throughout manipulation, we investigate visual anchors as a means to enhance spatial and temporal reasoning within VLA policies for robotic manipulation. Conventional VLAs generate actions by conditioning on a single current frame together with a language instruction. However, since the frame is encoded as a 2D image, it does not contain detailed spatial information, and the VLA similarly lacks any means to incorporate past context. As a result, it frequently forgets objects under occlusion and becomes spatially disoriented during the manipulation process. Thus, we propose AnchorVLA4D, a simple spatial-temporal VLA that augments the visual input with an anchor image to preserve the initial scene context throughout execution, and adds a lightweight spatial encoder that jointly processes the anchor and current frames to expose geometric relationships within an episode. Built on a Qwen2.5-VL backbone with a diffusion-based action head, AnchorVLA4D requires no additional sensing modalities (e.g., depth or point clouds) and introduces negligible inference overhead. Combining anchoring with a frozen pretrained spatial encoder yields further gains, realizing a 13.6% improvement on the Simpler WidowX benchmark and confirming the approach on real-world tasks, where it achieved an average success rate of 80%.

Chinese Translation

由于当前的视觉-语言-动作（VLA）系统在空间感知方面存在局限，并且在操作过程中缺乏记忆，我们研究了视觉锚点作为增强VLA策略中空间和时间推理的一种手段，以改善机器人操作。传统的VLA通过对单一当前帧及语言指令进行条件生成动作。然而，由于帧被编码为二维图像，它不包含详细的空间信息，VLA同样缺乏整合过去上下文的手段。因此，它经常在遮挡物下忘记物体，并在操作过程中变得空间失调。因此，我们提出了AnchorVLA4D，这是一种简单的时空VLA，通过用锚点图像增强视觉输入，以在执行过程中保持初始场景上下文，并增加一个轻量级空间编码器，该编码器共同处理锚点和当前帧，以揭示一个情节中的几何关系。AnchorVLA4D基于Qwen2.5-VL主干网络，配备扩散基础的动作头，不需要额外的传感模式（如深度或点云），并引入了微不足道的推理开销。将锚定与冻结的预训练空间编码器结合，进一步提升了性能，在Simpler WidowX基准上实现了13.6%的提升，并在现实世界任务中验证了该方法，取得了80%的平均成功率。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2603.12736

Conflict Mitigation in Shared Environments using Flow-Aware Multi-Agent Path Finding

基于流量感知的多智能体路径规划在共享环境中的冲突缓解

Heuer, Lukas, Zhu, Yufei, Palmieri, Luigi, Rudenko, Andrey, Mannucci, Anna, Koenig, Sven, Magnusson, Martin

Abstract

Deploying multi-robot systems in environments shared with dynamic and uncontrollable agents presents significant challenges, especially for large robot fleets. In such environments, individual robot operations can be delayed due to unforeseen conflicts with uncontrollable agents. While existing research primarily focuses on preserving the completeness of Multi-Agent Path Finding (MAPF) solutions considering delays, there is limited emphasis on utilizing additional environmental information to enhance solution quality in the presence of other dynamic agents. To this end, we propose Flow-Aware Multi-Agent Path Finding (FA-MAPF), a novel framework that integrates learned motion patterns of uncontrollable agents into centralized MAPF algorithms. Our evaluation, conducted on a diverse set of benchmark maps with simulated uncontrollable agents and on a real-world map with recorded human trajectories, demonstrates the effectiveness of FA-MAPF compared to state-of-the-art baselines. The experimental results show that FA-MAPF can consistently reduce conflicts with uncontrollable agents, up to 55%, without compromising task efficiency.

Chinese Translation

在与动态且不可控的智能体共享的环境中部署多机器人系统面临重大挑战，尤其是对于大型机器人队伍。在这样的环境中，个别机器人的操作可能因与不可控智能体的意外冲突而延迟。尽管现有研究主要集中在考虑延迟的情况下保持多智能体路径规划（MAPF）解决方案的完整性，但对利用额外环境信息以提高在其他动态智能体存在下的解决方案质量的关注有限。为此，我们提出了基于流量感知的多智能体路径规划（FA-MAPF），这是一个将不可控智能体的学习运动模式整合到集中式MAPF算法中的新框架。我们在一组多样化的基准地图上进行评估，模拟不可控智能体，并在一个记录了人类轨迹的真实地图上进行测试，结果表明FA-MAPF相比于最先进的基线方法具有有效性。实验结果显示，FA-MAPF能够持续减少与不可控智能体的冲突，最高可达55%，且不影响任务效率。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2603.12769

Easy-IIL: Reducing Human Operational Burden in Interactive Imitation Learning via Assistant Experts

Easy-IIL：通过助手专家减少交互模仿学习中的人类操作负担

Zhang, Chengjie, Tang, Chao, Dong, Wenlong, Huang, Dehao, Gu, Aoxiang, Zhang, Hong

Abstract

Interactive Imitation Learning (IIL) typically relies on extensive human involvement for both offline demonstration and online interaction. Prior work primarily focuses on reducing human effort in passive monitoring rather than active operation. Interestingly, structured model-based imitation approaches achieve comparable performance with significantly fewer demonstrations than end-to-end imitation learning policies in the low-data regime. However, these methods are typically surpassed by end-to-end policies as the data increases. Leveraging this insight, we propose Easy-IIL, a framework that utilizes off-the-shelf model-based imitation methods as an assistant expert to replace active human operation for the majority of data collection. The human expert only provides a single demonstration to initialize the assistant expert and intervenes in critical states where the task is approaching failure. Furthermore, Easy-IIL can maintain IIL performance by preserving both offline and online data quality. Extensive simulation and real-world experiments demonstrate that Easy-IIL significantly reduces human operational burden while maintaining performance comparable to mainstream IIL baselines. User studies further confirm that Easy-IIL reduces subjective workload on the human expert. Project page: https://sites.google.com/view/easy-iil

Chinese Translation

交互模仿学习（IIL）通常依赖于大量人类参与，包括离线演示和在线交互。之前的研究主要集中在减少人类在被动监控中的努力，而非主动操作。有趣的是，结构化的基于模型的模仿方法在低数据环境下能够以显著较少的演示实现与端到端模仿学习策略相当的性能。然而，随着数据量的增加，这些方法通常会被端到端策略超越。基于这一洞察，我们提出了Easy-IIL，一个利用现成的基于模型的模仿方法作为助手专家，以替代大部分数据收集中的主动人类操作的框架。人类专家仅需提供一次演示以初始化助手专家，并在任务接近失败的关键状态进行干预。此外，Easy-IIL能够通过保持离线和在线数据质量来维持IIL性能。大量的仿真和现实世界实验表明，Easy-IIL显著减少了人类操作负担，同时保持与主流IIL基准相当的性能。用户研究进一步确认，Easy-IIL降低了人类专家的主观工作负荷。项目页面：https://sites.google.com/view/easy-iil

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2603.12791

Motion-Specific Battery Health Assessment for Quadrotors Using High-Fidelity Battery Models

基于高保真电池模型的四旋翼特定运动电池健康评估

Kim, Joonhee, Park, Sanghyun, Kim, Donghyeong, Choi, Eunseon, Han, Soohee

Abstract

Quadrotor endurance is ultimately limited by battery behavior, yet most energy aware planning treats the battery as a simple energy reservoir and overlooks how flight motions induce dynamic current loads that accelerate battery degradation. This work presents an end to end framework for motion aware battery health assessment in quadrotors. We first design a wide range current sensing module to capture motion specific current profiles during real flights, preserving transient features. In parallel, a high fidelity battery model is calibrated using reference performance tests and a metaheuristic based on a degradation coupled electrochemical model.By simulating measured flight loads in the calibrated model, we systematically resolve how different flight motions translate into degradation modes loss of lithium inventory and loss of active material as well as internal side reactions. The results demonstrate that even when two flight profiles consume the same average energy, their transient load structures can drive different degradation pathways, emphasizing the need for motion-aware battery management that balances efficiency with battery degradation.

Chinese Translation

四旋翼的续航能力最终受到电池行为的限制，但大多数考虑能量的规划将电池视为简单的能量储存装置，忽视了飞行运动如何引发动态电流负载，从而加速电池的退化。本文提出了一种针对四旋翼的运动感知电池健康评估的端到端框架。我们首先设计了一种广泛的电流传感模块，以捕捉真实飞行中的特定运动电流特征，保留瞬态特征。同时，基于参考性能测试对高保真电池模型进行校准，并采用基于退化耦合电化学模型的元启发式方法。通过在校准模型中模拟测得的飞行负载，我们系统地解析了不同飞行运动如何转化为退化模式，包括锂库存损失、活性材料损失以及内部副反应。结果表明，即使两个飞行轨迹消耗相同的平均能量，其瞬态负载结构也可能驱动不同的退化路径，这强调了需要运动感知电池管理，以平衡效率与电池退化。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2603.12806

FLUX: Accelerating Cross-Embodiment Generative Navigation Policies via Rectified Flow and Static-to-Dynamic Learning

FLUX：通过校正流和静态到动态学习加速跨体现生成导航策略

Gong, Zeying, Zhong, Yangyi, Ding, Yiyi, Hu, Tianshuai, Zhao, Guoyang, Kong, Lingdong, Li, Rong, You, Jiadi, Liang, Junwei

Abstract

Autonomous navigation requires a broad spectrum of skills, from static goal-reaching to dynamic social traversal, yet evaluation remains fragmented across disparate protocols. We introduce DynBench, a dynamic navigation benchmark featuring physically valid crowd simulation. Combined with existing static protocols, it supports comprehensive evaluation across six fundamental navigation tasks. Within this framework, we propose FLUX, the first flow-based unified navigation policy. By linearizing probability flow, FLUX replaces iterative denoising with straight-line trajectories, improving per-step inference efficiency by 47% over prior flow-based methods and 29% over diffusion-based ones. Following a static-to-dynamic curriculum, FLUX initially establishes geometric priors and is subsequently refined through reinforcement learning in dynamic social environments. This regime not only strengthens socially-aware navigation but also enhances static task robustness by capturing recovery behaviors through stochastic action distributions. FLUX achieves state-of-the-art performance across all tasks and demonstrates zero-shot sim-to-real transfer on wheeled, quadrupedal, and humanoid platforms without any fine-tuning.

Chinese Translation

自主导航需要广泛的技能，从静态目标到动态社交穿越，但评估仍然在不同的协议之间碎片化。我们引入了DynBench，一个具有物理有效人群模拟的动态导航基准。结合现有的静态协议，它支持在六个基本导航任务中进行全面评估。在此框架内，我们提出了FLUX，这是第一个基于流的统一导航策略。通过线性化概率流，FLUX用直线路径替代了迭代去噪，提高了每步推理效率，相较于之前的基于流的方法提高了47%，相较于基于扩散的方法提高了29%。在静态到动态的课程中，FLUX最初建立几何先验，随后通过在动态社交环境中的强化学习进行精炼。这种机制不仅增强了社会意识导航，还通过捕捉随机行动分布中的恢复行为提高了静态任务的鲁棒性。FLUX在所有任务中都达到了最先进的性能，并在轮式、四足和类人平台上实现了零-shot的仿真到现实转移，无需任何微调。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2603.12807

Reinforcement Learning for Elliptical Cylinder Motion Control Tasks

用于椭圆柱体运动控制任务的强化学习

Marczewski, Pawel, Superczynska, Paulina, Bernat, Jakub, Szczesny, Szymon

Abstract

The control of devices with limited input always bring attention to solve by research due to its difficulty and non-trival solution. For instance, the inverted pendulum is benchmarking problem in control theory and machine learning. In this work, we are focused on the elliptical cylinder and its motion under limited torque. The inspiration of the problem is from untethered magnetic devices, which due to distance have to operate with limited input torque. In this work, the main goal is to define the control problem of elliptic cylinder with limited input torque and solve it by Reinforcement Learning. As a classical baseline, we evaluate a two-stage controller composed of an energy-shaping swing-up law and a local Linear Quadratic Regulator (LQR) stabilizer around the target equilibrium. The swing-up controller increases the system's mechanical energy to drive the state toward a neighborhood of the desired equilibrium, a linearization of the nonlinear model yields an LQR that regulates the angle and angular-rate states to the target orientation with bounded input. This swing-up + LQR policy is a strong, interpretable reference for underactuated system and serves a point of comparison to the learned policy under identical limits and parameters. The solution shows that the learning is possible however, the different cases like stabilization in upward position or rotating of half turn are very difficult for increasing mass or ellipses with a strongly unequal perimeter ratio.

Chinese Translation

具有有限输入的设备控制因其难度和非平凡的解决方案而始终受到研究的关注。例如，倒立摆是控制理论和机器学习中的基准问题。在本研究中，我们关注椭圆柱体及其在有限扭矩下的运动。该问题的灵感来源于无绳磁性设备，由于距离限制，它们必须在有限的输入扭矩下工作。本研究的主要目标是定义有限输入扭矩下椭圆柱体的控制问题，并通过强化学习来解决。作为经典基线，我们评估了一个由能量塑形摆动法则和局部线性二次调节器（LQR）稳定器组成的两阶段控制器，后者围绕目标平衡点进行稳定。摆动控制器增加系统的机械能，以驱动状态朝向期望平衡的邻域，非线性模型的线性化产生一个LQR，该LQR将角度和角速度状态调节到目标方向，同时保持输入的界限。这个摆动+LQR策略是一个强大且可解释的参考，适用于欠驱动系统，并为在相同限制和参数下学习到的策略提供了比较的基准。解决方案表明学习是可能的，然而，像在上升位置的稳定或半转旋转等不同情况对于增加质量或周长比例差异较大的椭圆来说是非常困难的。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2603.12842

SmoothTurn: Learning to Turn Smoothly for Agile Navigation with Quadrupedal Robots

SmoothTurn：学习平滑转向以实现四足机器人灵活导航

You, Zunzhi, Guo, Haolan, Wang, Yunke, Xu, Chang

Abstract

Quadrupedal robots show great potential for valuable real-world applications such as fire rescue and industrial inspection. Such applications often require urgency and the ability to navigate agilely, which in turn demands the capability to change directions smoothly while running in high speed. Existing approaches for agile navigation typically learn a single-goal reaching policy by encouraging the robot to stay at the target position after reaching there. As a result, when the policy is used to reach sequential goals that require changing directions, it cannot anticipate upcoming maneuvers or maintain momentum across the switch of goals, thereby preventing the robot from fully exploiting its agility potential. In this work, we formulate the task as sequential local navigation, extending the single-goal-conditioned local navigation formulation in prior work. We then introduce SmoothTurn, a learning-based control framework that learns to turn smoothly while running rapidly for agile sequential local navigation. The framework adopts a novel sequential goal-reaching reward, an expanded observation space with a lookahead window for future goals, and an automatic goal curriculum that progressively expands the difficulty of sampled goal sequences based on the goal-reaching performance. The trained policy can be directly deployed on real quadrupedal robots with onboard sensors and computation. Both simulation and real-world empirical results show that SmoothTurn learns an agile locomotion policy that performs smooth turning across goals, with emergent behaviors such as controlling momentum when switching goals, facing towards the future goal in advance, and planning efficient paths. We have provided video demos of the learned motions in the supplementary materials. The source code and trained policies will be made available upon acceptance.

Chinese Translation

四足机器人在消防救援和工业检查等实际应用中展现出巨大的潜力。这些应用通常要求紧急反应和灵活导航，这反过来又要求在高速运行时能够平滑地改变方向。现有的灵活导航方法通常通过鼓励机器人在到达目标位置后停留在该位置来学习单一目标的到达策略。因此，当该策略用于达到需要改变方向的连续目标时，它无法预见即将到来的机动或在目标切换时保持动量，从而阻碍了机器人充分发挥其灵活性的潜力。在本研究中，我们将任务形式化为顺序局部导航，扩展了先前工作的单一目标条件局部导航形式。然后，我们引入了SmoothTurn，一个基于学习的控制框架，旨在快速运行时学习平滑转向，以实现灵活的顺序局部导航。该框架采用了一种新颖的顺序目标到达奖励，扩展的观察空间以及针对未来目标的前瞻窗口，并且自动目标课程根据目标到达性能逐步增加采样目标序列的难度。训练后的策略可以直接部署在配备传感器和计算能力的真实四足机器人上。模拟和实际实验结果表明，SmoothTurn学习到了一种灵活的运动策略，能够在目标之间进行平滑转向，并展现出诸如在切换目标时控制动量、提前面向未来目标以及规划高效路径等新兴行为。我们在补充材料中提供了学习运动的视频演示。源代码和训练策略将在接受后提供。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2603.12868

Beyond Imitation: Reinforcement Learning Fine-Tuning for Adaptive Diffusion Navigation Policies

超越模仿：用于自适应扩散导航策略的强化学习微调

Sheng, Junhe, Bai, Ruofei, Xu, Kuan, Liu, Ruimeng, Chen, Jie, Yuan, Shenghai, Yau, Wei-Yun, Xie, Lihua

Abstract

Diffusion-based robot navigation policies trained on large-scale imitation learning datasets, can generate multi-modal trajectories directly from the robot's visual observations, bypassing the traditional localization-mapping-planning pipeline and achieving strong zero-shot generalization. However, their performance remains constrained by the coverage of offline datasets, and when deployed in unseen settings, distribution shift often leads to accumulated trajectory errors and safety-critical failures. Adapting diffusion policies with reinforcement learning is challenging because their iterative denoising structure hinders effective gradient backpropagation, while also making the training of an additional value network computationally expensive and less stable. To address these issues, we propose a reinforcement learning fine-tuning framework tailored for diffusion-based navigation. The method leverages the inherent multi-trajectory sampling mechanism of diffusion models and adopts Group Relative Policy Optimization (GRPO), which estimates relative advantages across sampled trajectories without requiring a separate value network. To preserve pretrained representations while enabling adaptation, we freeze the visual encoder and selectively update the higher decoder layers and action head, enhancing safety-aware behaviors through online environmental feedback. On the PointGoal task in Isaac Sim, our approach improves the Success Rate from 52.0% to 58.7% and SPL from 0.49 to 0.54 on unseen scenes, while reducing collision frequency. Additional experiments show that the fine-tuned policy transfers zero-shot to a real quadruped platform and maintains stable performance in geometrically out-of-distribution environments, suggesting improved adaptability and safe generalization to new domains.

Chinese Translation

基于扩散的机器人导航策略在大规模模仿学习数据集上训练，可以直接从机器人的视觉观察中生成多模态轨迹，绕过传统的定位-映射-规划流程，并实现强大的零-shot 泛化能力。然而，它们的性能仍然受到离线数据集覆盖范围的限制，当在未见过的环境中部署时，分布转移常常导致累积轨迹误差和安全关键性失败。使用强化学习来适应扩散策略具有挑战性，因为其迭代去噪结构阻碍了有效的梯度反向传播，同时使得额外价值网络的训练计算成本高且稳定性差。为了解决这些问题，我们提出了一种针对基于扩散的导航的强化学习微调框架。该方法利用扩散模型固有的多轨迹采样机制，并采用群体相对策略优化（Group Relative Policy Optimization, GRPO），该方法在不需要单独的价值网络的情况下估计采样轨迹之间的相对优势。为了在实现适应的同时保留预训练表示，我们冻结视觉编码器，并选择性地更新更高的解码器层和动作头，通过在线环境反馈增强安全意识行为。在 Isaac Sim 的 PointGoal 任务中，我们的方法将成功率从 52.0% 提高到 58.7%，SPL 从 0.49 提高到 0.54，且减少了碰撞频率。额外实验表明，微调后的策略能够零-shot 转移到真实的四足平台，并在几何上超出分布的环境中保持稳定性能，表明其在新领域的适应性和安全泛化能力得到了改善。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2603.12904

Consistent and Efficient MSCKF-based LiDAR-Inertial Odometry with Inferred Cluster-to-Plane Constraints for UAVs

基于MSCKF的LiDAR-惯性里程计的一致性与高效性：针对无人机的推断簇到平面约束

Zhu, Jinwen, Zhao, Xudong, Zhu, Fangcheng, Hu, Jun, Jin, Shi, Mao, Yinian, Huang, Guoquan

Abstract

Robust and accurate navigation is critical for Unmanned Aerial Vehicles (UAVs) especially for those with stringent Size, Weight, and Power (SWaP) constraints. However, most state-of-the-art (SOTA) LiDAR-Inertial Odometry (LIO) systems still suffer from estimation inconsistency and computational bottlenecks when deployed on such platforms. To address these issues, this paper proposes a consistent and efficient tightly-coupled LIO framework tailored for UAVs. Within the efficient Multi-State Constraint Kalman Filter (MSCKF) framework, we build coplanar constraints inferred from planar features observed across a sliding window. By applying null-space projection to sliding-window coplanar constraints, we eliminate the direct dependency on feature parameters in the state vector, thereby mitigating overconfidence and improving consistency. More importantly, to further boost the efficiency, we introduce a parallel voxel-based data association and a novel compact cluster-to-plane measurement model. This compact measurement model losslessly reduces observation dimensionality and significantly accelerating the update process. Extensive evaluations demonstrate that our method outperforms most state-of-the-art (SOTA) approaches by providing a superior balance of consistency and efficiency. It exhibits improved robustness in degenerate scenarios, achieves the lowest memory usage via its map-free nature, and runs in real-time on resource-constrained embedded platforms (e.g., NVIDIA Jetson TX2).

Chinese Translation

对于无人驾驶飞行器（UAV）而言，稳健且准确的导航至关重要，尤其是对于那些具有严格尺寸、重量和功耗（SWaP）限制的无人机。然而，大多数最先进的（SOTA）LiDAR-惯性里程计（LIO）系统在此类平台上部署时仍然面临估计不一致性和计算瓶颈。为了解决这些问题，本文提出了一种专为无人机量身定制的一致性和高效性的紧耦合LIO框架。在高效的多状态约束卡尔曼滤波器（MSCKF）框架内，我们构建了从滑动窗口中观察到的平面特征推断出的共面约束。通过对滑动窗口共面约束应用零空间投影，我们消除了状态向量中对特征参数的直接依赖，从而减轻了过度自信并提高了一致性。更重要的是，为了进一步提升效率，我们引入了一种并行体素数据关联和一种新颖的紧凑簇到平面测量模型。该紧凑测量模型无损地降低了观测维度，并显著加快了更新过程。广泛的评估表明，我们的方法在一致性和效率之间提供了更优的平衡，超越了大多数最先进的方法（SOTA）。它在退化场景中表现出更强的鲁棒性，通过其无地图特性实现了最低的内存使用，并在资源受限的嵌入式平台（如NVIDIA Jetson TX2）上实时运行。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2603.12908

GoalSwarm: Multi-UAV Semantic Coordination for Open-Vocabulary Object Navigation

GoalSwarm：多无人机语义协调用于开放词汇目标导航

James, MoniJesu Wonders, Habel, Amir Atef, Fedoseev, Aleksey, Tsetserokou, Dzmitry

Abstract

Cooperative visual semantic navigation is a foundational capability for aerial robot teams operating in unknown environments. However, achieving robust open-vocabulary object-goal navigation remains challenging due to the computational constraints of deploying heavy perception models onboard and the complexity of decentralized multi-agent coordination. We present GoalSwarm, a fully decentralized multi-UAV framework for zero-shot semantic object-goal navigation. Each UAV collaboratively constructs a shared, lightweight 2D top-down semantic occupancy map by projecting depth observations from aerial vantage points, eliminating the computational burden of full 3D representations while preserving essential geometric and semantic structure. The core contributions of GoalSwarm are threefold: (1) integration of zero-shot foundation model -- SAM3 for open vocabulary detection and pixel-level segmentation, enabling open-vocabulary target identification without task-specific training; (2) a Bayesian Value Map that fuses multi-viewpoint detection confidences into a per-pixel goal-relevance distribution, enabling informed frontier scoring via Upper Confidence Bound (UCB) exploration; and (3) a decentralized coordination strategy combining semantic frontier extraction, cost-utility bidding with geodesic path costs, and spatial separation penalties to minimize redundant exploration across the swarm.

Chinese Translation

合作视觉语义导航是空中机器人团队在未知环境中操作的基础能力。然而，由于在机载部署重型感知模型的计算限制以及去中心化多智能体协调的复杂性，实现稳健的开放词汇目标导航仍然具有挑战性。我们提出了GoalSwarm，一个完全去中心化的多无人机框架，用于零样本语义目标导航。每个无人机通过从空中视角投影深度观测，共同构建一个共享的轻量级二维自上而下的语义占用图，消除了全三维表示的计算负担，同时保留了基本的几何和语义结构。GoalSwarm的核心贡献有三方面：（1）集成零样本基础模型——SAM3，用于开放词汇检测和像素级分割，使得无需特定任务训练即可进行开放词汇目标识别；（2）一个贝叶斯价值图，将多视角检测置信度融合为每像素目标相关性分布，通过上置信界（UCB）探索实现知情的边界评分；（3）一种去中心化协调策略，结合语义边界提取、成本效用竞标与测地路径成本，以及空间分离惩罚，以最小化群体内的冗余探索。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2603.12936

MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins

MotionAnymesh：基于物理的关节模拟准备数字双胞胎

Xu, WenBo, Liu, Liu, Zhang, Li, Guo, Dan, Liu, RuoNan

Abstract

Converting static 3D meshes into interactable articulated assets is crucial for embodied AI and robotic simulation. However, existing zero-shot pipelines struggle with complex assets due to a critical lack of physical grounding. Specifically, ungrounded Vision-Language Models (VLMs) frequently suffer from kinematic hallucinations, while unconstrained joint estimation inevitably leads to catastrophic mesh inter-penetration during physical simulation. To bridge this gap, we propose MotionAnymesh, an automated zero-shot framework that seamlessly transforms unstructured static meshes into simulation-ready digital twins. Our method features a kinematic-aware part segmentation module that grounds VLM reasoning with explicit SP4D physical priors, effectively eradicating kinematic hallucinations. Furthermore, we introduce a geometry-physics joint estimation pipeline that combines robust type-aware initialization with physics-constrained trajectory optimization to rigorously guarantee collision-free articulation. Extensive experiments demonstrate that MotionAnymesh significantly outperforms state-of-the-art baselines in both geometric precision and dynamic physical executability, providing highly reliable assets for downstream applications.

Chinese Translation

将静态3D网格转换为可交互的关节资产对于具身人工智能和机器人模拟至关重要。然而，现有的零样本管道在处理复杂资产时面临严重的物理基础不足问题。具体而言，未接地的视觉-语言模型（VLMs）常常遭遇运动幻觉，而不受约束的关节估计不可避免地导致物理模拟中的灾难性网格相互穿透。为了解决这一问题，我们提出了MotionAnymesh，一个自动化的零样本框架，能够无缝地将非结构化静态网格转换为模拟准备的数字双胞胎。我们的方法具有一个运动感知的部件分割模块，该模块通过明确的SP4D物理先验来接地VLM推理，有效消除了运动幻觉。此外，我们引入了一个几何-物理联合估计管道，该管道结合了稳健的类型感知初始化和物理约束的轨迹优化，以严格保证无碰撞的关节运动。大量实验表明，MotionAnymesh在几何精度和动态物理可执行性方面显著优于最先进的基线，为下游应用提供了高度可靠的资产。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2603.12939

RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics

RoboStream：在机器人视觉语言模型中融合时空推理与记忆

Huang, Yuzhi, Wu, Jie, Bu, Weijue, Xiong, Ziyi, Jiang, Gaoyang, Li, Ye, Ji, Kangye, Xie, Shuzhao, Huang, Yue, Wu, Chenglei, Jiang, Jingyan, Wang, Zhi

Abstract

Enabling reliable long-horizon robotic manipulation is a crucial step toward open-world embodied intelligence. However, VLM-based planners treat each step as an isolated observation-to-action mapping, forcing them to reinfer scene geometry from raw pixels at every decision point while remaining unaware of how prior actions have reshaped the environment. Despite strong short-horizon performance, these systems lack the spatio-temporal reasoning required for persistent geometric anchoring and memory of action-triggered state transitions. Without persistent state tracking, perceptual errors accumulate across the execution horizon, temporarily occluded objects are catastrophically forgotten, and these compounding failures lead to precondition violations that cascade through subsequent steps. In contrast, humans maintain a persistent mental model that continuously tracks spatial relations and action consequences across interactions rather than reconstructing them at each instant. Inspired by this human capacity for causal spatio-temporal reasoning with persistent memory, we propose RoboStream, a training-free framework that achieves geometric anchoring through Spatio-Temporal Fusion Tokens (STF-Tokens), which bind visual evidence to 3D geometric attributes for persistent object grounding, and maintains causal continuity via a Causal Spatio-Temporal Graph (CSTG) that records action-triggered state transitions across steps. This design enables the planner to trace causal chains and preserve object permanence under occlusion without additional training or fine-tuning. RoboStream achieves 90.5% on long-horizon RLBench and 44.4% on challenging real-world block-building tasks, where both SoFar and VoxPoser score 11.1%, demonstrating that spatio-temporal reasoning and causal memory are critical missing components for reliable long-horizon manipulation.

Chinese Translation

实现可靠的长时间机器人操作是迈向开放世界具身智能的重要一步。然而，基于视觉语言模型（VLM）的规划器将每一步视为孤立的观察到行动的映射，这迫使它们在每个决策点重新从原始像素中推断场景几何，同时对先前的行动如何重塑环境毫无察觉。尽管在短时间内表现出色，这些系统缺乏持久几何锚定和对行动触发状态转变的记忆所需的时空推理。没有持久的状态跟踪，感知误差在执行过程中累积，暂时被遮挡的物体被灾难性地遗忘，这些累积的失败导致前提条件的违反，并在后续步骤中产生连锁反应。相比之下，人类保持一个持续的心理模型，持续跟踪空间关系和行动后果，而不是在每个瞬间重建它们。受到这种人类因果时空推理与持久记忆能力的启发，我们提出了RoboStream，一个无训练框架，通过时空融合标记（Spatio-Temporal Fusion Tokens, STF-Tokens）实现几何锚定，将视觉证据与3D几何属性绑定，以实现持久的物体定位，并通过因果时空图（Causal Spatio-Temporal Graph, CSTG）维护因果连续性，记录跨步骤的行动触发状态转变。该设计使规划器能够追踪因果链，并在遮挡下保持物体的持久性，而无需额外的训练或微调。RoboStream在长时间RLBench任务中取得了90.5%的成绩，在具有挑战性的真实世界建块任务中取得了44.4%的成绩，而SoFar和VoxPoser的得分均为11.1%，这表明时空推理和因果记忆是可靠的长时间操作中缺失的关键组成部分。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2603.12940

Coordinated Manipulation of Hybrid Deformable-Rigid Objects in Constrained Environments

约束环境中混合可变形-刚性物体的协调操控

Peringal, Anees, Mathew, Anup Teejo, liatsis, Panagiotis, Renda, Federico

Abstract

Coordinated robotic manipulation of deformable linear objects (DLOs), such as ropes and cables, has been widely studied; however, handling hybrid assemblies composed of both deformable and rigid elements in constrained environments remains challenging. This work presents a quasi-static optimization-based manipulation planner that employs a strain-based Cosserat rod model, extending rigid-body formulations to hybrid deformable linear objects (hDLO). The proposed planner exploits the compliance of deformable links to maneuver through constraints while achieving task-space objectives for the object that are unreachable with rigid tools. By leveraging a differentiable model with analytically derived gradients, the method achieves up to a 33x speedup over finite-difference baselines for inverse kinetostatic(IKS) problems. Furthermore, the subsequent trajectory optimization problem, warm-started using the IKS solution, is only practically realizable via analytical derivatives. The proposed algorithm is validated in simulation on various hDLO systems and experimentally on a three-link hDLO manipulated in a constrained environment using a dual-arm robotic system. Experimental results confirm the planner's accuracy, yielding an average deformation error of approximately 3 cm (5% of the deformable link length) between the desired and measured marker positions. Finally, the proposed optimal planner is compared against a sampling-based feasibility planner adapted to the strain-based formulation. The results demonstrate the effectiveness and applicability of the proposed approach for robotic manipulation of hybrid assemblies in constrained environments.

Chinese Translation

可变形线性物体（DLOs）的协调机器人操控，如绳索和电缆，已得到广泛研究；然而，在约束环境中处理由可变形和刚性元素组成的混合组件仍然具有挑战性。本研究提出了一种基于准静态优化的操控规划器，该规划器采用基于应变的Cosserat杆模型，将刚体模型扩展到混合可变形线性物体（hDLO）。所提出的规划器利用可变形连接的柔顺性在约束中进行操控，同时实现物体的任务空间目标，这些目标是使用刚性工具无法达到的。通过利用具有解析导数的可微模型，该方法在逆运动静力学（IKS）问题上实现了相较于有限差分基线高达33倍的加速。此外，后续的轨迹优化问题在使用IKS解进行热启动时，仅通过解析导数才能在实际中实现。所提出的算法在各种hDLO系统的仿真中进行了验证，并在使用双臂机器人系统的约束环境中对三连杆hDLO进行了实验操控。实验结果确认了规划器的准确性，期望和测量标记位置之间的平均变形误差约为3厘米（可变形连接长度的5%）。最后，所提出的最优规划器与适应于基于应变的模型的基于采样的可行性规划器进行了比较。结果表明，所提出的方法在约束环境中对混合组件的机器人操控的有效性和适用性。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2603.12942

ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

ReMem-VLA：通过双层递归查询增强视觉-语言-动作模型的记忆能力

Li, Hang, Shen, Fengyi, Chen, Dong, Yang, Liudi, Wang, Xudong, Shi, Jinkui, Bing, Zhenshan, Liu, Ziyuan, Knoll, Alois

Abstract

Vision-language-action (VLA) models for closed-loop robot control are typically cast under the Markov assumption, making them prone to errors on tasks requiring historical context. To incorporate memory, existing VLAs either retrieve from a memory bank, which can be misled by distractors, or extend the frame window, whose fixed horizon still limits long-term retention. In this paper, we introduce ReMem-VLA, a Recurrent Memory VLA model equipped with two sets of learnable queries: frame-level recurrent memory queries for propagating information across consecutive frames to support short-term memory, and chunk-level recurrent memory queries for carrying context across temporal chunks for long-term memory. These queries are trained end-to-end to aggregate and maintain relevant context over time, implicitly guiding the model's decisions without additional training or inference cost. Furthermore, to enhance visual memory, we introduce Past Observation Prediction as an auxiliary training objective. Through extensive memory-centric simulation and real-world robot experiments, we demonstrate that ReMem-VLA exhibits strong memory capabilities across multiple dimensions, including spatial, sequential, episodic, temporal, and visual memory. ReMem-VLA significantly outperforms memory-free VLA baselines $\pi$0.5 and OpenVLA-OFT and surpasses MemoryVLA on memory-dependent tasks by a large margin.

Chinese Translation

闭环机器人控制的视觉-语言-动作（VLA）模型通常在马尔可夫假设下构建，这使得它们在需要历史上下文的任务中容易出错。为了整合记忆，现有的VLA要么从记忆库中检索信息，但可能受到干扰物的误导，要么扩展帧窗口，但固定的视野仍然限制了长期记忆的保持。在本文中，我们提出了ReMem-VLA，一种递归记忆VLA模型，配备了两组可学习的查询：帧级递归记忆查询用于在连续帧之间传播信息以支持短期记忆，以及块级递归记忆查询用于在时间块之间传递上下文以实现长期记忆。这些查询经过端到端训练，以聚合和保持相关上下文，隐式地指导模型的决策，而无需额外的训练或推理成本。此外，为了增强视觉记忆，我们引入了过去观察预测作为辅助训练目标。通过广泛的以记忆为中心的模拟和真实世界的机器人实验，我们证明ReMem-VLA在空间、序列、情节、时间和视觉记忆等多个维度上展现出强大的记忆能力。ReMem-VLA在记忆依赖任务上显著优于无记忆的VLA基线模型$ ext{π}$0.5和OpenVLA-OFT，并且在记忆依赖任务上大幅超越MemoryVLA。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2603.12960

Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization

通过衰减残差策略优化实现高效的真实世界自主赛车

Trumpp, Raphael, Hoornaert, Denis, Theile, Mirco, Caccamo, Marco

Abstract

Residual policy learning (RPL), in which a learned policy refines a static base policy using deep reinforcement learning (DRL), has shown strong performance across various robotic applications. Its effectiveness is particularly evident in autonomous racing, a domain that serves as a challenging benchmark for real-world DRL. However, deploying RPL-based controllers introduces system complexity and increases inference latency. We address this by introducing an extension of RPL named attenuated residual policy optimization ($\alpha$-RPO). Unlike standard RPL, $\alpha$-RPO yields a standalone neural policy by progressively attenuating the base policy, which initially serves to bootstrap learning. Furthermore, this mechanism enables a form of privileged learning, where the base policy is permitted to use sensor modalities not required for final deployment. We design $\alpha$-RPO to integrate seamlessly with PPO, ensuring that the attenuated influence of the base controller is dynamically compensated during policy optimization. We evaluate $\alpha$-RPO by building a framework for 1:10-scaled autonomous racing around it. In both simulation and zero-shot real-world transfer to Roboracer cars, $\alpha$-RPO not only reduces system complexity but also improves driving performance compared to baselines - demonstrating its practicality for robotic deployment. Our code is available at: https://github.com/raphajaner/arpo_racing.

Chinese Translation

残差策略学习（Residual Policy Learning, RPL）是一种利用深度强化学习（Deep Reinforcement Learning, DRL）对静态基础策略进行优化的学习方法，在各种机器人应用中表现出色。其有效性在自主赛车领域尤为明显，该领域为真实世界的DRL提供了一个具有挑战性的基准。然而，基于RPL的控制器的部署会引入系统复杂性并增加推理延迟。为此，我们提出了一种RPL的扩展方法，称为衰减残差策略优化（Attenuated Residual Policy Optimization, $eta$-RPO）。与标准RPL不同，$eta$-RPO通过逐步衰减基础策略来生成独立的神经策略，基础策略最初用于引导学习。此外，该机制还实现了一种特权学习形式，允许基础策略使用最终部署不需要的传感器模态。我们设计的$eta$-RPO能够与PPO（Proximal Policy Optimization）无缝集成，确保在策略优化过程中基础控制器的衰减影响能够动态补偿。我们通过围绕$eta$-RPO构建一个1:10比例的自主赛车框架来评估其性能。在仿真和零-shot真实世界转移到Roboracer汽车的实验中，$eta$-RPO不仅降低了系统复杂性，还改善了驾驶性能，相较于基线表现出更好的实用性。我们的代码可在以下链接获取：https://github.com/raphajaner/arpo_racing。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2603.12967

Language-Grounded Decoupled Action Representation for Robotic Manipulation

基于语言的解耦动作表示用于机器人操作

Weng, Wuding, Wu, Tongshu, Chen, Liucheng, Xie, Siyu, Wang, Zheng, Xu, Xing, Song, Jingkuan, Shen, Heng Tao

Abstract

The heterogeneity between high-level vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the Language-Grounded Decoupled Action Representation (LaDA) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives--translation, rotation, and gripper control--providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive weighting strategy, inspired by curriculum learning, dynamically balances contrastive and imitation objectives for stable and effective training. Extensive experiments on simulated benchmarks (LIBERO and MimicGen) and real-world demonstrations validate that LaDA achieves strong performance and generalizes effectively to unseen or related tasks.

Chinese Translation

高层次视觉-语言理解与低层次动作控制之间的异质性仍然是机器人操作中的一个基本挑战。尽管近期的方法在任务特定的动作对齐方面取得了进展，但它们在为新颖或语义相关的任务生成稳健且准确的动作时常常面临困难。为了解决这一问题，我们提出了基于语言的解耦动作表示框架（Language-Grounded Decoupled Action Representation，LaDA），该框架利用自然语言作为语义桥梁，连接感知与控制。LaDA引入了一个细粒度的中间层，包含三种可解释的动作原语——平移、旋转和夹持控制——为低层次动作提供明确的语义结构。它进一步采用语义引导的软标签对比学习目标，以对齐任务间相似的动作原语，从而增强泛化能力和运动一致性。受课程学习启发的自适应加权策略动态平衡对比和模仿目标，以实现稳定和有效的训练。在模拟基准（LIBERO和MimicGen）和真实世界演示上的大量实验验证了LaDA在性能上表现优异，并能有效地泛化到未见或相关任务。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2603.12994

Route Fragmentation Based on Resource-centric Prioritisation for Efficient Multi-Robot Path Planning in Agricultural Environments

基于资源中心优先级的路径碎片化方法在农业环境中实现高效多机器人路径规划

Heselden, James R., Das, Gautham P.

Abstract

Agricultural environments present high proportions of spatially dense navigation bottlenecks for long-term navigation and operational planning of agricultural mobile robots. The existing agent-centric multi-robot path planning (MRPP) approaches resolve conflicts from the perspective of agents, rather than from the resources under contention. Further, the density of such contentions limits the capabilities of spatial interleaving, a concept that many planners rely on to achieve high throughput. In this work, two variants of the priority-based Fragment Planner (FP) are presented as resource-centric MRPP algorithms that leverage route fragmentation to enable partial route progression and limit the impact of binary-based waiting. These approaches are evaluated in lifelong simulation over a 3.6km topological map representing a commercial polytunnel environment. Their performances are contrasted against 5 baseline algorithms with varying robotic fleet sizes. The Fragment Planners achieved significant gains in throughput compared with Prioritised Planning (PP) and Priority-Based Search (PBS) algorithms. They further demonstrated a task throughput of 95% of the optimal task throughput over the same time period. This work shows that, for long-term deployment of agricultural robots in corridor-dominant agricultural environments, resource-centric MRPP approaches are a necessity for high-efficacy operational planning.

Chinese Translation

农业环境中存在高比例的空间密集型导航瓶颈，这对农业移动机器人的长期导航和运营规划造成了挑战。现有的以代理为中心的多机器人路径规划（MRPP）方法从代理的角度解决冲突，而不是从争用资源的角度出发。此外，这种争用的密度限制了空间交错的能力，而空间交错是许多规划者依赖以实现高吞吐量的概念。在本研究中，提出了两种基于优先级的碎片规划器（Fragment Planner, FP）变体，作为资源中心的MRPP算法，利用路径碎片化来实现部分路径进展并限制基于二进制的等待影响。这些方法在一个3.6公里的拓扑图上进行了长期模拟评估，该图代表了一个商业温室环境。它们的性能与5种不同机器人车队规模的基线算法进行了对比。与优先规划（Prioritised Planning, PP）和基于优先级搜索（Priority-Based Search, PBS）算法相比，碎片规划器在吞吐量方面取得了显著提升。此外，它们在相同时间段内展示了95%的最优任务吞吐量。这项工作表明，对于在走廊主导的农业环境中长期部署农业机器人，资源中心的MRPP方法是实现高效运营规划的必要条件。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2603.13003

From Passive Monitoring to Active Defence: Resilient Control of Manipulators Under Cyberattacks

从被动监测到主动防御：在网络攻击下的操纵器韧性控制

Gualandi, Gabriele, Papadopoulos, Alessandro V.

Abstract

Cyber-physical robotic systems are vulnerable to false data injection attacks (FDIAs), in which an adversary corrupts sensor signals while evading residual-based passive anomaly detectors such as the chi-squared test. Such stealthy attacks can induce substantial end-effector deviations without triggering alarms. This paper studies the resilience of redundant manipulators to stealthy FDIAs and advances the architecture from passive monitoring to active defence. We formulate a closed-loop model comprising a feedback-linearized manipulator, a steady-state Kalman filter, and a chi-squared-based anomaly detector. Building on this passive monitoring layer, we propose an active control-level defence that attenuates the control input through a monotone function of an anomaly score generated by a novel actuation-projected, measurement-free state predictor. The proposed design provides probabilistic guarantees on nominal actuation loss and preserves closed-loop stability. From the attacker perspective, we derive a convex QCQP for computing one-step optimal stealthy attacks. Simulations on a 6-DOF planar manipulator show that the proposed defence significantly reduces attack-induced end-effector deviation while preserving nominal task performance in the absence of attacks.

Chinese Translation

网络物理机器人系统易受到虚假数据注入攻击（FDIAs）的影响，其中对手在规避基于残差的被动异常检测器（如卡方检验）的同时，破坏传感器信号。这种隐蔽攻击可以在不触发警报的情况下引起显著的末端执行器偏差。本文研究了冗余操纵器对隐蔽FDIAs的韧性，并将架构从被动监测推进到主动防御。我们构建了一个闭环模型，包括反馈线性化操纵器、稳态卡尔曼滤波器和基于卡方的异常检测器。在此被动监测层的基础上，我们提出了一种主动控制级别的防御，通过一种单调函数对由新型激励投影、无测量状态预测器生成的异常分数进行控制输入的衰减。所提设计对名义激励损失提供了概率保证，并保持了闭环稳定性。从攻击者的角度出发，我们推导出一个凸二次约束规划（QCQP），用于计算一步最优隐蔽攻击。在一个6自由度平面操纵器上的仿真结果表明，所提防御显著减少了攻击引起的末端执行器偏差，同时在没有攻击的情况下保持了名义任务性能。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2603.13098

SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

SldprtNet：用于语言驱动的3D设计中的CAD生成的大规模多模态数据集

Li, Ruogu, Li, Sikai, Mu, Yao, Ding, Mingyu

Abstract

We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part's appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.

Chinese Translation

我们介绍了SldprtNet，这是一个包含超过242,000个工业零件的大规模数据集，旨在用于语义驱动的CAD建模、几何深度学习，以及多模态模型在3D设计中的训练和微调。该数据集提供了以.step和.sldprt格式的3D模型，以支持多样化的训练和测试。为了实现参数化建模并促进数据集的可扩展性，我们开发了支持工具，包括一个编码器和一个解码器，支持13种CAD命令，并实现3D模型与结构化文本表示之间的无损转换。此外，每个样本都与一个复合图像配对，该图像通过合并来自3D模型不同视角的七个渲染视图创建，有效减少了输入标记长度并加速了推理。通过将该图像与编码器的参数化文本输出结合，我们采用轻量级多模态语言模型Qwen2.5-VL-7B生成每个零件外观和功能的自然语言描述。为了确保准确性，我们手动验证并对生成的描述、渲染图像和3D模型进行了对齐。这些描述以及参数化建模脚本、渲染图像和3D模型文件完全对齐，以构建SldprtNet。为了评估其有效性，我们在数据集子集上微调了基线模型，比较了图像加文本输入与仅文本输入的效果。结果确认了多模态数据集在CAD生成中的必要性和价值。该数据集特征包括精心挑选的真实工业零件、支持可扩展数据集扩展的工具、多样化的模态，以及确保模型复杂性和几何特征的多样性，使其成为一个全面的多模态数据集，专为语义驱动的CAD建模和跨模态学习而构建。

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2603.13100

Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

评估视觉语言模型在机器人运动中的空间推理能力：朝着具有运动偏好的机器人规划迈出一步

Wu, Wenxi, Zhang, Jingjing, Brandão, Martim

Abstract

Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method, Qwen2.5-VL achieves 71.4% accuracy zero-shot and 75% on a smaller model after fine-tuning, and GPT-4o leads to lower performance. We evaluate two types of motion preferences (object-proximity and path-style), and we also analyze the trade-off between accuracy and computation cost in number of tokens. This work shows some promise in the potential of VLM integration with robot motion planning pipelines.

Chinese Translation

理解用户指令和周围环境中物体的空间关系对于智能机器人系统在各种任务中协助人类至关重要。视觉语言模型（Vision-Language Models, VLMs）的自然语言和空间推理能力有潜力增强机器人规划者在新任务、物体和运动规范上的泛化能力。尽管基础模型已被应用于任务规划，但尚不清楚它们在空间推理方面的能力在多大程度上能够满足用户对运动的偏好或约束，例如与物体的期望距离、拓扑特性或运动风格偏好。本文评估了四种最先进的视觉语言模型在机器人运动中的空间推理能力，使用了四种不同的查询方法。我们的结果表明，在表现最佳的查询方法下，Qwen2.5-VL在零样本情况下达到了71.4%的准确率，而在经过微调后较小模型的准确率为75%，而GPT-4o的表现较低。我们评估了两种类型的运动偏好（物体接近性和路径风格），并分析了准确性与计算成本（以标记数为单位）之间的权衡。这项工作展示了视觉语言模型与机器人运动规划管道集成的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2603.13103

A Feasibility-Enhanced Control Barrier Function Method for Multi-UAV Collision Avoidance

一种增强可行性的多无人机碰撞避免控制障碍函数方法

Zhong, Qishen, Wu, Junlong, Yang, Jian, Xiao, Guanwei, Wu, Junqi, Jiang, Zimeng, Fang, Pingan

Abstract

This paper presents a feasibility-enhanced control barrier function (FECBF) framework for multi-UAV collision avoidance. In dense multi-UAV scenarios, the feasibility of the CBF quadratic program (CBF-QP) can be compromised due to internal incompatibility among multiple CBF constraints. To address this issue, we analyze the internal compatibility of CBF constraints and derive a sufficient condition for internal compatibility. Based on this condition, a sign-consistency constraint is introduced to mitigate internal incompatibility. The proposed constraint is incorporated into a decentralized CBF-QP formulation using worst-case estimates and slack variables. Simulation results demonstrate that the proposed method significantly reduces infeasibility and improves collision avoidance performance compared with existing baselines in dense scenarios. Additional simulations under varying time delays demonstrate the robustness of the proposed method. Real-world experiments validate the practical applicability of the proposed method.

Chinese Translation

本文提出了一种用于多无人机碰撞避免的增强可行性控制障碍函数（FECBF）框架。在密集的多无人机场景中，由于多个控制障碍函数（CBF）约束之间的内部不兼容性，CBF二次规划（CBF-QP）的可行性可能受到影响。为了解决这一问题，我们分析了CBF约束的内部兼容性，并推导出内部兼容性的充分条件。基于该条件，引入了符号一致性约束以减轻内部不兼容性。所提出的约束被纳入使用最坏情况估计和松弛变量的分散式CBF-QP公式中。仿真结果表明，与现有基准相比，所提出的方法显著降低了不可行性，并改善了在密集场景中的碰撞避免性能。在不同时间延迟下的额外仿真进一步验证了所提出方法的鲁棒性。实际实验验证了该方法的实际适用性。

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2603.13108

Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots

四足机器人全景多模态语义占用预测

Zhao, Guoqiang, Yang, Zhe, Wu, Sheng, Teng, Fei, Duan, Mengfei, Zheng, Yuanfan, Luo, Kai, Yang, Kailun

Abstract

Panoramic imagery provides holistic 360{\deg} visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly designed for wheeled autonomous driving and rely heavily on RGB cues, limiting their robustness in complex environments. To bridge this gap, (1) we present PanoMMOcc, the first real-world panoramic multimodal occupancy dataset for quadruped robots, featuring four sensing modalities across diverse scenes. (2) We propose a panoramic multimodal occupancy perception framework, VoxelHound, tailored for legged mobility and spherical imaging. Specifically, we design (i) a Vertical Jitter Compensation (VJC) module to mitigate severe viewpoint perturbations caused by body pitch and roll during mobility, enabling more consistent spatial reasoning, and (ii) an effective Multimodal Information Prompt Fusion (MIPF) module that jointly leverages panoramic visual cues and auxiliary modalities to enhance volumetric occupancy prediction. (3) We establish a benchmark based on PanoMMOcc and provide detailed data analysis to enable systematic evaluation of perception methods under challenging embodied scenarios. Extensive experiments demonstrate that VoxelHound achieves state-of-the-art performance on PanoMMOcc (+4.16%} in mIoU). The dataset and code will be publicly released to facilitate future research on panoramic multimodal 3D perception for embodied robotic systems at https://github.com/SXDR/PanoMMOcc, along with the calibration tools released at https://github.com/losehu/CameraLiDAR-Calib.

Chinese Translation

全景图像为四足机器人提供了360{ extdegree}的整体视觉覆盖，增强了感知能力。然而，现有的占用预测方法主要针对轮式自主驾驶设计，过于依赖RGB线索，这限制了它们在复杂环境中的鲁棒性。为了解决这一问题，(1) 我们提出了PanoMMOcc，这是第一个针对四足机器人的真实场景全景多模态占用数据集，涵盖了四种传感模态和多样化的场景。(2) 我们提出了一种针对腿部移动和球形成像的全景多模态占用感知框架VoxelHound。具体而言，我们设计了(i) 一个垂直抖动补偿（VJC）模块，以减轻在移动过程中由于身体俯仰和滚转引起的严重视角扰动，从而实现更一致的空间推理；以及(ii) 一个有效的多模态信息提示融合（MIPF）模块，联合利用全景视觉线索和辅助模态，以增强体积占用预测。(3) 我们基于PanoMMOcc建立了一个基准，并提供详细的数据分析，以便在具有挑战性的具身场景下系统性地评估感知方法。大量实验表明，VoxelHound在PanoMMOcc上实现了最先进的性能（mIoU提高了4.16%）。该数据集和代码将公开发布，以促进未来在具身机器人系统的全景多模态3D感知方面的研究，链接为https://github.com/SXDR/PanoMMOcc，同时校准工具也将在https://github.com/losehu/CameraLiDAR-Calib发布。

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2603.13133

DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation

DecoVLN：解耦观察、推理与纠正的视觉与语言导航

Xin, Zihao, Li, Wentong, Jiang, Yixuan, Wang, Bin, Cong, Runming, Qin, Jie, Huang, Shengjun

Abstract

Vision-and-Language Navigation (VLN) requires agents to follow long-horizon instructions and navigate complex 3D environments. However, existing approaches face two major challenges: constructing an effective long-term memory bank and overcoming the compounding errors problem. To address these issues, we propose DecoVLN, an effective framework designed for robust streaming perception and closed-loop control in long-horizon navigation. First, we formulate long-term memory construction as an optimization problem and introduce adaptive refinement mechanism that selects frames from a historical candidate pool by iteratively optimizing a unified scoring function. This function jointly balances three key criteria: semantic relevance to the instruction, visual diversity from the selected memory, and temporal coverage of the historical trajectory. Second, to alleviate compounding errors, we introduce a state-action pair-level corrective finetuning strategy. By leveraging geodesic distance between states to precisely quantify deviation from the expert trajectory, the agent collects high-quality state-action pairs in the trusted region while filtering out the polluted data with low relevance. This improves both the efficiency and stability of error correction. Extensive experiments demonstrate the effectiveness of DecoVLN, and we have deployed it in real-world environments.

Chinese Translation

视觉与语言导航（VLN）要求代理遵循长时间跨度的指令并在复杂的三维环境中导航。然而，现有的方法面临两个主要挑战：构建有效的长期记忆库和克服累积误差问题。为了解决这些问题，我们提出了DecoVLN，这是一个旨在实现长时间跨度导航中稳健流媒体感知和闭环控制的有效框架。首先，我们将长期记忆构建形式化为一个优化问题，并引入自适应精炼机制，通过迭代优化统一评分函数，从历史候选池中选择帧。该函数共同平衡三个关键标准：与指令的语义相关性、从所选记忆中获得的视觉多样性，以及历史轨迹的时间覆盖。其次，为了减轻累积误差，我们引入了一种状态-动作对级别的纠正微调策略。通过利用状态之间的测地距离精确量化与专家轨迹的偏差，代理在可信区域内收集高质量的状态-动作对，同时过滤掉低相关性的污染数据。这提高了误差纠正的效率和稳定性。大量实验表明DecoVLN的有效性，我们已将其部署在现实环境中。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

115

cs.CV / 1 / 2603.12310

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

VQQA：一种用于视频评估和质量改进的代理方法

Song, Yiwen, Pfister, Tomas, Song, Yale

Abstract

Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.

Chinese Translation

尽管视频生成模型快速发展，但将其输出与复杂用户意图对齐仍然具有挑战性。现有的测试时优化方法通常要么计算成本高昂，要么需要对模型内部进行白盒访问。为了解决这个问题，我们提出了VQQA（视频质量问答），这是一个统一的多代理框架，能够在多种输入模式和视频生成任务中进行推广。通过动态生成视觉问题，并利用生成的视觉-语言模型（Vision-Language Model, VLM）评估作为语义梯度，VQQA用人类可解释的、可操作的反馈替代了传统的被动评估指标。这使得通过黑盒自然语言接口实现高效的闭环提示优化过程成为可能。大量实验表明，VQQA有效地隔离并解决了视觉伪影，在仅需少量优化步骤的情况下显著提高了生成质量。我们的算法适用于文本到视频（Text-to-Video, T2V）和图像到视频（Image-to-Video, I2V）任务，在T2V-CompBench上实现了+11.57%的绝对提升，在VBench2上实现了+8.43%的提升，显著超越了最先进的随机搜索和提示优化技术。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2603.12354

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

交替梯度流效用：深度网络中结构剪枝与动态路由的统一度量

Qian, Tianhao, Li, Zhuoxuan, Cao, Jinde, Shi, Xinli, Liu, Hanjie, Rutkowski, Leszek

Abstract

Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network's structural "kinetic utility". First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92$\times$) without sacrificing the full-model accuracy.

Chinese Translation

高效的深度学习传统上依赖于静态启发式方法，如权重大小或激活意识（例如，Wanda，RIA）。虽然在无结构设置中取得了成功，但我们观察到在将这些度量应用于深度视觉网络的结构剪枝时存在一个关键限制。这些现代度量受到大小偏差的影响，未能保留关键的功能路径。为了解决这个问题，我们提出了一种受交替梯度流（AGF）启发的解耦动能范式，利用绝对特征空间的泰勒展开准确捕捉网络的结构“动能效用”。首先，我们揭示了在极端稀疏情况下的拓扑相变，在此情况下，AGF成功保留了基线功能，并表现出拓扑隐式正则化，避免了从头训练的模型中出现的崩溃现象。其次，在没有严格结构先验的架构中，我们揭示了视觉变换器（ViTs）中的稀疏瓶颈现象。通过梯度大小解耦分析，我们发现动态信号在收敛模型中遭受信号压缩，使其在实时路由中表现不佳。最后，基于这些经验约束，我们设计了一种混合路由框架，将AGF引导的离线结构搜索与通过零成本物理先验的在线执行解耦。我们在大规模基准测试中验证了我们的范式：在ImageNet-1K上进行75%的压缩压力测试时，AGF有效避免了结构崩溃，而传统度量则在随机采样以下大幅下降。此外，当在ImageNet-100上系统性地部署动态推理时，我们的混合方法实现了帕累托最优效率。它将重型专家的使用减少了约50%（实现了估计的整体成本为0.92×），而不牺牲全模型的准确性。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2603.12369

Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

人类知识集成的多模态学习用于单源领域泛化

Banerjee, Ayan, Thakur, Kuntal, Gupta, Sandeep

Abstract

Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization (SDG). Across eight DR and two SOZ datasets, GenEval achieves superior SDG performance, with average accuracy of 69.2% (DR) and 81% (SOZ), outperforming the strongest baselines by 9.4% and 1.8%, respectively.

Chinese Translation

在诸如基于眼底图像的糖尿病视网膜病变（DR）分级和静息态功能磁共振成像（fMRI）发作区（SOZ）检测等关键任务中，跨领域的图像分类泛化仍然面临挑战。当领域在未知的因果因素上存在差异时，实现跨领域泛化变得困难，并且没有建立的方法论可以在没有直接元数据或数据收集者的协议级信息的情况下客观评估这些差异，而这些信息通常是不可获取的。我们首先介绍领域一致性界限（Domain Conformal Bounds，DCB），这是一个评估领域在未知因果因素上是否存在差异的理论框架。在此基础上，我们提出了GenEval，这是一种多模态视觉语言模型（Vision Language Models，VLM）方法，结合基础模型（例如，MedGemma-4B）与人类知识，通过低秩适应（Low-Rank Adaptation，LoRA）来弥补因果差距并增强单源领域泛化（Single-source Domain Generalization，SDG）。在八个DR和两个SOZ数据集上，GenEval实现了优越的SDG性能，平均准确率为69.2%（DR）和81%（SOZ），分别比最强基线提高了9.4%和1.8%。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2603.12382

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

SPARROW：在像素基础视频多模态大语言模型中学习空间精度和时间参考一致性

Alansari, Mohamad, Suryanto, Naufal, Velayudhan, Divya, Javed, Sajid, Werghi, Naoufel, Naseer, Muzammal

Abstract

Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: https://risys-lab.github.io/SPARROW

Chinese Translation

多模态大语言模型（MLLMs）已从图像级推理发展到像素级基础，但将这些能力扩展到视频仍然面临挑战，因为模型必须实现空间精度和时间一致的参考跟踪。现有的视频 MLLMs 通常依赖于静态分割标记（[SEG]）进行逐帧基础，这提供了语义信息，但缺乏时间上下文，导致空间漂移、身份切换以及在物体移动或重新出现时的不稳定初始化。我们提出了 SPARROW，一种像素基础的视频 MLLM，通过两个关键组件统一空间准确性和时间稳定性：（i）目标特定跟踪特征（TSF），在训练过程中注入时间对齐的参考线索，以及（ii）双提示设计，解码框（[BOX]）和分割（[SEG]）标记，以融合几何先验与语义基础。SPARROW 得益于一个精心策划的参考视频数据集，包含 30,646 个视频和 45,231 对问答，并通过无类的 SAM2 基础提议者实现端到端操作，无需外部检测器。SPARROW 集成到三个最新的开源视频 MLLMs（UniPixel、GLUS 和 VideoGLaMM）中，在六个基准测试中提供了一致的提升，在 RVOS 上提高了 +8.9 J&F，在视觉基础上提高了 +5 mIoU，在 GCG 上提高了 +5.4 CLAIR。这些结果表明，SPARROW 在像素基础的视频理解中显著提高了参考稳定性、空间精度和时间一致性。项目页面：https://risys-lab.github.io/SPARROW

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2603.12388

Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

面向部署的基于地标的会话级元校准用于网络摄像头注视追踪

Zhang, Chenkai

Abstract

Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.

Chinese Translation

实际的网络摄像头注视追踪不仅受到误差的限制，还受到校准负担、对头部运动和会话漂移的鲁棒性、运行时占用和浏览器使用的制约。因此，我们的目标是一个面向部署的操作点，而不是图像大骨干的范畴。我们将基于地标的注视点估计视为会话级适应：一个共享的几何编码器生成的嵌入可以通过小规模的校准集对齐到新的会话。我们提出了等变元校准注视（Equivariant Meta-Calibrated Gaze，EMC-Gaze），这是一种轻量级的仅基于地标的方法，结合了E(3)-等变地标图编码器、局部眼部几何、双眼强调、辅助3D注视方向监督，以及通过情节元训练区分的闭式岭校准器。为了减少姿态泄漏，我们使用了双视图典范一致性损失。部署的预测器仅使用面部地标，并通过简短的校准拟合每个会话的岭头。在100厘米的33个会话的注视风格交互评估中，EMC-Gaze在9点校准后实现了5.79 +/- 1.81度的均方根误差（RMSE），而Elastic Net为6.68 +/- 2.34度；在静态头部查询中，增益更大（2.92 +/- 0.75度对比4.45 +/- 0.30度）。在三个包含10名受试者的保留样本中，EMC-Gaze保持了优势（5.66 +/- 0.19度对比6.49 +/- 0.33度）。在MPIIFaceGaze上，经过短期每会话校准，眼部聚焦模型在16次校准时达到8.82 +/- 1.21度，在1次校准时与Elastic Net持平，并在3次校准后超越它。导出的眼部聚焦编码器有944,423个参数，大小为4.76 MB（ONNX格式），并在Chromium 145中通过ONNX Runtime Web支持每个样本12.58/12.58/12.90毫秒（均值/中位数/p90）的校准浏览器预测。这些结果使EMC-Gaze成为一个友好的校准操作点，而不是对更重的基于外观系统的普遍最先进的声明。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2603.12409

ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection

ABRA：跨领域传输微调知识以实现开放词汇对象检测

Bernardi, Mattia, Cappellino, Chiara, Mosconi, Matteo, Sangineto, Enver, Porrello, Angelo, Calderara, Simone

Abstract

Although recent Open-Vocabulary Object Detection architectures, such as Grounding DINO, demonstrate strong zero-shot capabilities, their performance degrades significantly under domain shifts. Moreover, many domains of practical interest, such as nighttime or foggy scenes, lack large annotated datasets, preventing direct fine-tuning. In this paper, we introduce Aligned Basis Relocation for Adaptation(ABRA), a method that transfers class-specific detection knowledge from a labeled source domain to a target domain where no training images containing these classes are accessible. ABRA formulates this adaptation as a geometric transport problem in the weight space of a pretrained detector, aligning source and target domain experts to transport class-specific knowledge. Extensive experiments across challenging domain shifts demonstrate that ABRA successfully teleports class-level specialization under multiple adverse conditions. Our code will be made public upon acceptance.

Chinese Translation

尽管近期的开放词汇对象检测架构，如 Grounding DINO，展示了强大的零样本能力，但在领域转移的情况下，其性能显著下降。此外，许多实际应用领域，如夜间或雾天场景，缺乏大型标注数据集，阻碍了直接微调。在本文中，我们提出了对齐基底重定位适应方法（Aligned Basis Relocation for Adaptation，ABRA），该方法将特定类别的检测知识从标注的源领域转移到目标领域，而后者没有包含这些类别的训练图像。ABRA 将这种适应过程形式化为预训练检测器权重空间中的几何传输问题，通过对齐源领域和目标领域的专家，传输特定类别的知识。大量在具有挑战性的领域转移中的实验表明，ABRA 能够在多种不利条件下成功传输类别级专业知识。我们的代码将在论文接受后公开。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2603.12421

A Neuro-Symbolic Framework Combining Inductive and Deductive Reasoning for Autonomous Driving Planning

结合归纳与演绎推理的神经符号框架用于自主驾驶规划

Wei, Hongyan, AbdAlmageed, Wael

Abstract

Existing end-to-end autonomous driving models rely heavily on purely data-driven inductive reasoning. This "black-box" nature leads to a lack of interpretability and absolute safety guarantees in complex, long-tail scenarios. To overcome this bottleneck, we propose a novel neuro-symbolic trajectory planning framework that seamlessly integrates rigorous deductive reasoning into end-to-end neural networks. Specifically, our framework utilizes a Large Language Model (LLM) to dynamically extract scene rules and employs an Answer Set Programming (ASP) solver for deterministic logical arbitration, generating safe and traceable discrete driving decisions. To bridge the gap between discrete symbols and continuous trajectories, we introduce a decision-conditioned decoding mechanism that transforms high-level logical decisions into learnable embedding vectors, simultaneously constraining the planning query and the physical initial velocity of a differentiable Kinematic Bicycle Model (KBM). By combining KBM-generated physical baseline trajectories with neural residual corrections, our approach inherently guarantees kinematic feasibility while ensuring a high degree of transparency. On the nuScenes benchmark, our method comprehensively outperforms the state-of-the-art baseline MomAD, reducing the L2 mean error to 0.57 m, decreasing the collision rate to 0.075%, and optimizing trajectory prediction consistency (TPC) to 0.47 m.

Chinese Translation

现有的端到端自主驾驶模型在很大程度上依赖于纯数据驱动的归纳推理。这种“黑箱”特性导致在复杂的长尾场景中缺乏可解释性和绝对安全保障。为了解决这一瓶颈，我们提出了一种新颖的神经符号轨迹规划框架，该框架将严格的演绎推理无缝集成到端到端神经网络中。具体而言，我们的框架利用大型语言模型（Large Language Model, LLM）动态提取场景规则，并使用答案集编程（Answer Set Programming, ASP）求解器进行确定性的逻辑仲裁，从而生成安全且可追溯的离散驾驶决策。为了弥合离散符号与连续轨迹之间的差距，我们引入了一种决策条件解码机制，将高层次的逻辑决策转化为可学习的嵌入向量，同时约束规划查询和可微分运动学自行车模型（Kinematic Bicycle Model, KBM）的物理初始速度。通过将KBM生成的物理基线轨迹与神经残差校正相结合，我们的方法在保证运动学可行性的同时，确保了高度的透明性。在nuScenes基准测试中，我们的方法全面超越了最先进的基线MomAD，将L2均值误差降低至0.57米，碰撞率降低至0.075%，并将轨迹预测一致性（Trajectory Prediction Consistency, TPC）优化至0.47米。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2603.12430

Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

Surg-R1：一种分层推理基础模型，用于可扩展和可解释的外科决策支持，并经过多中心临床验证

Jiang, Jian, Lin, Chenxi, Gu, Yiming, Qin, Zengyi, Zeng, Zhitao, Yuan, Kun, Long, Yonghao, Xia, Xiang, Yuan, Cheng, Wang, Yuqi, Yue, Zijie, Yang, Kunyi, Zhang, Yuting, Zhuo, Zhu, Qin, Dian, Wang, Xin, Fai, NG Chi, Anthony, Brian, Xu, Daguang, Rosman, Guy, Meireles, Ozanan, Zhang, Zizhen, Padoy, Nicolas, Wang, Hesheng, Dou, Qi, Jin, Yueming, Ban, Yutong

Abstract

Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.

Chinese Translation

外科场景理解不仅需要准确的预测，还需要可解释的推理，以便外科医生能够根据临床专业知识进行验证。然而，现有的外科视觉-语言模型生成的预测缺乏推理链，而通用推理模型在没有领域特定知识的情况下无法处理组合外科任务。我们提出了Surg-R1，这是一种外科视觉-语言模型，通过四阶段管道训练来填补这一空白。我们的方法引入了三个关键贡献：（1）一个三层推理层次结构，将外科解释分解为感知基础、关系理解和上下文推理；（2）拥有32万个推理对的最大外科思维链数据集；（3）一个四阶段训练管道，从监督微调到组相对策略优化和迭代自我改进。对SurgBench的评估，包括来自五个机构的六个公共基准和六个多中心外部验证数据集，表明Surg-R1在公共基准上获得了最高的Arena Score（64.9%），相比之下，Gemini 3.0 Pro为46.1%，GPT-5.1为37.9%。在大多数任务中，包括工具定位、三元组识别、阶段识别、动作识别和安全评估的关键视图，Surg-R1超越了专有推理模型和专业外科视觉-语言模型，外部验证中相较于最强外科基线提高了15.2个百分点。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2603.12433

Revisiting Model Stitching In the Foundation Model Era

在基础模型时代重新审视模型拼接

Mai, Zheda, Zhang, Ke, Wang, Fu-En, Wang, Zixiao Ken, Chen, Albert Y. C., Xia, Lu, Sun, Min, Chao, Wei-Lun, Kuo, Cheng-Hao

Abstract

Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.

Chinese Translation

模型拼接是通过一个轻量级拼接层将一个模型（源模型）的早期层连接到另一个模型（目标模型）的后期层，这一方法已被用作表示兼容性的探测工具。先前的研究发现，即使在不同的初始化或目标下，训练于相同数据集的模型仍然可以拼接（精度下降可忽略不计）。我们重新审视了视觉基础模型（Vision Foundation Models, VFM）的拼接，这些模型在目标、数据和模态组合上存在差异（例如，CLIP、DINOv2、SigLIP 2），并提出以下问题：异构的 VFM 是否可拼接？我们引入了一个系统化的协议，涵盖拼接点、拼接层系列、训练损失和下游任务。研究得出三项发现。(1) 拼接层的训练至关重要：传统方法在拼接点匹配中间特征或优化任务损失的端到端方法在保持精度方面面临困难，尤其是在浅层拼接点。(2) 通过在目标模型的倒数第二层使用简单的特征匹配损失，异构 VFM 在视觉任务中变得可靠可拼接。(3) 对于深层拼接点，拼接模型在仅有小幅推理开销（针对拼接层）的情况下，可以超越任一组成模型。基于这些发现，我们进一步提出了 VFM 拼接树（VFM Stitch Tree, VST），该树在保留 VFM 后期层的同时共享早期层，从而为通常利用多个 VFM 的多模态大语言模型（LLM）提供可控的精度-延迟权衡。综合来看，我们的研究将拼接从一种诊断探测工具提升为整合互补 VFM 优势的实用方法，并明确指出它们的表示在何处对齐或偏离。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2603.12468

Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions

通过去偏见预测适应组织病理学中的弱监督定位

Guichemerre, Alexis, Karimian, Banafsheh, Belharbi, Soufiane, Gillet, Natacha, Thome, Nicolas, Shamsolmoali, Pourya, Shateri, Mohammadhadi, McCaffrey, Luke, Granger, Eric

Abstract

Weakly Supervised Object Localization (WSOL) models enable joint classification and region-of-interest localization in histology images using only image-class supervision. When deployed in a target domain, distributions shift remains a major cause of performance degradation, especially when applied on new organs or institutions with different staining protocols and scanner characteristics. Under stronger cross-domain shifts, WSOL predictions can become biased toward dominant classes, producing highly skewed pseudo-label distributions in the target domain. Source-Free (Unsupervised) Domain Adaptation (SFDA) methods are commonly employed to address domain shift. However, because they rely on self-training, the initial bias is reinforced over training iterations, degrading both classification and localization tasks. We identify this amplification of prediction bias as a primary obstacle to the SFDA of WSOL models in histopathology. This paper introduces \sfdadep, a method inspired by machine unlearning that formulates SFDA as an iterative process of identifying and correcting prediction bias. It periodically identifies target images from over-predicted classes and selectively reduces the predictive confidence for uncertain (high entropy) images, while preserving confident predictions. This process reduces the drift of decision boundaries and bias toward dominant classes. A jointly optimized pixel-level classifier further restores discriminative localization features under distribution shift. Extensive experiments on cross-organ and -center histopathology benchmarks (glas, CAMELYON-16, CAMELYON-17) with several WSOL models show that SFDA-DeP consistently improves classification and localization over state-of-the-art SFDA baselines. {\small Code: \href{https://anonymous.4open.science/r/SFDA-DeP-1797/}{anonymous.4open.science/r/SFDA-DeP-1797/}}

Chinese Translation

弱监督目标定位（Weakly Supervised Object Localization, WSOL）模型仅使用图像类别监督，能够在组织学图像中实现联合分类和感兴趣区域定位。然而，在目标领域部署时，分布转移仍然是性能下降的主要原因，尤其是在应用于具有不同染色协议和扫描仪特性的新的器官或机构时。在更强的跨域转移下，WSOL预测可能会偏向于主导类别，从而在目标领域产生高度偏斜的伪标签分布。无源（无监督）领域适应（Source-Free Domain Adaptation, SFDA）方法通常用于解决领域转移问题。然而，由于它们依赖自我训练，初始偏见在训练迭代中被强化，导致分类和定位任务的性能下降。我们将这种预测偏见的放大视为WSOL模型在组织病理学中进行SFDA的主要障碍。本文介绍了 ext{sfdadep}，一种受机器遗忘启发的方法，将SFDA表述为识别和纠正预测偏见的迭代过程。它定期识别来自过度预测类别的目标图像，并选择性地降低对不确定（高熵）图像的预测置信度，同时保留自信的预测。该过程减少了决策边界的漂移和对主导类别的偏见。一个联合优化的像素级分类器进一步恢复了在分布转移下的判别定位特征。在多个WSOL模型的跨器官和跨中心组织病理学基准（glas, CAMELYON-16, CAMELYON-17）上进行的广泛实验表明，SFDA-DeP在分类和定位上始终优于最先进的SFDA基线。{ iny 代码： exttt{https://anonymous.4open.science/r/SFDA-DeP-1797/}}

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2603.12469

Unleashing Video Language Models for Fine-grained HRCT Report Generation

释放视频语言模型以生成细粒度的高分辨率计算机断层扫描报告

Fang, Yingying, Zhou, Huichi, Lee, KinHei, Wang, Yijia, Zhang, Zhenxuan, Huang, Jiahao, Yang, Guang

Abstract

Generating precise diagnostic reports from High-Resolution Computed Tomography (HRCT) is critical for clinical workflow, yet it remains a formidable challenge due to the high pathological diversity and spatial sparsity within 3D volumes. While Video Language Models (VideoLMs) have demonstrated remarkable spatio-temporal reasoning in general domains, their adaptability to domain-specific, high-volume medical interpretation remains underexplored. In this work, we present AbSteering, an abnormality-centric framework that steers VideoLMs toward precise HRCT report generation. Specifically, AbSteering introduces: (i) an abnormality-centric Chain-of-Thought scheme that enforces abnormality reasoning, and (ii) a Direct Preference Optimization objective that utilizes clinically confusable abnormalities as hard negatives to enhance fine-grained discrimination. Our results demonstrate that general-purpose VideoLMs possess strong transferability to high-volume medical imaging when guided by this paradigm. Notably, AbSteering outperforms state-of-the-art domain-specific CT foundation models, which are pretrained with large-scale CTs, achieving superior detection sensitivity while simultaneously mitigating hallucinations. Our data and model weights are released at https://anonymous.4open.science/r/hrct-report-generation-video-vlm-728C/

Chinese Translation

从高分辨率计算机断层扫描（HRCT）生成精确的诊断报告对于临床工作流程至关重要，但由于3D体积内的高病理多样性和空间稀疏性，这仍然是一个巨大的挑战。尽管视频语言模型（VideoLMs）在一般领域展示了显著的时空推理能力，但它们在特定领域的高容量医学解读中的适应性仍然未得到充分探索。在本研究中，我们提出了AbSteering，一个以异常为中心的框架，旨在引导VideoLMs实现精确的HRCT报告生成。具体而言，AbSteering引入了：（i）一个以异常为中心的思维链方案，强制进行异常推理，以及（ii）一个直接偏好优化目标，利用临床上易混淆的异常作为困难负样本，以增强细粒度的区分能力。我们的结果表明，在这种范式的指导下，通用视频语言模型在高容量医学影像中具有强大的迁移能力。值得注意的是，AbSteering在检测灵敏度上超越了最先进的特定领域CT基础模型，这些模型是通过大规模CT预训练的，同时有效减少了幻觉现象。我们的数据和模型权重已发布在 https://anonymous.4open.science/r/hrct-report-generation-video-vlm-728C/

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2603.12478

Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

更少的数据，更快的收敛：面向目标的数据优化用于多模态指令调优

Wu, Rujie, Zhao, Haozhe, Ci, Hai, Wang, Yizhou

Abstract

Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1$\times$ training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at https://github.com/rujiewu/GDO.

Chinese Translation

多模态指令调优通常计算效率低，因为训练预算分散在大型混合图像-视频池中，而其效用高度不均。我们提出了目标驱动的数据优化（Goal-Driven Data Optimization, GDO），这是一个为每个候选样本计算六个样本描述符并为不同目标构建优化的1×训练子集的框架。在固定的单轮Qwen3-VL-8B-Instruct训练和评估方案下，使用8个H20 GPU，GDO使用的训练样本远少于Uni-10x基线，同时收敛更快，准确率更高。相较于固定的512k样本Uni-10x基线，GDO在MVBench上在35.4k样本后达到Uni-10x参考，在VideoMME上为26.6k，在MLVU上为27.3k，在LVBench上为34.7k，同时准确率分别提高了+1.38、+1.67、+3.08和+0.84个百分点。MVBench和MLVU上的增益最大，而LVBench的提升较为温和，这与其超长视频设置及该基准与短视频/图像主导的训练池之间的不匹配一致。在MinLoss、Diverse、Temp和Temp+中，较强的时间强调使得长视频理解行为稳步改善。总体而言，GDO提供了一个目标驱动的数据优化框架，使得在固定训练协议下以更少的训练样本实现更快的收敛。代码可在https://github.com/rujiewu/GDO获取。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2603.12482

CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning

CalliMaster：通过布局引导的空间规划掌握页级中文书法

Xu, Tianshuo, Hong, Tiantian, Chen, Zhifei, Chao, Fei, Chen, Ying-cong

Abstract

Page-level calligraphy synthesis requires balancing glyph precision with layout composition. Existing character models lack spatial context, while page-level methods often compromise brushwork detail. In this paper, we present \textbf{CalliMaster}, a unified framework for controllable generation and editing that resolves this conflict by decoupling spatial planning from content synthesis. Inspired by the human cognitive process of ``planning before writing'', we introduce a coarse-to-fine pipeline \textbf{(Text $\rightarrow$ Layout $\rightarrow$ Image)} to tackle the combinatorial complexity of page-scale synthesis. Operating within a single Multimodal Diffusion Transformer, a spatial planning stage first predicts character bounding boxes to establish the global spatial arrangement. This intermediate layout then serves as a geometric prompt for the content synthesis stage, where the same network utilizes flow-matching to render high-fidelity brushwork. Beyond achieving state-of-the-art generation quality, this disentanglement supports versatile downstream capabilities. By treating the layout as a modifiable constraint, CalliMaster enables controllable semantic re-planning: users can resize or reposition characters while the model automatically harmonizes the surrounding void space and brush momentum. Furthermore, we demonstrate the framework's extensibility to artifact restoration and forensic analysis, providing a comprehensive tool for digital cultural heritage.

Chinese Translation

页级书法合成需要在字形精度与布局构成之间取得平衡。现有的字符模型缺乏空间上下文，而页级方法往往妥协于笔触细节。本文提出了 extbf{CalliMaster}，一个统一的可控生成与编辑框架，通过将空间规划与内容合成解耦来解决这一冲突。受到人类“写作前规划”认知过程的启发，我们引入了一个粗到细的管道 extbf{(文本 $ ightarrow$ 布局 $ ightarrow$ 图像)}，以应对页尺度合成的组合复杂性。在单一的多模态扩散变换器内，空间规划阶段首先预测字符边界框，以建立全局空间布局。这个中间布局随后作为内容合成阶段的几何提示，在该阶段，同一网络利用流匹配来渲染高保真笔触。除了实现最先进的生成质量外，这种解耦还支持多样化的下游能力。通过将布局视为可修改的约束，CalliMaster使得可控的语义重新规划成为可能：用户可以调整字符的大小或位置，而模型会自动协调周围的空白空间和笔触动量。此外，我们展示了该框架在文物修复和法医学分析中的可扩展性，为数字文化遗产提供了全面的工具。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2603.12493

RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution

用于真实智能手机超分辨率的RAW域退化模型

Mosleh, Ali, Ali, Faraz, Zhang, Fengjia, Tsogkas, Stavros, Lee, Junyong, Levinshtein, Alex, Brown, Michael S.

Abstract

Digital zoom on smartphones relies on learning-based super-resolution (SR) models that operate on RAW sensor images, but obtaining sensor-specific training data is challenging due to the lack of ground-truth images. Synthetic data generation via ``unprocessing'' pipelines offers a potential solution by simulating the degradations that transform high-resolution (HR) images into their low-resolution (LR) counterparts. However, these pipelines can introduce domain gaps due to incomplete or unrealistic degradation modeling. In this paper, we demonstrate that principled and carefully designed degradation modeling can enhance SR performance in real-world conditions. Instead of relying on generic priors for camera blur and noise, we model device-specific degradations through calibration and unprocess publicly available rendered images into the RAW domain of different smartphones. Using these image pairs, we train a single-image RAW-to-RGB SR model and evaluate it on real data from a held-out device. Our experiments show that accurate degradation modeling leads to noticeable improvements, with our SR model outperforming baselines trained on large pools of arbitrarily chosen degradations.

Chinese Translation

智能手机上的数字变焦依赖于基于学习的超分辨率（SR）模型，这些模型在RAW传感器图像上运行，但由于缺乏真实图像，获取特定传感器的训练数据具有挑战性。通过“去处理”管道生成合成数据提供了一种潜在的解决方案，该方法通过模拟将高分辨率（HR）图像转化为低分辨率（LR）图像的退化过程。然而，这些管道可能会由于退化建模的不完整或不现实而引入领域差距。在本文中，我们证明了经过原则性和精心设计的退化建模可以提升在真实世界条件下的SR性能。我们通过校准建模设备特定的退化，而不是依赖于通用的相机模糊和噪声先验，将公开可用的渲染图像“去处理”到不同智能手机的RAW域。利用这些图像对，我们训练了一个单图像RAW到RGB的SR模型，并在一个保留设备的真实数据上进行了评估。我们的实验表明，准确的退化建模导致了显著的改进，我们的SR模型在训练于大量任意选择的退化的基线模型上表现更佳。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2603.12506

Na\"ive PAINE: Lightweight Text-to-Image Generation Improvement with Prompt Evaluation

Na"ive PAINE：通过提示评估提升轻量级文本到图像生成

Kim, Joong Ho, Thai, Nicholas, Dip, Souhardya Saha, Lao, Dong, Mills, Keith G.

Abstract

Text-to-Image (T2I) generation is primarily driven by Diffusion Models (DM) which rely on random Gaussian noise. Thus, like playing the slots at a casino, a DM will produce different results given the same user-defined inputs. This imposes a gambler's burden: To perform multiple generation cycles to obtain a satisfactory result. However, even though DMs use stochastic sampling to seed generation, the distribution of generated content quality highly depends on the prompt and the generative ability of a DM with respect to it. To account for this, we propose Na\"ive PAINE for improving the generative quality of Diffusion Models by leveraging T2I preference benchmarks. We directly predict the numerical quality of an image from the initial noise and given prompt. Na\"ive PAINE then selects a handful of quality noises and forwards them to the DM for generation. Further, Na\"ive PAINE provides feedback on the DM generative quality given the prompt and is lightweight enough to seamlessly fit into existing DM pipelines. Experimental results demonstrate that Na\"ive PAINE outperforms existing approaches on several prompt corpus benchmarks.

Chinese Translation

文本到图像（T2I）生成主要依赖于扩散模型（Diffusion Models, DM），这些模型依赖于随机高斯噪声。因此，就像在赌场玩老虎机一样，DM在给定相同用户定义输入的情况下会产生不同的结果。这给用户带来了赌徒的负担：需要进行多次生成周期以获得令人满意的结果。然而，尽管DM使用随机采样来启动生成，但生成内容质量的分布在很大程度上依赖于提示及其对应的DM生成能力。为了解决这个问题，我们提出了Na"ive PAINE，通过利用T2I偏好基准来提高扩散模型的生成质量。我们直接从初始噪声和给定提示中预测图像的数值质量。Na"ive PAINE随后选择少量高质量噪声并将其转发给DM进行生成。此外，Na"ive PAINE根据提示提供对DM生成质量的反馈，并且足够轻量，可以无缝集成到现有的DM流程中。实验结果表明，Na"ive PAINE在多个提示语料库基准上优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2603.12513

MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens

MemRoPE：通过演化记忆令牌实现无训练的无限视频生成

Kim, Youngrae, Hu, Qixin, Kuo, C. -C. Jay, Beerel, Peter A.

Abstract

Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.

Chinese Translation

自回归扩散技术实现了实时帧流传输，但现有的滑动窗口缓存会丢弃过去的上下文，导致在长时间范围内的保真度下降、身份漂移和运动停滞。目前的方法将固定的一组早期令牌作为注意力的锚点，但这种静态锚点无法反映不断发展的增长视频内容。我们提出了MemRoPE，这是一种无训练的框架，包含两个协同设计的组件。记忆令牌通过指数移动平均将所有过去的键持续压缩为双重长期和短期流，从而在固定大小的缓存中保持全球身份和近期动态。在线RoPE索引缓存未旋转的键，并在注意力时间动态应用位置嵌入，确保聚合不受冲突位置相位的影响。这两个机制相互促进：位置解耦使得时间聚合定义明确，而聚合则使得固定大小缓存适用于无限生成。大量实验验证了MemRoPE在时间一致性、视觉保真度和主题一致性方面优于现有方法，适用于分钟到小时级别的生成。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2603.12514

Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding

通过自监督和半监督学习结合顶点相对位置编码解决3D创伤检测中的数据稀缺问题

Chaudhary, Shivam, Bhat, Sheethal, Maier, Andreas

Abstract

Accurate detection and localization of traumatic injuries in abdominal CT scans remains a critical challenge in emergency radiology, primarily due to severe scarcity of annotated medical data. This paper presents a label-efficient approach combining self-supervised pre-training with semi-supervised detection for 3D medical image analysis. We employ patch-based Masked Image Modeling (MIM) to pre-train a 3D U-Net encoder on 1,206 CT volumes without annotations, learning robust anatomical representations. The pretrained encoder enables two downstream clinical tasks: 3D injury detection using VDETR with Vertex Relative Position Encoding, and multi-label injury classification. For detection, semi-supervised learning with 2,000 unlabeled volumes and consistency regularization achieves 56.57% validation [email protected] and 45.30% test [email protected] with only 144 labeled training samples, representing a 115% improvement over supervised-only training. For classification, expanding to 2,244 labeled samples yields 94.07% test accuracy across seven injury categories using only a frozen encoder, demonstrating immediately transferable self-supervised features. Our results validate that self-supervised pre-training combined with semi-supervised learning effectively addresses label scarcity in medical imaging, enabling robust 3D object detection with limited annotations.

Chinese Translation

在腹部CT扫描中准确检测和定位创伤性损伤仍然是急诊放射学中的一项关键挑战，主要由于标注医学数据的严重稀缺。本文提出了一种结合自监督预训练和半监督检测的标签高效方法，用于3D医学图像分析。我们采用基于补丁的掩蔽图像建模（Masked Image Modeling, MIM）在1,206个无标注的CT体积上预训练3D U-Net编码器，从而学习到稳健的解剖表示。预训练的编码器支持两个下游临床任务：使用带有顶点相对位置编码（Vertex Relative Position Encoding, VDETR）的3D损伤检测和多标签损伤分类。在检测方面，利用2,000个无标注体积和一致性正则化的半监督学习实现了56.57%的验证[email protected]和45.30%的测试[email protected]，仅使用144个标注训练样本，较仅使用监督训练提高了115%。在分类方面，扩展到2,244个标注样本，使用仅冻结的编码器在七个损伤类别中实现了94.07%的测试准确率，展示了可立即转移的自监督特征。我们的结果验证了自监督预训练结合半监督学习有效解决医学成像中的标签稀缺问题，使得在有限标注下实现稳健的3D物体检测成为可能。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2603.12533

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

你看到我指向的是什么吗？基于手势的自我中心视频问答

Choi, Yura, Miles, Roy, Potamias, Rolandos Alexandros, Elezi, Ismail, Deng, Jiankang, Zafeiriou, Stefanos

Abstract

Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: https://yuuraa.github.io/papers/choi2026egovqa

Chinese Translation

理解并回答基于用户指向手势的问题对于下一代自我中心人工智能助手至关重要。然而，当前的多模态大型语言模型（MLLMs）在这类任务上表现不佳，原因在于缺乏丰富的手势数据以及它们在从自我中心视频中推断细粒度指向意图方面的能力有限。为了解决这一问题，我们引入了EgoPointVQA，一个用于手势基础自我中心问答的数据集和基准，包含4000个合成视频和400个真实世界视频，涵盖多个指示推理任务。在此基础上，我们进一步提出了手势意图标记（Hand Intent Tokens, HINT），该标记通过使用现成的重建模型对3D手部关键点进行编码，并将其与模型输入交错，以提供明确的空间和时间上下文来解释指向意图。我们展示了我们的模型在不同的骨干网络和模型规模中优于其他模型。特别是，HINT-14B在6个任务上的平均准确率达到68.1%，超越了当前最先进的InternVL3-14B，提升幅度为6.6%。为了进一步促进开放研究，我们将发布代码、模型和数据集。项目页面：https://yuuraa.github.io/papers/choi2026egovqa

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2603.12538

Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

基于混合专家的空间语义专家路由架构用于指代图像分割

Dalaq, Alaa, Behzad, Muzammil

Abstract

Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.

Chinese Translation

指代图像分割旨在为自然语言表达描述的图像区域生成像素级掩膜。尽管预训练的视觉-语言模型改善了语义基础，但许多现有方法仍依赖于统一的细化策略，这些策略并未完全匹配指代表达的多样推理需求。由于这种不匹配，预测结果往往包含碎片化区域、不准确的边界，甚至错误的对象，尤其是在为了计算效率而冻结预训练主干时。为了解决这些局限性，我们提出了SERA，一种用于指代图像分割的空间语义专家路由架构。SERA在视觉-语言框架内的两个互补阶段引入了轻量级、表达感知的专家细化。首先，我们设计了SERA-Adapter，它将一个表达条件适配器插入选定的主干块中，通过专家引导的细化和跨模态注意力提高空间一致性和边界精度。然后，我们引入SERA-Fusion，通过将标记特征重塑为空间网格并在多模态交互之前应用几何保持的专家变换，增强中间视觉表示。此外，一个轻量级路由机制自适应地加权专家贡献，同时保持与预训练表示的兼容性。为了使该路由在冻结编码器下保持稳定，SERA采用了一种参数高效的调优策略，仅更新归一化和偏置项，影响不到1%的主干参数。在标准指代图像分割基准上的实验表明，SERA始终优于强基线，尤其在需要准确空间定位和精确边界划分的表达上表现出明显的提升。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2603.12545

Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

空间推理并非免费的午餐：关于 LLaVA 的对照研究

Alam, Nahid, Murali, Leema Krishna, Bharadwaj, Siddhant, Liu, Patrick, Chung, Timothy, Sharma, Drishti, A., Akshata, Kiran, Kranthi, Tam, Wesley, Vegesna, Bala Krishna S

Abstract

Vision-language models (VLMs) have advanced rapidly, yet they still struggle with basic spatial reasoning. Despite strong performance on general benchmarks, modern VLMs remain brittle at understanding 2D spatial relationships such as relative position, layout, and counting. We argue that this failure is not merely a data problem, but is closely tied to dominant design choices in current VLM pipelines: reliance on CLIP-style image encoders and the flattening of images into 1D token sequences with 1D positional encoding. We present a controlled diagnostic study within the LLaVA framework to isolate how these choices affect spatial grounding. We evaluate frontier models and LLaVA variants on a suite of spatial benchmarks, comparing CLIP-based encoders against alternatives trained with denser or generative objectives, as well as variants augmented with 2D positional encoding. Our results show consistent spatial performance gaps across models, and indicate that encoder objectives and positional structure shape spatial behavior, but do not fully resolve it.

Chinese Translation

视觉语言模型（VLMs）迅速发展，但在基本的空间推理方面仍然存在困难。尽管在一般基准测试中表现强劲，现代 VLMs 在理解二维空间关系（如相对位置、布局和计数）方面仍然脆弱。我们认为，这一失败不仅仅是数据问题，而是与当前 VLM 流程中的主导设计选择密切相关：依赖于 CLIP 风格的图像编码器以及将图像展平为一维标记序列并使用一维位置编码。我们在 LLaVA 框架内进行了一项对照诊断研究，以隔离这些选择如何影响空间基础。我们在一系列空间基准测试中评估前沿模型和 LLaVA 变体，比较基于 CLIP 的编码器与使用更密集或生成目标训练的替代方案，以及增强了二维位置编码的变体。我们的结果显示模型之间存在一致的空间性能差距，并表明编码器目标和位置结构塑造了空间行为，但并未完全解决这一问题。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2603.12547

Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation

解码至关重要：基于Mamba的高效解码器与分布感知深度监督在医学图像分割中的应用

Bougourzi, Fares, Dornaika, Fadi, Hadid, Abdenour

Abstract

Deep learning has achieved remarkable success in medical image segmentation, often reaching expert-level accuracy in delineating tumors and tissues. However, most existing approaches remain task-specific, showing strong performance on individual datasets but limited generalization across diverse imaging modalities. Moreover, many methods focus primarily on the encoder, relying on large pretrained backbones that increase computational complexity. In this paper, we propose a decoder-centric approach for generalized 2D medical image segmentation. The proposed Deco-Mamba follows a U-Net-like structure with a Transformer-CNN-Mamba design. The encoder combines a CNN block and Transformer backbone for efficient feature extraction, while the decoder integrates our novel Co-Attention Gate (CAG), Vision State Space Module (VSSM), and deformable convolutional refinement block to enhance multi-scale contextual representation. Additionally, a windowed distribution-aware KL-divergence loss is introduced for deep supervision across multiple decoding stages. Extensive experiments on diverse medical image segmentation benchmarks yield state-of-the-art performance and strong generalization capability while maintaining moderate model complexity. The source code will be released upon acceptance.

Chinese Translation

深度学习在医学图像分割中取得了显著成功，通常能够达到专家级的肿瘤和组织描绘精度。然而，大多数现有方法仍然是任务特定的，在个别数据集上表现强劲，但在多样化成像模式下的泛化能力有限。此外，许多方法主要集中在编码器上，依赖于大型预训练骨干网络，这增加了计算复杂性。本文提出了一种以解码器为中心的通用二维医学图像分割方法。所提出的Deco-Mamba遵循类似U-Net的结构，采用Transformer-CNN-Mamba设计。编码器结合了CNN模块和Transformer骨干网络，以实现高效的特征提取，而解码器则集成了我们新颖的协同注意力门（Co-Attention Gate, CAG）、视觉状态空间模块（Vision State Space Module, VSSM）和可变形卷积细化块，以增强多尺度上下文表示。此外，提出了一种基于窗口的分布感知KL散度损失，用于多个解码阶段的深度监督。在多样化的医学图像分割基准上进行的广泛实验显示出最先进的性能和强大的泛化能力，同时保持适中的模型复杂性。源代码将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2603.12551

CVGL: Causal Learning and Geometric Topology

CVGL：因果学习与几何拓扑

Ouyang, Songsong, Zhu, Yingying

Abstract

Cross-view geo-localization (CVGL) aims to estimate the geographic location of a street image by matching it with a corresponding aerial image. This is critical for autonomous navigation and mapping in complex real-world scenarios. However, the task remains challenging due to significant viewpoint differences and the influence of confounding factors. To tackle these issues, we propose the Causal Learning and Geometric Topology (CLGT) framework, which integrates two key components: a Causal Feature Extractor (CFE) that mitigates the influence of confounding factors by leveraging causal intervention to encourage the model to focus on stable, task-relevant semantics; and a Geometric Topology Fusion (GT Fusion) module that injects Bird's Eye View (BEV) road topology into street features to alleviate cross-view inconsistencies caused by extreme perspective changes. Additionally, we introduce a Data-Adaptive Pooling (DA Pooling) module to enhance the representation of semantically rich regions. Extensive experiments on CVUSA, CVACT, and their robustness-enhanced variants (CVUSA-C-ALL and CVACT-C-ALL) demonstrate that CLGT achieves state-of-the-art performance, particularly under challenging real-world corruptions. Our codes are available at https://github.com/oyss-szu/CLGT.

Chinese Translation

跨视角地理定位（CVGL）旨在通过将街景图像与相应的航空图像匹配来估计其地理位置。这对于复杂现实场景中的自主导航和地图绘制至关重要。然而，由于视角差异显著以及混杂因素的影响，这一任务仍然具有挑战性。为了解决这些问题，我们提出了因果学习与几何拓扑（CLGT）框架，该框架集成了两个关键组件：因果特征提取器（Causal Feature Extractor, CFE），通过利用因果干预来减轻混杂因素的影响，鼓励模型关注稳定的、与任务相关的语义；以及几何拓扑融合（Geometric Topology Fusion, GT Fusion）模块，该模块将鸟瞰视角（Bird's Eye View, BEV）道路拓扑注入街景特征中，以缓解因极端视角变化导致的跨视角不一致。此外，我们还引入了数据自适应池化（Data-Adaptive Pooling, DA Pooling）模块，以增强语义丰富区域的表示。在CVUSA、CVACT及其增强鲁棒性的变体（CVUSA-C-ALL和CVACT-C-ALL）上的大量实验表明，CLGT在特别具有挑战性的现实世界干扰下实现了最先进的性能。我们的代码可在https://github.com/oyss-szu/CLGT获取。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2603.12575

AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

AccelAes：加速无训练美学增强图像生成的扩散变换器

Yin, Xuanhua, Xu, Chuanzhi, Zhou, Haoxian, Wei, Boyu, Cai, Weidong

Abstract

Diffusion Transformers (DiTs) are a dominant backbone for high-fidelity text-to-image generation due to strong scalability and alignment at high resolutions. However, quadratic self-attention over dense spatial tokens leads to high inference latency and limits deployment. We observe that denoising is spatially non-uniform with respect to aesthetic descriptors in the prompt. Regions associated with aesthetic tokens receive concentrated cross-attention and show larger temporal variation, while low-affinity regions evolve smoothly with redundant computation. Based on this insight, we propose AccelAes, a training-free framework that accelerates DiTs through aesthetics-aware spatio-temporal reduction while improving perceptual aesthetics. AccelAes builds AesMask, a one-shot aesthetic focus mask derived from prompt semantics and cross-attention signals. When localized computation is feasible, SkipSparse reallocates computation and guidance to masked regions. We further reduce temporal redundancy using a lightweight step-level prediction cache that periodically replaces full Transformer evaluations. Experiments on representative DiT families show consistent acceleration and improved aesthetics-oriented quality. On Lumina-Next, AccelAes achieves a 2.11$\times$ speedup and improves ImageReward by +11.9% over the dense baseline. Code is available at https://github.com/xuanhuayin/AccelAes.

Chinese Translation

扩散变换器（Diffusion Transformers, DiTs）因其在高分辨率下的强大可扩展性和对齐能力，成为高保真文本到图像生成的主流骨干。然而，密集空间标记的二次自注意力导致高推理延迟，限制了其部署。我们观察到去噪在与提示中的美学描述符相关的空间上是不均匀的。与美学标记相关的区域接收集中交叉注意力，并显示出更大的时间变化，而低亲和区域则平滑演变，伴随冗余计算。基于这一见解，我们提出了AccelAes，一个无训练的框架，通过美学感知的时空降维加速DiTs，同时提高感知美学。AccelAes构建了AesMask，一个基于提示语义和交叉注意力信号的单次美学聚焦掩码。当局部计算可行时，SkipSparse将计算和指导重新分配到掩码区域。我们进一步通过轻量级的步级预测缓存减少时间冗余，该缓存定期替换完整的变换器评估。在代表性的DiT家族上的实验显示出一致的加速和改善的美学导向质量。在Lumina-Next上，AccelAes实现了2.11倍的加速，并在密集基线基础上将ImageReward提高了11.9%。代码可在https://github.com/xuanhuayin/AccelAes获取。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2603.12579

DINOLight: Robust Ambient Light Normalization with Self-supervised Visual Prior Integration

DINOLight：具有自监督视觉先验整合的鲁棒环境光归一化

Oh, Youngjin, Kwon, Junhyeong, Cho, Nam Ik

Abstract

This paper presents a new ambient light normalization framework, DINOLight, that integrates the self-supervised model DINOv2's image understanding capability into the restoration process as a visual prior. Ambient light normalization aims to restore images degraded by non-uniform shadows and lighting caused by multiple light sources and complex scene geometries. We observe that DINOv2 can reliably extract both semantic and geometric information from a degraded image. Based on this observation, we develop a novel framework to utilize DINOv2 features for lighting normalization. First, we propose an adaptive feature fusion module that combines features from different DINOv2 layers using a point-wise softmax mask. Next, the fused features are integrated into our proposed restoration network in both spatial and frequency domains through an auxiliary cross-attention mechanism. Experiments show that DINOLight achieves superior performance on the Ambient6K dataset, and that DINOv2 features are effective for enhancing ambient light normalization. We also apply our method to shadow-removal benchmark datasets, achieving competitive results compared to methods that use mask priors. Codes will be released upon acceptance.

Chinese Translation

本文提出了一种新的环境光归一化框架DINOLight，该框架将自监督模型DINOv2的图像理解能力整合到恢复过程中，作为视觉先验。环境光归一化旨在恢复因多光源和复杂场景几何形状造成的不均匀阴影和光照而退化的图像。我们观察到，DINOv2能够可靠地从退化图像中提取语义和几何信息。基于这一观察，我们开发了一种新颖的框架，以利用DINOv2特征进行光照归一化。首先，我们提出了一种自适应特征融合模块，该模块使用逐点softmax掩码结合来自不同DINOv2层的特征。接下来，融合的特征通过辅助交叉注意机制在空间域和频率域中整合到我们提出的恢复网络中。实验表明，DINOLight在Ambient6K数据集上实现了优越的性能，并且DINOv2特征在增强环境光归一化方面有效。我们还将该方法应用于阴影去除基准数据集，相较于使用掩码先验的方法，取得了具有竞争力的结果。代码将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2603.12587

MRGeo: Robust Cross-View Geo-Localization of Corrupted Images via Spatial and Channel Feature Enhancement

MRGeo：通过空间和通道特征增强实现鲁棒的跨视角地理定位

Wu, Le, Bo, Lv, Ouyang, Songsong, Zhu, Yingying

Abstract

Cross-view geo-localization (CVGL) aims to accurately localize street-view images through retrieval of corresponding geo-tagged satellite images. While prior works have achieved nearly perfect performance on certain standard datasets, their robustness in real-world corrupted environments remains under-explored. This oversight causes severe performance degradation or failure when images are affected by corruption such as blur or weather, significantly limiting practical deployment. To address this critical gap, we introduce MRGeo, the first systematic method designed for robust CVGL under corruption. MRGeo employs a hierarchical defense strategy that enhances the intrinsic quality of features and then enforces a robust geometric prior. Its core is the Spatial-Channel Enhancement Block, which contains: (1) a Spatial Adaptive Representation Module that models global and local features in parallel and uses a dynamic gating mechanism to arbitrate their fusion based on feature reliability; and (2) a Channel Calibration Module that performs compensatory adjustments by modeling multi-granularity channel dependencies to counteract information loss. To prevent spatial misalignment under severe corruption, a Region-level Geometric Alignment Module imposes a geometric structure on the final descriptors, ensuring coarse-grained consistency. Comprehensive experiments on both robustness benchmark and standard datasets demonstrate that MRGeo not only achieves an average R@1 improvement of 2.92\% across three comprehensive robustness benchmarks (CVUSA-C-ALL, CVACT\_val-C-ALL, and CVACT\_test-C-ALL) but also establishes superior performance in cross-area evaluation, thereby demonstrating its robustness and generalization capability.

Chinese Translation

跨视角地理定位（CVGL）旨在通过检索相应的地理标记卫星图像来准确定位街景图像。尽管先前的研究在某些标准数据集上已实现近乎完美的性能，但它们在真实世界中受损环境下的鲁棒性仍然未得到充分探索。这一忽视导致当图像受到模糊或天气等腐蚀影响时，性能严重下降或失败，从而显著限制了实际应用。为了解决这一关键问题，我们提出了MRGeo，这是首个专为在腐蚀情况下实现鲁棒CVGL而设计的系统方法。MRGeo采用了一种分层防御策略，增强特征的内在质量，然后强制执行鲁棒的几何先验。其核心是空间-通道增强模块，其中包含：（1）空间自适应表示模块，该模块并行建模全局和局部特征，并使用动态门控机制根据特征可靠性仲裁其融合；（2）通道校准模块，通过建模多粒度通道依赖关系进行补偿性调整，以抵消信息损失。为了防止在严重腐蚀下的空间错位，区域级几何对齐模块对最终描述符施加几何结构，确保粗粒度的一致性。在鲁棒性基准和标准数据集上的全面实验表明，MRGeo不仅在三个综合鲁棒性基准（CVUSA-C-ALL、CVACT_val-C-ALL和CVACT_test-C-ALL）上实现了平均R@1提升2.92\%，而且在跨区域评估中表现出色，从而证明了其鲁棒性和泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2603.12588

SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification

SDF-Net：结构感知的解耦特征学习用于光学-合成孔径雷达船只再识别

Chen, Furui, Wang, Han, Sun, Yuhan, You, Jianing, Lv, Yixuan, Zhou, Zhuang, Tan, Hong, Li, Shengyang

Abstract

Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery is fundamentally challenged by the severe radiometric discrepancy between passive optical imaging and coherent active radar sensing. While existing approaches primarily rely on statistical distribution alignment or semantic matching, they often overlook a critical physical prior: ships are rigid objects whose geometric structures remain stable across sensing modalities, whereas texture appearance is highly modality-dependent. In this work, we propose SDF-Net, a Structure-Aware Disentangled Feature Learning Network that systematically incorporates geometric consistency into optical--SAR ship ReID. Built upon a ViT backbone, SDF-Net introduces a structure consistency constraint that extracts scale-invariant gradient energy statistics from intermediate layers to robustly anchor representations against radiometric variations. At the terminal stage, SDF-Net disentangles the learned representations into modality-invariant identity features and modality-specific characteristics. These decoupled cues are then integrated through a parameter-free additive residual fusion, effectively enhancing discriminative power. Extensive experiments on the HOSS-ReID dataset demonstrate that SDF-Net consistently outperforms existing state-of-the-art methods. The code and trained models are publicly available at https://github.com/cfrfree/SDF-Net.

Chinese Translation

光学与合成孔径雷达（SAR）图像之间的跨模态船只再识别（ReID）面临着被动光学成像与相干主动雷达感知之间严重的辐射度差异这一根本挑战。现有方法主要依赖于统计分布对齐或语义匹配，但往往忽视了一个关键的物理先验：船只是刚性物体，其几何结构在不同感知模态下保持稳定，而纹理外观则高度依赖于模态。在本研究中，我们提出了SDF-Net，一种结构感知的解耦特征学习网络，系统地将几何一致性纳入光学-合成孔径雷达船只再识别中。SDF-Net基于ViT（Vision Transformer）骨干网络构建，引入了一种结构一致性约束，从中间层提取尺度不变的梯度能量统计，以稳健地锚定表示，抵御辐射度变化。在终端阶段，SDF-Net将学习到的表示解耦为模态不变的身份特征和模态特定的特征。这些解耦的线索通过无参数的加性残差融合进行整合，有效增强了区分能力。在HOSS-ReID数据集上的大量实验表明，SDF-Net始终优于现有的最先进方法。代码和训练模型已公开发布在https://github.com/cfrfree/SDF-Net。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2603.12598

Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating

神经门：通过神经级梯度门控减轻大型视觉语言模型中的隐私风险

Cao, Xiangkui, Zhang, Jie, Kan, Meina, Shan, Shiguang, Chen, Xilin

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable potential across a wide array of vision-language tasks, leading to their adoption in critical domains such as finance and healthcare. However, their growing deployment also introduces significant security and privacy risks. Malicious actors could potentially exploit these models to extract sensitive information, highlighting a critical vulnerability. Recent studies show that LVLMs often fail to consistently refuse instructions designed to compromise user privacy. While existing work on privacy protection has made meaningful progress in preventing the leakage of sensitive data, they are constrained by limitations in both generalization and non-destructiveness. They often struggle to robustly handle unseen privacy-related queries and may inadvertently degrade a model's performance on standard tasks. To address these challenges, we introduce Neural Gate, a novel method for mitigating privacy risks through neuron-level model editing. Our method improves a model's privacy safeguards by increasing its rate of refusal for privacy-related questions, crucially extending this protective behavior to novel sensitive queries not encountered during the editing process. Neural Gate operates by learning a feature vector to identify neurons associated with privacy-related concepts within the model's representation of a subject. This localization then precisely guides the update of model parameters. Through comprehensive experiments on MiniGPT and LLaVA, we demonstrate that our method significantly boosts the model's privacy protection while preserving its original utility.

Chinese Translation

大型视觉语言模型（LVLMs）在各种视觉语言任务中展现出了显著的潜力，导致它们在金融和医疗等关键领域的广泛应用。然而，它们的日益部署也带来了显著的安全和隐私风险。恶意行为者可能利用这些模型提取敏感信息，突显出一种关键的脆弱性。最近的研究表明，LVLMs往往无法始终拒绝旨在妨碍用户隐私的指令。尽管现有的隐私保护工作在防止敏感数据泄露方面取得了有意义的进展，但它们在泛化能力和非破坏性方面受到限制。它们通常难以稳健地处理未见过的隐私相关查询，并可能无意中降低模型在标准任务上的表现。为了解决这些挑战，我们提出了神经门（Neural Gate），一种通过神经级模型编辑减轻隐私风险的新方法。我们的方法通过提高模型对隐私相关问题的拒绝率来增强模型的隐私保护，关键是将这种保护行为扩展到在编辑过程中未遇到的新敏感查询。神经门通过学习一个特征向量来识别与模型对某一主题的表示中隐私相关概念相关的神经元。这种定位精确指导了模型参数的更新。通过对MiniGPT和LLaVA的全面实验，我们证明了我们的方法显著增强了模型的隐私保护，同时保持了其原有的实用性。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2603.12599

A Prediction-as-Perception Framework for 3D Object Detection

基于预测感知框架的三维物体检测

Zhang, Song, Chen, Haoyu, Wang, Ruibo

Abstract

Humans combine prediction and perception to observe the world. When faced with rapidly moving birds or insects, we can only perceive them clearly by predicting their next position and focusing our gaze there. Inspired by this, this paper proposes the Prediction-As-Perception (PAP) framework, integrating a prediction-perception architecture into 3D object perception tasks to enhance the model's perceptual accuracy. The PAP framework consists of two main modules: prediction and perception, primarily utilizing continuous frame information as input. Firstly, the prediction module forecasts the potential future positions of ego vehicles and surrounding traffic participants based on the perception results of the current frame. These predicted positions are then passed as queries to the perception module of the subsequent frame. The perceived results are iteratively fed back into the prediction module. We evaluated the PAP structure using the end-to-end model UniAD on the nuScenes dataset. The results demonstrate that the PAP structure improves UniAD's target tracking accuracy by 10% and increases the inference speed by 15%. This indicates that such a biomimetic design significantly enhances the efficiency and accuracy of perception models while reducing computational resource consumption.

Chinese Translation

人类通过结合预测与感知来观察世界。当面对快速移动的鸟类或昆虫时，我们只能通过预测它们的下一个位置并将视线集中在那里，才能清晰地感知它们。受到此启发，本文提出了预测感知框架（Prediction-As-Perception, PAP），将预测-感知架构整合到三维物体感知任务中，以提高模型的感知准确性。PAP框架由两个主要模块组成：预测和感知，主要利用连续帧信息作为输入。首先，预测模块基于当前帧的感知结果预测自我车辆和周围交通参与者的潜在未来位置。这些预测位置随后作为查询传递给下一帧的感知模块。感知结果被迭代地反馈到预测模块中。我们使用端到端模型UniAD在nuScenes数据集上评估了PAP结构。结果表明，PAP结构提高了UniAD的目标跟踪准确性10%，并将推理速度提高了15%。这表明，这种仿生设计显著提高了感知模型的效率和准确性，同时减少了计算资源的消耗。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2603.12605

A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering

A2Z-10M+: 基于A到Z边界表示注释的几何深度学习，用于AI辅助的CAD建模和逆向工程

Jena, Pritham Kumar, Baburaj, Bhavika, Anand, Tushar, Dutta, Vedant, Ulavala, Vineeth, Ali, Sk Aziz

Abstract

Reverse engineering and rapid prototyping of computer-aided design (CAD) models from 3D scans, sketches, or simple text prompts are vital in industrial product design. However, recent advances in geometric deep learning techniques lack a multi-modal understanding of parametric CAD features stored in their boundary representation (BRep). This study presents the largest compilation of 10 million multi-modal annotations and metadata for 1 million ABC CAD models, namely A2Z, to unlock an unprecedented level of BRep learning. A2Z comprises (i) high-resolution meshes with salient 3D scanning features, (ii) 3D hand-drawn sketches equipped with (iii) geometric and topological information about BRep co-edges, corners, and surfaces, and (iv) textual captions and tags describing the product in the mechanical world. Creating such carefully structured, large-scale data, which requires nearly 5 terabytes of storage to leverage unparalleled CAD learning/retrieval tasks, is very challenging. The scale, quality, and diversity of our multi-modal annotations are assessed using novel metrics, GPT-5, Gemini, and extensive human feedback mechanisms. To this end, we also merge an additional 25,000 CAD models of electronic enclosures (e.g., tablets, ports) designed by skilled professionals with our A2Z dataset. Subsequently, we train and benchmark a foundation model on a subset of 150K CAD models to detect BRep co-edges and corner vertices from 3D scans, a key downstream task in CAD reverse engineering. The annotated dataset, metrics, and checkpoints will be publicly released to support numerous research directions.

Chinese Translation

从3D扫描、草图或简单文本提示中进行计算机辅助设计（CAD）模型的逆向工程和快速原型制作在工业产品设计中至关重要。然而，最近在几何深度学习技术方面的进展缺乏对存储在其边界表示（BRep）中的参数CAD特征的多模态理解。本研究展示了针对100万个ABC CAD模型的1000万多模态注释和元数据的最大汇编，称为A2Z，以解锁前所未有的BRep学习。A2Z包括（i）具有显著3D扫描特征的高分辨率网格，（ii）配备（iii）关于BRep共边、角点和表面的几何和拓扑信息的3D手绘草图，以及（iv）描述机械世界中产品的文本说明和标签。创建这样结构严谨的大规模数据需要近5TB的存储，以支持无与伦比的CAD学习/检索任务，这非常具有挑战性。我们使用新颖的指标、GPT-5、Gemini和广泛的人类反馈机制来评估我们多模态注释的规模、质量和多样性。为此，我们还将由熟练专业人士设计的额外25,000个电子外壳（例如，平板电脑、端口）CAD模型与我们的A2Z数据集合并。随后，我们在150K CAD模型的子集上训练和基准测试一个基础模型，以从3D扫描中检测BRep共边和角点，这是CAD逆向工程中的一个关键下游任务。注释数据集、指标和检查点将公开发布，以支持众多研究方向。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2603.12606

Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

掌握否定：通过基于分组对立学习提升基础模型

Yang, Zesheng, Jiang, Xi, Hu, Bingzhang, Guan, Weili, Cong, Runmin, Qi, Guo-Jun, Zheng, Feng

Abstract

Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To address this challenge, we introduce D-Negation, a new dataset that provides objects annotated with both positive and negative semantic descriptions. Building upon the observation that negation reasoning frequently appears in natural language, we further propose a grouped opposition-based learning framework that learns negation-aware representations from limited samples. Specifically, our method organizes opposing semantic descriptions from D-Negation into structured groups and formulates two complementary loss functions that encourage the model to reason about negation and semantic qualifiers. We integrate the proposed dataset and learning strategy into a state-of-the-art language-based grounding model. By fine-tuning fewer than 10 percent of the model parameters, our approach achieves improvements of up to 4.4 mAP and 5.7 mAP on positive and negative semantic evaluations, respectively. These results demonstrate that explicitly modeling negation semantics can substantially enhance the robustness and localization accuracy of vision-language grounding models.

Chinese Translation

当前的视觉-语言检测和基础模型主要集中于具有积极语义的提示，往往难以准确解释和定位包含否定语义的复杂表达。这一局限性的一个关键原因是缺乏高质量的训练数据，无法明确捕捉到具有区分性的否定样本和关注否定的语言描述。为了解决这一挑战，我们引入了D-Negation，一个新的数据集，提供了同时标注有积极和消极语义描述的对象。基于否定推理在自然语言中频繁出现的观察，我们进一步提出了一种基于分组对立的学习框架，从有限样本中学习关注否定的表示。具体而言，我们的方法将D-Negation中的对立语义描述组织成结构化的组，并制定了两个互补的损失函数，鼓励模型对否定和语义限定词进行推理。我们将所提出的数据集和学习策略整合到一个最先进的基于语言的基础模型中。通过微调不到10%的模型参数，我们的方法在积极和消极语义评估中分别实现了高达4.4 mAP和5.7 mAP的提升。这些结果表明，明确建模否定语义可以显著增强视觉-语言基础模型的鲁棒性和定位准确性。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2603.12624

Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

基于实例分割的故障检测的轻量级基础模型驱动提示方法在货运列车中的应用

Sun, Guodong, Liang, Qihang, Pan, Xingyu, Liu, Moyun, Zhang, Yang

Abstract

Accurate visual fault detection in freight trains remains a critical challenge for intelligent transportation system maintenance, due to complex operational environments, structurally repetitive components, and frequent occlusions or contaminations in safety-critical regions. Conventional instance segmentation methods based on convolutional neural networks and Transformers often suffer from poor generalization and limited boundary accuracy under such conditions. To address these challenges, we propose a lightweight self-prompted instance segmentation framework tailored for freight train fault detection. Our method leverages the Segment Anything Model by introducing a self-prompt generation module that automatically produces task-specific prompts, enabling effective knowledge transfer from foundation models to domain-specific inspection tasks. In addition, we adopt a Tiny Vision Transformer backbone to reduce computational cost, making the framework suitable for real-time deployment on edge devices in railway monitoring systems. We construct a domain-specific dataset collected from real-world freight inspection stations and conduct extensive evaluations. Experimental results show that our method achieves 74.6 $AP^{\text{box}}$ and 74.2 $AP^{\text{mask}}$ on the dataset, outperforming existing state-of-the-art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial-scale fault diagnosis scenarios. Project page: https://github.com/MVME-HBUT/SAM_FTI-FDet.git

Chinese Translation

在货运列车中，准确的视觉故障检测仍然是智能交通系统维护面临的一个关键挑战，这主要由于复杂的操作环境、结构重复的组件以及安全关键区域内频繁的遮挡或污染。基于卷积神经网络和变换器的传统实例分割方法在这些条件下往往表现出较差的泛化能力和有限的边界精度。为了解决这些挑战，我们提出了一种轻量级自提示实例分割框架，专门针对货运列车故障检测。我们的方法利用了Segment Anything Model，通过引入自提示生成模块，自动生成特定任务的提示，从而实现基础模型到领域特定检查任务的有效知识转移。此外，我们采用了Tiny Vision Transformer作为骨干网络，以降低计算成本，使该框架适合在铁路监控系统的边缘设备上进行实时部署。我们构建了一个从真实货运检查站收集的领域特定数据集，并进行了广泛的评估。实验结果表明，我们的方法在该数据集上达到了74.6 $AP^{ ext{box}}$ 和 74.2 $AP^{ ext{mask}}$，在准确性和鲁棒性上均优于现有的最先进方法，同时保持了较低的计算开销。这项工作为自动化货运列车检查提供了一种可部署且高效的视觉解决方案，展示了基础模型在工业规模故障诊断场景中的适应潜力。项目页面：https://github.com/MVME-HBUT/SAM_FTI-FDet.git

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2603.12639

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

RoboStereo：用于统一策略优化的双塔4D具身世界模型

Zhang, Ruicheng, Chen, Guangyu, Xu, Zunnan, Liu, Zihao, Zhong, Zhizhou, Zhang, Mingyang, Zhou, Jun, Li, Xiu

Abstract

Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering >97% average relative improvement on fine-grained manipulation tasks.

Chinese Translation

可扩展的具身人工智能面临着由于现实世界交互的高昂成本和安全风险而带来的基本限制。尽管具身世界模型（EWM）通过想象的展开提供了希望，但现有方法存在几何幻觉的问题，并且缺乏统一的优化框架以实现实际的策略改进。我们提出了RoboStereo，一种对称的双塔4D世界模型，采用双向跨模态增强以确保时空几何一致性并减轻物理幻觉。在这一高保真4D模拟器的基础上，我们提出了第一个基于世界模型的策略优化统一框架： (1) 测试时策略增强（TTPA）用于执行前验证， (2) 模仿-进化策略学习（IEPL）利用视觉感知奖励从专家演示中学习，以及 (3) 开放探索策略学习（OEPL）使自主技能发现和自我修正成为可能。全面的实验表明，RoboStereo在生成质量上达到了最先进的水平，我们的统一框架在细粒度操作任务上实现了超过97%的平均相对提升。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2603.12647

LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

LR-SGS：基于LiDAR反射率引导的鲁棒显著高斯溅射用于自动驾驶场景重建

Chen, Ziyu, Zhu, Fan, Zhu, Hui, Kong, Deyi, Kuang, Xinkai, Zhang, Yujia, Jiang, Chunmao

Abstract

Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

Chinese Translation

近期的3D高斯溅射（3DGS）方法已展示了自动驾驶场景重建和新视图合成的可行性。然而，大多数现有方法要么仅依赖于摄像头，要么仅使用LiDAR进行高斯初始化或深度监督，而点云中丰富的场景信息，如反射率，以及LiDAR与RGB之间的互补性尚未得到充分利用，导致在高自我运动和复杂光照等挑战性自动驾驶场景中性能下降。为了解决这些问题，我们提出了一种鲁棒高效的基于LiDAR反射率引导的显著高斯溅射方法（LR-SGS），该方法引入了一种结构感知的显著高斯表示，从LiDAR提取的几何和反射率特征点初始化，并通过显著变换和改进的密度控制进行优化，以捕捉边缘和平面结构。此外，我们将LiDAR强度校准为反射率，并将其附加到每个高斯上，作为一个光照不变的材料通道，与RGB共同对齐以增强边界一致性。在Waymo开放数据集上的大量实验表明，LR-SGS在使用更少的高斯和更短的训练时间的情况下，实现了卓越的重建性能。特别是在复杂光照场景中，我们的方法在PSNR上超越了OmniRe 1.18 dB。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2603.12648

From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

从稀疏到密集：通过增强条件空间的多视角GRPO用于流模型

Bu, Jiazi, Ling, Pengyang, Zhou, Yujie, Wang, Yibin, Zang, Yuhang, Wei, Tianyi, Zhan, Xiaohang, Wang, Jiaqi, Wu, Tong, Pan, Xingang, Lin, Dahua

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.

Chinese Translation

群体相对策略优化（Group Relative Policy Optimization, GRPO）已成为文本到图像（Text-to-Image, T2I）流模型中偏好对齐的强大框架。然而，我们观察到，标准范式下对一组生成样本进行单一条件评估的方式，因对样本间关系探索不足而受到限制，从而制约了对齐效果和性能上限。为了解决这一稀疏的单视角评估方案，我们提出了多视角GRPO（Multi-View GRPO, MV-GRPO），这是一种通过增强条件空间来创建密集的多视角奖励映射的新方法，从而增强关系探索。具体而言，对于从一个提示生成的一组样本，MV-GRPO利用灵活的条件增强器生成语义相近但多样的标题。这些标题使得多视角优势重新评估成为可能，捕捉多样的语义属性并提供更丰富的优化信号。通过推导基于这些新标题的原始样本的概率分布，我们可以在训练过程中将其纳入，而无需昂贵的样本再生。大量实验表明，MV-GRPO在对齐性能上优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2603.12655

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

VGGT-World：将VGGT转化为自回归几何世界模型

Sun, Xiangyu, Wang, Shijie, Zhang, Fengyi, Liu, Lin, Jia, Caiyan, Song, Ziying, Huang, Zi, Luo, Yadan

Abstract

World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.

Chinese Translation

世界模型通过生成未来视频帧来预测场景演变，然而它们的大部分能力集中于光度细节，导致生成的预测往往在几何上不一致。我们提出了VGGT-World，这是一种几何世界模型，完全绕过视频生成，而是预测冻结几何基础模型（GFM）特征的时间演变。具体而言，我们将冻结的VGGT的潜在标记重新用于世界状态，并训练一个轻量级的时间流变换器以自回归方式预测其未来轨迹。在这个高维（d=1024）特征空间中出现了两个技术挑战：（i）标准的速度预测流匹配崩溃，以及（ii）自回归展开遭受累积曝光偏差。我们通过干净目标（z-预测）参数化解决第一个问题，从而获得显著更高的信噪比；通过两阶段潜在流强制课程解决第二个问题，逐步使模型适应其自身部分去噪的展开。在KITTI、Cityscapes和TartanAir上的实验表明，VGGT-World在深度预测方面显著优于最强基线，同时运行速度快3.6-5倍，仅需0.43B的可训练参数，确立了冻结GFM特征作为3D世界建模的有效且高效的预测状态。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2603.12657

VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors

VFM-Recon：解锁跨域场景级神经重建的尺度对齐基础先验

Ming, Yuhang, Xi, Tingkang, Yang, Xingrui, Yang, Lixin, Peng, Yong, Lu, Cewu, Kong, Wanzeng

Abstract

Scene-level neural volumetric reconstruction from monocular videos remains challenging, especially under severe domain shifts. Although recent advances in vision foundation models (VFMs) provide transferable generalized priors learned from large-scale data, their scaleambiguous predictions are incompatible with the scale consistency required by volumetric fusion. To address this gap, we present VFMRecon, the first attempt to bridge transferable VFM priors with scaleconsistent requirements in scene-level neural reconstruction. Specifically, we first introduce a lightweight scale alignment stage that restores multiview scale coherence. We then integrate pretrained VFM features into the neural volumetric reconstruction pipeline via lightweight task-specific adapters, which are trained for reconstruction while preserving the crossdomain robustness of pretrained representations. We train our model on ScanNet train split and evaluate on both in-distribution ScanNet test split and out-of-distribution TUM RGB-D and Tanks and Temples datasets. The results demonstrate that our model achieves state-of-theart performance across all datasets domains. In particular, on the challenging outdoor Tanks and Temples dataset, our model achieves an F1 score of 70.1 in reconstructed mesh evaluation, substantially outperforming the closest competitor, VGGT, which only attains 51.8.

Chinese Translation

从单目视频进行场景级神经体积重建仍然具有挑战性，尤其是在严重的领域转移情况下。尽管最近在视觉基础模型（VFM）方面的进展提供了从大规模数据中学习的可转移通用先验，但其尺度模糊的预测与体积融合所需的尺度一致性不兼容。为了解决这一问题，我们提出了VFM-Recon，这是首次尝试将可转移的VFM先验与场景级神经重建中的尺度一致性要求相结合。具体而言，我们首先引入了一个轻量级的尺度对齐阶段，以恢复多视图的尺度一致性。然后，我们通过轻量级的任务特定适配器将预训练的VFM特征集成到神经体积重建管道中，这些适配器在重建过程中经过训练，同时保持预训练表示的跨域鲁棒性。我们在ScanNet训练集上训练模型，并在ScanNet测试集（同分布）和TUM RGB-D及Tanks and Temples数据集（异分布）上进行评估。结果表明，我们的模型在所有数据集领域中均达到了最先进的性能。特别是在具有挑战性的户外Tanks and Temples数据集上，我们的模型在重建网格评估中达到了70.1的F1分数，显著超越了最接近的竞争对手VGGT，其仅获得51.8的分数。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2603.12659

AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

AVION：从离线教师到提示调优网络的空中视觉-语言指令

Hu, Yu, Gu, Jianyang, Liu, Hao, Cao, Yue, Hamari, Jozsef, Liu, Zheng, Zardadi, Mohsen

Abstract

Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.

Chinese Translation

将视觉-语言模型适应于遥感图像仍然面临挑战，主要由于两个关键因素：文本表示的语义覆盖有限和视觉特征的适应性不足。这些问题在空中场景中尤为显著，因为这些场景涉及各种视觉外观和细粒度的物体区分。我们提出了AVION，一种针对遥感适应的视觉-语言模型的知识蒸馏框架。教师模块通过收集来自大型语言模型的描述并利用遥感图像特征验证有效性，构建语义丰富的文本原型。学生模块将轻量且可学习的提示整合到视觉和语言编码器中，在教师的指导下对嵌入及其跨模态关系进行对齐。训练完成后，学生在推理过程中独立操作。在六个光学遥感基准测试上的实验表明，AVION在不降低对新类别的泛化能力的情况下，提高了少样本分类和基础类别的准确性。同时，它还增强了跨模态检索的平均召回率，且增加的可训练参数极少。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2603.12663

Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization

从全景激光雷达扫描中学习几何和光度特征以进行户外场所分类

Nakashima, Kazuto, Jung, Hojung, Oto, Yuki, Iwashita, Yumi, Kurazume, Ryo, Mozos, Oscar Martinez

Abstract

Semantic place categorization, which is one of the essential tasks for autonomous robots and vehicles, allows them to have capabilities of self-decision and navigation in unfamiliar environments. In particular, outdoor places are more difficult targets than indoor ones due to perceptual variations, such as dynamic illuminance over twenty-four hours and occlusions by cars and pedestrians. This paper presents a novel method of categorizing outdoor places using convolutional neural networks (CNNs), which take omnidirectional depth/reflectance images obtained by 3D LiDARs as the inputs. First, we construct a large-scale outdoor place dataset named Multi-modal Panoramic 3D Outdoor (MPO) comprising two types of point clouds captured by two different LiDARs. They are labeled with six outdoor place categories: coast, forest, indoor/outdoor parking, residential area, and urban area. Second, we provide CNNs for LiDAR-based outdoor place categorization and evaluate our approach with the MPO dataset. Our results on the MPO dataset outperform traditional approaches and show the effectiveness in which we use both depth and reflectance modalities. To analyze our trained deep networks we visualize the learned features.

Chinese Translation

语义场所分类是自主机器人和车辆的基本任务之一，使它们能够在陌生环境中实现自我决策和导航。特别是，户外场所由于感知变化（如二十四小时内的动态光照和汽车及行人的遮挡）而比室内场所更具挑战性。本文提出了一种使用卷积神经网络（CNN）对户外场所进行分类的新方法，该方法以由3D激光雷达获取的全向深度/反射图像作为输入。首先，我们构建了一个名为多模态全景3D户外（MPO）的规模庞大的户外场所数据集，该数据集由两种不同激光雷达捕获的点云组成，并标注了六种户外场所类别：海岸、森林、室内/室外停车场、住宅区和城市区域。其次，我们为基于激光雷达的户外场所分类提供了CNN，并使用MPO数据集评估我们的方法。我们在MPO数据集上的结果优于传统方法，显示了我们同时使用深度和反射模态的有效性。为了分析我们训练的深度网络，我们可视化了学习到的特征。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2603.12667

Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies

基于标记的骨料三维重建及二维与三维形态的比较分析

Huang, Haohang, Luo, Jiayi, Qamhia, Issam, Tutumluer, Erol, Hart, John M., Stolba, Andrew J.

Abstract

Aggregates, serving as the main skeleton in assemblies of construction materials, are important functional components in various building and transportation infrastructures. They can be used in unbound layer applications, e.g. pavement base and railroad ballast, bound applications of cement concrete and asphalt concrete, and as riprap and large-sized primary crushed rocks. Information on the size and shape or morphology of aggregates can greatly facilitate the Quality Assurance/Quality Control (QA/QC) process by providing insights of aggregate behavior during composition and packing. A full 3D characterization of aggregate particle morphology is difficult both during production in a quarry and at a construction site. Many aggregate imaging approaches have been developed to quantify the particle morphology by computer vision, including 2D image-based approaches that analyze particle silhouettes and 3D scanning-based methods that require expensive devices such as 3D laser scanners or X-Ray Computed Tomography (CT) equipment. This paper presents a flexible and cost-effective photogrammetry-based approach for the 3D reconstruction of aggregate particles. The proposed approach follows a marker-based design that enables background suppression, point cloud stitching, and scale referencing to obtain high-quality aggregate models. The accuracy of the reconstruction results was validated against ground-truth for selected aggregate samples. Comparative analyses were conducted on 2D and 3D morphological properties of the selected samples. Significant differences were found between the 2D and 3D statistics. Based on the presented approach, 3D shape information of aggregates can be obtained easily and at a low cost, thus allowing convenient aggregate inspection, data collection, and 3D morphological analysis.

Chinese Translation

骨料作为建筑材料组合中的主要骨架，是各种建筑和交通基础设施中重要的功能组成部分。它们可以用于无绑定层的应用，例如路面基础和铁路道砟，水泥混凝土和沥青混凝土的绑定应用，以及作为护坡石和大尺寸初级碎石。骨料的大小和形状或形态信息可以极大地促进质量保证/质量控制（QA/QC）过程，通过提供骨料在组成和堆积过程中的行为洞察。对骨料颗粒形态的全面三维表征在采石场生产和施工现场都很困难。许多骨料成像方法已经被开发出来，通过计算机视觉定量颗粒形态，包括分析颗粒轮廓的二维图像基础方法和需要昂贵设备（如三维激光扫描仪或X射线计算机断层扫描（CT）设备）的三维扫描基础方法。本文提出了一种灵活且具有成本效益的基于摄影测量的骨料颗粒三维重建方法。所提出的方法遵循基于标记的设计，能够实现背景抑制、点云拼接和尺度参考，从而获得高质量的骨料模型。重建结果的准确性通过与选定骨料样本的真实值进行验证。对选定样本的二维和三维形态特性进行了比较分析，发现二维和三维统计之间存在显著差异。基于所提出的方法，可以轻松且低成本地获得骨料的三维形状信息，从而方便骨料检查、数据收集和三维形态分析。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2603.12669

Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

视觉验证增强的视觉语言模型融合用于高效视觉推理

Tekin, Selim Furkan, Xu, Yichang, Liu, Gaowen, Kompella, Ramana Rao, Loper, Margaret L., Liu, Ling

Abstract

With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.

Chinese Translation

随着视觉语言模型（VLMs）数量和多样性的增加，许多研究探索了基于语言的集成、协作和路由技术，以改善多模型推理。相较之下，我们通过视觉和语言模态共同解决多样化的模型选择问题。我们引入了焦点错误多样性，以捕捉VLMs之间的互补推理，并提出了一种基于CKA的焦点多样性度量（CKA-focal），用于衡量它们视觉嵌入中的不一致性。在从候选VLM池构建的集成表面上，我们应用了遗传算法，有效地剔除那些对融合性能没有增益的组件VLM。我们为每个任务识别最佳组合，并融合模型池中每个VLM的输出，展示异构模型能够动态捕捉认知不确定性并减轻幻觉。我们的V3Fusion方法能够在没有多数共识或大多数VLM做出错误预测的情况下，生成具有高性能的双重焦点多样性融合预测。大量实验验证了V3Fusion在四个流行的VLM基准（A-OKVQA、MMMU、MMMU-Pro和OCR-VQA）上的有效性。结果表明，V3Fusion在MMMU上比表现最佳的VLM提高了8.09%的准确率，在MMMU-Pro上提高了4.87%。对于生成任务，V3Fusion在A-OKVQA和OCR-VQA上超越了Intern-VL2-8b和Qwen2.5-VL-7b这两款表现最好的VLM。我们的代码和数据集可在https://github.com/sftekin/v3fusion获取。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2603.12680

Bin~Wan,G2HFNet: GeoGran-Aware Hierarchical Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images

G2HFNet：用于光学遥感图像显著性目标检测的地理粒度感知层次特征融合网络

Wan, Bin, Cong, Runmin, Zhou, Xiaofei, Fang, Hao, Lv, Chengtao, Kwong, Sam

Abstract

Remote sensing images captured from aerial perspectives often exhibit significant scale variations and complex backgrounds, posing challenges for salient object detection (SOD). Existing methods typically extract multi-level features at a single scale using uniform attention mechanisms, leading to suboptimal representations and incomplete detection results. To address these issues, we propose a GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet) that fully exploits geometric and granular cues in optical remote sensing images. Specifically, G2HFNet adopts Swin Transformer as the backbone to extract multi-level features and integrates three key modules: the multi-scale detail enhancement (MDE) module to handle object scale variations and enrich fine details, the dual-branch geo-gran complementary (DGC) module to jointly capture fine-grained details and positional information in mid-level features, and the deep semantic perception (DSP) module to refine high-level positional cues via self-attention. Additionally, a local-global guidance fusion (LGF) module is introduced to replace traditional convolutions for effective multi-level feature integration. Extensive experiments demonstrate that G2HFNet achieves high-quality saliency maps and significantly improves detection performance in challenging remote sensing scenarios.

Chinese Translation

从空中视角捕获的遥感图像通常表现出显著的尺度变化和复杂的背景，这给显著性目标检测（SOD）带来了挑战。现有方法通常在单一尺度下使用统一的注意力机制提取多层次特征，导致表示不够优化和检测结果不完整。为了解决这些问题，我们提出了一种地理粒度感知层次特征融合网络（G2HFNet），充分利用光学遥感图像中的几何和粒度线索。具体而言，G2HFNet采用Swin Transformer作为主干网络提取多层次特征，并集成了三个关键模块：多尺度细节增强（MDE）模块以处理目标尺度变化并丰富细节，双分支地理粒度互补（DGC）模块以联合捕捉中层特征中的细粒度细节和位置信息，以及深度语义感知（DSP）模块通过自注意力精炼高层位置信息。此外，引入了局部-全局引导融合（LGF）模块，以替代传统卷积实现有效的多层次特征融合。大量实验表明，G2HFNet能够生成高质量的显著性图，并显著提高在具有挑战性的遥感场景中的检测性能。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2603.12685

RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection

RSONet：基于区域引导的选择性优化网络用于RGB-T显著目标检测

Wan, Bin, Cong, Runmin, Zhou, Xiaofei, Fang, Hao, Lv, Chengtao, Kwong, Sam

Abstract

This paper focuses on the inconsistency in salient regions between RGB and thermal images. To address this issue, we propose the Region-guided Selective Optimization Network for RGB-T Salient Object Detection, which consists of the region guidance stage and saliency generation stage. In the region guidance stage, three parallel branches with same encoder-decoder structure equipped with the context interaction (CI) module and spatial-aware fusion (SF) module are designed to generate the guidance maps which are leveraged to calculate similarity scores. Then, in the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on the previously obtained similarity values to mitigate the impact of inconsistent distribution of salient targets between the two modalities. After that, to generate high-quality detection result, the dense detail enhancement (DDE) module which adopts the multiple dense connections and visual state space blocks is applied to low-level features for optimizing the detail information. In addition, the mutual interaction semantic (MIS) module is placed in the high-level features to dig the location cues by the mutual fusion strategy. We conduct extensive experiments on the RGB-T dataset, and the results demonstrate that the proposed RSONet achieves competitive performance against 27 state-of-the-art SOD methods.

Chinese Translation

本文关注RGB图像与热成像图像之间显著区域的不一致性。为了解决这一问题，我们提出了基于区域引导的选择性优化网络（Region-guided Selective Optimization Network，RSONet）用于RGB-T显著目标检测，该网络由区域引导阶段和显著性生成阶段组成。在区域引导阶段，设计了三个具有相同编码-解码结构的并行分支，配备上下文交互（Context Interaction，CI）模块和空间感知融合（Spatial-aware Fusion，SF）模块，以生成引导图，这些引导图用于计算相似性分数。然后，在显著性生成阶段，选择性优化（Selective Optimization，SO）模块根据先前获得的相似性值融合RGB和热成像特征，以减轻两种模态之间显著目标分布不一致的影响。之后，为了生成高质量的检测结果，采用多重密集连接和视觉状态空间块的密集细节增强（Dense Detail Enhancement，DDE）模块被应用于低级特征，以优化细节信息。此外，互交语义（Mutual Interaction Semantic，MIS）模块被置于高级特征中，通过互融策略挖掘位置信息。我们在RGB-T数据集上进行了广泛的实验，结果表明，所提出的RSONet在27种最先进的显著目标检测方法中表现出竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2603.12688

STRAP-ViT: Segregated Tokens with Randomized -- Transformations for Defense against Adversarial Patches in ViTs

STRAP-ViT：针对视觉变换器中对抗性补丁的防御的随机化变换分离标记

Chattopadhyay, Nandish, Goyal, Anadi, Karfa, Chandan, Chattopadhyay, Anupam

Abstract

Adversarial patches are physically realizable localized noise, which are able to hijack Vision Transformers (ViT) self-attention, pulling focus toward a small, high-contrast region and corrupting the class token to force confident misclassifications. In this paper, we claim that the tokens which correspond to the areas of the image that contain the adversarial noise, have different statistical properties when compared to the tokens which do not overlap with the adversarial perturbations. We use this insight to propose a mechanism, called STRAP-ViT, which uses Jensen-Shannon Divergence as a metric for segregating tokens that behave as anomalies in the Detection Phase, and then apply randomized composite transformations on them during the Mitigation Phase to make the adversarial noise ineffective. The minimum number of tokens to transform is a hyper-parameter for the defense mechanism and is chosen such that at least 50% of the patch is covered by the transformed tokens. STRAP-ViT fits as a non-trainable plug-and-play block within the ViT architectures, for inference purposes only, with a minimal computational cost and does not require any additional training cost/effort. STRAP-ViT has been tested on multiple pre-trained vision transformer architectures (ViT-base-16 and DinoV2) and datasets (ImageNet and CalTech-101), across multiple adversarial attacks (Adversarial Patch, LAVAN, GDPA and RP2), and found to provide excellent robust accuracies lying within a 2-3% range of the clean baselines, and outperform the state-of-the-art.

Chinese Translation

对抗性补丁是可物理实现的局部噪声，能够劫持视觉变换器（ViT）的自注意力，将焦点拉向一个小的高对比度区域，并破坏类别标记以强制产生自信的错误分类。本文声称，与不重叠对抗性扰动的标记相比，图像中包含对抗性噪声区域的标记具有不同的统计特性。我们利用这一见解提出了一种机制，称为STRAP-ViT，该机制使用詹森-香农散度（Jensen-Shannon Divergence）作为度量，分离在检测阶段表现为异常的标记，并在缓解阶段对其应用随机化复合变换，使对抗性噪声失效。需要转换的最小标记数量是防御机制的超参数，选择时确保至少50%的补丁被转换后的标记覆盖。STRAP-ViT作为一个不可训练的即插即用模块，适用于ViT架构，仅用于推理，具有最低的计算成本，并且不需要额外的训练成本/努力。STRAP-ViT已在多个预训练的视觉变换器架构（ViT-base-16和DinoV2）和数据集（ImageNet和CalTech-101）上进行了测试，针对多种对抗性攻击（对抗性补丁、LAVAN、GDPA和RP2），发现其提供的鲁棒准确率在干净基线的2-3%范围内，并且超越了现有的最先进技术。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2603.12690

CM-Bench: A Comprehensive Cross-Modal Feature Matching Benchmark Bridging Visible and Infrared Images

CM-Bench：一个综合性的跨模态特征匹配基准，连接可见光与红外图像

Sun, Liangzheng, He, Mengfan, Shao, Xingyu, Li, Binbin, Yan, Zhiqiang, Li, Chunyu, Meng, Ziyang, Xing, Fei

Abstract

Infrared-visible (IR-VIS) feature matching plays an essential role in cross-modality visual localization, navigation and perception. Along with the rapid development of deep learning techniques, a number of representative image matching methods have been proposed. However, crossmodal feature matching is still a challenging task due to the significant appearance difference. A significant gap for cross-modal feature matching research lies in the absence of standardized benchmarks and metrics for evaluations. In this paper, we introduce a comprehensive cross-modal feature matching benchmark, CM-Bench, which encompasses 30 feature matching algorithms across diverse cross-modal datasets. Specifically, state-of-the-art traditional and deep learning-based methods are first summarized and categorized into sparse, semidense, and dense methods. These methods are evaluated by different tasks including homography estimation, relative pose estimation, and feature-matching-based geo-localization. In addition, we introduce a classification-network-based adaptive preprocessing front-end that automatically selects suitable enhancement strategies before matching. We also present a novel infrared-satellite cross-modal dataset with manually annotated ground-truth correspondences for practical geo-localization evaluation. The dataset and resource will be available at: https://github.com/SLZ98/CM-Bench.

Chinese Translation

红外-可见光（IR-VIS）特征匹配在跨模态视觉定位、导航和感知中扮演着重要角色。随着深度学习技术的快速发展，已经提出了多种代表性的图像匹配方法。然而，由于显著的外观差异，跨模态特征匹配仍然是一项具有挑战性的任务。跨模态特征匹配研究的一个显著缺口在于缺乏标准化的基准和评估指标。在本文中，我们介绍了一个综合性的跨模态特征匹配基准，CM-Bench，涵盖了30种特征匹配算法，涉及多种跨模态数据集。具体而言，首先总结并分类了最先进的传统方法和基于深度学习的方法，分为稀疏、半稠密和稠密方法。这些方法通过不同的任务进行评估，包括单应性估计、相对姿态估计和基于特征匹配的地理定位。此外，我们引入了一种基于分类网络的自适应预处理前端，能够在匹配之前自动选择合适的增强策略。我们还提出了一个新颖的红外-卫星跨模态数据集，包含手动标注的真实对应关系，以便于实际的地理定位评估。该数据集和资源将可在以下网址获取：https://github.com/SLZ98/CM-Bench。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2603.12693

HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection and Fine-Grained Violence Classification

HSEmotion团队在ABAW-10竞赛中的表现：面部表情识别、情感价效估计、动作单元检测和细粒度暴力分类

Savchenko, Andrey V., Tsypliakova, Kseniia

Abstract

This article presents our results for the 10th Affective Behavior Analysis in-the-Wild (ABAW) competition. For frame-wise facial emotion understanding tasks (frame-wise facial expression recognition, valence-arousal estimation, action unit detection), we propose a fast approach based on facial embedding extraction with pre-trained EfficientNet-based emotion recognition models. If the latter model's confidence exceeds a threshold, its prediction is used. Otherwise, we feed embeddings into a simple multi-layered perceptron trained on the AffWild2 dataset. Estimated class-level scores are smoothed in a sliding window of fixed size to mitigate noise in frame-wise predictions. For the fine-grained violence detection task, we examine several pre-trained architectures for frame embeddings and their aggregation for video classification. Experimental results on four tasks from the ABAW challenge demonstrate that our approach significantly improves validation metrics over existing baselines.

Chinese Translation

本文展示了我们在第十届情感行为分析野外竞赛（ABAW）中的结果。对于逐帧面部情感理解任务（逐帧面部表情识别、情感价效估计、动作单元检测），我们提出了一种基于预训练的EfficientNet模型的面部嵌入提取的快速方法。如果该模型的置信度超过阈值，则使用其预测结果。否则，我们将嵌入输入到一个在AffWild2数据集上训练的简单多层感知器中。估计的类别级得分在固定大小的滑动窗口中进行平滑，以减轻逐帧预测中的噪声。对于细粒度暴力检测任务，我们考察了几种预训练架构用于帧嵌入及其在视频分类中的聚合。ABAW挑战中四个任务的实验结果表明，我们的方法在验证指标上显著优于现有基线。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2603.12703

VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

VCBench：用于长视频中时空状态维护的流式计数基准

Liu, Pengyiang, Shi, Zhongyue, Hao, Hongye, Fu, Qi, Bi, Xueting, Zhang, Siwei, Hu, Xiaoyang, Wang, Zitian, Huang, Linjiang, Liu, Si

Abstract

Video understanding requires models to continuously track and update world state during playback. While existing benchmarks have advanced video understanding evaluation across multiple dimensions, the observation of how models maintain world state remains insufficient. We propose VCBench, a streaming counting benchmark that repositions counting as a minimal probe for diagnosing world state maintenance capability. We decompose this capability into object counting (tracking currently visible objects vs.\ tracking cumulative unique identities) and event counting (detecting instantaneous actions vs.\ tracking complete activity cycles), forming 8 fine-grained subcategories. VCBench contains 406 videos with frame-by-frame annotations of 10,071 event occurrence moments and object state change moments, generating 1,000 streaming QA pairs with 4,576 query points along timelines. By observing state maintenance trajectories through streaming multi-point queries, we design three complementary metrics to diagnose numerical precision, trajectory consistency, and temporal awareness. Evaluation on mainstream video-language models shows that current models still exhibit significant deficiencies in spatial-temporal state maintenance, particularly struggling with tasks like periodic event counting. VCBench provides a diagnostic framework for measuring and improving state maintenance in video understanding systems.

Chinese Translation

视频理解要求模型在播放过程中持续跟踪和更新世界状态。尽管现有基准在多个维度上推动了视频理解评估的发展，但对模型如何维护世界状态的观察仍然不足。我们提出了VCBench，一个将计数重新定位为诊断世界状态维护能力的最小探针的流式计数基准。我们将这种能力分解为对象计数（跟踪当前可见对象与跟踪累计唯一身份）和事件计数（检测瞬时动作与跟踪完整活动周期），形成8个细粒度子类别。VCBench包含406个视频，逐帧注释了10,071个事件发生时刻和对象状态变化时刻，生成了1,000个流式问答对，沿时间线设置了4,576个查询点。通过观察流式多点查询的状态维护轨迹，我们设计了三种互补指标来诊断数值精度、轨迹一致性和时间意识。对主流视频-语言模型的评估表明，当前模型在时空状态维护方面仍存在显著不足，特别是在周期性事件计数等任务上表现不佳。VCBench提供了一个诊断框架，用于测量和改善视频理解系统中的状态维护。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2603.12708

HFP-SAM: Hierarchical Frequency Prompted SAM for Efficient Marine Animal Segmentation

HFP-SAM：用于高效海洋动物分割的分层频率提示SAM

Zhang, Pingping, Yan, Tianyu, Wang, Yuhao, Liu, Yang, Tang, Tongdan, Ma, Yili, Lv, Long, Tian, Feng, Sun, Weibing, Lu, and Huchuan

Abstract

Marine Animal Segmentation (MAS) aims at identifying and segmenting marine animals from complex marine environments. Most of previous deep learning-based MAS methods struggle with the long-distance modeling issue. Recently, Segment Anything Model (SAM) has gained popularity in general image segmentation. However, it lacks of perceiving fine-grained details and frequency information. To this end, we propose a novel learning framework, named Hierarchical Frequency Prompted SAM (HFP-SAM) for high-performance MAS. First, we design a Frequency Guided Adapter (FGA) to efficiently inject marine scene information into the frozen SAM backbone through frequency domain prior masks. Additionally, we introduce a Frequency-aware Point Selection (FPS) to generate highlighted regions through frequency analysis. These regions are combined with the coarse predictions of SAM to generate point prompts and integrate into SAM's decoder for fine predictions. Finally, to obtain comprehensive segmentation masks, we introduce a Full-View Mamba (FVM) to efficiently extract spatial and channel contextual information with linear computational complexity. Extensive experiments on four public datasets demonstrate the superior performance of our approach. The source code is publicly available at https://github.com/Drchip61/TIP-HFP-SAM.

Chinese Translation

海洋动物分割（MAS）旨在从复杂的海洋环境中识别和分割海洋动物。大多数基于深度学习的MAS方法在长距离建模问题上表现不佳。最近，Segment Anything Model（SAM）在一般图像分割中获得了广泛关注。然而，它缺乏对细粒度细节和频率信息的感知。为此，我们提出了一种新颖的学习框架，称为分层频率提示SAM（HFP-SAM），以实现高性能的MAS。首先，我们设计了一种频率引导适配器（FGA），通过频率域先验掩膜高效地将海洋场景信息注入冻结的SAM主干网络。此外，我们引入了一种频率感知点选择（FPS），通过频率分析生成突出区域。这些区域与SAM的粗略预测相结合，生成点提示并集成到SAM的解码器中以进行精细预测。最后，为了获得全面的分割掩膜，我们引入了一种全视图曼巴（FVM），以线性计算复杂度高效提取空间和通道上下文信息。在四个公共数据集上的广泛实验表明我们的方法具有优越的性能。源代码可在 https://github.com/Drchip61/TIP-HFP-SAM 获取。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2603.12711

Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

具有双重先验的文本-阶段协同网络用于无监督跨域图像检索

Yang, Jing, Xue, Hui, Zhu, Shipeng, Fang, Pengfei

Abstract

This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.

Chinese Translation

本文研究无监督跨域图像检索（UCDIR），旨在在不同领域中检索同一类别的图像，而无需依赖标记数据。现有方法通常利用从聚类算法中得出的伪标签作为内部领域表示学习和跨领域特征对齐的监督信号。然而，这些离散的伪标签往往无法提供准确和全面的语义指导。此外，对齐过程常常忽视领域特定信息与语义信息之间的纠缠，导致学习到的表示出现语义退化，最终影响检索性能。本文通过提出具有双重先验的文本-阶段协同网络（Text-Phase Synergy Network, TPSNet）来解决这些局限性。具体而言，我们首先使用 CLIP 生成每个领域的一组类别特定提示，称为领域提示，作为文本先验，提供更精确的语义监督。同时，我们进一步引入一个阶段先验，由领域不变的阶段特征表示，集成到原始图像表示中，以弥合领域分布差距，同时保持语义完整性。利用这两个先验的协同作用，TPSNet 在 UCDIR 基准测试中显著超越了现有最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2603.12716

UNIStainNet: Foundation-Model-Guided Virtual Staining of H&E to IHC

UNIStainNet：基础模型引导的H&E到IHC的虚拟染色

Saurav, Jillur Rahman, Pham, Thuong Le Hoai, Mukherjee, Pritam, Yi, Paul, Orr, Brent A., Luber, Jacob M.

Abstract

Virtual immunohistochemistry (IHC) staining from hematoxylin and eosin (H&E) images can accelerate diagnostics by providing preliminary molecular insight directly from routine sections, reducing the need for repeat sectioning when tissue is limited. Existing methods improve realism through contrastive objectives, prototype matching, or domain alignment, yet the generator itself receives no direct guidance from pathology foundation models. We present UNIStainNet, a SPADE-UNet conditioned on dense spatial tokens from a frozen pathology foundation model (UNI), providing tissue-level semantic guidance for stain translation. A misalignment-aware loss suite preserves stain quantification accuracy, and learned stain embeddings enable a single model to serve multiple IHC markers simultaneously. On MIST, UNIStainNet achieves state-of-the-art distributional metrics on all four stains (HER2, Ki67, ER, PR) from a single unified model, where prior methods typically train separate per-stain models. On BCI, it also achieves the best distributional metrics. A tissue-type stratified failure analysis reveals that remaining errors are systematic, concentrating in non-tumor tissue. Code is available at https://github.com/facevoid/UNIStainNet.

Chinese Translation

从苏木精-伊红（H&E）图像进行虚拟免疫组化（IHC）染色可以通过直接从常规切片中提供初步的分子见解，加速诊断，减少在组织有限时重复切片的需求。现有方法通过对比目标、原型匹配或领域对齐来提高真实感，但生成器本身并未直接受到病理基础模型的指导。我们提出了UNIStainNet，这是一种基于来自冻结病理基础模型（UNI）的密集空间标记的条件SPADE-UNet，为染色转换提供组织级语义指导。一套对齐感知的损失函数保持了染色定量的准确性，学习到的染色嵌入使单个模型能够同时服务多个IHC标记。在MIST数据集上，UNIStainNet在四种染色（HER2、Ki67、ER、PR）的所有分布指标上均达到了最先进的水平，而之前的方法通常需要为每种染色训练单独的模型。在BCI数据集上，它也取得了最佳的分布指标。组织类型分层的失败分析显示，剩余的错误是系统性的，集中在非肿瘤组织上。代码可在 https://github.com/facevoid/UNIStainNet 获取。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2603.12718

The COTe score: A decomposable framework for evaluating Document Layout Analysis models

COTe评分：评估文档布局分析模型的可分解框架

Bourne, Jonathan, Simbeye, Mwiza, Govia, Ishtar

Abstract

Document Layout analysis (DLA), is the process by which a page is parsed into meaningful elements, often using machine learning models. Typically, the quality of a model is judged using general object detection metrics such as IoU, F1 or mAP. However, these metrics are designed for images that are 2D projections of 3D space, not for the natively 2D imagery of printed media. This discrepancy can result in misleading or uninformative interpretation of model performance by the metrics. To encourage more robust, comparable, and nuanced DLA, we introduce: The Structural Semantic Unit (SSU) a relational labelling approach that shifts the focus from the physical to the semantic structure of the content; and the Coverage, Overlap, Trespass, and Excess (COTe) score, a decomposable metric for measuring page parsing quality. We demonstrate the value of these methods through case studies and by evaluating 5 common DLA models on 3 DLA datasets. We show that the COTe score is more informative than traditional metrics and reveals distinct failure modes across models, such as breaching semantic boundaries or repeatedly parsing the same region. In addition, the COTe score reduces the interpretation-performance gap by up to 76% relative to the F1. Notably, we find that the COTe's granularity robustness largely holds even without explicit SSU labelling, lowering the barriers to entry for using the system. Finally, we release an SSU labelled dataset and a Python library for applying COTe in DLA projects.

Chinese Translation

文档布局分析（DLA）是将页面解析为有意义元素的过程，通常使用机器学习模型。通常，模型的质量通过一般的目标检测指标来评估，如IoU、F1或mAP。然而，这些指标是为3D空间的2D图像设计的，而不是为印刷媒体的原生2D图像设计的。这种差异可能导致对模型性能的误导性或无信息解释。为了促进更稳健、可比较和细致的DLA，我们引入了结构语义单元（SSU），这是一种关系标注方法，转移了对内容的物理结构到语义结构的关注；以及覆盖、重叠、侵入和过剩（COTe）评分，这是一种可分解的度量，用于衡量页面解析质量。我们通过案例研究和在3个DLA数据集上评估5个常见DLA模型来展示这些方法的价值。我们表明，COTe评分比传统指标更具信息性，并揭示了模型之间的不同失败模式，例如突破语义边界或重复解析同一区域。此外，COTe评分相对于F1减少了多达76%的解释-性能差距。值得注意的是，我们发现COTe的粒度鲁棒性在没有明确的SSU标注的情况下仍然保持较好，降低了使用该系统的入门门槛。最后，我们发布了一个SSU标注的数据集和一个用于在DLA项目中应用COTe的Python库。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2603.12719

IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration

IGASA：集成几何感知和跳跃注意力模块以增强点云配准

Zhang, Dongxu, Zhu, Jihua, Li, Shiqi, Yan, Wenbiao, Xu, Haoran, Fan, Peilin, Lu, Huimin

Abstract

Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi-scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross-Layer Attention (HCLA) module and the Iterative Geometry-Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi-resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state-of-the-art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in \href{https://github.com/DongXu-Zhang/IGASA}{https://github.com/DongXu-Zhang/IGASA}.

Chinese Translation

点云配准（PCR）是3D视觉中的一项基础任务，为自动驾驶、机器人技术和环境建模等应用提供了重要支持。尽管其应用广泛，但现有方法在面对现实世界中的挑战，如强噪声、显著遮挡和大规模变换时，常常表现不佳。这些局限性通常导致配准精度下降和在复杂环境中的鲁棒性不足。本文提出了IGASA，作为一种新颖的配准框架，基于层次金字塔架构（HPA）构建，旨在实现鲁棒的多尺度特征提取和融合。该框架集成了两个关键组件，即层次跨层注意力（HCLA）模块和迭代几何感知细化（IGAR）模块。HCLA模块利用跳跃注意力机制对齐多分辨率特征，并增强局部几何一致性。同时，IGAR模块旨在通过利用在粗配准阶段建立的可靠对应关系，进行精细匹配阶段的处理。这种架构内的协同集成使IGASA能够有效适应多样的点云结构和复杂变换。我们在四个广泛认可的基准数据集上评估了IGASA的性能，包括3D(Lo)Match、KITTI和nuScenes。我们的广泛实验一致表明，IGASA显著超越了最先进的方法，并在配准精度上取得了显著提升。这项工作为推进点云配准技术提供了坚实的基础，同时为实际的3D视觉应用提供了宝贵的见解。IGASA的代码可在 https://github.com/DongXu-Zhang/IGASA 获取。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2603.12721

CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration

CMHANet：一种用于点云配准的跨模态混合注意力网络

Zhang, Dongxu, Wang, Yingsen, Sun, Yiding, Xu, Haoran, Fan, Peilin, Zhu, Jihua

Abstract

Robust point cloud registration is a fundamental task in 3D computer vision and geometric deep learning, essential for applications such as large-scale 3D reconstruction, augmented reality, and scene understanding. However, the performance of established learning-based methods often degrades in complex, real world scenarios characterized by incomplete data, sensor noise, and low overlap regions. To address these limitations, we propose CMHANet, a novel Cross-Modal Hybrid Attention Network. Our method integrates the fusion of rich contextual information from 2D images with the geometric detail of 3D point clouds, yielding a comprehensive and resilient feature representation. Furthermore, we introduce an innovative optimization function based on contrastive learning, which enforces geometric consistency and significantly improves the model's robustness to noise and partial observations. We evaluated CMHANet on the 3DMatch and the challenging 3DLoMatch datasets. \rev{Additionally, zero-shot evaluations on the TUM RGB-D SLAM dataset verify the model's generalization capability to unseen domains.} The experimental results demonstrate that our method achieves substantial improvements in both registration accuracy and overall robustness, outperforming current techniques. We also release our code in \href{https://github.com/DongXu-Zhang/CMHANet}{https://github.com/DongXu-Zhang/CMHANet}.

Chinese Translation

鲁棒的点云配准是三维计算机视觉和几何深度学习中的一项基础任务，对于大规模三维重建、增强现实和场景理解等应用至关重要。然而，现有的基于学习的方法在复杂的现实场景中，尤其是数据不完整、传感器噪声和重叠区域较少的情况下，性能往往会下降。为了解决这些局限性，我们提出了CMHANet，一种新颖的跨模态混合注意力网络。我们的方法将来自二维图像的丰富上下文信息与三维点云的几何细节相融合，从而生成全面且具有韧性的特征表示。此外，我们引入了一种基于对比学习的创新优化函数，该函数强制执行几何一致性，并显著提高模型对噪声和部分观测的鲁棒性。我们在3DMatch和具有挑战性的3DLoMatch数据集上评估了CMHANet。此外，在TUM RGB-D SLAM数据集上的零样本评估验证了模型对未见领域的泛化能力。实验结果表明，我们的方法在配准精度和整体鲁棒性方面均取得了显著提升，超越了当前的技术。我们还在此发布我们的代码，链接为：https://github.com/DongXu-Zhang/CMHANet。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2603.12722

CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment

CognitionCapturerPro：通过多模态信息和非对称对齐实现高保真视觉解码

Zhang, Kaifan, He, Lihuo, Ke, Junjie, Ji, Yuqi, Wu, Lukun, Wang, Lizi, Gao, Xinbo

Abstract

Visual stimuli reconstruction from EEG remains challenging due to fidelity loss and representation shift. We propose CognitionCapturerPro, an enhanced framework that integrates EEG with multi-modal priors (images, text, depth, and edges) via collaborative training. Our core contributions include an uncertainty-weighted similarity scoring mechanism to quantify modality-specific fidelity and a fusion encoder for integrating shared representations. By employing a simplified alignment module and a pre-trained diffusion model, our method significantly outperforms the original CognitionCapturer on the THINGS-EEG dataset, improving Top-1 and Top-5 retrieval accuracy by 25.9% and 10.6%, respectively. Code is available at: https://github.com/XiaoZhangYES/CognitionCapturerPro.

Chinese Translation

从脑电图（EEG）重建视觉刺激仍然具有挑战性，主要由于保真度损失和表示转移。我们提出了CognitionCapturerPro，这是一个增强框架，通过协同训练将EEG与多模态先验（图像、文本、深度和边缘）集成在一起。我们的核心贡献包括一种不确定性加权相似性评分机制，用于量化特定模态的保真度，以及一个用于整合共享表示的融合编码器。通过采用简化的对齐模块和预训练的扩散模型，我们的方法在THINGS-EEG数据集上显著优于原始的CognitionCapturer，Top-1和Top-5检索准确率分别提高了25.9%和10.6%。代码可在以下链接获取：https://github.com/XiaoZhangYES/CognitionCapturerPro。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2603.12743

MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization

MoKus：利用跨模态知识转移进行知识感知的概念定制

Zhu, Chenyang, Li, Hongxiang, Li, Xiu, Chen, Long

Abstract

Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge-aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high-fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge-aware concept customization. Our framework relies on a key observation: cross-modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high-fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge-aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state-of-the-art methods. Moreover, the cross-model knowledge transfer allows MoKus to be easily extended to other knowledge-aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.

Chinese Translation

概念定制通常将稀有标记绑定到目标概念。不幸的是，这些方法往往表现不稳定，因为预训练数据中很少包含这些稀有标记。同时，这些稀有标记无法传达目标概念的内在知识。因此，我们引入了知识感知概念定制，这是一项新任务，旨在将多样的文本知识绑定到目标视觉概念。该任务要求模型识别文本提示中的知识，以实现高保真度的定制生成。同时，模型应有效地将所有文本知识绑定到目标概念。因此，我们提出了MoKus，这是一种新的知识感知概念定制框架。我们的框架依赖于一个关键观察：跨模态知识转移，在生成过程中，文本模态中的知识修改自然转移到视觉模态。受此观察启发，MoKus包含两个阶段：（1）在视觉概念学习中，我们首先学习锚点表示，以存储目标概念的视觉信息。（2）在文本知识更新中，我们更新对锚点表示的知识查询的答案，从而实现高保真度的定制生成。为了更全面地评估我们提出的MoKus在新任务上的表现，我们引入了知识感知概念定制的第一个基准：KnowCusBench。广泛的评估表明，MoKus优于最先进的方法。此外，跨模态知识转移使得MoKus可以轻松扩展到其他知识感知应用，如虚拟概念创建和概念消除。我们还展示了我们的方法在世界知识基准上的改进能力。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2603.12746

Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

动态思维：多模态大型语言模型如何感知、追踪和推理物理4D世界中的动态

Huang, Yuzhi, Wen, Kairun, Gao, Rongxin, Liu, Dongxuan, Lou, Yibin, Wu, Jie, Xu, Jing, Zhang, Jian, Yang, Zheng, Lin, Yunlong, Li, Chenxin, Pan, Panwang, Lu, Junbin, Jiang, Jingyan, Ding, Xinghao, Huang, Yue, Wang, Zhi

Abstract

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics", i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at https://dyn-bench.github.io/.

Chinese Translation

人类生活在一个物理4D世界中，几何结构和语义内容随时间演变，构成了一个动态的4D现实（具有时间维度的空间）。尽管当前的多模态大型语言模型（MLLMs）在静态视觉理解方面表现出色，但它们是否也能在“动态思维”方面表现出色，即感知、追踪和推理不断变化场景中的时空动态？为了系统评估它们的时空推理和局部动态感知能力，我们引入了Dyn-Bench，这是一个基于多样化的真实世界和合成视频数据集构建的大规模基准，能够对时空理解进行稳健和可扩展的评估。通过对大量2D和4D数据源的多阶段筛选，Dyn-Bench提供了一组高质量的动态场景，包括1000个视频、7000对视觉问答（VQA）和3000对动态物体定位。我们探讨了一般性、空间和区域级的MLLMs，表达它们如何在语言和视觉上进行动态思维，并发现现有模型无法在时空推理和动态物体定位方面同时保持强劲的表现，常常产生对运动和交互的不一致解释。值得注意的是，传统的提示策略（例如，链式思维或基于标题的提示）提供的改进有限，而结构化集成方法，包括掩码引导融合（Mask-Guided Fusion）和时空文本认知图（Spatio-Temporal Textual Cognitive Map, ST-TCM），显著增强了MLLMs在物理4D世界中的动态感知和时空推理能力。代码和基准可在 https://dyn-bench.github.io/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2603.12749

SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

SLICE：通过分区嵌入进行语义潜在注入的图像水印技术

Gao, Zheng, Yang, Yifan, Li, Xiaoyu, Feng, Xiaoyan, Fan, Haoran, Song, Yang, Jiang, Jiaojiao

Abstract

Watermarking the initial noise of diffusion models has emerged as a promising approach for image provenance, but content-independent noise patterns can be forged via inversion and regeneration attacks. Recent semantic-aware watermarking methods improve robustness by conditioning verification on image semantics. However, their reliance on a single global semantic binding makes them vulnerable to localized but globally coherent semantic edits. To address this limitation and provide a trustworthy semantic-aware watermark, we propose $\underline{\textbf{S}}$emantic $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$njection via $\underline{\textbf{C}}$ompartmentalized $\underline{\textbf{E}}$mbedding ($\textbf{SLICE}$). Our framework decouples image semantics into four semantic factors (subject, environment, action, and detail) and precisely anchors them to distinct regions in the initial Gaussian noise. This fine-grained semantic binding enables advanced watermark verification where semantic tampering is detectable and localizable. We theoretically justify why SLICE enables robust and reliable tamper localization and provides statistical guarantees on false-accept rates. Experimental results demonstrate that SLICE significantly outperforms existing baselines against advanced semantic-guided regeneration attacks, substantially reducing attack success while preserving image quality and semantic fidelity. Overall, SLICE offers a practical, training-free provenance solution that is both fine-grained in diagnosis and robust to realistic adversarial manipulations.

Chinese Translation

对扩散模型初始噪声进行水印处理已成为图像来源认证的一种有前景的方法，但内容无关的噪声模式可能会通过反演和再生攻击被伪造。近期的语义感知水印方法通过将验证与图像语义相结合，提高了鲁棒性。然而，它们依赖于单一的全局语义绑定，使其易受局部但全局一致的语义编辑的影响。为了解决这一局限性并提供可信的语义感知水印，我们提出了$ extbf{SLICE}$（$ extbf{S}$emantic $ extbf{L}$atent $ extbf{I}$njection via $ extbf{C}$ompartmentalized $ extbf{E}$mbedding）。我们的框架将图像语义解耦为四个语义因子（主体、环境、动作和细节），并将其精确锚定到初始高斯噪声的不同区域。这种细粒度的语义绑定使得高级水印验证成为可能，能够检测和定位语义篡改。我们从理论上证明了SLICE如何实现鲁棒和可靠的篡改定位，并提供了关于误接受率的统计保证。实验结果表明，SLICE在对抗先进的语义引导再生攻击时显著优于现有基线，显著降低了攻击成功率，同时保持了图像质量和语义保真度。总体而言，SLICE提供了一种实用的、无训练的来源解决方案，既在诊断上细致入微，又对现实中的对抗性操控具有鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2603.12751

Show, Don't Tell: Detecting Novel Objects by Watching Human Videos

展示，而非叙述：通过观看人类视频检测新物体

Akl, James, Arbelaez, Jose Nicolas Avendano, Barabas, James, Barry, Jennifer L., Ching, Kalie, Eshed, Noam, Fu, Jiahui, Hidalgo, Michel, Hoelscher, Andrew, Kusnur, Tushar, Messing, Andrew, Nagler, Zachary, Okorn, Brian, Passerino, Mauro, Perkins, Tim J., Rosen, Eric, Shah, Ankit, Shankar, Tanmay, Shaw, Scott

Abstract

How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, "Show, Don't Tell," we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our "Show, Don't Tell" paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.

Chinese Translation

机器人如何能够快速识别和识别在人工演示中展示的新物体？现有的闭集物体检测器常常无法做到这一点，因为这些物体超出了其分布范围。虽然开放集检测器（例如，VLMs）有时能够成功，但它们通常需要昂贵且繁琐的人类参与提示工程，以唯一识别新物体实例。在本文中，我们提出了一种自监督系统，通过在自动创建的数据集上训练定制的物体检测器，消除了对繁琐语言描述和昂贵提示工程的需求，该数据集由人类演示本身进行监督。在我们的方法“展示，而非叙述”中，我们在演示过程中向检测器展示特定的感兴趣物体，而不是通过复杂的语言描述告诉检测器这些物体。通过完全绕过语言，这一范式使我们能够快速训练针对人类任务演示中观察到的相关物体量身定制的检测器。我们开发了一个集成的机器人系统，以在真实世界的机器人上部署我们的“展示，而非叙述”自动数据集创建和新物体检测范式。实证结果表明，我们的管道在操控物体的检测和识别方法上显著优于最先进的技术，从而提高了机器人的任务完成率。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2603.12758

FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking

FC-Track：针对在线多目标跟踪的重叠感知后关联修正

Ju, Cheng, Zhao, Zejing, Namiki, Akio

Abstract

Reliable multi-object tracking (MOT) is essential for robotic systems operating in complex and dynamic environments. Despite recent advances in detection and association, online MOT methods remain vulnerable to identity switches caused by frequent occlusions and object overlap, where incorrect associations can propagate over time and degrade tracking reliability. We present a lightweight post-association correction framework (FC-Track) for online MOT that explicitly targets overlap-induced mismatches during inference. The proposed method suppresses unreliable appearance updates under high-overlap conditions using an Intersection over Area (IoA)-based filtering strategy, and locally corrects detection-to-tracklet mismatches through appearance similarity comparison within overlapped tracklet pairs. By preventing short-term mismatches from propagating, our framework effectively mitigates long-term identity switches without resorting to global optimization or re-identification. The framework operates online without global optimization or re-identification, making it suitable for real-time robotic applications. We achieve 81.73 MOTA, 82.81 IDF1, and 66.95 HOTA on the MOT17 test set with a running speed of 5.7 FPS, and 77.52 MOTA, 80.90 IDF1, and 65.67 HOTA on the MOT20 test set with a running speed of 0.6 FPS. Specifically, our framework FC-Track produces only 29.55% long-term identity switches, which is substantially lower than existing online trackers. Meanwhile, our framework maintains state-of-the-art performance on the MOT20 benchmark.

Chinese Translation

可靠的多目标跟踪（MOT）对于在复杂和动态环境中运行的机器人系统至关重要。尽管在检测和关联方面取得了近期进展，但在线MOT方法仍然容易受到频繁遮挡和目标重叠引起的身份切换的影响，其中错误的关联可能会随着时间的推移而传播，从而降低跟踪的可靠性。我们提出了一种轻量级的后关联修正框架（FC-Track），专门针对推理过程中因重叠引起的不匹配。所提出的方法在高重叠条件下使用基于面积交集（IoA）的过滤策略来抑制不可靠的外观更新，并通过在重叠的轨迹对之间进行外观相似性比较来局部修正检测与轨迹片段之间的不匹配。通过防止短期不匹配的传播，我们的框架有效减轻了长期身份切换，而无需依赖全局优化或重新识别。该框架在线运行，无需全局优化或重新识别，适合实时机器人应用。我们在MOT17测试集上取得了81.73的MOTA、82.81的IDF1和66.95的HOTA，运行速度为5.7 FPS；在MOT20测试集上取得了77.52的MOTA、80.90的IDF1和65.67的HOTA，运行速度为0.6 FPS。具体而言，我们的框架FC-Track仅产生29.55%的长期身份切换，显著低于现有的在线跟踪器。同时，我们的框架在MOT20基准测试中保持了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2603.12759

SAP: Segment Any 4K Panorama

SAP: 任意4K全景分割

Jiang, Lutao, Cao, Zidong, Chen, Weikai, Zheng, Xu, Lyu, Yuanhuiyi, Li, Zhenyang, HU, Zeyu, Yin, Yingda, Luo, Keyang, Zhang, Runze, Yan, Kai, Qian, Shengju, Fan, Haidi, Peng, Yifan, Wang, Xin, Xiong, Hui, Chen, Ying-Cong

Abstract

Promptable instance segmentation is widely adopted in embodied and AR systems, yet the performance of foundation models trained on perspective imagery often degrades on 360{\deg} panoramas. In this paper, we introduce Segment Any 4K Panorama (SAP), a foundation model for 4K high-resolution panoramic instance-level segmentation. We reformulate panoramic segmentation as fixed-trajectory perspective video segmentation, decomposing a panorama into overlapping perspective patches sampled along a continuous spherical traversal. This memory-aligned reformulation preserves native 4K resolution while restoring the smooth viewpoint transitions required for stable cross-view propagation. To enable large-scale supervision, we synthesize 183,440 4K-resolution panoramic images with instance segmentation labels using the InfiniGen engine. Trained under this trajectory-aligned paradigm, SAP generalizes effectively to real-world 360{\deg} images, achieving +17.2 zero-shot mIoU gain over vanilla SAM2 of different sizes on real-world 4K panorama benchmark.

Chinese Translation

可提示的实例分割在具身和增强现实系统中被广泛采用，但在360°全景图像上训练的基础模型的性能往往会下降。本文介绍了任意4K全景分割（SAP），这是一个用于4K高分辨率全景实例级分割的基础模型。我们将全景分割重新表述为固定轨迹的透视视频分割，将全景图分解为沿着连续球面遍历采样的重叠透视补丁。这种内存对齐的重新表述保持了原生4K分辨率，同时恢复了稳定的视角过渡所需的平滑性。为了实现大规模监督，我们使用InfiniGen引擎合成了183,440张带有实例分割标签的4K分辨率全景图像。在这种轨迹对齐的范式下训练后，SAP在真实世界的360°图像上有效地进行了泛化，在真实世界的4K全景基准测试中，相较于不同尺寸的原始SAM2，取得了+17.2的零-shot mIoU增益。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2603.12760

HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks

HIFICL：用于多模态任务的高保真上下文学习

Li, Xiaoyu, Liu, Yuhang, Luo, Zheng, Kang, Xuanshuo, Lou, Fangqi, Wu, Xiaohua, Xiong, Zihan

Abstract

In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a "shift vector". Inspired by the exact decomposition, we introduce High-Fidelity In-Context Learning (HIFICL) to more faithfully model the ICL mechanism. HIFICL consists of three key components: 1) a set of "virtual key-value pairs" to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks. The code is available at https://github.com/bbbandari/HiFICL.

Chinese Translation

上下文学习（ICL）是大型多模态模型（LMMs）的一种重要范式，通过少量的上下文示例（ICDs）进行新任务的适应。然而，其性能对示例配置敏感且计算成本高。数学上，这些示例的影响可以分解为标准注意力输出和上下文值的动态混合。目前的近似方法通过学习一个“位移向量”来简化这一过程。受到精确分解的启发，我们引入了高保真上下文学习（HIFICL），以更真实地建模ICL机制。HIFICL由三个关键组成部分构成：1）一组“虚拟键值对”，作为可学习的上下文；2）用于稳定和正则化训练的低秩分解；3）一个简单的端到端训练目标。从另一个角度来看，这一机制构成了一种上下文感知的参数高效微调（PEFT）。大量实验表明，HIFICL在多个多模态基准测试中始终优于现有的近似方法。代码可在 https://github.com/bbbandari/HiFICL 获取。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2603.12762

TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation

TerraFlow：用于地球观测的多模态、多时间表示学习

Puriy, Nazar, Jakubik, Johannes, Blumenstiel, Benedikt, Schindler, Konrad

Abstract

We propose TerraFlow, a novel approach to multimodal, multitemporal learning for Earth observation. TerraFlow builds on temporal training objectives that enable sequence-aware learning across space, time, and modality, while remaining robust to the variable-length inputs commonly encountered in real-world Earth observation data. Our experiments demonstrate superiority of TerraFlow over state-of-the-art foundation models for Earth observation across all temporal tasks of the GEO-Bench-2 benchmark. We additionally demonstrate that TerraFlow is able to make initial steps towards deep-learning based risk map prediction for natural disasters -- a task on which other state-of-the-art foundation models frequently collapse. TerraFlow outperforms state-of-the-art foundation models by up to 50% in F1 score and 24% in Brier score.

Chinese Translation

我们提出了TerraFlow，一种用于地球观测的多模态、多时间学习的新方法。TerraFlow基于时间训练目标，能够实现跨空间、时间和模态的序列感知学习，同时对现实世界中常见的变长输入保持鲁棒性。我们的实验表明，TerraFlow在GEO-Bench-2基准的所有时间任务中优于现有的地球观测基础模型。此外，我们还展示了TerraFlow在基于深度学习的自然灾害风险地图预测方面迈出了初步步伐——这是其他现有基础模型常常失败的任务。TerraFlow在F1分数上比现有基础模型提高了多达50%，在Brier分数上提高了24%。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2603.12764

SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

SAVA-X：通过场景自适应视图对齐和双向交叉视图融合进行自我到外部模仿错误检测

Li, Xiang, Qiu, Heqian, Wang, Lanxiao, Qiu, Benliu, Meng, Fanman, Xu, Linfeng, Li, Hongliang

Abstract

Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.

Chinese Translation

错误检测在工业培训、医疗保健和装配质量控制中至关重要。现有的大多数研究假设单视图设置，无法处理使用第三人（外部）示范来评估第一人（自我）模仿的实际情况。我们正式定义了自我$ ightarrow$外部模仿错误检测：给定异步、长度不匹配的自我和外部视频，模型必须在自我时间线上定位程序步骤，并判断每个步骤是否存在错误。该设置引入了跨视图领域转移、时间错位和大量冗余。在统一协议下，我们从密集视频字幕生成和时间动作检测中适应强基线，并显示它们在这一跨视图环境中的困难。随后，我们提出了SAVA-X，一个Align-Fuse-Detect框架，具有（i）视图条件自适应采样，（ii）场景自适应视图嵌入，以及（iii）双向交叉注意力融合。在EgoMe基准测试中，SAVA-X在所有基线中始终提高了AUPRC和平均tIoU，消融实验确认了其组件的互补效益。代码可在https://github.com/jack1ee/SAVAX获取。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2603.12766

Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation

Catalyst4D：通过动态传播实现高保真3D到4D场景编辑

Chen, Shifeng, Li, Yihui, Liao, Jun, Yang, Hongyu, Huang, Di

Abstract

Recent advances in 3D scene editing using NeRF and 3DGS enable high-quality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUAR) preserves temporal appearance consistency by estimating per-Gaussian color uncertainty and selectively refining regions prone to occlusion-induced artifacts. Extensive experiments demonstrate that Catalyst4D achieves temporally stable, high-fidelity dynamic scene editing and outperforms existing methods in both visual quality and motion coherence.

Chinese Translation

最近在使用NeRF和3DGS进行3D场景编辑方面的进展，使得高质量静态场景编辑成为可能。相比之下，动态场景编辑仍然面临挑战，因为直接将2D扩散模型扩展到4D的方法往往会产生运动伪影、时间闪烁和风格传播不一致的问题。我们提出了Catalyst4D，一个将高质量3D编辑转移到动态4D高斯场景的框架，同时保持空间和时间的一致性。在其核心，基于锚点的运动引导（Anchor-based Motion Guidance, AMG）从原始和编辑后的高斯中构建一组结构稳定且空间代表性的锚点。这些锚点作为稳健的区域级参考，其对应关系通过最优运输建立，从而实现一致的变形传播，避免跨区域干扰或运动漂移。补充地，颜色不确定性引导的外观细化（Color Uncertainty-guided Appearance Refinement, CUAR）通过估计每个高斯的颜色不确定性并选择性地细化易受遮挡伪影影响的区域，保持时间上的外观一致性。大量实验表明，Catalyst4D实现了时间稳定、高保真的动态场景编辑，并在视觉质量和运动一致性方面超越了现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2603.12772

PVI: Plug-in Visual Injection for Vision-Language-Action Models

PVI：用于视觉-语言-动作模型的插件视觉注入

Zhang, Zezhou, Zhang, Songxin, Xiong, Xiao, Zhang, Junjie, Xie, Zejian, Xi, Jingyi, Mao, Zunyao, Mao, Zan, Mai, Zhixin, Song, Zhuoyang, Zhang, Jiaxing

Abstract

VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert. Prior work mitigates this by injecting auxiliary visual features, but existing approaches either focus on static spatial representations or require substantial architectural modifications to accommodate temporal inputs, leaving temporal information underexplored. We propose Plug-in Visual Injection (PVI), a lightweight, encoder-agnostic module that attaches to a pretrained action expert and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning. Using PVI, we obtain consistent gains over the base policy and a range of competitive alternative injection strategies, and our controlled study shows that temporal video features (V-JEPA2) outperform strong static image features (DINOv2), with the largest gains on multi-phase tasks requiring state tracking and coordination. Real-robot experiments on long-horizon bimanual cloth folding further demonstrate the practicality of PVI beyond simulation.

Chinese Translation

将预训练的视觉语言模型（VLM）与流匹配动作专家相结合的视觉语言-动作（VLA）架构已成为语言条件下操控的强大范式。然而，VLM优化于语义抽象，通常基于静态视觉观测进行条件处理，往往会削弱细粒度几何线索，并且缺乏动作专家所需的明确时间证据。以往的研究通过注入辅助视觉特征来缓解这一问题，但现有方法要么专注于静态空间表示，要么需要对架构进行大量修改以适应时间输入，从而使时间信息未得到充分探索。我们提出了插件视觉注入（PVI），这是一种轻量级、编码器无关的模块，能够附加到预训练的动作专家上，通过零初始化的残差路径注入辅助视觉表示，仅需单阶段微调即可保留预训练行为。使用PVI，我们在基础策略和一系列竞争性的替代注入策略上获得了一致的提升，我们的对照研究表明，时间视频特征（V-JEPA2）优于强大的静态图像特征（DINOv2），在需要状态跟踪和协调的多阶段任务中获得了最大的提升。在长时间跨度的双手布料折叠的真实机器人实验中，进一步展示了PVI在模拟之外的实用性。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2603.12773

Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

利用视觉语言模型增强语义敏感的水下图像增强

Fan, Guodong, Zhou, Shengning, Yuan, Genji, Li, Huiyu, Zhou, Jingchun, Li, Jinjiang

Abstract

In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.

Chinese Translation

近年来，基于学习的水下图像增强（UIE）技术迅速发展。然而，高质量增强输出与自然图像之间的分布差异可能会阻碍下游视觉任务的语义线索提取，从而限制现有增强模型的适应性。为了解决这一挑战，本文提出了一种新的学习机制，利用视觉语言模型（VLM）赋予UIE模型语义敏感的能力。具体而言，我们的策略首先通过VLM从退化图像生成关键对象的文本描述。随后，一个文本-图像对齐模型将这些相关描述重新映射到图像上，以生成空间语义引导图。该图通过双重引导机制引导UIE网络，该机制结合了交叉注意力和显式对齐损失。这迫使网络在图像重建过程中将恢复能力集中于语义敏感区域，而不是追求全局均匀的改进，从而确保关键对象特征的忠实恢复。实验结果表明，当我们的策略应用于不同的UIE基线时，显著提升了它们在感知质量指标上的表现，并增强了它们在检测和分割任务上的性能，验证了其有效性和适应性。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2603.12787

Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

基本外科动作的广义识别促进技能评估和基于视觉-语言模型的外科规划

Xu, Mengya, Shen, Daiyun, Zhang, Jie, Yip, Hon Chi, Gao, Yujia, Chen, Cheng, Imans, Dillan, Long, Yonghao, Ye, Yiru, Liu, Yixiao, Mai, Rongyun, Chen, Kai, Ren, Hongliang, Ban, Yutong, Wang, Guangsuo, Wong, Francis, Ng, Chi-Fai, Ngiam, Kee Yuan, Taylor, Russell H., Xu, Daguang, Jin, Yueming, Dou, Qi

Abstract

Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific knowledge, and action planning in cholecystectomy and nephrectomy using large vision-language models. Multinational surgeons' evaluation of the language model's output of the action planning explainable texts demonstrated clinical relevance. These findings indicate that basic surgical actions can be robustly recognized across scenarios, and an accurate BSA understanding model can essentially facilitate complex applications and speed up the realization of surgical superintelligence.

Chinese Translation

人工智能、成像技术和大型语言模型有潜力改变外科实践、培训和自动化。理解和建模基本外科动作（BSA），即任何手术中的基本操作单元，对于推动该领域的发展至关重要。本文提出了一个包含10个基本动作和6个外科专业的BSA数据集，涵盖超过11,000个视频片段，是迄今为止最大的此类数据集。基于BSA数据集，我们开发了一种新的基础模型，能够进行基本动作的通用识别。我们的研究方法在不同程序类型和各种身体部位的数据集上经过验证，显示出强大的跨专业性能。此外，我们通过使用领域特定知识在前列腺切除术中的外科技能评估，以及在胆囊切除术和肾切除术中使用大型视觉-语言模型进行动作规划，展示了BAS基础模型所支持的下游应用。多国外科医生对动作规划可解释文本的语言模型输出的评估显示了临床相关性。这些发现表明，基本外科动作可以在不同场景中被可靠地识别，而准确的BSA理解模型本质上可以促进复杂应用，并加速外科超智能的实现。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2603.12788

Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing

思考与回答我：遥感中多实体推理基础的基准测试与探索

Lyu, Shuchang, Wen, Haiquan, Cheng, Guangliang, Li, Meng, Zhou, Zheng, Zhou, You, Yao, Dingding, Shi, Zhenwei

Abstract

Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi-step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception-level matching and single-entity formulations, limiting the role of explicit reasoning and inter-entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi-Entity Reasoning Grounding in Remote Sensing (ME-RSRG). Based on ME-RSRG, we reformulate remote sensing grounding as a multi-entity reasoning task and propose an Entity-Aware Reasoning (EAR) framework built upon visual-linguistic foundation models. EAR generates structured reasoning traces and subject-object grounding outputs. It adopts supervised fine-tuning for cold-start initialization and is further optimized via entity-aware reward-driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME-RSRG demonstrate the challenges of multi-entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at https://github.com/CV-ShuchangLyu/ME-RSRG.

Chinese Translation

近期在可验证奖励的推理语言模型和强化学习方面的进展显著增强了多步骤推理能力。这一进展促使推理范式扩展到遥感视觉基础任务。然而，现有的遥感基础方法仍然主要局限于感知层面的匹配和单实体的表述，限制了显式推理和实体间建模的作用。为了解决这一挑战，我们引入了一个新的基准数据集，用于遥感中的多实体推理基础（Multi-Entity Reasoning Grounding in Remote Sensing，ME-RSRG）。基于ME-RSRG，我们将遥感基础重新表述为多实体推理任务，并提出了一种基于视觉-语言基础模型的实体感知推理（Entity-Aware Reasoning，EAR）框架。EAR生成结构化的推理轨迹和主体-客体基础输出。它采用监督微调进行冷启动初始化，并通过实体感知奖励驱动的群体相对策略优化（Group Relative Policy Optimization，GRPO）进一步优化。在ME-RSRG上的大量实验展示了多实体推理的挑战，并验证了我们提出的EAR框架的有效性。我们的数据集、代码和模型将会在 https://github.com/CV-ShuchangLyu/ME-RSRG 上发布。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2603.12789

Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

单次传递下的多人人物多视角视频一致性重建

Kim, Sangmin, Hwang, Minhyuk, Cha, Geonho, Wee, Dongyoon, Park, Jaesik

Abstract

Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.

Chinese Translation

近年来，3D基础模型的进展引发了对重建人类及其周围环境的日益关注。然而，大多数现有方法集中于单目输入，而将其扩展到多视角设置需要额外的模块或预处理数据。为此，我们提出了CHROMM，一个统一框架，能够在不依赖外部模块或预处理的情况下，从多人人物多视角视频中联合估计相机、场景点云和人类网格。我们将来自Pi3X和Multi-HMR的强几何和人类先验整合到一个可训练的神经网络架构中，并引入了一个尺度调整模块，以解决人类与场景之间的尺度差异。我们还提出了一种多视角融合策略，以在测试时将每个视角的估计聚合为单一表示。最后，我们提出了一种基于几何的多人人物关联方法，该方法比基于外观的方法更具鲁棒性。在EMDB、RICH、EgoHumans和EgoExo4D上的实验表明，CHROMM在全球人类运动和多视角姿态估计方面表现出竞争力，同时运行速度比先前基于优化的多视角方法快超过8倍。项目页面：https://nstar1125.github.io/chromm。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2603.12793

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Cheers：将补丁细节与语义表示解耦，实现统一的多模态理解与生成

Zhang, Yichen, Peng, Da, Guo, Zonghao, Zhang, Zijian, Yang, Xuesong, Sun, Tong, Sun, Shichu, Zhang, Yidan, Li, Yanghao, Zhao, Haiyan, Xu, Wang, Shi, Qi, Sun, Yangang, Chen, Chi, Wang, Shuo, Yan, Yukun, Han, Xu, Ma, Qiang, Ke, Wei, Wang, Liang, Liu, Zhiyuan, Sun, Maosong

Abstract

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

Chinese Translation

多模态建模中的一个前沿课题是将视觉理解与生成统一在一个模型中。然而，这两项任务需要不匹配的解码机制和视觉表示，使得在共享特征空间内共同优化变得复杂。在本研究中，我们提出了Cheers，一个统一的多模态模型，它将补丁级别的细节与语义表示解耦，从而稳定多模态理解的语义，并通过门控细节残差提高图像生成的保真度。Cheers包含三个关键组件：(i) 一个统一的视觉标记器，它将图像潜在状态编码并压缩为语义标记，以便高效地进行大规模语言模型（LLM）条件化；(ii) 一个基于LLM的Transformer，它统一了文本生成的自回归解码和图像生成的扩散解码；(iii) 一个级联流匹配头，它首先解码视觉语义，然后从视觉标记器注入语义门控的细节残差，以细化高频内容。在流行基准上的实验表明，Cheers在视觉理解和生成方面与先进的统一多模态模型（UMMs）相匹配或超越。Cheers还实现了4倍的标记压缩，使得高分辨率图像编码和生成更加高效。值得注意的是，Cheers在流行基准GenEval和MMBench上超越了Tar-1.5B，同时仅需20%的训练成本，表明其有效且高效的统一多模态建模（即4倍标记压缩）。我们将发布所有代码和数据以供未来研究使用。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2603.12796

Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting

针对资源目标攻击的光谱防御在3D高斯喷溅中的应用

Chen, Yang, Yu, Yi, He, Jiaming, Duan, Yueqi, Zhu, Zheng, Tan, Yap-Peng

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) deliver high-quality rendering, yet the Gaussian representation exposes a new attack surface, the resource-targeting attack. This attack poisons training images, excessively inducing Gaussian growth to cause resource exhaustion. Although efficiency-oriented methods such as smoothing, thresholding, and pruning have been explored, these spatial-domain strategies operate on visible structures but overlook how stealthy perturbations distort the underlying spectral behaviors of training data. As a result, poisoned inputs introduce abnormal high-frequency amplifications that mislead 3DGS into interpreting noisy patterns as detailed structures, ultimately causing unstable Gaussian overgrowth and degraded scene fidelity. To address this, we propose \textbf{Spectral Defense} in Gaussian and image fields. We first design a 3D frequency filter to selectively prune Gaussians exhibiting abnormally high frequencies. Since natural scenes also contain legitimate high-frequency structures, directly suppressing high frequencies is insufficient, and we further develop a 2D spectral regularization on renderings, distinguishing naturally isotropic frequencies while penalizing anisotropic angular energy to constrain noisy patterns. Experiments show that our defense builds robust, accurate, and secure 3DGS, suppressing overgrowth by up to $5.92\times$, reducing memory by up to $3.66\times$, and improving speed by up to $4.34\times$ under attacks.

Chinese Translation

近期在3D高斯喷溅（3D Gaussian Splatting, 3DGS）领域的进展实现了高质量的渲染，然而高斯表示法暴露了一个新的攻击面——资源目标攻击。该攻击通过污染训练图像，过度诱导高斯增长，从而导致资源耗尽。尽管已经探索了诸如平滑、阈值处理和剪枝等以效率为导向的方法，这些空间域策略仅针对可见结构进行操作，却忽视了隐蔽扰动如何扭曲训练数据的基础光谱行为。因此，受污染的输入引入了异常的高频放大，误导3DGS将噪声模式解读为细节结构，最终导致高斯过度生长和场景保真度下降。为了解决这一问题，我们提出了在高斯和图像领域的 extbf{光谱防御}。我们首先设计了一个3D频率滤波器，以选择性地剪除表现出异常高频的高斯。由于自然场景中也包含合法的高频结构，直接抑制高频是不够的，因此我们进一步在渲染中开发了2D光谱正则化，区分自然各向同性频率，同时惩罚各向异性角能量，以约束噪声模式。实验表明，我们的防御机制构建了稳健、准确且安全的3DGS，在攻击下抑制过度生长高达$5.92 imes$，减少内存使用高达$3.66 imes$，并提高速度高达$4.34 imes$。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2603.12799

What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

什么使得视觉语言模型（VLMs）具备鲁棒性？朝着鲁棒性与准确性在视觉语言模型中的协调迈进

Nie, Sen, Zhang, Jie, Wang, Zhongqi, Wei, Zhaoyang, Shan, Shiguang, Chen, Xilin

Abstract

Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.

Chinese Translation

在视觉语言模型（VLMs）中实现对抗鲁棒性不可避免地会影响在干净数据上的准确性，这呈现出一个长期存在且具有挑战性的权衡。在本研究中，我们通过探讨一个基本问题重新审视这一权衡：是什么使得VLMs具备鲁棒性？通过对对抗性微调模型的详细分析，我们考察了鲁棒性机制的内部运作及其与干净准确性的相互作用。我们的分析揭示，对抗鲁棒性并不是在网络深度上均匀分布的。相反，出乎意料的是，它主要集中在浅层，由低频谱偏差和对输入不敏感的注意力模式驱动。同时，对深层的更新往往会削弱干净准确性和鲁棒泛化。基于这些见解，我们提出了对抗鲁棒性适应（R-Adapt），这是一个简单而有效的框架，它冻结所有预训练权重，仅在初始层引入最小的、基于洞察的适应。这一设计在对抗鲁棒性和干净准确性之间实现了卓越的平衡。R-Adapt进一步支持无训练、模型引导和数据驱动的范式，提供灵活的途径以无缝地为标准模型赋予鲁棒性。在18个数据集和多种任务上的广泛评估展示了我们在各种攻击下的最先进性能。值得注意的是，R-Adapt能够有效地推广到大型视觉语言模型（如LLaVA和Qwen-VL），以增强其鲁棒性。我们的项目页面可访问 https://summu77.github.io/R-Adapt。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2603.12811

OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution

OARS：面向过程的在线对齐用于生成真实世界图像超分辨率

Zhao, Shijie, Zhang, Xuanyu, Chen, Bin, Li, Weiqi, Xing, Qunliang, Zhang, Kexin, Wang, Yan, Li, Junlin, Zhang, Li, Zhang, Jian, Xue, Tianfan

Abstract

Aligning generative real-world image super-resolution models with human visual preference is challenging due to the perception--fidelity trade-off and diverse, unknown degradations. Prior approaches rely on offline preference optimization and static metric aggregation, which are often non-interpretable and prone to pseudo-diversity under strong conditioning. We propose OARS, a process-aware online alignment framework built on COMPASS, a MLLM-based reward that evaluates the LR to SR transition by jointly modeling fidelity preservation and perceptual gain with an input-quality-adaptive trade-off. To train COMPASS, we curate COMPASS-20K spanning synthetic and real degradations, and introduce a three-stage perceptual annotation pipeline that yields calibrated, fine-grained training labels. Guided by COMPASS, OARS performs progressive online alignment from cold-start flow matching to full-reference and finally reference-free RL via shallow LoRA optimization for on-policy exploration. Extensive experiments and user studies demonstrate consistent perceptual improvements while maintaining fidelity, achieving state-of-the-art performance on Real-ISR benchmarks.

Chinese Translation

将生成的真实世界图像超分辨率模型与人类视觉偏好对齐具有挑战性，因为存在感知-保真度权衡和多样化、未知的退化。以往的方法依赖于离线偏好优化和静态度量聚合，这些方法往往缺乏可解释性，并且在强条件下容易出现伪多样性。我们提出了OARS，一个基于COMPASS的面向过程的在线对齐框架，COMPASS是一种基于多模态学习模型（MLLM）的奖励机制，通过联合建模保真度保持和感知增益，采用输入质量自适应的权衡来评估低分辨率（LR）到高分辨率（SR）的转变。为了训练COMPASS，我们整理了涵盖合成和真实退化的COMPASS-20K数据集，并引入了一个三阶段的感知标注流程，生成经过校准的细粒度训练标签。在COMPASS的指导下，OARS从冷启动流匹配到全参考，最终通过浅层LoRA优化进行无参考强化学习（RL）以实现在线探索。大量实验和用户研究表明，在保持保真度的同时，感知效果持续改善，在Real-ISR基准测试中实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2603.12829

coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation

coDrawAgents：一种用于组合图像生成的多智能体对话框架

Li, Chunhan, Wu, Qifeng, Pan, Jia-Hui, Hui, Ka-Hei, Hu, Jingyu, Jiang, Yuming, Sheng, Bin, Liu, Xihui, Gong, Wenjuan, Liu, Zhengzhe

Abstract

Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides between a direct text-to-image pathway and a layout-aware multi-agent process. In the layout-aware mode, it parses the prompt into attribute-rich object descriptors, ranks them by semantic salience, and groups objects with the same semantic priority level for joint generation. Guided by the Interpreter, the Planner adopts a divide-and-conquer strategy, incrementally proposing layouts for objects with the same semantic priority level while grounding decisions in the evolving visual context of the canvas. The Checker introduces an explicit error-correction mechanism by validating spatial consistency and attribute alignment, and refining layouts before they are rendered. Finally, the Painter synthesizes the image step by step, incorporating newly planned objects into the canvas to provide richer context for subsequent iterations. Together, these agents address three key challenges: reducing layout complexity, grounding planning in visual context, and enabling explicit error correction. Extensive experiments on benchmarks GenEval and DPG-Bench demonstrate that coDrawAgents substantially improves text-image alignment, spatial accuracy, and attribute binding compared to existing methods.

Chinese Translation

文本到图像生成技术发展迅速，但现有模型在忠实地组合多个对象和保留它们在复杂场景中的属性方面仍然面临挑战。我们提出了coDrawAgents，一个具有四个专业智能体的互动多智能体对话框架：解释者（Interpreter）、规划者（Planner）、检查者（Checker）和画家（Painter），它们协作以改善组合生成。解释者自适应地在直接的文本到图像路径和布局感知的多智能体过程之间做出决策。在布局感知模式下，它将提示解析为富含属性的对象描述符，根据语义显著性对其进行排序，并将具有相同语义优先级的对象分组以进行联合生成。在解释者的指导下，规划者采用分而治之的策略，逐步提出具有相同语义优先级的对象布局，同时将决策基于画布上不断演变的视觉上下文。检查者通过验证空间一致性和属性对齐，引入了显式的错误修正机制，并在渲染之前优化布局。最后，画家逐步合成图像，将新规划的对象纳入画布，以为后续迭代提供更丰富的上下文。这些智能体共同解决了三个关键挑战：减少布局复杂性、将规划与视觉上下文相结合，以及实现显式的错误修正。在基准测试GenEval和DPG-Bench上的广泛实验表明，与现有方法相比，coDrawAgents显著提高了文本与图像的对齐、空间准确性和属性绑定。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2603.12832

Hierarchical Dual-Change Collaborative Learning for UAV Scene Change Captioning

用于无人机场景变化描述的层次双变化协作学习

Chen, Fuhai, Huang, Pengpeng, Wu, Junwen, Zhang, Hehong, Wang, Shiping, Ma, Xiaoguang, Ge, Xuri

Abstract

This paper proposes a novel task for UAV scene understanding - UAV Scene Change Captioning (UAV-SCC) - which aims to generate natural language descriptions of semantic changes in dynamic aerial imagery captured from a movable viewpoint. Unlike traditional change captioning that mainly describes differences between image pairs captured from a fixed camera viewpoint over time, UAV scene change captioning focuses on image-pair differences resulting from both temporal and spatial scene variations dynamically captured by a moving camera. The key challenge lies in understanding viewpoint-induced scene changes from UAV image pairs that share only partially overlapping scene content due to viewpoint shifts caused by camera rotation, while effectively exploiting the relative orientation between the two images. To this end, we propose a Hierarchical Dual-Change Collaborative Learning (HDC-CL) method for UAV scene change captioning. In particular, a novel transformer, \emph{i.e.} Dynamic Adaptive Layout Transformer (DALT) is designed to adaptively model diverse spatial layouts of the image pair, where the interrelated features derived from the overlapping and non-overlapping regions are learned within the flexible and unified encoding layer. Furthermore, we propose a Hierarchical Cross-modal Orientation Consistency Calibration (HCM-OCC) method to enhance the model's sensitivity to viewpoint shift directions, enabling more accurate change captioning. To facilitate in-depth research on this task, we construct a new benchmark dataset, named UAV-SCC dataset, for UAV scene change captioning. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on this task. The dataset and code will be publicly released upon acceptance of this paper.

Chinese Translation

本文提出了一项针对无人机场景理解的新任务——无人机场景变化描述（UAV-SCC），旨在生成对从可移动视角捕获的动态航空图像中语义变化的自然语言描述。与传统的变化描述主要描述从固定相机视角随时间捕获的图像对之间的差异不同，无人机场景变化描述关注的是由于相机移动而动态捕获的场景变化所导致的图像对之间的差异。关键挑战在于理解由于相机旋转引起的视角变化导致的无人机图像对之间仅部分重叠的场景内容的变化，同时有效利用两幅图像之间的相对方向。为此，我们提出了一种用于无人机场景变化描述的层次双变化协作学习（HDC-CL）方法。特别地，我们设计了一种新颖的变换器，即动态自适应布局变换器（Dynamic Adaptive Layout Transformer, DALT），以自适应地建模图像对的多样空间布局，其中来自重叠和非重叠区域的相关特征在灵活统一的编码层中进行学习。此外，我们提出了一种层次跨模态方向一致性校准（HCM-OCC）方法，以增强模型对视角变化方向的敏感性，从而实现更准确的变化描述。为了促进对该任务的深入研究，我们构建了一个新的基准数据集，命名为UAV-SCC数据集，用于无人机场景变化描述。大量实验表明，所提出的方法在该任务上达到了最先进的性能。该数据集和代码将在本文被接受后公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2603.12845

Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation

多模态蛋白质语言模型用于酶动力学参数：从底物识别到构象适应

Wang, Fei, Zheng, Xinye, Li, Kun, Wei, Yanyan, Liu, Yuxin, Hu, Ganpeng, Bao, Tong, Yang, Jingwen

Abstract

Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number ($k_\text{cat}$), Michaelis constant ($K_\text{m}$), and inhibition constant ($K_\text{i}$) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines simplify this process to a static compatibility problem between the enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the Enzyme-Reaction Bridging Adapter (ERBA), which injects cross-modal information via fine-tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross-Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry-aware Mixture-of-Experts (G-MoE) then integrates active-site structure and routes samples to pocket-specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme-Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out-of-distribution performance compared with sequence-only and shallow-fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time-resolved structural cues.

Chinese Translation

预测酶动力学参数量化了酶在特定生化条件下催化特定底物的效率。经典参数如转化数 ($k_ ext{cat}$)、米氏常数 ($K_ ext{m}$) 和抑制常数 ($K_ ext{i}$) 共同依赖于酶序列、底物化学以及在结合过程中活性位点的构象适应。许多学习流程将此过程简化为酶与底物之间的静态兼容性问题，通过浅层操作融合它们的表示并回归单一值。这种公式化忽视了催化的分阶段特性，催化过程涉及底物识别和构象适应。在这方面，我们将动力学预测重新表述为一个分阶段的多模态条件建模问题，并引入酶-反应桥接适配器（Enzyme-Reaction Bridging Adapter, ERBA），该适配器通过微调将跨模态信息注入蛋白质语言模型（Protein Language Models, PLMs），同时保留其生化先验。ERBA 在两个阶段进行条件化：分子识别交叉注意力（Molecular Recognition Cross-Attention, MRCA）首先将底物信息注入酶表示中，以捕捉特异性；几何感知专家混合模型（Geometry-aware Mixture-of-Experts, G-MoE）随后整合活性位点结构，并将样本路由到特定口袋的专家，以反映诱导契合。为了保持语义的保真性，酶-底物分布对齐（Enzyme-Substrate Distribution Alignment, ESDA）在重现核希尔伯特空间中强制执行 PLM 流形内的分布一致性。在三个动力学终点和多个 PLM 骨干的实验中，ERBA 相比于仅基于序列和浅层融合的基线模型提供了一致的增益和更强的分布外性能，为可扩展的动力学预测提供了生物学基础，并为添加辅因子、突变和时间分辨结构线索奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2603.12848

Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

第十届ABAW竞赛中的Team LEYA：多模态矛盾/犹豫识别方法

Ryumina, Elena, Axyonov, Alexandr, Sysoev, Dmitry, Abdulkadirov, Timur, Almetov, Kirill, Morozova, Yulia, Ryumin, Dmitry

Abstract

Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.

Chinese Translation

在非约束视频中识别矛盾/犹豫是一项具有挑战性的任务，因为这种行为状态的微妙性、多模态性和上下文依赖性。本文提出了一种用于视频级别矛盾/犹豫识别的多模态方法，旨在参加第十届ABAW竞赛。所提出的方法整合了四种互补的模态：场景、面部、音频和文本。场景动态通过基于VideoMAE的模型捕捉，面部信息通过统计池化聚合的情感帧级嵌入进行编码，音频表示通过EmotionWav2Vec2.0提取并由基于Mamba的时间编码器处理，而语言线索则使用微调的基于变换器的文本模型进行建模。最终得到的单模态嵌入进一步通过多模态融合模型进行组合，包括原型增强变体。在BAH语料库上的实验表明，多模态融合相较于所有单模态基线有明显的提升。最佳单模态配置的平均MF1达到了70.02%，而最佳多模态融合模型则达到了83.25%。通过五个原型增强融合模型的集成，获得了最高的最终测试性能71.43%。获得的结果突显了互补多模态线索和稳健融合策略在矛盾/犹豫识别中的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2603.12852

Wear Classification of Abrasive Flap Wheels using a Hierarchical Deep Learning Approach

基于分层深度学习方法的磨料翻转轮磨损分类

Kähler, Falko, Wille, Maxim, Schmedemann, Ole, Schüppstuhl, Thorsten

Abstract

Abrasive flap wheels are common for finishing complex free-form surfaces due to their flexibility. However, this flexibility results in complex wear patterns such as concave/convex flap profiles or flap tears, which influence the grinding result. This paper proposes a novel, vision-based hierarchical classification framework to automate the wear condition monitoring of flap wheels. Unlike monolithic classification approaches, we decompose the problem into three logical levels: (1) state detection (new vs. worn), (2) wear type identification (rectangular, concave, convex) and flap tear detection, and (3) severity assessment (partial vs. complete deformation). A custom-built dataset of real flap wheel images was generated and a transfer learning approach with EfficientNetV2 architecture was used. The results demonstrate high robustness with classification accuracies ranging from 93.8% (flap tears) to 99.3% (concave severity). Furthermore, Gradient-weighted Class Activation Mapping (Grad-CAM) is utilized to validate that the models learn physically relevant features and examine false classifications. The proposed hierarchical method provides a basis for adaptive process control and wear consideration in automated flap wheel grinding.

Chinese Translation

磨料翻转轮因其灵活性而广泛应用于复杂自由形状表面的精加工。然而，这种灵活性导致了复杂的磨损模式，如凹/凸翻转轮轮廓或翻转轮撕裂，这些都会影响磨削结果。本文提出了一种新颖的基于视觉的分层分类框架，以自动化监测翻转轮的磨损状态。与单一分类方法不同，我们将问题分解为三个逻辑层次：(1) 状态检测（新轮与磨损轮），(2) 磨损类型识别（矩形、凹形、凸形）及翻转轮撕裂检测，以及 (3) 严重性评估（部分变形与完全变形）。我们生成了一个自定义的真实翻转轮图像数据集，并采用了基于 EfficientNetV2 架构的迁移学习方法。结果表明，分类准确率高，范围从 93.8%（翻转轮撕裂）到 99.3%（凹形严重性）。此外，采用梯度加权类激活映射（Gradient-weighted Class Activation Mapping, Grad-CAM）验证模型学习到的物理相关特征，并检查错误分类。所提出的分层方法为自适应过程控制和自动化翻转轮磨削中的磨损考虑提供了基础。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2603.12864

Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation

通过解耦控制构建驾驶世界以生成对抗场景

Zhan, Yifan, Chen, Zhengqing, Wang, Qingjie, He, Zhuo, Niu, Muyao, Guo, Xiaoyang, Yin, Wei, Ren, Weiqiang, Zhang, Qian, Zheng, Yinqiang

Abstract

A major challenge in autonomous driving is the "long tail" of safety-critical edge cases, which often emerge from unusual combinations of common traffic elements. Synthesizing these scenarios is crucial, yet current controllable generative models provide incomplete or entangled guidance, preventing the independent manipulation of scene structure, object identity, and ego actions. We introduce CompoSIA, a compositional driving video simulator that disentangles these traffic factors, enabling fine-grained control over diverse adversarial driving scenarios. To support controllable identity replacement of scene elements, we propose a noise-level identity injection, allowing pose-agnostic identity generation across diverse element poses, all from a single reference image. Furthermore, a hierarchical dual-branch action control mechanism is introduced to improve action controllability. Such disentangled control enables adversarial scenario synthesis-systematically combining safe elements into dangerous configurations that entangled generators cannot produce. Extensive comparisons demonstrate superior controllable generation quality over state-of-the-art baselines, with a 17% improvement in FVD for identity editing and reductions of 30% and 47% in rotation and translation errors for action control. Furthermore, downstream stress-testing reveals substantial planner failures: across editing modalities, the average collision rate of 3s increases by 173%.

Chinese Translation

自动驾驶中的一个主要挑战是安全关键边缘案例的“长尾”问题，这些案例通常源于常见交通元素的不寻常组合。合成这些场景至关重要，但当前的可控生成模型提供的指导不完整或相互纠缠，阻碍了对场景结构、物体身份和自我动作的独立操控。我们提出了CompoSIA，一种组合驾驶视频模拟器，它解耦了这些交通因素，使得对多样化对抗驾驶场景的细粒度控制成为可能。为了支持场景元素的可控身份替换，我们提出了一种噪声级别身份注入方法，允许在多样元素姿态下进行与姿态无关的身份生成，所有这些均来自单一参考图像。此外，引入了一种分层双分支动作控制机制，以提高动作的可控性。这种解耦控制使得对抗场景合成成为可能——系统地将安全元素组合成危险配置，而纠缠生成器无法产生这些配置。广泛的比较表明，在可控生成质量上优于最先进的基线，在身份编辑方面FVD提高了17%，在动作控制方面旋转和位移误差分别减少了30%和47%。此外，下游压力测试揭示了显著的规划失败：在编辑模式下，3秒的平均碰撞率增加了173%。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2603.12873

TRACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking

TRACE：一种结构感知的字符编码框架，用于稳健且可泛化的文档水印

Meng, Jiale, Zhang, Jie, Hu, Runyi, Lu, Zhe-Ming, Zhang, Tianwei, Li, Yiming

Abstract

We propose TRACE, a structure-aware framework leveraging diffusion models for localized character encoding to embed data. Unlike existing methods that rely on edge features or pre-defined codebooks, TRACE exploits character structures that provide inherent resistance to noise interference due to their stability and unified representation across diverse characters. Our framework comprises three key components: (1) adaptive diffusion initialization that automatically identifies handle points, target points, and editing regions through specialized algorithms including movement probability estimator (MPE), target point estimation (TPE) and mask drawing model (MDM), (2) guided diffusion encoding for precise movement of selected point, and (3) masked region replacement with a specialized loss function to minimize feature alterations after the diffusion process. Comprehensive experiments demonstrate \name{}'s superior performance over state-of-the-art methods, achieving more than 5 dB improvement in PSNR and 5\% higher extraction accuracy following cross-media transmission. \name{} achieves broad generalizability across multiple languages and fonts, making it particularly suitable for practical document security applications.

Chinese Translation

我们提出了TRACE，一个结构感知的框架，利用扩散模型进行局部字符编码以嵌入数据。与现有依赖边缘特征或预定义代码本的方法不同，TRACE利用字符结构，这些结构由于其稳定性和在多样字符中的统一表示，提供了固有的抗噪声干扰能力。我们的框架包括三个关键组件：(1) 自适应扩散初始化，通过包括运动概率估计器（MPE）、目标点估计（TPE）和掩模绘制模型（MDM）在内的专门算法自动识别处理点、目标点和编辑区域；(2) 引导扩散编码，以精确移动选定点；(3) 使用专门的损失函数进行掩模区域替换，以最小化扩散过程后的特征变化。全面的实验表明，TRACE在性能上优于最先进的方法，在PSNR上实现超过5 dB的提升，并在跨媒体传输后提高了5%的提取准确率。TRACE在多种语言和字体中具有广泛的泛化能力，特别适合实际文档安全应用。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2603.12886

A protocol for evaluating robustness to H&E staining variation in computational pathology models

评估计算病理模型对H&E染色变化鲁棒性的协议

Schönpflug, Lydia A., Berg, Nikki van den, Andani, Sonali, Horeweg, Nanda, Wolf, Jurriaan Barkey, Bosse, Tjalling, Koelzer, Viktor H., Lafarge, Maxime W.

Abstract

Sensitivity to staining variation remains a major barrier to deploying computational pathology (CPath) models as hematoxylin and eosin (H&E) staining varies across laboratories, requiring systematic assessment of how this variability affects model prediction. In this work, we developed a three-step protocol for evaluating robustness to H&E staining variation in CPath models. Step 1: Select reference staining conditions, Step 2: Characterize test set staining properties, Step 3: Apply CPath model(s) under simulated reference staining conditions. Here, we first created a new reference staining library based on the PLISM dataset. As an exemplary use case, we applied the protocol to assess the robustness properties of 306 microsatellite instability (MSI) classification models on the unseen SurGen colorectal cancer dataset (n=738), including 300 attention-based multiple instance learning models trained on the TCGA-COAD/READ datasets across three feature extractors (UNI2-h, H-Optimus-1, Virchow2), alongside six public MSI classification models. Classification performance was measured as AUC, and robustness as the min-max AUC range across four simulated staining conditions (low/high H&E intensity, low/high H&E color similarity). Across models and staining conditions, classification performance ranged from AUC 0.769-0.911 ($\Delta$ = 0.142). Robustness ranged from 0.007-0.079 ($\Delta$ = 0.072), and showed a weak inverse correlation with classification performance (Pearson r=-0.22, 95% CI [-0.34, -0.11]). Thus, we show that the proposed evaluation protocol enables robustness-informed CPath model selection and provides insight into performance shifts across H&E staining conditions, supporting the identification of operational ranges for reliable model deployment. Code is available at https://github.com/CTPLab/staining-robustness-evaluation .

Chinese Translation

对染色变化的敏感性仍然是部署计算病理（CPath）模型的一大障碍，因为苏木精-伊红（H&E）染色在不同实验室之间存在差异，这需要系统性评估这种变异如何影响模型预测。在本研究中，我们开发了一个三步协议，用于评估CPath模型对H&E染色变化的鲁棒性。第一步：选择参考染色条件；第二步：表征测试集的染色特性；第三步：在模拟参考染色条件下应用CPath模型。在这里，我们首先基于PLISM数据集创建了一个新的参考染色库。作为一个示例用例，我们应用该协议评估了306个微卫星不稳定性（MSI）分类模型在未见的SurGen结直肠癌数据集（n=738）上的鲁棒性特征，包括300个基于注意力的多实例学习模型，这些模型在TCGA-COAD/READ数据集上训练，并使用了三种特征提取器（UNI2-h、H-Optimus-1、Virchow2），以及六个公共MSI分类模型。分类性能以AUC衡量，鲁棒性则以四种模拟染色条件下的最小-最大AUC范围（低/高H&E强度，低/高H&E颜色相似性）来衡量。在不同模型和染色条件下，分类性能的AUC范围为0.769-0.911（$ riangle$ = 0.142）。鲁棒性范围为0.007-0.079（$ riangle$ = 0.072），并与分类性能呈弱负相关（Pearson r=-0.22，95% CI [-0.34, -0.11]）。因此，我们展示了所提出的评估协议能够支持基于鲁棒性的CPath模型选择，并提供了对H&E染色条件下性能变化的洞察，支持识别可靠模型部署的操作范围。代码可在https://github.com/CTPLab/staining-robustness-evaluation获取。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2603.12887

Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning

通过跨物种迁移学习从非接触式摄像头预测癫痫发作

Zhai, Mingkai, Wang, Wei, Li, Zongsheng, Liu, Quanying

Abstract

Epileptic seizure forecasting is a clinically important yet challenging problem in epilepsy research. Existing approaches predominantly rely on neural signals such as electroencephalography (EEG), which require specialized equipment and limit long-term deployment in real-world settings. In contrast, video data provide a non-invasive and accessible alternative, yet existing video-based studies mainly focus on post-onset seizure detection, leaving seizure forecasting largely unexplored. In this work, we formulate a novel task of video-based epileptic seizure forecasting, where short pre-ictal video segments (3-10 seconds) are used to predict whether a seizure will occur within the subsequent 5 seconds. To address the scarcity of annotated human epilepsy videos, we propose a cross-species transfer learning framework that leverages large-scale rodent video data for auxiliary pretraining. This enables the model to capture seizure-related behavioral dynamics that generalize across species. Experimental results demonstrate that our approach achieves over 70% prediction accuracy under a strictly video-only setting and outperforms existing baselines. These findings highlight the potential of cross-species learning for building non-invasive, scalable early-warning systems for epilepsy.

Chinese Translation

癫痫发作预测是癫痫研究中一个临床重要但具有挑战性的问题。现有方法主要依赖于神经信号，如脑电图（EEG），这需要专业设备并限制了在现实环境中的长期部署。相比之下，视频数据提供了一种非侵入性且易于获取的替代方案，但现有基于视频的研究主要集中在发作后检测上，发作预测仍然基本未被探索。在本研究中，我们提出了一项新任务，即基于视频的癫痫发作预测，其中使用短的前发作视频片段（3-10秒）来预测在随后的5秒内是否会发生发作。为了应对标注人类癫痫视频的稀缺性，我们提出了一种跨物种迁移学习框架，利用大规模啮齿动物视频数据进行辅助预训练。这使得模型能够捕捉跨物种普遍适用的与发作相关的行为动态。实验结果表明，我们的方法在严格的视频-only设置下实现了超过70%的预测准确率，并且优于现有基线。这些发现突显了跨物种学习在构建非侵入性、可扩展的癫痫早期预警系统中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2603.12893

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

基于有限差分流优化的文本到图像模型强化学习后训练

McAllister, David, Aittala, Miika, Karras, Tero, Hellsten, Janne, Kanazawa, Angjoo, Aila, Timo, Laine, Samuli

Abstract

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

Chinese Translation

强化学习（RL）已成为后训练基于扩散的图像合成模型的标准技术，因为它能够通过奖励信号学习，明确改善图像质量和提示对齐等期望方面。在本文中，我们提出了一种在线RL变体，通过采样配对轨迹并将流速拉向更有利的图像，从而减少模型更新的方差。与现有方法将每个采样步骤视为单独的策略动作不同，我们将整个采样过程视为一个单一的动作。我们在高质量视觉语言模型和现成质量指标上进行实验，以作为奖励，并使用广泛的指标集评估输出。我们的方法收敛速度更快，输出质量和提示对齐度优于以往的方法。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2603.12903

Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis

用于无姿态激光雷达视图合成的谱几何神经场

Jiang, Yinuo, Cheng, Jun, Wang, Yiran, Cheng, Cheng

Abstract

Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate the effectiveness of our framework, especially in challenging low-frequency scenarios. Compared to previous state-of-the-art methods, SG-NLF improves reconstruction quality and pose accuracy by over 35.8% and 68.8%. Our work can provide a novel perspective for LiDAR view synthesis.

Chinese Translation

神经辐射场（NeRF）在图像新视图合成（NVS）中表现出显著的成功，激发了对激光雷达（LiDAR）NVS的扩展。然而，大多数方法严重依赖于准确的相机姿态进行场景重建。激光雷达数据的稀疏性和无纹理特性也带来了独特的挑战，导致几何孔洞和不连续的表面。为了解决这些问题，我们提出了SG-NLF，一个无姿态的激光雷达NeRF框架，结合了谱信息和几何一致性。具体而言，我们设计了一种基于谱先验的混合表示，以重建平滑的几何形状。对于姿态优化，我们构建了一个基于特征兼容性的置信度感知图，以实现全局对齐。此外，引入了一种对抗学习策略，以强制执行跨帧一致性，从而提高重建质量。全面的实验表明，我们的框架在具有挑战性的低频场景中尤其有效。与之前的最先进方法相比，SG-NLF在重建质量和姿态准确性上分别提高了超过35.8%和68.8%。我们的工作为激光雷达视图合成提供了新的视角。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2603.12912

FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts

FedBPrompt：通过身体分布感知视觉提示的联邦领域泛化行人重识别

Xu, Xin, Li, Weilong, Liu, Wei, Huang, Wenke, Yu, Zhixi, Yang, Bin, Liao, Xiaoying, Jiang, Kui

Abstract

Federated Domain Generalization for Person Re-Identification (FedDG-ReID) learns domain-invariant representations from decentralized data. While Vision Transformer (ViT) is widely adopted, its global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints -- a challenge amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), introducing learnable visual prompts to guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) comprising: Holistic Full Body Prompts to suppress cross-client background noise, and Body Part Alignment Prompts to capture fine-grained details robust to pose and viewpoint variations. To mitigate high communication costs, we design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code is available at https://github.com/leavlong/FedBPrompt.

Chinese Translation

联邦领域泛化行人重识别（FedDG-ReID）从去中心化数据中学习领域不变的表征。尽管视觉变换器（ViT）被广泛采用，但其全局注意力往往无法区分与高相似背景或不同视角下的行人——这一挑战在FedDG-ReID中因跨客户端分布变化而加剧。为了解决这一问题，我们提出了联邦身体分布感知视觉提示（FedBPrompt），引入可学习的视觉提示以引导变换器注意力聚焦于以行人为中心的区域。FedBPrompt采用身体分布感知视觉提示机制（BAPM），包括：整体全身提示以抑制跨客户端背景噪声，以及身体部位对齐提示以捕捉对姿态和视角变化具有鲁棒性的细粒度细节。为了降低高昂的通信成本，我们设计了一种基于提示的微调策略（PFTS），该策略冻结ViT主干网络，仅更新轻量级提示，显著减少通信开销，同时保持适应性。大量实验表明，BAPM有效增强了特征区分能力和跨领域泛化能力，而PFTS在仅进行少数聚合轮次内实现了显著的性能提升。此外，BAPM和PFTS均可轻松集成到现有的基于ViT的FedDG-ReID框架中，使FedBPrompt成为一种灵活且有效的联邦行人重识别解决方案。代码可在 https://github.com/leavlong/FedBPrompt 获取。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2603.12915

Stake the Points: Structure-Faithful Instance Unlearning

权衡要点：结构保真实例遗忘

Hong, Kiseong, Shin, JungKyoo, Kim, Eunwoo

Abstract

Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion-retention balance. In this work, we propose a novel structure-faithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion-retention trade-off and enhancing generalization.

Chinese Translation

机器遗忘（MU）旨在解决预训练模型中的隐私风险。MU的主要目标是去除指定数据的影响，同时保留保留知识的效用。实现这一目标需要保持保留实例之间的语义关系，而现有研究往往忽视这一点。我们观察到，如果不保持这种关系，模型将遭受逐步的结构崩溃，从而破坏删除与保留之间的平衡。在本研究中，我们提出了一种新颖的结构保真框架，该框架引入了“权衡”，即作为参考点的语义锚点，以维护知识结构。通过利用这些锚点，我们的框架捕捉并稳定知识的语义组织。具体而言，我们通过语义编码器（例如，CLIP）从语言驱动的属性描述中实例化锚点。我们通过结构感知对齐和正则化来强制保持知识结构：前者在遗忘前后围绕锚点对保留知识的组织进行对齐，而后者则规范对结构关键参数的更新。来自图像分类、检索和人脸识别的结果显示，性能平均提升了32.9%、22.5%和19.3%，在删除与保留的权衡中实现了平衡，并增强了泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2603.12918

VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

VIRD：通过双轴变换实现视图不变表示的跨视图姿态估计

Park, Juhye, Lee, Wooju, Hong, Dasol, Sung, Changki, Seo, Youngwoo, Kang, Dongwan, Myung, Hyun

Abstract

Accurate global localization is crucial for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

Chinese Translation

准确的全球定位对于自动驾驶和机器人技术至关重要，但基于全球导航卫星系统（GNSS）的方法常常因遮挡和多路径效应而降低性能。作为一种新兴的替代方案，跨视图姿态估计预测与地面视图图像对应的3自由度相机姿态，基于地理参考的卫星图像。然而，现有方法在弥合地面视图与卫星视图之间显著的视点差距方面面临挑战，主要是由于空间对应关系有限。我们提出了一种新颖的跨视图姿态估计方法，通过双轴变换（VIRD）构建视图不变表示。VIRD首先对卫星视图应用极坐标变换，以建立水平对应关系，然后在地面特征和极坐标变换后的卫星特征上使用增强上下文的位置信息注意力，以解决垂直不对齐问题，明确减小视点差距。引入视图重建损失以进一步增强视图不变性，鼓励所得到的表示重建原始和跨视图图像。在KITTI和VIGOR数据集上的实验表明，VIRD在没有方向先验的情况下优于最新的方法，分别在KITTI上将中位位置和方向误差减少了50.7%和76.5%，在VIGOR上减少了18.0%和46.8%。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2603.12930

Rethinking VLMs for Image Forgery Detection and Localization

重新思考视觉语言模型在图像伪造检测与定位中的应用

Guo, Shaofeng, Cui, Jiequan, Hong, Richang

Abstract

With the rapid rise of Artificial Intelligence Generated Content (AIGC), image manipulation has become increasingly accessible, posing significant challenges for image forgery detection and localization (IFDL). In this paper, we study how to fully leverage vision-language models (VLMs) to assist the IFDL task. In particular, we observe that priors from VLMs hardly benefit the detection and localization performance and even have negative effects due to their inherent biases toward semantic plausibility rather than authenticity. Additionally, the location masks explicitly encode the forgery concepts, which can serve as extra priors for VLMs to ease their training optimization, thus enhancing the interpretability of detection and localization results. Building on these findings, we propose a new IFDL pipeline named IFDL-VLM. To demonstrate the effectiveness of our method, we conduct experiments on 9 popular benchmarks and assess the model performance under both in-domain and cross-dataset generalization settings. The experimental results show that we consistently achieve new state-of-the-art performance in detection, localization, and interpretability.Code is available at: https://github.com/sha0fengGuo/IFDL-VLM.

Chinese Translation

随着人工智能生成内容（AIGC）的快速崛起，图像处理变得愈加容易，这对图像伪造检测与定位（IFDL）带来了重大挑战。本文研究如何充分利用视觉语言模型（VLMs）来辅助IFDL任务。特别地，我们观察到VLMs的先验知识对检测和定位性能几乎没有帮助，甚至由于其固有的偏向语义合理性而对真实性产生负面影响。此外，位置掩码明确编码了伪造概念，这可以作为VLMs的额外先验，帮助其训练优化，从而增强检测和定位结果的可解释性。基于这些发现，我们提出了一种新的IFDL流程，命名为IFDL-VLM。为了验证我们方法的有效性，我们在9个流行基准上进行了实验，并在领域内和跨数据集泛化设置下评估模型性能。实验结果表明，我们在检测、定位和可解释性方面始终实现了新的最先进性能。代码可在以下链接获取：https://github.com/sha0fengGuo/IFDL-VLM。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2603.12937

SGMatch: Semantic-Guided Non-Rigid Shape Matching with Flow Regularization

SGMatch：基于语义引导的非刚性形状匹配与流动正则化

Ye, Tianwei, Mei, Xiaoguang, Xia, Yifan, Fan, Fan, Huang, Jun, Ma, Jiayi

Abstract

Establishing accurate point-to-point correspondences between non-rigid 3D shapes remains a critical challenge, particularly under non-isometric deformations and topological noise. Existing functional map pipelines suffer from ambiguities that geometric descriptors alone cannot resolve, and spatial inconsistencies inherent in the projection of truncated spectral bases to dense pointwise correspondences. In this paper, we introduce SGMatch, a learning-based framework for semantic-guided non-rigid shape matching. Specifically, we design a Semantic-Guided Local Cross-Attention module that integrates semantic features from vision foundation models into geometric descriptors while preserving local structural continuity. Furthermore, we introduce a regularization objective based on conditional flow matching, which supervises a time-varying velocity field to encourage spatial smoothness of the recovered correspondences. Experimental results on multiple benchmarks demonstrate that SGMatch achieves competitive performance across near-isometric settings and consistent improvements under non-isometric deformations and topological noise.

Chinese Translation

在非刚性三维形状之间建立准确的点对点对应关系仍然是一个关键挑战，尤其是在非等距变形和拓扑噪声的情况下。现有的功能映射管道存在模糊性，仅依靠几何描述符无法解决，同时在将截断谱基投影到密集点对对应关系时固有的空间不一致性。在本文中，我们提出了SGMatch，一个基于学习的语义引导非刚性形状匹配框架。具体而言，我们设计了一个语义引导局部交叉注意力模块，将视觉基础模型中的语义特征整合到几何描述符中，同时保持局部结构的连续性。此外，我们引入了一种基于条件流匹配的正则化目标，该目标监督一个时变速度场，以促进恢复的对应关系的空间平滑性。在多个基准测试上的实验结果表明，SGMatch在近等距设置下取得了竞争力的性能，并在非等距变形和拓扑噪声下实现了一致的改进。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2603.12938

Thinking in Streaming Video

流媒体视频中的思维

Liu, Zikang, Guo, Longteng, Li, Handong, Zhen, Ru, He, Xingjian, Ji, Ruyi, Ren, Xiaoming, Zhang, Yanhao, Lu, Haonan, Liu, Jing

Abstract

Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream

Chinese Translation

实时理解连续视频流对于在动态环境中操作的互动助手和多模态智能体至关重要。然而，大多数现有的视频推理方法遵循批处理范式，推迟推理直到观察到完整的视频上下文，这导致高延迟和不断增长的计算成本，与流媒体场景不兼容。本文介绍了ThinkStream，一个基于观察-思考-说话（Watch--Think--Speak）范式的流媒体视频推理框架，使模型能够在新的视频观察到达时逐步更新其理解。在每一步中，模型执行短期推理更新，并决定是否积累了足够的证据以产生响应。为了支持长时间跨度的流媒体，我们提出了推理压缩流媒体记忆（Reasoning-Compressed Streaming Memory，RCSM），该方法将中间推理痕迹视为紧凑的语义记忆，替换过时的视觉标记，同时保留必要的上下文。我们进一步使用可验证奖励的流媒体强化学习方案训练模型，使增量推理和响应时机与流媒体交互的要求相一致。在多个流媒体视频基准上的实验表明，ThinkStream显著优于现有的在线视频模型，同时保持低延迟和低内存使用。代码、模型和数据将发布在 https://github.com/johncaged/ThinkStream

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2603.12988

Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning

基于性别对抗注意力的多实例学习在胸部CT中公平肺病诊断

Parikh, Aditya, Feragen, Aasa

Abstract

We present a fairness-aware framework for multi-class lung disease diagnosis from chest CT volumes, developed for the Fair Disease Diagnosis Challenge at the PHAROS-AIF-MIH Workshop (CVPR 2026). The challenge requires classifying CT scans into four categories -- Healthy, COVID-19, Adenocarcinoma, and Squamous Cell Carcinoma -- with performance measured as the average of per-gender macro F1 scores, explicitly penalizing gender-inequitable predictions. Our approach addresses two core difficulties: the sparse pathological signal across hundreds of slices, and a severe demographic imbalance compounded across disease class and gender. We propose an attention-based Multiple Instance Learning (MIL) model on a ConvNeXt backbone that learns to identify diagnostically relevant slices without slice-level supervision, augmented with a Gradient Reversal Layer (GRL) that adversarially suppresses gender-predictive structure in the learned scan representation. Training incorporates focal loss with label smoothing, stratified cross-validation over joint (class, gender) strata, and targeted oversampling of the most underrepresented subgroup. At inference, all five-fold checkpoints are ensembled with horizontal-flip test-time augmentation via soft logit voting and out-of-the-fold threshold optimization for robustness. Our model achieves a mean validation competition score of 0.685 (std - 0.030), with the best single fold reaching 0.759. All training and inference code is publicly available at https://github.com/ADE-17/cvpr-fair-chest-ct

Chinese Translation

我们提出了一个关注公平性的框架，用于从胸部CT图像中进行多类别肺病诊断，该框架是为PHAROS-AIF-MIH研讨会（CVPR 2026）中的公平疾病诊断挑战而开发的。该挑战要求将CT扫描分类为四个类别——健康、COVID-19、腺癌和鳞状细胞癌，性能以每个性别的宏观F1分数的平均值进行衡量，明确惩罚性别不平等的预测。我们的方法解决了两个核心难题：在数百个切片中稀疏的病理信号，以及在疾病类别和性别上加剧的严重人口不平衡。我们提出了一种基于注意力的多实例学习（MIL）模型，采用ConvNeXt骨干网络，能够在没有切片级监督的情况下学习识别诊断相关的切片，并辅以梯度反转层（GRL），对学习到的扫描表示中的性别预测结构进行对抗性抑制。训练过程中结合了聚焦损失和标签平滑，针对联合（类别，性别）层次进行分层交叉验证，并对最少代表性子群体进行有针对性的过采样。在推理阶段，所有五折检查点通过软逻辑投票和折外阈值优化进行水平翻转测试时间增强，以提高鲁棒性。我们的模型在验证竞赛中获得了0.685的平均分（标准差-0.030），最佳单折达到0.759。所有训练和推理代码均可在https://github.com/ADE-17/cvpr-fair-chest-ct上公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2603.12989

Test-Time Attention Purification for Backdoored Large Vision Language Models

测试时注意力净化用于后门攻击的大型视觉语言模型

Zhang, Zhifang, Yang, Bojun, He, Shuo, Chen, Weitong, Zhang, Wei Emma, Maennel, Olaf, Feng, Lei, Xu, Miao

Abstract

Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model's utility on both clean and poisoned samples.

Chinese Translation

尽管大型视觉语言模型（LVLMs）在多模态性能上表现强劲，但在微调过程中却容易受到后门攻击，攻击者通过将嵌入触发器的样本插入训练数据中，植入可以在测试时恶意激活的行为。现有的防御方法通常依赖于使用干净数据重新训练后门参数（例如，适配器或LoRA模块），这在计算上代价高昂，并且往往会降低模型性能。在本研究中，我们提供了对LVLMs中后门行为的新机制理解：触发器并不是通过低级视觉模式影响预测，而是通过异常的跨模态注意力重分配，其中携带触发器的视觉标记从文本上下文中窃取注意力——我们称之为注意力窃取。基于此，我们提出了CleanSight，这是一种无训练、即插即用的防御方法，完全在测试时操作。CleanSight (i) 根据选定跨模态融合层中的相对视觉-文本注意力比率检测被污染的输入，并且 (ii) 通过选择性地修剪可疑的高注意力视觉标记来净化输入，以中和后门激活。大量实验表明，CleanSight在多种数据集和后门攻击类型上显著优于现有的基于像素的净化防御，同时保持模型在干净和被污染样本上的效用。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2603.12998

A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

具有跨模态和任务效用保证的视觉语言模型去偏见的封闭形式解

Lian, Tangzheng, Hu, Guanyu, Ren, Yijing, Kollias, Dimitrios, Celiktutan, Oya

Abstract

While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.

Chinese Translation

尽管视觉语言模型（VLMs）在多种下游任务中取得了显著的性能，但最近的研究表明，它们可能会从训练数据中继承社会偏见，并进一步将这些偏见传播到下游应用中。为了解决这一问题，已经提出了多种去偏见方法，但大多数方法旨在提高公平性，而没有理论保证模型的效用得以保留。本文提出了一种去偏见方法，在跨模态空间中获得了 extbf{封闭形式}解，实现了帕累托最优的公平性，并具有 extbf{有界的效用损失}。我们的方法是 extbf{无训练}的， extbf{不需要标注数据}，并且可以在下游任务中同时去偏见视觉和文本模态。大量实验表明，我们的方法在多种公平性指标和数据集上优于现有的去偏见方法，适用于下游任务中的群体公平性和 extbf{交叉性}公平性，如零-shot图像分类、文本到图像检索和文本到图像生成，同时保持任务性能。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2603.13024

SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

SAW：通过可控和可扩展的视频生成迈向外科行动世界模型

Rapuri, Sampath, Seenivasan, Lalithkumar, Schneider, Dominik, Soberanis-Mukul, Roger, He, Yufan, Ding, Hao, Xu, Jiru, Yu, Chenhao, Jing, Chenyan, Guo, Pengfei, Xu, Daguang, Unberath, Mathias

Abstract

A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation -- from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting their scalability. Other approaches exhibit limited temporal consistency across complex laparoscopic scenes and do not possess sufficient realism. We propose Surgical Action World (SAW) -- a step toward surgical action world modeling through video diffusion conditioned on four lightweight signals: language prompts encoding tool-action context, a reference surgical scene, tissue affordance mask, and 2D tool-tip trajectories. We design a conditional video diffusion approach that reformulates video-to-video diffusion into trajectory-conditioned surgical action synthesis. The backbone diffusion model is fine-tuned on a custom-curated dataset of 12,044 laparoscopic clips with lightweight spatiotemporal conditioning signals, leveraging a depth consistency loss to enforce geometric plausibility without requiring depth at inference. SAW achieves state-of-the-art temporal consistency (CD-FVD: 199.19 vs. 546.82) and strong visual quality on held-out test data. Furthermore, we demonstrate its downstream utility for (a) surgical AI, where augmenting rare actions with SAW-generated videos improves action recognition (clipping F1-score: 20.93% to 43.14%; cutting: 0.00% to 8.33%) on real test data, and (b) surgical simulation, where rendering tool-tissue interaction videos from simulator-derived trajectory points toward a visually faithful simulation engine.

Chinese Translation

一种能够生成真实外科行动视频并精确控制工具与组织相互作用的外科世界模型，可以解决外科人工智能和仿真中的基本挑战——从数据稀缺和稀有事件合成到弥合外科自动化的模拟与现实之间的差距。然而，目前的视频生成方法，作为此类外科世界模型的核心，要求在推理时提供昂贵的注释或复杂的结构化中介作为条件信号，这限制了它们的可扩展性。其他方法在复杂的腹腔镜场景中表现出有限的时间一致性，并且缺乏足够的真实感。我们提出了外科行动世界（Surgical Action World, SAW）——通过基于四个轻量级信号的视频扩散，向外科行动世界建模迈出的一步：编码工具-行动上下文的语言提示、参考外科场景、组织可用性掩模以及二维工具尖端轨迹。我们设计了一种条件视频扩散方法，将视频到视频的扩散重新构建为轨迹条件的外科行动合成。主干扩散模型在一个自定义策划的数据集上进行了微调，该数据集包含12,044个腹腔镜剪辑，配备轻量级时空条件信号，利用深度一致性损失来强制几何合理性，而无需在推理时要求深度。SAW在时间一致性方面达到了最先进的水平（CD-FVD: 199.19 vs. 546.82），并在保留的测试数据上展现出强大的视觉质量。此外，我们展示了其下游应用的潜力：在（a）外科人工智能中，通过SAW生成的视频增强稀有动作，提高了真实测试数据上的动作识别（剪辑F1分数：20.93%提升至43.14%；切割：0.00%提升至8.33%）；在（b）外科仿真中，从模拟器导出的轨迹点渲染工具-组织交互视频，指向一个视觉上真实的仿真引擎。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2603.13027

SortScrews: A Dataset and Baseline for Real-time Screw Classification

SortScrews：实时螺丝分类的数据集和基准

Fu, Tianhao, Yang, Bingxuan, Guo, Juncheng, Sribalan, Shrena, Chen, Yucheng

Abstract

Automatic identification of screw types is important for industrial automation, robotics, and inventory management. However, publicly available datasets for screw classification are scarce, particularly for controlled single-object scenarios commonly encountered in automated sorting systems. In this work, we introduce $\textbf{SortScrews}$, a dataset for casewise visual classification of screws. The dataset contains 560 RGB images at $512\times512$ resolution covering six screw types and a background class. Images are captured using a standardized acquisition setup and include mild variations in lighting and camera perspective across four capture settings. To facilitate reproducible research and dataset expansion, we also provide a reusable data collection script that allows users to easily construct similar datasets for custom hardware components using inexpensive camera setups. We establish baseline results using transfer learning with EfficientNet-B0 and ResNet-18 classifiers pretrained on ImageNet. In addition, we conduct a well-explored failure analysis. Despite the limited dataset size, these lightweight models achieve strong classification accuracy, demonstrating that controlled acquisition conditions enable effective learning even with relatively small datasets. The dataset, collection pipeline, and baseline training code are publicly available at https://github.com/ATATC/SortScrews.

Chinese Translation

螺丝类型的自动识别对于工业自动化、机器人技术和库存管理至关重要。然而，公开可用的螺丝分类数据集稀缺，特别是在自动化分拣系统中常见的受控单对象场景下。在本研究中，我们介绍了 $ extbf{SortScrews}$，这是一个用于螺丝逐个视觉分类的数据集。该数据集包含560张分辨率为 $512 imes512$ 的RGB图像，涵盖六种螺丝类型和一个背景类。图像使用标准化的采集设备捕获，并在四种采集设置中包括了光照和相机视角的轻微变化。为了促进可重复的研究和数据集扩展，我们还提供了一个可重用的数据收集脚本，允许用户使用廉价的相机设备轻松构建类似的数据集，以适应定制硬件组件。我们使用在ImageNet上预训练的EfficientNet-B0和ResNet-18分类器建立了基准结果。此外，我们还进行了充分探讨的失败分析。尽管数据集规模有限，这些轻量级模型仍然实现了较强的分类准确性，证明了受控采集条件能够有效学习，即使在相对较小的数据集上。该数据集、收集管道和基准训练代码已公开可用，网址为 https://github.com/ATATC/SortScrews。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2603.13032

Multimodal OCR: Parse Anything from Documents

多模态OCR：从文档中解析任何内容

Zheng, Handong, Li, Yumeng, Zhang, Kaile, Xin, Liang, Zhao, Guangwei, Liu, Hao, Chen, Jiayu, Lou, Jie, Qiu, Jiyu, Fu, Qi, Yang, Rui, Jiang, Shuo, Luo, Weijian, Su, Weijie, Zhang, Weijun, Zhu, Xingyu, Li, Yabin, ma, Yiwei, Chen, Yu, Yu, Zhaohui, Yang, Guang, Zhang, Colin, Zhang, Lei, Liu, Yuliang, Bai, Xiang

Abstract

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.

Chinese Translation

我们提出了多模态OCR（MOCR），一种文档解析范式，能够将文本和图形联合解析为统一的文本表示。与传统的OCR系统专注于文本识别并将图形区域视为裁剪像素不同，我们的方法称为dots.mocr，将图表、图示、表格和图标等视觉元素视为首要解析目标，使系统能够在解析文档时保留元素之间的语义关系。它提供了几个优势：（1）它将文本和图形重建为结构化输出，从而实现更真实的文档重建；（2）它支持对异构文档元素的端到端训练，使模型能够利用文本和视觉组件之间的语义关系；（3）它将以前被丢弃的图形转换为可重用的代码级监督，解锁嵌入在现有文档中的多模态监督。为了使这一范式在规模上可行，我们从PDF、渲染网页和原生SVG资产构建了一个全面的数据引擎，并通过分阶段预训练和监督微调训练了一个紧凑的3B参数模型。我们从文档解析和结构化图形解析两个角度评估dots.mocr。在文档解析基准测试中，它在我们的OCR Arena Elo排行榜上仅次于Gemini 3 Pro，超越了现有的开源文档解析系统，并在olmOCR Bench上设定了83.9的新状态。在结构化图形解析中，dots.mocr在图像到SVG基准测试中实现了比Gemini 3 Pro更高的重建质量，在图表、用户界面布局、科学图形和化学图表上表现出色。这些结果展示了构建大规模图像到代码语料库以进行多模态预训练的可扩展路径。代码和模型可在https://github.com/rednote-hilab/dots.mocr上公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2603.13033

ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

ESPIRE：一种用于视觉语言模型的具身空间推理的诊断基准

Zhao, Yanpeng, Ding, Wentao, Li, Hongtao, Jia, Baoxiong, Zheng, Zilong

Abstract

A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

Chinese Translation

最近，视觉语言模型（VLMs）的一种趋势是增强其在具身领域的空间认知。尽管取得了一定进展，但现有的评估在范式和覆盖范围上都存在局限，阻碍了模型的快速迭代开发。为了解决这些局限性，我们提出了ESPIRE，一种用于具身空间推理的诊断基准。ESPIRE提供了一个模拟世界，物理上将VLMs与空间推理中心的机器人任务相结合，从而缩小评估与现实世界部署之间的差距。为了使VLMs适应机器人任务，我们将每个任务分解为定位和执行，并将两者框架化为生成性问题，这与主要依赖干扰项并忽略执行的判别性评估（例如，通过视觉问答）形成鲜明对比。这种分解进一步使得我们能够进行更细致的分析，超越被动空间推理，向推理以行动转变。我们在指令层面和环境层面系统设计ESPIRE，确保广泛覆盖空间推理场景。我们使用ESPIRE对一系列前沿VLMs进行诊断，并提供其空间推理行为的深入分析。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2603.13044

Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study

通用视觉模型是否足以满足二维医学图像分割的需求？一项跨数据集的实证研究

Borst, Vanessa, Kounev, Samuel

Abstract

Medical image segmentation (MIS) is a fundamental component of computer-assisted diagnosis and clinical decision support systems. Over the past decade, numerous architectures specifically tailored to medical imaging have emerged to address domain-specific challenges such as low contrast, small anatomical structures, and limited annotated data. In parallel, rapid progress in computer vision has produced highly capable general-purpose vision models (GP-VMs) originally designed for natural images. Despite their strong performance on standard vision benchmarks, their effectiveness for MIS remains insufficiently understood. In this work, we conduct a controlled empirical study to examine whether specialized medical segmentation architectures (SMAs) provide systematic advantages over modern GP-VMs for 2D MIS. We compare eleven SMAs and GP-VMs using a unified training and evaluation protocol. Experiments are performed across three heterogeneous datasets covering different imaging modalities, class structures, and data characteristics. Beyond segmentation accuracy, we analyze qualitative Grad-CAM visualizations to investigate explainability (XAI) behavior. Our results demonstrate that, for the analyzed datasets, GP-VMs out-perform the majority of specialized MIS models. Moreover, XAI analyses indicate that GP-VMs can capture clinically relevant structures without explicit domain-specific architectural design. These findings suggest that GP-VMs can represent a viable alternative to domain-specific methods, highlighting the importance of informed model selection for end-to-end MIS systems. All code and resources are available at GitHub.

Chinese Translation

医学图像分割（MIS）是计算机辅助诊断和临床决策支持系统的基本组成部分。在过去十年中，针对医学成像的许多专门架构应运而生，以应对低对比度、小解剖结构和有限标注数据等领域特定挑战。与此同时，计算机视觉的快速进展产生了高度能力的通用视觉模型（GP-VMs），这些模型最初是为自然图像设计的。尽管它们在标准视觉基准测试中表现出色，但其在医学图像分割中的有效性仍然不够明确。在本研究中，我们进行了一项受控的实证研究，以检验专门医学分割架构（SMAs）是否在二维医学图像分割中相较于现代通用视觉模型提供系统性优势。我们使用统一的训练和评估协议比较了十一种SMAs和GP-VMs。实验在三个异构数据集上进行，涵盖不同的成像模式、类别结构和数据特征。除了分割准确性外，我们还分析了定性Grad-CAM可视化，以研究可解释性（XAI）行为。我们的结果表明，对于所分析的数据集，GP-VMs的表现优于大多数专门的医学图像分割模型。此外，XAI分析表明，GP-VMs能够捕捉临床相关结构，而无需明确的领域特定架构设计。这些发现表明，GP-VMs可以作为领域特定方法的可行替代方案，突显了在端到端医学图像分割系统中进行明智模型选择的重要性。所有代码和资源均可在GitHub上获取。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2603.13054

Topo-R1: Detecting Topological Anomalies via Vision-Language Models

Topo-R1：通过视觉-语言模型检测拓扑异常

Xu, Meilong, Hu, Qingqiao, Hu, Xiaoling, Abousamra, Shahira, Yu, Xin, Lyu, Weimin, Qi, Kehan, Samaras, Dimitris, Chen, Chao

Abstract

Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.

Chinese Translation

拓扑正确性对于血管、神经纤维和道路网络等管状结构至关重要。现有的保持拓扑的方法依赖于特定领域的真实数据，这既昂贵又难以跨领域转移。当在没有注释的新领域中部署时，出现了一个关键问题：我们如何在没有真实数据监督的情况下检测拓扑异常？我们将其重新定义为拓扑异常检测，这是一项结构化的视觉推理任务，要求模型在预测的分割掩膜中定位和分类拓扑错误。视觉-语言模型（VLMs）是自然的候选者；然而，我们发现最先进的VLMs几乎是随机表现，缺乏识别稀疏连接错误所需的细粒度、拓扑感知的感知能力。为了填补这一空白，我们开发了一种自动化数据策划管道，该管道合成了具有可验证注释的多样化拓扑异常，并在逐渐增加的难度级别中构建了该任务的第一个大规模多领域基准。然后，我们引入了Topo-R1，一个通过两阶段训练赋予VLMs拓扑感知能力的框架：监督微调，随后是使用群体相对策略优化（Group Relative Policy Optimization, GRPO）的强化学习。我们方法的核心是一个拓扑感知的复合奖励，该奖励结合了类型感知的匈牙利匹配用于结构化错误分类、空间定位评分，以及直接惩罚连接中断的中心线Dice（clDice）奖励，从而共同激励语义精度和结构保真性。大量实验表明，Topo-R1为无注释的拓扑质量评估建立了一个新范式，在所有评估协议中始终优于通用VLMs和监督基线。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2603.13056

Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

第十届ABAW竞赛中的团队RAS：多模态情感价值和唤醒估计方法

Ryumina, Elena, Markitantov, Maxim, Axyonov, Alexandr, Ryumin, Dmitry, Dolgushin, Mikhail, Dresvyanskiy, Denis, Karpov, Alexey

Abstract

Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies: a Directed Cross-Modal Mixture-of-Experts Fusion Strategy that learns interactions between modalities with adaptive weighting, and a Reliability-Aware Audio-Visual Fusion Strategy that combines visual features at the frame-level while using audio as complementary context. The results are reported on the Aff-Wild2 dataset following the 10th Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. Experiments demonstrate that the proposed multimodal fusion strategy achieves a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set.

Chinese Translation

在真实环境（ITW）条件下，基于情感价值和唤醒的连续情感识别仍然是一个具有挑战性的问题，原因在于外观、头部姿态、光照、遮挡以及个体特有的情感表达模式的巨大变化。我们提出了一种用于ITW情感价值-唤醒估计的多模态方法。该方法结合了三种互补模态：面部、行为和音频。面部模态依赖于基于GRADA的帧级嵌入和基于Transformer的时间回归。我们使用Qwen3-VL-4B-Instruct从视频片段中提取与行为相关的信息，同时使用Mamba建模片段间的时间动态。音频模态依赖于WavLM-Large，采用注意力统计池化，并包括一个跨模态过滤阶段，以减少不可靠或非语音片段的影响。为了融合模态，我们探索了两种融合策略：一种是有向跨模态专家混合融合策略，该策略通过自适应加权学习模态之间的交互；另一种是考虑可靠性的音频-视觉融合策略，该策略在帧级别结合视觉特征，同时使用音频作为补充上下文。结果在Aff-Wild2数据集上按照第十届情感行为分析在真实环境（ABAW）挑战协议进行报告。实验表明，所提出的多模态融合策略在Aff-Wild2开发集上达到了0.658的协和相关系数（CCC）。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2603.13057

Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

基于人类反馈的虚拟试穿无参考图像质量评估

Hirakawa, Yuki, Wada, Takashi, Shimizu, Ryotaro, Furusawa, Takuya, Saito, Yuki, Araki, Ryosuke, Chen, Tianwei, Mo, Fan, Aoki, Yoshimitsu

Abstract

Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fr\'echet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.

Chinese Translation

给定一个人像和一件服装图像，基于图像的虚拟试穿（VTON）合成出一个人穿着目标服装的试穿图像。随着VTON系统在时尚电子商务等实际应用中的重要性日益增加，可靠评估其输出结果已成为一个关键挑战。在现实场景中，通常无法获得同一人穿着目标服装的真实图像，这使得基于参考的评估变得不切实际。此外，广泛使用的分布级指标，如Fréchet Inception Distance和Kernel Inception Distance，测量的是数据集级别的相似性，无法反映单个生成图像的感知质量。为了解决这些局限性，我们提出了虚拟试穿图像质量评估（VTON-IQA），这是一个无参考的框架，用于人类对齐的图像级质量评估，无需真实图像。为了建模人类的感知判断，我们构建了VTON-QBench，这是一个大规模的人类标注基准，包含由14个代表性VTON模型生成的62,688个试穿图像和从13,838名合格标注者收集的431,800个质量标注。根据我们所知，这是迄今为止用于虚拟试穿人类主观评估的最大数据集。评估虚拟试穿质量需要验证服装的真实感和个体特征细节的保留。为了明确建模这种交互，我们引入了交错交叉注意力模块，该模块通过在后续块的自注意力和多层感知机之间插入交叉注意力层，扩展了标准的变换器块。大量实验表明，VTON-IQA能够实现可靠的人类对齐图像级质量预测。此外，我们使用VTON-IQA对14个代表性VTON模型进行了全面的基准评估。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2603.13070

Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection

通过区域感知提示增强和多模态复制检测减轻文本到图像扩散中的记忆化

Chen, Yunzhuo, Vice, Jordan, Akhtar, Naveed, Haldar, Nur Al Hasan, Mian, Ajmal

Abstract

State-of-the-art text-to-image diffusion models can produce impressive visuals but may memorize and reproduce training images, creating copyright and privacy risks. Existing prompt perturbations applied at inference time, such as random token insertion or embedding noise, may lower copying but often harm image-prompt alignment and overall fidelity. To address this, we introduce two complementary methods. First, Region-Aware Prompt Augmentation (RAPTA) uses an object detector to find salient regions and turn them into semantically grounded prompt variants, which are randomly sampled during training to increase diversity, while maintaining semantic alignment. Second, Attention-Driven Multimodal Copy Detection (ADMCD) aggregates local patch, global semantic, and texture cues with a lightweight transformer to produce a fused representation, and applies simple thresholded decision rules to detect copying without training with large annotated datasets. Experiments show that RAPTA reduces overfitting while maintaining high synthesis quality, and that ADMCD reliably detects copying, outperforming single-modal metrics.

Chinese Translation

最先进的文本到图像扩散模型能够生成令人印象深刻的视觉效果，但可能会记忆并重现训练图像，从而带来版权和隐私风险。现有的在推理时应用的提示扰动，如随机令牌插入或嵌入噪声，虽然可以降低复制风险，但往往会损害图像与提示之间的对齐和整体保真度。为了解决这个问题，我们提出了两种互补的方法。首先，区域感知提示增强（Region-Aware Prompt Augmentation, RAPTA）利用物体检测器找到显著区域，并将其转化为语义上有根基的提示变体，这些变体在训练过程中随机采样以增加多样性，同时保持语义对齐。其次，基于注意力的多模态复制检测（Attention-Driven Multimodal Copy Detection, ADMCD）通过轻量级变换器聚合局部块、全局语义和纹理线索，以生成融合表示，并应用简单的阈值决策规则来检测复制，而无需使用大型标注数据集进行训练。实验表明，RAPTA在保持高合成质量的同时减少了过拟合，而ADMCD可靠地检测复制，超越了单模态指标。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2603.13077

Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods

基于稀疏传感器的屋顶风场重建：从确定性方法到生成学习方法

Zhou, Yihang, Lin, Chao, Kikumoto, Hideki, Ooka, Ryozo, Cheng, Sibo

Abstract

Real-time rooftop wind-speed distribution is important for the safe operation of drones and urban air mobility systems, wind control systems, and rooftop utilization. However, rooftop flows show strong nonlinearity, separation, and cross-direction variability, which make flow field reconstruction from sparse sensors difficult. This study develops a learning-from-observation framework using wind-tunnel experimental data obtained by Particle Image Velocimetry (PIV) and compares Kriging interpolation with three deep learning models: UNet, Vision Transformer Autoencoder (ViTAE), and Conditional Wasserstein GAN (CWGAN). We evaluate two training strategies, single wind-direction training (SDT) and mixed wind-direction training (MDT), across sensor densities from 5 to 30, test robustness under sensor position perturbations of plus or minus 1 grid, and optimize sensor placement via Proper Orthogonal Decomposition with QR decomposition. Results show that deep learning methods can reconstruct rooftop wind fields from sparse sensor data effectively. Compared with Kriging interpolation, the deep learning models improved SSIM by up to 32.7%, FAC2 by 24.2%, and NMSE by 27.8%. Mixed wind-direction training further improved performance, with gains of up to 173.7% in SSIM, 16.7% in FAC2, and 98.3% in MG compared with single-direction training. The results also show that sensor configuration, optimization, and training strategy should be considered jointly for reliable deployment. QR-based optimization improved robustness by up to 27.8% under sensor perturbations, although with metric-dependent trade-offs. Training on experimental rather than simulated data also provides practical guidance for method selection and sensor placement in different scenarios.

Chinese Translation

实时屋顶风速分布对于无人机和城市空中出行系统、安全运行、风控系统以及屋顶利用至关重要。然而，屋顶流动表现出强烈的非线性、分离和交叉方向的变异性，这使得从稀疏传感器重建流场变得困难。本研究开发了一种基于观察学习的框架，利用通过粒子图像测速（PIV）获得的风洞实验数据，并将克里金插值与三种深度学习模型进行比较：UNet、视觉变换器自编码器（ViTAE）和条件瓦瑟斯坦生成对抗网络（CWGAN）。我们评估了两种训练策略：单一风向训练（SDT）和混合风向训练（MDT），在传感器密度从5到30的范围内进行测试，检验在传感器位置扰动（正负1网格）下的鲁棒性，并通过QR分解的正交分解优化传感器布置。结果表明，深度学习方法能够有效地从稀疏传感器数据中重建屋顶风场。与克里金插值相比，深度学习模型在结构相似性指数（SSIM）上提高了最多32.7%，在FAC2上提高了24.2%，在归一化均方误差（NMSE）上提高了27.8%。混合风向训练进一步提升了性能，与单一方向训练相比，SSIM提高了最多173.7%，FAC2提高了16.7%，MG提高了98.3%。结果还表明，传感器配置、优化和训练策略应共同考虑，以确保可靠的部署。基于QR的优化在传感器扰动下提高了鲁棒性，最多可达到27.8%的改善，尽管存在依赖于指标的权衡。在实验数据而非模拟数据上进行训练也为不同场景下的方法选择和传感器布置提供了实际指导。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2603.13082

InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

InterEdit：导航文本引导的多人体3D运动编辑

Yang, Yebin, Wen, Di, Qi, Lei, Kong, Weitong, Zheng, Junwei, Liu, Ruiping, Chen, Yufan, Wu, Chengzhi, Yang, Kailun, Fu, Yuqian, Paudel, Danda Pani, Van Gool, Luc, Peng, Kunyu

Abstract

Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at https://github.com/YNG916/InterEdit.

Chinese Translation

文本引导的3D运动编辑在单人场景中取得了成功，但由于配对数据的限制和人际互动的复杂性，其在多人场景中的扩展尚未得到充分探索。我们引入了多人体3D运动编辑的任务，其中目标运动是从源运动和文本指令生成的。为此，我们提出了InterEdit3D，一个具有手动两人运动变化注释的新数据集，以及一个文本引导的多人体运动编辑（TMME）基准。我们展示了InterEdit，一个为TMME设计的同步无分类器条件扩散模型。它引入了具有可学习标记的语义感知计划标记对齐，以捕捉高级互动线索，并采用基于离散余弦变换（DCT）和能量池化的互动感知频率标记对齐策略，以建模周期性运动动态。实验表明，InterEdit提高了文本到运动的一致性和编辑保真度，达到了最先进的TMME性能。数据集和代码将发布在 https://github.com/YNG916/InterEdit。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2603.13089

V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

V-Bridge：将视频生成先验与多样化的少样本图像修复相结合

Zheng, Shenghe, Jiang, Junpeng, Li, Wenbo

Abstract

Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.

Chinese Translation

大规模视频生成模型在广泛且多样的视觉数据上进行训练，使其能够内化视觉世界丰富的结构、语义和动态先验。尽管这些模型展现了令人印象深刻的生成能力，但其作为通用视觉学习者的潜力仍然未被充分挖掘。在本研究中，我们提出了V-Bridge，一个将这种潜在能力与多样化的少样本图像修复任务相结合的框架。我们将图像修复重新解释为一个渐进的生成过程，而非静态回归问题，并利用视频模型模拟从降质输入到高保真输出的逐步精炼。令人惊讶的是，仅凭1,000个多任务训练样本（少于现有修复方法的2%），预训练的视频模型就能被引导执行具有竞争力的图像修复，使用单一模型完成多个任务，甚至与专门为此目的设计的架构相媲美。我们的研究结果表明，视频生成模型隐式学习了强大且可迁移的修复先验，这些先验仅需极其有限的数据即可激活，挑战了生成建模与低级视觉之间的传统界限，并为视觉任务中的基础模型设计开辟了新的范式。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2603.13091

Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

视频推理：评估多模态大语言模型如何提取、整合和重构时空证据

Bang, Seunghwan, Song, Hwanjun

Abstract

The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.

Chinese Translation

对具身代理的日益关注增加了对时空视频理解的需求，但现有基准主要强调提取性推理，即答案可以在时空事件中明确呈现。目前尚不清楚多模态大语言模型是否能够执行抽象时空推理，这需要在时间上整合观察结果、结合分散的线索，并推断隐含的空间和上下文结构。为了解决这一空白，我们通过引入一个结构化的评估分类法来形式化视频中的抽象时空推理，该分类法系统性地针对其核心维度，并构建一个可控的、情境驱动的合成自我中心视频数据集，以评估抽象时空推理能力，涵盖对象、房间和楼层平面级别的场景。在此框架基础上，我们提出了VAEX-BENCH，一个基准包含五个抽象推理任务及其提取性对应任务。我们的广泛实验比较了最先进的多模态大语言模型在提取性和抽象性设置下的表现，揭示了它们在抽象任务上的局限性，并提供了对潜在瓶颈的细致分析。该数据集将很快发布。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2603.13102

BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending

BenDFM：用于钣金弯曲可制造性评估的分类法和合成CAD数据集

Ballegeer, Matteo, Benoit, Dries F.

Abstract

Predicting the manufacturability of CAD designs early, in terms of both feasibility and required effort, is a key goal of Design for Manufacturing (DFM). Despite advances in deep learning for CAD and its widespread use in manufacturing process selection, learning-based approaches for predicting manufacturability within a specific process remain limited. Two key challenges limit progress: inconsistency across prior work in how manufacturability is defined and consequently in the associated learning targets, and a scarcity of suitable datasets. Existing labels vary significantly: they may reflect intrinsic design constraints or depend on specific manufacturing capabilities (such as available tools), and they range from discrete feasibility checks to continuous complexity measures. Furthermore, industrial datasets typically contain only manufacturable parts, offering little signal for infeasible cases, while existing synthetic datasets focus on simple geometries and subtractive processes. To address these gaps, we propose a taxonomy of manufacturability metrics along the axes of configuration dependence and measurement type, allowing clearer scoping of generalizability and learning objectives. Next, we introduce BenDFM, the first synthetic dataset for manufacturability assessment in sheet metal bending. BenDFM contains 20,000 parts, both manufacturable and unmanufacturable, generated with process-aware bending simulations, providing both folded and unfolded geometries and multiple manufacturability labels across the taxonomy, enabling systematic study of previously unexplored learning-based DFM challenges. We benchmark two state-of-the-art 3D learning architectures on BenDFM, showing that graph-based representations that capture relationships between part surfaces achieve better accuracy, and that predicting metrics that depend on specific manufacturing setups remains more challenging.

Chinese Translation

在设计阶段早期预测CAD设计的可制造性，包括可行性和所需努力，是制造设计（DFM）的一个关键目标。尽管深度学习在CAD领域取得了进展并广泛应用于制造过程选择，但基于学习的方法在特定过程中的可制造性预测仍然有限。两个关键挑战限制了进展：一是先前工作的可制造性定义不一致，导致相关学习目标的差异；二是合适数据集的稀缺。现有标签差异显著：它们可能反映内在设计约束，或依赖于特定的制造能力（如可用工具），并且范围从离散的可行性检查到连续的复杂性度量。此外，工业数据集通常仅包含可制造的部件，提供的不可行案例信号有限，而现有的合成数据集则专注于简单几何形状和减法加工。为了解决这些问题，我们提出了一种可制造性度量的分类法，基于配置依赖性和测量类型，明确了可推广性和学习目标的范围。接下来，我们介绍了BenDFM，这是第一个用于钣金弯曲可制造性评估的合成数据集。BenDFM包含20,000个部件，包括可制造和不可制造的部件，采用基于过程的弯曲模拟生成，提供折叠和展开的几何形状以及多个可制造性标签，支持对以前未探索的基于学习的DFM挑战的系统研究。我们在BenDFM上基准测试了两种最先进的3D学习架构，结果表明，捕捉部件表面之间关系的图形表示能够实现更好的准确性，而预测依赖于特定制造设置的度量仍然更具挑战性。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2603.13118

NOIR: Neural Operator mapping for Implicit Representations

NOIR：隐式表示的神经算子映射

Hadramy, Sidaty El, Haouchine, Nazim, Wehrli, Michael, Cattin, Philippe C.

Abstract

This paper presents NOIR, a framework that reframes core medical imaging tasks as operator learning between continuous function spaces, challenging the prevailing paradigm of discrete grid-based deep learning. Instead of operating on fixed pixel or voxel grids, NOIR embeds discrete medical signals into shared Implicit Neural Representations and learns a Neural Operator that maps between their latent modulations, enabling resolution-independent function-to-function transformations. We evaluate NOIR across multiple 2D and 3D downstream tasks, including segmentation, shape completion, image-to-image translation, and image synthesis, on several public datasets such as Shenzhen, OASIS-4, SkullBreak, fastMRI, as well as an in-house clinical dataset. It achieves competitive performance at native resolution while demonstrating strong robustness to unseen discretizations, and empirically satisfies key theoretical properties of neural operators. The project page is available here: https://github.com/Sidaty1/NOIR-io.

Chinese Translation

本文提出了NOIR，一个将核心医学成像任务重新框定为连续函数空间之间的算子学习的框架，挑战了基于离散网格的深度学习的主流范式。NOIR并不是在固定的像素或体素网格上操作，而是将离散医学信号嵌入共享的隐式神经表示中，并学习一个神经算子，该算子在其潜在调制之间进行映射，从而实现与分辨率无关的函数到函数的转换。我们在多个2D和3D下游任务上评估NOIR，包括分割、形状补全、图像到图像的转换和图像合成，使用了多个公共数据集，如Shenzhen、OASIS-4、SkullBreak、fastMRI，以及一个内部临床数据集。NOIR在原生分辨率下实现了竞争力的性能，同时对未见的离散化表现出强大的鲁棒性，并在经验上满足神经算子的关键理论属性。项目页面可在此访问： https://github.com/Sidaty1/NOIR-io。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2603.13119

Geometry-Guided Camera Motion Understanding in VideoLLMs

基于几何引导的摄像机运动理解在视频语言模型中的应用

Feng, Haoan, Musunuri, Sri Harsha, Su, Guan-Ming

Abstract

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.

Chinese Translation

摄像机运动是塑造视觉感知和电影风格的基本几何信号，但当前的视频能力视觉语言模型（VideoLLMs）很少明确表示这一点，并且在细粒度运动原语上常常表现不佳。我们通过一个包括$ extbf{基准}$、$ extbf{诊断}$和$ extbf{注入}$的框架来解决这一问题。我们整理了$ extbf{CameraMotionDataset}$，这是一个具有明确摄像机控制的大规模合成数据集，将摄像机运动表述为约束感知的多标签识别，并构建了一个视觉问答基准--$ extbf{CameraMotionVQA}$。在多种现成的VideoLLMs中，我们观察到在识别摄像机运动原语时存在显著错误。对Qwen2.5-VL视觉编码器的探测实验表明，摄像机运动线索的表示较弱，尤其是在更深的ViT模块中，这有助于解释观察到的失败模式。为了在不进行昂贵训练或微调的情况下弥补这一差距，我们提出了一种轻量级、模型无关的管道，从3D基础模型（3DFMs）中提取几何摄像机线索，利用时间分类器预测约束运动原语，并通过结构化提示将其注入下游VideoLLM推理中。实验表明，运动识别得到改善，模型响应更加关注摄像机，强调了基于几何的线索提取和结构化提示作为实现摄像机感知VideoLLM和视觉语言处理系统的实际步骤。数据集和基准可在https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2603.13121

FDeID-Toolbox: Face De-Identification Toolbox

FDeID-Toolbox：面部去标识化工具箱

Wei, Hui, Yu, Hao, Zhao, Guoying

Abstract

Face de-identification (FDeID) aims to remove personally identifiable information from facial images while preserving task-relevant utility attributes such as age, gender, and expression. It is critical for privacy-preserving computer vision, yet the field suffers from fragmented implementations, inconsistent evaluation protocols, and incomparable results across studies. These challenges stem from the inherent complexity of the task: FDeID spans multiple downstream applications (e.g., age estimation, gender recognition, expression analysis) and requires evaluation across three dimensions (e.g., privacy protection, utility preservation, and visual quality), making existing codebases difficult to use and extend. To address these issues, we present FDeID-Toolbox, a comprehensive toolbox designed for reproducible FDeID research. Our toolbox features a modular architecture comprising four core components: (1) standardized data loaders for mainstream benchmark datasets, (2) unified method implementations spanning classical approaches to SOTA generative models, (3) flexible inference pipelines, and (4) systematic evaluation protocols covering privacy, utility, and quality metrics. Through experiments, we demonstrate that FDeID-Toolbox enables fair and reproducible comparison of diverse FDeID methods under consistent conditions.

Chinese Translation

面部去标识化（FDeID）的目标是从面部图像中去除个人可识别信息，同时保留与任务相关的效用属性，如年龄、性别和表情。这对于隐私保护的计算机视觉至关重要，但该领域面临着实现碎片化、评估协议不一致以及研究结果不可比等挑战。这些问题源于任务本身的复杂性：FDeID 涉及多个下游应用（例如，年龄估计、性别识别、表情分析），并需要在三个维度上进行评估（例如，隐私保护、效用保留和视觉质量），使得现有代码库难以使用和扩展。为了解决这些问题，我们提出了 FDeID-Toolbox，这是一个旨在实现可重复的 FDeID 研究的综合工具箱。我们的工具箱具有模块化架构，包含四个核心组件：（1）主流基准数据集的标准化数据加载器，（2）涵盖经典方法到最新生成模型的统一方法实现，（3）灵活的推理管道，以及（4）涵盖隐私、效用和质量指标的系统评估协议。通过实验，我们证明了 FDeID-Toolbox 能够在一致条件下实现不同 FDeID 方法的公平和可重复比较。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2603.13163

Towards Faithful Multimodal Concept Bottleneck Models

面向可信的多模态概念瓶颈模型

Moreau, Pierre, Ferrand, Emeline Pineau, Choho, Yann, Wong, Benjamin, Blangero, Annabelle, Bhan, Milan

Abstract

Concept Bottleneck Models (CBMs) are interpretable models that route predictions through a layer of human-interpretable concepts. While widely studied in vision and, more recently, in NLP, CBMs remain largely unexplored in multimodal settings. For their explanations to be faithful, CBMs must satisfy two conditions: concepts must be properly detected, and concept representations must encode only their intended semantics, without smuggling extraneous task-relevant or inter-concept information into final predictions, a phenomenon known as leakage. Existing approaches treat concept detection and leakage mitigation as separate problems, and typically improve one at the expense of predictive accuracy. In this work, we introduce f-CBM, a faithful multimodal CBM framework built on a vision-language backbone that jointly targets both aspects through two complementary strategies: a differentiable leakage loss to mitigate leakage, and a Kolmogorov-Arnold Network prediction head that provides sufficient expressiveness to improve concept detection. Experiments demonstrate that f-CBM achieves the best trade-off between task accuracy, concept detection, and leakage reduction, while applying seamlessly to both image and text or text-only datasets, making it versatile across modalities.

Chinese Translation

概念瓶颈模型（Concept Bottleneck Models, CBMs）是一种可解释的模型，通过一层人类可解释的概念来引导预测。尽管在视觉领域以及最近的自然语言处理（NLP）中得到了广泛研究，但CBMs在多模态环境中的探索仍然相对较少。为了使其解释具有可信性，CBMs必须满足两个条件：概念必须被正确检测，且概念表示必须仅编码其预期语义，而不应将多余的与任务相关或概念间的信息混入最终预测中，这种现象被称为泄漏。现有方法将概念检测和泄漏缓解视为两个独立的问题，通常在提高其中一个的同时牺牲预测准确性。在本研究中，我们提出了f-CBM，这是一种基于视觉-语言骨干的可信多模态CBM框架，通过两种互补策略共同解决这两个方面：一种可微分的泄漏损失来缓解泄漏，以及一个Kolmogorov-Arnold网络预测头，提供足够的表达能力以改善概念检测。实验表明，f-CBM在任务准确性、概念检测和泄漏减少之间实现了最佳的权衡，同时能够无缝应用于图像和文本或仅文本的数据集，使其在多模态中具有广泛的适用性。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2603.13176

Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception

关注重要性：基于相关性的多模态流媒体感知调度

Huang, Dingcheng, Zhang, Xiaotong, Youcef-Toumi, Kamal

Abstract

In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework's capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.

Chinese Translation

在现代人机协作（HRC）应用中，多种感知模块共同提取视觉、听觉和上下文线索，以实现全面的场景理解，从而使机器人能够智能地为人类代理提供适当的帮助。尽管在离线环境中逐帧执行多个感知模块可以提高感知质量，但这不可避免地会累积延迟，导致在流媒体感知场景中系统性能显著下降。近期在场景理解方面的研究，称为相关性（Relevance），为HRC中高效方法的开发奠定了坚实基础。然而，现代感知管道仍面临信息冗余和计算资源分配不优化的问题。我们受到相关性概念和HRC事件中信息稀疏性的启发，提出了一种新颖的轻量级感知调度框架，该框架有效利用前帧的输出，根据场景上下文实时估计和调度必要的感知模块。实验结果表明，与传统的并行感知管道相比，所提出的感知调度框架有效减少了计算延迟，最高可达27.52%，同时在MMPose激活召回率上提高了72.73%。此外，该框架还展示了高关键帧准确率，达到98%的水平。结果验证了该框架在不显著妥协准确性的情况下提升实时感知效率的能力。该框架展现出作为HRC中多模态流媒体感知系统的可扩展和系统化解决方案的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2603.13182

Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification

基于扩散的特征去噪与非负矩阵分解在稳健脑肿瘤分类中的应用

Al-kharsan, Hiba Adil, Rajkó, Róbert

Abstract

Brain tumor classification from magnetic resonance imaging, which is also known as MRI, plays a sensitive role in computer-assisted diagnosis systems. In recent years, deep learning models have achieved high classification accuracy. However, their sensitivity to adversarial perturbations has become an important reliability concern in medical applications. This study suggests a robust brain tumor classification framework that combines Non-Negative Matrix Factorization (NNMF or NMF), lightweight convolutional neural networks (CNNs), and diffusion-based feature purification. Initially, MRI images are preprocessed and converted into a non-negative data matrix, from which compact and interpretable NNMF feature representations are extracted. Statistical metrics, including AUC, Cohen's d, and p-values, are used to rank and choose the most discriminative components. Then, a lightweight CNN classifier is trained directly on the selected feature groups. To improve adversarial robustness, a diffusion-based feature-space purification module is introduced. A forward noise method followed by a learned denoiser network is used before classification. System performance is estimated using both clean accuracy and robust accuracy under powerful adversarial attacks created by AutoAttack. The experimental results show that the proposed framework achieves competitive classification performance while significantly enhancing robustness against adversarial perturbations.The findings presuppose that combining interpretable NNMF-based representations with a lightweight deep approach and diffusion-based defense technique supplies an effective and reliable solution for medical image classification under adversarial conditions.

Chinese Translation

脑肿瘤分类基于磁共振成像（MRI），在计算机辅助诊断系统中扮演着重要角色。近年来，深度学习模型已实现高分类准确率。然而，它们对对抗性扰动的敏感性在医学应用中成为了一个重要的可靠性问题。本研究提出了一种稳健的脑肿瘤分类框架，该框架结合了非负矩阵分解（NNMF或NMF）、轻量级卷积神经网络（CNN）和基于扩散的特征净化。首先，对MRI图像进行预处理并转换为非负数据矩阵，从中提取紧凑且可解释的NNMF特征表示。使用统计指标，包括AUC、Cohen's d和p值，来排名并选择最具区分性的成分。然后，直接在选定的特征组上训练轻量级CNN分类器。为了提高对抗鲁棒性，引入了基于扩散的特征空间净化模块。在分类之前，使用前向噪声方法和学习的去噪网络。系统性能通过在AutoAttack创建的强对抗攻击下评估干净准确率和鲁棒准确率。实验结果表明，所提出的框架在实现竞争性分类性能的同时，显著增强了对对抗性扰动的鲁棒性。研究结果表明，将可解释的基于NNMF的表示与轻量级深度方法和基于扩散的防御技术相结合，为在对抗条件下的医学图像分类提供了一种有效且可靠的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2603.13185

Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

基于单目视频的时空世界场景图生成研究

Peddi, Rohith, Saurabh, Shanmugam, Shravan, Pallapothula, Likhitha, Xiang, Yu, Singla, Parag, Gogate, Vibhav

Abstract

Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

Chinese Translation

时空场景图为建模不断演变的物体交互提供了一个原则性表示，然而现有方法仍然在根本上是以帧为中心：它们仅考虑当前可见的物体，在遮挡时丢弃实体，并且仅在二维空间中操作。为了解决这一问题，我们首先引入ActionGenome4D数据集，该数据集通过前馈3D重建，将Action Genome视频升级为4D场景，为每个参与动作的物体提供世界框架导向的边界框，并对由于遮挡或相机运动而暂时未观察到的物体提供密集的关系注释。在此基础上，我们正式定义了世界场景图生成（WSGG）任务，该任务旨在在每个时间戳构建一个世界场景图，涵盖场景中所有交互的物体，包括已观察和未观察的物体。然后，我们提出了三种互补的方法，每种方法探索不同的归纳偏置以推理未观察到的物体：PWG（持久世界图），通过零阶特征缓冲实现物体的持久性；MWAE（掩码世界自编码器），将未观察物体的推理重新框定为带有跨视图关联检索的掩码补全；以及4DST（4D场景变换器），用由3D运动和相机姿态特征丰富的可微分每物体时间注意力替代静态缓冲。我们进一步设计并评估了强大的开源视觉-语言模型在WSGG任务上的表现，通过一系列基于图RAG的方法建立未定位关系预测的基准。因此，WSGG推动视频场景理解朝向以世界为中心、时间上持久和可解释的场景推理发展。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2603.13215

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

视而不见，心而不念？评估视频世界模型中的状态演变

Ma, Ziqi, Liufu, Mengzhan, Gkioxari, Georgia

Abstract

Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/

Chinese Translation

世界中的演变，例如水的倾倒或冰的融化，发生时并不依赖于观察。视频世界模型通过二维帧观察生成“世界”。这些生成的“世界”能否在没有观察的情况下演变？为探讨这一问题，我们设计了一个基准测试，以评估视频世界模型是否能够将状态演变与观察解耦。我们的基准测试，STEVO-Bench，通过插入遮挡物、关闭灯光或指定相机“转移视线”轨迹的指令，对演变过程应用观察控制。通过对一组自然发生的演变进行评估，比较有无相机控制的视频模型，我们揭示了它们在将状态演变与观察解耦方面的局限性。STEVO-Bench 提出了一个评估协议，以自动检测和拆解视频世界模型在自然状态演变关键方面的失败模式。对 STEVO-Bench 结果的分析为当前视频世界模型潜在的数据和架构偏差提供了新的见解。项目网站：https://glab-caltech.github.io/STEVOBench/。博客：https://ziqi-ma.github.io/blog/2026/outofsight/

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2603.13224

Visual-ERM: Reward Modeling for Visual Equivalence

视觉等价奖励模型：用于视觉等价的奖励建模

Liu, Ziyu, Ding, Shengyuan, Fang, Xinyu, Dai, Xuanlang, Yang, Penghui, Liang, Jianze, Wang, Jiaqi, Chen, Kai, Lin, Dahua, Zang, Yuhang

Abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

Chinese Translation

视觉到代码的任务要求模型将结构化的视觉输入（如图表、表格和SVG）重构为具有高视觉保真度的可执行或结构化表示。尽管最近的大型视觉语言模型（LVLMs）通过监督微调取得了良好的结果，但由于奖励信号的不一致，强化学习仍然面临挑战。现有的奖励要么依赖于文本规则，要么依赖于粗略的视觉嵌入相似性，这两者都无法捕捉细粒度的视觉差异，并且容易受到奖励操控的影响。我们提出了视觉等价奖励模型（Visual-ERM），这是一种多模态生成奖励模型，能够提供细粒度、可解释且与任务无关的反馈，直接在渲染的视觉空间中评估视觉到代码的质量。集成到强化学习中，Visual-ERM使Qwen3-VL-8B-Instruct在图表到代码的任务上提高了8.4分，并在表格和SVG解析上获得了一致的提升（平均分别为2.7和4.1），并进一步通过反思和修订增强了测试时的扩展性。我们还引入了VisualCritic-RewardBench（VC-RewardBench），这是一个用于评估结构化视觉数据上细粒度图像到图像差异的基准，其中Visual-ERM在8B模型下显著超越了Qwen3-VL-235B-Instruct，并接近领先的闭源模型。我们的结果表明，细粒度的视觉奖励监督对于视觉到代码的强化学习既是必要的也是充分的，无论任务的特异性如何。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2603.12287

Context-Enriched Natural Language Descriptions of Vessel Trajectories

上下文丰富的船舶轨迹自然语言描述

Patroumpas, Kostas, Troupiotis-Kapeliaris, Alexandros, Spiliopoulos, Giannis, Betchavas, Panagiotis, Skoutas, Dimitrios, Zissis, Dimitris, Bikakis, Nikos

Abstract

We address the problem of transforming raw vessel trajectory data collected from AIS into structured and semantically enriched representations interpretable by humans and directly usable by machine reasoning systems. We propose a context-aware trajectory abstraction framework that segments noisy AIS sequences into distinct trips each consisting of clean, mobility-annotated episodes. Each episode is further enriched with multi-source contextual information, such as nearby geographic entities, offshore navigation features, and weather conditions. Crucially, such representations can support generation of controlled natural language descriptions using LLMs. We empirically examine the quality of such descriptions generated using several LLMs over AIS data along with open contextual features. By increasing semantic density and reducing spatiotemporal complexity, this abstraction can facilitate downstream analytics and enable integration with LLMs for higher-level maritime reasoning tasks.

Chinese Translation

我们解决了将从AIS收集的原始船舶轨迹数据转化为结构化和语义丰富的表示的问题，这些表示可被人类解读并可直接用于机器推理系统。我们提出了一种基于上下文的轨迹抽象框架，该框架将噪声AIS序列分割为不同的行程，每个行程由干净的、带有移动性注释的事件组成。每个事件进一步丰富了多源上下文信息，例如附近的地理实体、离岸导航特征和天气条件。至关重要的是，这种表示可以支持使用大型语言模型（LLMs）生成受控的自然语言描述。我们实证考察了使用多个LLMs对AIS数据及开放上下文特征生成的描述质量。通过增加语义密度和减少时空复杂性，这种抽象可以促进下游分析，并使其与LLMs集成以实现更高层次的海事推理任务。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2603.12372

Efficient Reasoning with Balanced Thinking

高效推理与平衡思维

Li, Yulin, Tu, Tengyao, Ding, Li, Wang, Junjie, Zhen, Huiling, Chen, Yixin, Li, Yong, Tian, Zhuotao

Abstract

Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs' reasoning trajectories. A dynamic control function modulates this vector's strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Code is available at https://github.com/yu-lin-li/ReBalance .

Chinese Translation

大型推理模型（Large Reasoning Models, LRM）展现了卓越的推理能力，但它们常常面临过度思考的问题，在简单问题上耗费冗余的计算步骤，或是思考不足，未能充分探索推理路径，尽管其具备内在能力。这些问题导致了效率低下和潜在的不准确性，限制了在资源受限环境中的实际应用。现有的减轻过度思考的方法，如抑制反思性关键词或调整推理长度，可能会无意中引发思考不足，从而影响准确性。因此，我们提出了ReBalance，一个无需训练的框架，旨在实现高效推理与平衡思维。ReBalance利用置信度作为推理动态的连续指标，通过高置信度方差识别过度思考，通过持续的过度自信识别思考不足。通过将小规模数据集中的隐藏状态聚合为推理模式原型，我们计算出一个引导向量，以指导LRM的推理轨迹。动态控制函数根据实时置信度调节该向量的强度和方向，在过度思考时修剪冗余，在思考不足时促进探索。在四个模型（范围从0.5B到32B）和九个基准（包括数学推理、一般问答和编码任务）上进行的广泛实验表明，ReBalance有效减少了输出冗余，同时提高了准确性，为高效且稳健的LRM部署提供了一种通用、无需训练且即插即用的策略。代码可在 https://github.com/yu-lin-li/ReBalance 获取。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2603.12483

Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel

使用 AgentFuel 生成富有表现力和可定制的时间序列数据分析代理评估

Maddi, Aadyaa, Naval, Prakhar, Mande, Deepti, Duan, Shane, Girish, Muckai, Sekar, Vyas

Abstract

Across many domains (e.g., IoT, observability, telecommunications, cybersecurity), there is an emerging adoption of conversational data analysis agents that enable users to "talk to your data" to extract insights. Such data analysis agents operate on timeseries data models; e.g., measurements from sensors or events monitoring user clicks and actions in product analytics. We evaluate 6 popular data analysis agents (both open-source and proprietary) on domain-specific data and query types, and find that they fail on stateful and incident-specific queries. We observe two key expressivity gaps in existing evals: domain-customized datasets and domain-specific query types. To enable practitioners in such domains to generate customized and expressive evals for such timeseries data agents, we present AgentFuel. AgentFuel helps domain experts quickly create customized evals to perform end-to-end functional tests. We show that AgentFuel's benchmarks expose key directions for improvement in existing data agent frameworks. We also present anecdotal evidence that using AgentFuel can improve agent performance (e.g., with GEPA). AgentFuel benchmarks are available at https://huggingface.co/datasets/RockfishData/TimeSeriesAgentEvals.

Chinese Translation

在许多领域（例如物联网、可观察性、电信、网络安全），对对话式数据分析代理的采用正在逐渐增加，这些代理使用户能够与数据“对话”以提取洞察。这些数据分析代理基于时间序列数据模型运行；例如，来自传感器的测量或监测用户点击和行为的事件分析。我们对6个流行的数据分析代理（包括开源和专有）在特定领域的数据和查询类型上进行了评估，发现它们在有状态和特定事件的查询上表现不佳。我们观察到现有评估中存在两个关键的表现力缺口：领域定制的数据集和领域特定的查询类型。为了使这些领域的从业者能够为时间序列数据代理生成定制和富有表现力的评估，我们提出了 AgentFuel。AgentFuel 帮助领域专家快速创建定制评估，以执行端到端的功能测试。我们展示了 AgentFuel 的基准测试揭示了现有数据代理框架改进的关键方向。我们还提供了使用 AgentFuel 可以提高代理性能的轶事证据（例如，使用 GEPA）。AgentFuel 的基准测试可在 https://huggingface.co/datasets/RockfishData/TimeSeriesAgentEvals 获取。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2603.12710

AI Planning Framework for LLM-Based Web Agents

基于大型语言模型的网络代理的人工智能规划框架

Shahnovsky, Orit, Dror, Rotem

Abstract

Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequential decision-making processes. We introduce a taxonomy that maps modern agent architectures to traditional planning paradigms: Step-by-Step agents to Breadth-First Search (BFS), Tree Search agents to Best-First Tree Search, and Full-Plan-in-Advance agents to Depth-First Search (DFS). This framework allows for a principled diagnosis of system failures like context drift and incoherent task decomposition. To evaluate these behaviors, we propose five novel evaluation metrics that assess trajectory quality beyond simple success rates. We support this analysis with a new dataset of 794 human-labeled trajectories from the WebArena benchmark. Finally, we validate our evaluation framework by comparing a baseline Step-by-Step agent against a novel Full-Plan-in-Advance implementation. Our results reveal that while the Step-by-Step agent aligns more closely with human gold trajectories (38% overall success), the Full-Plan-in-Advance agent excels in technical measures such as element accuracy (89%), demonstrating the necessity of our proposed metrics for selecting appropriate agent architectures based on specific application constraints.

Chinese Translation

为基于网络的任务开发自主代理是人工智能中的一个核心挑战。尽管大型语言模型（LLM）代理能够解释复杂的用户请求，但它们通常作为黑箱操作，这使得诊断其失败原因或规划过程变得困难。本文通过将网络任务正式视为顺序决策过程来填补这一空白。我们引入了一种分类法，将现代代理架构映射到传统规划范式：逐步代理对应于宽度优先搜索（BFS），树搜索代理对应于最佳优先树搜索，提前全规划代理对应于深度优先搜索（DFS）。该框架允许对系统故障（如上下文漂移和任务分解不一致）进行原则性的诊断。为了评估这些行为，我们提出了五个新颖的评估指标，以评估轨迹质量，超越简单的成功率。我们通过WebArena基准的794个人工标注轨迹的新数据集来支持这一分析。最后，我们通过将基线逐步代理与一种新颖的提前全规划实现进行比较，验证了我们的评估框架。我们的结果显示，尽管逐步代理与人类黄金轨迹的吻合度更高（整体成功率为38%），但提前全规划代理在元素准确性等技术指标上表现优异（89%），这表明我们提出的指标在根据特定应用约束选择适当代理架构时的必要性。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2603.12733

On Using Machine Learning to Early Detect Catastrophic Failures in Marine Diesel Engines

利用机器学习早期检测海洋柴油发动机的灾难性故障

Maione, Francesco, Lino, Paolo, Giannino, Giuseppe, Maione, Guido

Abstract

Catastrophic failures of marine engines imply severe loss of functionality and destroy or damage the systems irreversibly. Being sudden and often unpredictable events, they pose a severe threat to navigation, crew, and passengers. The abrupt nature makes early detection the only effective countermeasure. However, research has concentrated on modeling the gradual degradation of components, with limited attention to sudden and anomalous phenomena. This work proposes a new method for early detection of catastrophic failures. Based on real data from a failed engine, the approach evaluates the derivatives of the deviation between actual sensor readings and expected values of engine variables. Predictions are obtained by a Random Forest, which is the most suitable Machine Learning algorithm among the tested ones. Traditional methods focus on deviations of monitored signals, whereas the proposed approach employs the derivatives of the deviations to provide earlier indications of abnormal dynamics, and to alert that a rapid and dangerous event is breaking out within the system. The method allows the detection of anomalies before measurements reach critical thresholds and alarms are triggered, which is the common method in industry. Consequently, operators can be warned in advance and shut down the engine, then prevent damage and unexpected power loss. Moreover, they have the time to safely change the ship route and avoid potential obstacles. Simulation results conf irm the effectiveness of the proposed approach in anticipating occurrence of catastrophic failures. Validation on real-world data further reinforces the robustness and practical applicability of the method. It is worth noting that data acquisition to train the predictive algorithm is not a problem, since a Deep Learning-based data augmentation procedure is used.

Chinese Translation

海洋发动机的灾难性故障意味着功能的严重丧失，并会不可逆转地破坏或损坏系统。这些故障通常是突发且不可预测的事件，对航行、船员和乘客构成了严重威胁。由于其突发性，早期检测是唯一有效的对策。然而，现有研究主要集中在组件的逐渐退化建模上，对突发和异常现象关注有限。本研究提出了一种新的灾难性故障早期检测方法。该方法基于来自故障发动机的真实数据，评估实际传感器读数与发动机变量预期值之间偏差的导数。通过随机森林（Random Forest）算法获得预测结果，该算法在测试的多种机器学习算法中最为合适。传统方法关注监测信号的偏差，而所提方法则利用偏差的导数提供异常动态的早期指示，并警示系统内即将发生快速且危险的事件。该方法允许在测量值达到临界阈值和触发警报之前检测到异常，这在工业中是常见的方法。因此，操作人员可以提前收到警告并关闭发动机，从而防止损坏和意外的功率损失。此外，他们还有时间安全地改变船只航线，避免潜在障碍物。仿真结果证实了所提方法在预测灾难性故障发生方面的有效性。对真实世界数据的验证进一步增强了该方法的稳健性和实际适用性。值得注意的是，训练预测算法所需的数据获取并不是问题，因为采用了基于深度学习的数据增强程序。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2603.12740

ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning

ToolTree：通过双反馈蒙特卡洛树搜索和双向剪枝实现高效的LLM代理工具规划

Yang, Shuo, Han, Soyeon Caren, Ding, Yihao, Wang, Shuhe, Hoy, Eduard

Abstract

Large Language Model (LLM) agents are increasingly applied to complex, multi-step tasks that require interaction with diverse external tools across various domains. However, current LLM agent tool planning methods typically rely on greedy, reactive tool selection strategies that lack foresight and fail to account for inter-tool dependencies. In this paper, we present ToolTree, a novel Monte Carlo tree search-inspired planning paradigm for tool planning. ToolTree explores possible tool usage trajectories using a dual-stage LLM evaluation and bidirectional pruning mechanism that enables the agent to make informed, adaptive decisions over extended tool-use sequences while pruning less promising branches before and after the tool execution. Empirical evaluations across both open-set and closed-set tool planning tasks on 4 benchmarks demonstrate that ToolTree consistently improves performance while keeping the highest efficiency, achieving an average gain of around 10\% compared to the state-of-the-art planning paradigm.

Chinese Translation

大型语言模型（LLM）代理越来越多地应用于需要与各个领域的多种外部工具进行交互的复杂多步骤任务。然而，目前的LLM代理工具规划方法通常依赖于贪婪的、反应式的工具选择策略，这些策略缺乏前瞻性，未能考虑工具之间的依赖关系。本文提出了ToolTree，一种基于蒙特卡洛树搜索的新颖工具规划范式。ToolTree通过双阶段LLM评估和双向剪枝机制探索可能的工具使用轨迹，使代理能够在延长的工具使用序列中做出知情的、自适应的决策，同时在工具执行前后剪除不太有前景的分支。在4个基准测试中，对开放集和闭合集工具规划任务的实证评估表明，ToolTree在保持最高效率的同时，始终提高了性能，与最先进的规划范式相比，平均提升约10%。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2603.12755

AI Model Modulation with Logits Redistribution

通过逻辑分布重分配进行人工智能模型调制

Wang, Zihan, Ma, Zhongkui, Feng, Xinguo, Mei, Zhiyang, Ma, Ethan, Wang, Derui, Xue, Minhui, Bai, Guangdong

Abstract

Large-scale models are typically adapted to meet the diverse requirements of model owners and users. However, maintaining multiple specialized versions of the model is inefficient. In response, we propose AIM, a novel model modulation paradigm that enables a single model to exhibit diverse behaviors to meet the specific end requirements. AIM enables two key modulation modes: utility and focus modulations. The former provides model owners with dynamic control over output quality to deliver varying utility levels, and the latter offers users precise control to shift model's focused input features. AIM introduces a logits redistribution strategy that operates in a training data-agnostic and retraining-free manner. We establish a formal foundation to ensure AIM's regulation capability, based on the statistical properties of logits ordering via joint probability distributions. Our evaluation confirms AIM's practicality and versatility for Al model modulation, with tasks spanning image classification, semantic segmentation and text generation, and prevalent architectures including ResNet, SegFormer and Llama.

Chinese Translation

大规模模型通常需要适应模型拥有者和用户的多样化需求。然而，维护多个专门版本的模型效率低下。为此，我们提出了AIM，一种新颖的模型调制范式，使单一模型能够展现多样化的行为，以满足特定的最终需求。AIM支持两种关键的调制模式：效用调制和聚焦调制。前者为模型拥有者提供了对输出质量的动态控制，以提供不同的效用水平，而后者则为用户提供了精确控制，以调整模型的聚焦输入特征。AIM引入了一种逻辑重分配策略，该策略在不依赖训练数据和无需重新训练的情况下运作。我们建立了一个正式的基础，以确保AIM的调节能力，基于逻辑排序的统计特性，通过联合概率分布进行分析。我们的评估确认了AIM在人工智能模型调制中的实用性和多样性，涵盖了图像分类、语义分割和文本生成等任务，以及包括ResNet、SegFormer和Llama等流行架构。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2603.12813

Context is all you need: Towards autonomous model-based process design using agentic AI in flowsheet simulations

上下文是你所需的一切：基于模型的自主过程设计在流程图模拟中使用代理人工智能的探索

Schäfer, Pascal, Krinke, Lukas J., Wlotzka, Martin, Asprion, Norbert

Abstract

Agentic AI systems integrating large language models (LLMs) with reasoning and tooluse capabilities are transforming various domains - in particular, software development. In contrast, their application in chemical process flowsheet modelling remains largely unexplored. In this work, we present an agentic AI framework that delivers assistance in an industrial flowsheet simulation environment. To this end, we show the capabilities of GitHub Copilot (GitHub, Inc., 2026), when using state-of-the-art LLMs, such as Claude Opus 4.6 (Anthropic, PBC, 2026), to generate valid syntax for our in-house process modelling tool Chemasim using the technical documentation and a few commented examples as context. Based on this, we develop a multi-agent system that decomposes process development tasks with one agent solving the abstract problem using engineering knowledge and another agent implementing the solution as Chemasim code. We demonstrate the effectiveness of our framework for typical flowsheet modelling examples, including (i) a reaction/separation process, (ii) a pressure-swing distillation, and (iii) a heteroazeotropic distillation including entrainer selection. Along these lines, we discuss current limitations of the framework and outline future research directions to further enhance its capabilities.

Chinese Translation

集成了大型语言模型（LLMs）与推理和工具使用能力的代理人工智能系统正在改变各个领域，尤其是软件开发。相比之下，它们在化学过程流程图建模中的应用仍然基本未被探索。在本研究中，我们提出了一个代理人工智能框架，旨在为工业流程图模拟环境提供支持。为此，我们展示了使用最先进的LLMs（如Claude Opus 4.6（Anthropic, PBC, 2026））时GitHub Copilot（GitHub, Inc., 2026）的能力，以生成我们内部过程建模工具Chemasim的有效语法，使用技术文档和一些带注释的示例作为上下文。在此基础上，我们开发了一个多智能体系统，该系统将过程开发任务进行分解，其中一个智能体利用工程知识解决抽象问题，另一个智能体则将解决方案实现为Chemasim代码。我们展示了该框架在典型流程图建模示例中的有效性，包括（i）反应/分离过程，（ii）压力波动蒸馏，以及（iii）包括助剂选择的异相共沸蒸馏。沿着这些思路，我们讨论了该框架的当前局限性，并概述了未来的研究方向，以进一步增强其能力。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2603.12926

ODRL Policy Comparison Through Normalisation

通过规范化比较 ODRL 策略

Salas, Jaime Osvaldo, Pareti, Paolo, Konstantinidis, George

Abstract

The ODRL language has become the standard for representing policies and regulations for digital rights. However its complexity is a barrier to its usage, which has caused many related theoretical and practical works to focus on different, and not interoperable, fragments of ODRL. Moreover, semantically equivalent policies can be expressed in numerous different ways, which makes comparing them and processing them harder. Building on top of a recently defined semantics, we tackle these problems by proposing an approach that involves a parametrised normalisation of ODRL policies into its minimal components which reformulates policies with permissions and prohibitions into policies with permissions exclusively, and simplifies complex logic constraints into simple ones. We provide algorithms to compute a normal form for ODRL policies and simplifying numerical and symbolic constraints. We prove that these algorithms preserve the semantics of policies, and analyse the size complexity of the result, which is exponential on the number of attributes and linear on the number of unique values for these attributes. We show how this makes complex policies representable in more basic fragments of ODRL, and how it reduces the problem of policy comparison to the simpler problem of checking if two rules are identical.

Chinese Translation

ODRL 语言已成为表示数字权利政策和法规的标准。然而，其复杂性成为了使用的障碍，导致许多相关的理论和实践工作集中在不同且不兼容的 ODRL 片段上。此外，语义上等价的政策可以用多种不同的方式表达，这使得比较和处理这些政策变得更加困难。在最近定义的语义基础上，我们通过提出一种方法来解决这些问题，该方法涉及将 ODRL 政策参数化规范化为其最小组件，将包含许可和禁止的政策重新表述为仅包含许可的政策，并将复杂的逻辑约束简化为简单的约束。我们提供了计算 ODRL 政策的规范形式以及简化数值和符号约束的算法。我们证明了这些算法保持政策的语义，并分析了结果的大小复杂性，该复杂性在属性数量上是指数级的，而在这些属性的唯一值数量上是线性级的。我们展示了这如何使复杂政策能够在 ODRL 的更基本片段中表示，以及这如何将政策比较的问题简化为检查两个规则是否相同的更简单问题。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2603.12933

Efficient and Interpretable Multi-Agent LLM Routing via Ant Colony Optimization

基于蚁群优化的高效可解释多智能体大语言模型路由

Wang, Xudong, Zhang, Chaoning, Zhang, Jiaquan, Li, Chenghao, Sun, Qigan, Bae, Sung-Ho, Wang, Peng, Xie, Ning, Zou, Jie, Yang, Yang, Shen, Hengtao

Abstract

Large Language Model (LLM)-driven Multi-Agent Systems (MAS) have demonstrated strong capability in complex reasoning and tool use, and heterogeneous agent pools further broaden the quality--cost trade-off space. Despite these advances, real-world deployment is often constrained by high inference cost, latency, and limited transparency, which hinders scalable and efficient routing. Existing routing strategies typically rely on expensive LLM-based selectors or static policies, and offer limited controllability for semantic-aware routing under dynamic loads and mixed intents, often resulting in unstable performance and inefficient resource utilization. To address these limitations, we propose AMRO-S, an efficient and interpretable routing framework for Multi-Agent Systems (MAS). AMRO-S models MAS routing as a semantic-conditioned path selection problem, enhancing routing performance through three key mechanisms: First, it leverages a supervised fine-tuned (SFT) small language model for intent inference, providing a low-overhead semantic interface for each query; second, it decomposes routing memory into task-specific pheromone specialists, reducing cross-task interference and optimizing path selection under mixed workloads; finally, it employs a quality-gated asynchronous update mechanism to decouple inference from learning, optimizing routing without increasing latency. Extensive experiments on five public benchmarks and high-concurrency stress tests demonstrate that AMRO-S consistently improves the quality--cost trade-off over strong routing baselines, while providing traceable routing evidence through structured pheromone patterns.

Chinese Translation

基于大语言模型（LLM）的多智能体系统（MAS）在复杂推理和工具使用方面展现了强大的能力，而异构智能体池进一步拓宽了质量与成本的权衡空间。尽管取得了这些进展，现实世界的部署往往受到高推理成本、延迟和有限透明度的限制，这阻碍了可扩展和高效的路由。现有的路由策略通常依赖于昂贵的基于LLM的选择器或静态策略，并且在动态负载和混合意图下提供有限的语义感知路由可控性，常常导致性能不稳定和资源利用效率低下。为了解决这些限制，我们提出了AMRO-S，一个高效且可解释的多智能体系统（MAS）路由框架。AMRO-S将MAS路由建模为一个语义条件下的路径选择问题，通过三个关键机制增强路由性能：首先，它利用一个经过监督微调（SFT）的较小语言模型进行意图推断，为每个查询提供低开销的语义接口；其次，它将路由记忆分解为任务特定的信息素专家，减少跨任务干扰并优化混合工作负载下的路径选择；最后，它采用质量门控的异步更新机制，将推理与学习解耦，优化路由而不增加延迟。在五个公共基准和高并发压力测试中的广泛实验表明，AMRO-S在强路由基线之上始终改善了质量与成本的权衡，同时通过结构化的信息素模式提供可追溯的路由证据。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2603.13017

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

个性化智能体记忆的结构化蒸馏：检索保留下的11倍令牌减少

Lewis, Sydney

Abstract

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.

Chinese Translation

与AI智能体的长时间对话为用户带来了一个简单的问题：历史记录是有用的，但逐字携带它的成本很高。我们研究个性化智能体记忆：用户与智能体的对话历史被蒸馏成一个紧凑的检索层，以便后续搜索。每次交流被压缩为一个包含四个字段的复合对象（exchange_core、specific_context、thematic room_assignments 和 regex-extracted files_touched）。可搜索的蒸馏文本平均每次交流38个令牌。该方法应用于来自6个软件工程项目的4,182次对话（14,340次交流），将平均交流长度从371个令牌减少到38个，达到了11倍的压缩。我们使用201个以回忆为导向的查询、107种配置（涵盖5种纯搜索模式和5种跨层搜索模式）以及5个LLM评分者（214,519对共识评分的查询结果）来评估个性化回忆在这种压缩下是否依然有效。最佳的纯蒸馏配置达到了最佳逐字均值逆相关率（MRR）的96%（0.717对比0.745）。结果依赖于机制。经过Bonferroni校正后，所有20个向量搜索配置均未显著，而所有20个BM25配置则显著降级（效应大小 |d|=0.031-0.756）。最佳的跨层设置略微超过了最佳的纯逐字基线（MRR 0.759）。结构化蒸馏在不均匀牺牲检索质量的情况下压缩了单用户智能体记忆。在1/11的上下文成本下，数千次交流可以适应于单个提示，同时逐字源仍可用于深入探讨。我们将实现和分析管道作为开源软件发布。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2603.13099

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

超越最终答案：CRYSTAL基准用于透明的多模态推理评估

Barrios, Wayner, Jin, SouYoung

Abstract

We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: *Match F1*, which scores step-level precision and recall via semantic similarity matching, and *Ordered Match F1*, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the **Causal Process Reward (CPR)**, a multiplicative reward that couples answer correctness with step-level alignment, and **CPR-Curriculum**, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.

Chinese Translation

我们介绍了**CRYSTAL**（*清晰推理通过产生步骤、可追溯性和逻辑*），这是一个包含6,372个实例的诊断基准，通过可验证的中间步骤评估多模态推理。我们提出了两个互补的指标：*Match F1*，通过语义相似性匹配对步骤级的精确度和召回率进行评分，以及*Ordered Match F1*，进一步惩罚无序的推理链。参考数据通过一个受德尔菲法启发的流程构建，其中四个独立的多模态大语言模型（MLLMs）生成轨迹，通过语义聚类进行汇总，并通过人工质量门进行验证。对20个多模态大语言模型的评估，包括在基准构建过程中未使用的商业前沿系统，揭示了在准确性上看不见的系统性失败：普遍的选择性偏好（精确度远超召回率）、非单调缩放权衡，以及无序推理，竞争模型中没有一个能保持超过60%的匹配步骤按正确顺序排列。除了评估，我们还提出了**因果过程奖励（CPR）**，这是一种将答案正确性与步骤级对齐相结合的乘法奖励，以及**CPR-课程**，在训练过程中逐步增加推理难度。CPR-课程通过GRPO实现了+32%的Match F1，而加性奖励策略未能实现，改善了推理而无需手动步骤注释。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2603.13131

Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

Steve-Evolving：通过细粒度诊断和双轨知识蒸馏实现开放世界具身自我进化

Xie, Zhengwei, Chen, Zhisheng, Weng, Ziyan, Wu, Tingyu, Li, Chenglong, Zhang, Vireo, Wang, Kun

Abstract

Open-world embodied agents must solve long-horizon tasks where the main bottleneck is not single-step planning quality but how interaction experience is organized and evolved. To this end, we present Steve-Evolving, a non-parametric self-evolving framework that tightly couples fine-grained execution diagnosis with dual-track knowledge distillation in a closed loop. The method follows three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. In detail, Experience Anchoring solidifies each subgoal attempt into a structured experience tuple with a fixed schema (pre-state, action, diagnosis-result, and post-state) and organizes it in a three-tier experience space with multi-dimensional indices (e.g., condition signatures, spatial hashing, and semantic tags) plus rolling summarization for efficient and auditable recall. To ensure sufficient information density for attribution, the execution layer provides compositional diagnosis signals beyond binary outcomes, including state-difference summaries, enumerated failure causes, continuous indicators, and stagnation/loop detection. Moreover, successful trajectories of Experience Distillation are generalized into reusable skills with explicit preconditions and verification criteria, while failures are distilled into executable guardrails that capture root causes and forbid risky operations at both subgoal and task granularities. Besides, Knowledge-Driven Closed-Loop Control retrieved skills and guardrails are injected into an LLM planner, and diagnosis-triggered local replanning updates the active constraints online, forming a continual evolution process without any model parameter updates. Experiments on the long-horizon suite of Minecraft MCU demonstrate consistent improvements over static-retrieval baselines.

Chinese Translation

开放世界具身智能体必须解决长时间跨度的任务，其中主要瓶颈不在于单步规划的质量，而在于如何组织和进化交互经验。为此，我们提出了Steve-Evolving，一个非参数自我进化框架，它将细粒度执行诊断与双轨知识蒸馏紧密结合在一个闭环中。该方法遵循三个阶段：经验锚定、经验蒸馏和知识驱动的闭环控制。具体而言，经验锚定将每个子目标尝试固化为一个具有固定模式的结构化经验元组（前状态、动作、诊断结果和后状态），并在一个具有多维索引（例如条件签名、空间哈希和语义标签）的三层经验空间中组织，同时进行滚动总结以实现高效和可审计的回忆。为了确保足够的信息密度以便归因，执行层提供了超越二元结果的组合诊断信号，包括状态差异总结、列举的失败原因、连续指标以及停滞/循环检测。此外，成功的经验蒸馏轨迹被概括为具有明确前提条件和验证标准的可重用技能，而失败则被蒸馏为可执行的保护措施，捕捉根本原因并禁止在子目标和任务粒度上的风险操作。此外，知识驱动的闭环控制将检索到的技能和保护措施注入到一个大型语言模型（LLM）规划器中，诊断触发的局部重规划在线更新活动约束，形成一个持续进化的过程，而无需任何模型参数更新。在Minecraft MCU的长时间跨度任务套件上的实验表明，相较于静态检索基线，表现出了一致的改进。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2603.13134

When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

当正确遇到错误：带有奖励-置信度校正的双边上下文条件化用于GRPO

Li, Yu, Lan, Tian, Qi, Zhengling

Abstract

Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capitalize on this, we present a contrastive reformulation of GRPO, showing that the GRPO objective implicitly maximizes the margin between the policy ratios of correct and incorrect samples. Building on this insight, we propose Bilateral Context Conditioning (BICC), a mechanism that allows the model to cross-reference successful and failed reasoning traces during the optimization, enabling a direct information flow across samples. We further introduce Reward-Confidence Correction (RCC) to stabilize training by dynamically adjusts the advantage baseline in GRPO using reward-confidence covariance derived from the first-order approximation of the variance-minimizing estimator. Both mechanisms require no additional sampling or auxiliary models and can be adapted to all GRPO variants. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements across comprehensive models and algorithms. Code is available at \href{https://github.com/Skylanding/BiCC}{https://github.com/Skylanding/BiCC}.

Chinese Translation

群体相对策略优化（GRPO）已成为训练推理模型的有效方法。尽管它基于群体均值计算优势，GRPO在优化过程中将每个输出视为独立样本，忽视了一个重要的结构信号：同一组内正确与错误解决方案之间的自然对比，从而忽略了通过明确对比成功的推理轨迹与失败的推理轨迹所能利用的丰富比较数据。为了利用这一点，我们提出了GRPO的对比重构，表明GRPO目标隐含地最大化正确样本与错误样本的策略比率之间的边际。基于这一见解，我们提出了双边上下文条件化（BICC），一种机制，允许模型在优化过程中交叉参考成功与失败的推理轨迹，从而实现样本之间的直接信息流动。我们进一步引入奖励-置信度校正（RCC），通过使用基于方差最小化估计器的一阶近似导出的奖励-置信度协方差动态调整GRPO中的优势基线，以稳定训练。这两种机制不需要额外的采样或辅助模型，并且可以适应所有GRPO变体。在数学推理基准上的实验显示了全面模型和算法的一致性改进。代码可在 exttt{https://github.com/Skylanding/BiCC} 获取。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2603.13168

Developing and evaluating a chatbot to support maternal health care

开发与评估支持母婴健康护理的聊天机器人

Jha, Smriti, Jain, Vidhi, Xu, Jianyu, Liu, Grace, Ramesh, Sowmya, Nagpal, Jitender, Chapman, Gretchen, Bellows, Benjamin, Goyal, Siddhartha, Singh, Aarti, Wilder, Bryan

Abstract

The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.

Chinese Translation

利用基于电话的聊天机器人提供可靠的母婴健康信息的能力可以产生显著影响，特别是在资源匮乏的环境中，用户的健康素养较低且获得护理的机会有限。然而，部署此类系统在技术上具有挑战性：用户查询通常简短、不明确，并且在多种语言中混合使用，答案需要特定于地区的背景支持，而部分或缺失的症状背景使得安全的转诊决策变得困难。我们展示了一个为印度母婴健康开发的聊天机器人，该项目是由学术研究人员、健康科技公司、公共卫生非营利组织和医院之间的合作成果。该系统结合了（1）阶段感知的分诊，将高风险查询路由到专家模板，（2）基于策划的母婴指南的混合检索，以及（3）基于大语言模型（LLM）的证据条件生成。我们的核心贡献是一个在有限专家监督下进行高风险部署的评估工作流程。针对组件级和端到端测试，我们引入了：（i）一个标记的分诊基准（N=150），实现了86.7%的紧急召回率，明确报告了漏报紧急情况与过度升级之间的权衡；（ii）一个合成的多证据检索基准（N=100），具有块级证据标签；（iii）在真实查询上进行的LLM作为评判者的比较（N=781），使用临床医生设计的标准；以及（iv）专家验证。我们的研究结果表明，在多语言、嘈杂环境中，可靠的医疗助手需要深度防御设计与多方法评估相结合，而不是依赖任何单一模型和评估方法的选择。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2603.13173

Semantic Invariance in Agentic AI

代理人工智能中的语义不变性

de Zarzà, I., de Curtò, J., Cabot, Jordi, Manzoni, Pietro, Calafate, Carlos T.

Abstract

Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.

Chinese Translation

大型语言模型（LLMs）越来越多地作为自主推理代理在决策支持、科学问题解决和多代理协调系统中发挥作用。然而，在重要应用中部署LLM代理需要确保其推理在语义等价输入变体下保持稳定，这一特性我们称之为语义不变性。标准基准评估通过评估固定的、规范的问题表述的准确性，未能捕捉到这一关键的可靠性维度。为了解决这一不足，本文提出了一种变形测试框架，以系统性地评估LLM推理代理的鲁棒性，应用了八种保持语义的变换（身份、释义、事实重排序、扩展、收缩、学术背景、商业背景和对比表述），涵盖了四种不同架构系列的七个基础模型：Hermes（70B, 405B）、Qwen3（30B-A3B, 235B-A22B）、DeepSeek-R1和gpt-oss（20B, 120B）。我们的评估涵盖了八个科学领域的19个多步骤推理问题。结果显示，模型规模并不能预测鲁棒性：较小的Qwen3-30B-A3B实现了最高的稳定性（79.6%的不变响应，语义相似度0.91），而较大的模型则表现出更大的脆弱性。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2603.12270

Task-Specific Knowledge Distillation via Intermediate Probes

通过中间探针进行任务特定的知识蒸馏

Brown, Ryan, Russell, Chris

Abstract

Knowledge distillation from large language models (LLMs) assumes that the teacher's output distribution is a high-quality training signal. On reasoning tasks, this assumption is frequently violated. A model's intermediate representations may encode the correct answer, yet this information is lost or distorted through the vocabulary projection, where prompt formatting and answer-token choices creates brittle, noisy outputs. We introduce \method{}, a distillation framework that bypasses this bottleneck by training lightweight probes on frozen teacher hidden states and using the probe's predictions, rather than output logits, as supervision for student training. This simple change yields consistent improvements across four reasoning benchmarks (AQuA-RAT, ARC Easy/Challenge, and MMLU), with gains most pronounced under limited data. Probes trained on intermediate representations provide cleaner labels than the teacher's own outputs, effectively denoising the distillation signal. \method{} requires no architectural changes to student or teacher, is architecture-agnostic, and adds minimal compute since probe training is cheap and teacher representations can be cached. By exploiting internal representations, \method{} enables practitioners to extract more value from large teacher models without additional training data or architectural complexity.

Chinese Translation

从大型语言模型（LLMs）进行知识蒸馏假设教师的输出分布是高质量的训练信号。然而，在推理任务中，这一假设常常被违反。模型的中间表示可能编码了正确的答案，但通过词汇投影，这些信息会丢失或失真，其中提示格式和答案标记的选择会产生脆弱且嘈杂的输出。我们提出了 extit{method}，一种蒸馏框架，通过在冻结的教师隐藏状态上训练轻量级探针，并使用探针的预测而非输出对数作为学生训练的监督，从而绕过这一瓶颈。这一简单的改变在四个推理基准（AQuA-RAT、ARC Easy/Challenge和MMLU）上带来了持续的改进，尤其在数据有限的情况下效果最为显著。基于中间表示训练的探针提供了比教师自身输出更干净的标签，有效地去噪了蒸馏信号。 extit{method}不需要对学生或教师进行架构上的改变，具有架构无关性，并且由于探针训练成本低且教师表示可以缓存，增加的计算量极小。通过利用内部表示， extit{method}使从大型教师模型中提取更多价值成为可能，而无需额外的训练数据或架构复杂性。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2603.12271

Diagnosing Retrieval Bias Under Multiple In-Context Knowledge Updates in Large Language Models

在大型语言模型中诊断多重上下文知识更新下的检索偏差

Qiao, Boyu, Guo, Sean, Yang, Xian, Li, Kun, Zhou, Wei, Hu, Songlin, Song, Yunya

Abstract

LLMs are widely used in knowledge-intensive tasks where the same fact may be revised multiple times within context. Unlike prior work focusing on one-shot updates or single conflicts, multi-update scenarios contain multiple historically valid versions that compete at retrieval, yet remain underexplored. This challenge resembles the AB-AC interference paradigm in cognitive psychology: when the same cue A is successively associated with B and C, the old and new associations compete during retrieval, leading to bias. Inspired by this, we introduce a Dynamic Knowledge Instance (DKI) evaluation framework, modeling multi-updates of the same fact as a cue paired with a sequence of updated values, and assess models via endpoint probing of the earliest (initial) and latest (current) states. Across diverse LLMs, we observe that retrieval bias intensifies as updates increase, earliest-state accuracy stays high while latest-state accuracy drops substantially. Diagnostic analyses of attention, hidden-state similarity, and output logits further reveal that these signals become flatter and weakly discriminative on errors, providing little stable basis for identifying the latest update. Finally, cognitively inspired heuristic intervention strategies yield only modest gains and do not eliminate the bias. Our results reveal a persistent challenge in tracking and following knowledge updates in long contexts.

Chinese Translation

大型语言模型（LLMs）广泛应用于知识密集型任务，其中同一事实可能在上下文中被多次修订。与以往关注单次更新或单一冲突的研究不同，多次更新场景包含多个历史有效版本，这些版本在检索时相互竞争，但仍然未得到充分探索。这一挑战类似于认知心理学中的AB-AC干扰范式：当同一线索A依次与B和C关联时，旧的和新的关联在检索过程中相互竞争，导致偏差。受到此启发，我们提出了一种动态知识实例（Dynamic Knowledge Instance, DKI）评估框架，将同一事实的多次更新建模为与一系列更新值配对的线索，并通过对最早（初始）和最新（当前）状态的端点探测来评估模型。在不同的大型语言模型中，我们观察到随着更新次数的增加，检索偏差加剧，最早状态的准确性保持较高，而最新状态的准确性显著下降。对注意力、隐藏状态相似性和输出对数几率的诊断分析进一步揭示，这些信号在错误时变得更加平坦且区分能力较弱，为识别最新更新提供了很少的稳定基础。最后，受到认知启发的启发式干预策略仅带来了适度的收益，并未消除偏差。我们的结果揭示了在长上下文中跟踪和遵循知识更新的持续挑战。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2603.12272

ActTail: Global Activation Sparsity in Large Language Models

ActTail：大型语言模型中的全局激活稀疏性

Hou, Wenwen, Song, Xinyuan, Liu, Shiwei

Abstract

Activation sparsity is a promising approach for accelerating large language model (LLM) inference by reducing computation and memory movement. However, existing activation sparsity methods typically apply uniform sparsity across projections, ignoring the heterogeneous statistical properties of Transformer weights and thereby amplifying performance degradation. In this paper, we propose ActTail, a TopK magnitude-based activation sparsity method with global activation sparsity allocation grounded in Heavy-Tailed Self-Regularization (HT-SR) theory. Specifically, we capture this heterogeneity via the heavy-tail exponent computed from each projection's empirical spectral density (ESD), which is used as a quantitative indicator to assign projection-specific sparsity budgets. Importantly, we provide a theoretical analysis that establishes an explicit relationship between the activation sparsity ratio and the heavy-tail exponent under the HT-SR regime, offering principled guidance for sparsity allocation beyond heuristic design. Experiments on LLaMA and Mistral models show that our method improves both perplexity and downstream task performance at high sparsity compared to uniform allocation. At 80% sparsity, perplexity is reduced by 21.8% on LLaMA-2-7B, 40.1% on LLaMA-2-13B, and 9.4% on Mistral-7B.

Chinese Translation

激活稀疏性是一种通过减少计算和内存移动来加速大型语言模型（LLM）推理的有前景的方法。然而，现有的激活稀疏性方法通常在投影上应用均匀稀疏性，忽视了Transformer权重的异质统计特性，从而加剧了性能下降。在本文中，我们提出了ActTail，一种基于TopK幅度的激活稀疏性方法，具有基于重尾自我正则化（Heavy-Tailed Self-Regularization, HT-SR）理论的全局激活稀疏性分配。具体而言，我们通过从每个投影的经验谱密度（Empirical Spectral Density, ESD）计算的重尾指数来捕捉这种异质性，该指数用作分配投影特定稀疏预算的定量指标。重要的是，我们提供了理论分析，建立了HT-SR机制下激活稀疏比率与重尾指数之间的明确关系，为超越启发式设计的稀疏分配提供了原则性指导。在LLaMA和Mistral模型上的实验表明，与均匀分配相比，我们的方法在高稀疏性下提高了困惑度和下游任务性能。在80%稀疏性下，LLaMA-2-7B的困惑度降低了21.8%，LLaMA-2-13B降低了40.1%，Mistral-7B降低了9.4%。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2603.12273

Aligning Language Models from User Interactions

从用户交互中对齐语言模型

Buening, Thomas Kleine, Hübotter, Jonas, Pásztor, Barna, Shenfeld, Idan, Ramponi, Giorgia, Krause, Andreas

Abstract

Multi-turn user interactions are among the most abundant data produced by language models, yet we lack effective methods to learn from them. While typically discarded, these interactions often contain useful information: follow-up user messages may indicate that a response was incorrect, failed to follow an instruction, or did not align with the user's preferences. Importantly, language models are already able to make use of this information in context. After observing a user's follow-up, the same model is often able to revise its behavior. We leverage this ability to propose a principled and scalable method for learning directly from user interactions through self-distillation. By conditioning the model on the user's follow-up message and comparing the resulting token distribution with the original policy, we obtain a target for updating the policy that captures how the model's behavior changes in hindsight. We then distill this hindsight distribution back into the current policy. Remarkably, we show that training on real-world user conversations from WildChat improves language models across standard alignment and instruction-following benchmarks, without regressing other capabilities. The same mechanism enables personalization, allowing models to continually adapt to individual users through interaction without explicit feedback. Our results demonstrate that raw user interactions that arise naturally during deployment enable alignment, personalization, and continual adaptation.

Chinese Translation

多轮用户交互是语言模型产生的最丰富的数据之一，但我们缺乏有效的方法来从中学习。虽然这些交互通常被丢弃，但它们往往包含有用的信息：后续用户消息可能表明某个回应是错误的、未能遵循指令，或未能符合用户的偏好。重要的是，语言模型已经能够在上下文中利用这些信息。在观察到用户的后续消息后，同一模型通常能够修正其行为。我们利用这一能力提出了一种原则性且可扩展的方法，通过自蒸馏直接从用户交互中学习。通过将模型条件化于用户的后续消息，并将生成的标记分布与原始策略进行比较，我们获得了一个更新策略的目标，该目标捕捉了模型行为在事后如何变化。然后，我们将这种事后分布蒸馏回当前策略。值得注意的是，我们展示了在 WildChat 的真实用户对话上训练可以改善语言模型在标准对齐和指令遵循基准上的表现，而不会导致其他能力的退化。同样的机制使个性化成为可能，允许模型通过交互不断适应个别用户，而无需明确反馈。我们的结果表明，在部署过程中自然产生的原始用户交互能够实现对齐、个性化和持续适应。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2603.12275

GONE: Structural Knowledge Unlearning via Neighborhood-Expanded Distribution Shaping

GONE：通过邻域扩展分布塑形实现结构知识的遗忘

Dahal, Chahana, Balasubramaniam, Ashutosh, Xiong, Zuobin

Abstract

Unlearning knowledge is a pressing and challenging task in Large Language Models (LLMs) because of their unprecedented capability to memorize and digest training data at scale, raising more significant issues regarding safety, privacy, and intellectual property. However, existing works, including parameter editing, fine-tuning, and distillation-based methods, are all focused on flat sentence-level data but overlook the relational, multi-hop, and reasoned knowledge in naturally structured data. In response to this gap, this paper introduces Graph Oblivion and Node Erasure (GONE), a benchmark for evaluating knowledge unlearning over structured knowledge graph (KG) facts in LLMs. This KG-based benchmark enables the disentanglement of three effects of unlearning: direct fact removal, reasoning-based leakage, and catastrophic forgetting. In addition, Neighborhood-Expanded Distribution Shaping (NEDS), a novel unlearning framework, is designed to leverage graph connectivity and identify anchor correlated neighbors, enforcing a precise decision boundary between the forgotten fact and its semantic neighborhood. Evaluations on LLaMA-3-8B and Mistral-7B across multiple knowledge editing and unlearning methods showcase NEDS's superior performance (1.000 on unlearning efficacy and 0.839 on locality) on GONE and other benchmarks. Code is available at https://anonymous.4open.science/r/GONE-4679/.

Chinese Translation

在大型语言模型（LLMs）中，遗忘知识是一项紧迫且具有挑战性的任务，因为它们具有前所未有的能力来大规模记忆和消化训练数据，这引发了关于安全性、隐私和知识产权的更大问题。然而，现有的研究，包括参数编辑、微调和基于蒸馏的方法，均集中于平面句子级数据，而忽视了自然结构数据中的关系、多跳和推理知识。针对这一空白，本文提出了图遗忘与节点擦除（Graph Oblivion and Node Erasure，GONE），这是一个用于评估结构知识图（KG）事实遗忘的基准，适用于LLMs。该基于KG的基准能够解构遗忘的三种效应：直接事实移除、基于推理的泄漏和灾难性遗忘。此外，邻域扩展分布塑形（Neighborhood-Expanded Distribution Shaping，NEDS）是一种新颖的遗忘框架，旨在利用图的连通性并识别锚定相关邻居，从而在遗忘事实与其语义邻域之间强制建立精确的决策边界。在LLaMA-3-8B和Mistral-7B上进行的多种知识编辑和遗忘方法的评估展示了NEDS在GONE及其他基准上的优越表现（遗忘效率为1.000，局部性为0.839）。代码可在 https://anonymous.4open.science/r/GONE-4679/ 获取。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2603.12277

Prompt Injection as Role Confusion

提示注入作为角色混淆

Ye, Charles, Cui, Jasmine, Hadfield-Menell, Dylan

Abstract

Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.

Chinese Translation

尽管经过广泛的安全培训，语言模型仍然容易受到提示注入攻击。我们将这一失败归因于角色混淆：模型根据文本的写作方式推断角色，而不是文本的来源。我们设计了新颖的角色探测器，以捕捉模型内部如何识别“谁在发言”。这些探测器揭示了提示注入为何有效：模仿某一角色的非可信文本继承了该角色的权威。我们通过将伪造的推理注入用户提示和工具输出进行测试，在多个开放和封闭权重模型上实现了在 StrongREJECT 上平均成功率为 60%，在代理外泄上为 61%，基线几乎为零。值得注意的是，内部角色混淆的程度在生成开始之前强烈预测了攻击的成功。我们的研究结果揭示了一个根本性的差距：安全性在接口处被定义，但权威在潜在空间中被分配。更广泛地说，我们引入了一个统一的、机械性的提示注入框架，展示了多样的提示注入攻击利用相同的潜在角色混淆机制。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2603.12343

LLM-Augmented Therapy Normalization and Aspect-Based Sentiment Analysis for Treatment-Resistant Depression on Reddit

基于大语言模型增强的治疗抵抗性抑郁症的治疗规范化与基于方面的情感分析：以Reddit为例

Zhu, Yuxin, Lakamana, Sahithi, Rouhizadeh, Masoud, Bozkurt, Selen, Hershenberg, Rachel, Sarker, Abeed

Abstract

Treatment-resistant depression (TRD) is a severe form of major depressive disorder in which patients do not achieve remission despite multiple adequate treatment trials. Evidence across pharmacologic options for TRD remains limited, and trials often do not fully capture patient-reported tolerability. Large-scale online peer-support narratives therefore offer a complementary lens on how patients describe and evaluate medications in real-world use. In this study, we curated a corpus of 5,059 Reddit posts explicitly referencing TRD from 3,480 subscribers across 28 mental health-related subreddits from 2010 to 2025. Of these, 3,839 posts mentioned at least one medication, yielding 23,399 mentions of 81 generic-name medications after lexicon-based normalization of brand names, misspellings, and colloquialisms. We developed an aspect-based sentiment classifier by fine-tuning DeBERTa-v3 on the SMM4H 2023 therapy-sentiment Twitter corpus with large language model based data augmentation, achieving a micro-F1 score of 0.800 on the shared-task test set. Applying this classifier to Reddit, we quantified sentiment toward individual medications across three categories: positive, neutral, and negative, and tracked patterns by drug, subscriber, subreddit, and year. Overall, 72.1% of medication mentions were neutral, 14.8% negative, and 13.1% positive. Conventional antidepressants, especially SSRIs and SNRIs, showed consistently higher negative than positive proportions, whereas ketamine and esketamine showed comparatively more favorable sentiment profiles. These findings show that normalized medication extraction combined with aspect-based sentiment analysis can help characterize patient-perceived treatment experiences in TRD-related Reddit discourse, complementing clinical evidence with large-scale patient-generated perspectives.

Chinese Translation

治疗抵抗性抑郁症（TRD）是一种严重的重度抑郁障碍形式，患者在经历多次适当的治疗尝试后仍未能达到缓解。针对TRD的药物治疗选项的证据仍然有限，且临床试验往往无法充分捕捉患者报告的耐受性。因此，大规模的在线同伴支持叙事为患者在实际使用中描述和评估药物提供了一个补充视角。在本研究中，我们从2010年至2025年间，整理了来自28个心理健康相关子版块的3,480名订阅者发布的5,059条明确提及TRD的Reddit帖子。其中，3,839条帖子提到了至少一种药物，经过基于词汇的品牌名称、拼写错误和口语化表达的规范化后，产生了23,399次对81种通用名称药物的提及。我们通过在SMM4H 2023治疗情感Twitter语料库上微调DeBERTa-v3，并结合基于大语言模型的数据增强，开发了一个基于方面的情感分类器，在共享任务测试集上达到了0.800的微F1分数。将该分类器应用于Reddit后，我们量化了对个别药物的情感，分为三类：积极、中性和消极，并按药物、订阅者、子版块和年份跟踪模式。总体而言，72.1%的药物提及为中性，14.8%为消极，13.1%为积极。传统抗抑郁药，特别是SSRIs和SNRIs，显示出消极比例始终高于积极比例，而氟西汀和艾司氟西汀则显示出相对更有利的情感特征。这些发现表明，规范化的药物提取结合基于方面的情感分析可以帮助描述患者在与TRD相关的Reddit讨论中的治疗体验，从而补充临床证据与大规模患者生成的观点。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2603.12350

TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

TASTE-Streaming：面向可流式文本对齐语音标记化和嵌入的口语语言建模

Tseng, Liang-Hsuan, Lee, Hung-yi

Abstract

Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE's performance while significantly reducing latency. Further investigations reveal that TASTE-S remains robust to transcriptions and enables long-form encoding and decoding.

Chinese Translation

文本-语音联合口语语言建模（SLM）旨在实现自然和智能的基于语音的交互，但开发这样的系统可能会面临模态不匹配的问题：语音单元序列的长度远大于文本标记。之前的研究通过文本对齐的标记化和嵌入（TASTE）来缩小这一差距，生成与其文本对应物长度对齐的语音标记。然而，依赖外部自动语音识别（ASR）系统和使用非因果解码器限制了流式使用。为了解决这一限制，我们提出了TASTE-S，这是TASTE的可流式扩展，适合实时使用。TASTE-S将基于CTC的ASR模块集成到编码器中，以实现即时的双模态编码。我们还重新设计了单元解码器，以支持即时解码。通过联合训练，我们表明TASTE-S在性能上与TASTE相匹配，同时显著降低延迟。进一步的研究表明，TASTE-S对转录保持稳健，并支持长格式编码和解码。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2603.12397

Not Just the Destination, But the Journey: Reasoning Traces Causally Shape Generalization Behaviors

不仅仅是目的地，更是旅程：推理轨迹因果地塑造泛化行为

Wen, Pengcheng, Zhu, Yanxu, Sun, Jiapeng, Zhu, Han, Zhou, Yujin, Chan, Chi-Min, Han, Sirui, Guo, Yike

Abstract

Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning's causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with \textit{Evil} reasoning embracing malice, \textit{Misleading} reasoning rationalizing harm, and \textit{Submissive} reasoning yielding to pressure. We train models (0.6B--14B parameters) under multiple paradigms, including question-thinking-answer (QTA), question-thinking (QT), and thinking-only (T-only), and evaluate them in both think and no-think modes. We find that: (1) CoT training could amplify harmful generalization more than standard fine-tuning; (2) distinct reasoning types induce distinct behavioral patterns aligned with their semantics, despite identical final answers; (3) training on reasoning without answer supervision (QT or T-only) is sufficient to alter behavior, proving reasoning carries an independent signal; and (4) these effects persist even when generating answers without reasoning, indicating deep internalization. Our findings demonstrate that reasoning content is causally potent, challenging alignment strategies that supervise only outputs.

Chinese Translation

链式思维（Chain-of-Thought, CoT）常被视为大型语言模型（LLM）决策过程的窗口，然而近期的研究表明它可能仅仅作为事后合理化。这引发了一个关键的对齐问题：推理轨迹是否因果地塑造模型的泛化，而不依赖于最终答案？为了隔离推理的因果效应，我们设计了一个控制实验，保持最终有害答案不变，同时改变推理路径。我们构建了包含恶意的 extit{Evil}推理、合理化伤害的 extit{Misleading}推理和屈从压力的 extit{Submissive}推理的数据集。我们在多种范式下训练模型（参数规模从0.6B到14B），包括问题-思考-答案（QTA）、问题-思考（QT）和仅思考（T-only），并在思考和非思考模式下进行评估。我们的发现包括：（1）CoT训练可能比标准微调更能放大有害泛化；（2）不同的推理类型诱导出与其语义一致的不同行为模式，尽管最终答案相同；（3）在没有答案监督的情况下进行推理训练（QT或T-only）足以改变行为，证明推理携带独立信号；（4）即使在没有推理的情况下生成答案，这些效应依然存在，表明深度内化。我们的研究结果表明，推理内容具有因果效力，挑战仅监督输出的对齐策略。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2603.12423

Interpreting Negation in GPT-2: Layer- and Head-Level Causal Analysis

解读GPT-2中的否定：层级和头级因果分析

Mofael, Abdullah Al, Kuhn, Lisa M., Alkadi, Ghassan, Yang, Kuo-Pao

Abstract

Negation remains a persistent challenge for modern language models, often causing reversed meanings or factual errors. In this work, we conduct a causal analysis of how GPT-2 Small internally processes such linguistic transformations. We examine its hidden representations at both the layer and head level. Our analysis is based on a self-curated 12,000-pair dataset of matched affirmative and negated sentences, covering multiple linguistic templates and forms of negation. To quantify this behavior, we define a metric, the Negation Effect Score (NES), which measures the model's sensitivity in distinguishing between affirmative statements and their negations. We carried out two key interventions to probe causal structure. In activation patching, internal activations from affirmative sentences were inserted into their negated counterparts to see how meaning shifted. In ablation, specific attention heads were temporarily disabled to observe how logical polarity changed. Together, these steps revealed how negation signals move and evolve through GPT-2's layers. Our findings indicate that this capability is not widespread; instead, it is highly concentrated within a limited number of mid-layer attention heads, primarily within layers 4 to 6. Ablating these specific components directly disrupts the model's negation sensitivity: on our in-domain, ablation increased NES (indicating weaker negation sensitivity), and re-introducing cached affirmative activations (rescue) increased NES further, confirming that these heads carry affirmative signal rather than restoring baseline behavior. On xNot360, ablation slightly decreased NES and rescue restored performance above baseline. This pattern demonstrates that these causal patterns are consistent across various negation forms and remain detectable on the external xNot360 benchmark, though with smaller magnitude.

Chinese Translation

否定在现代语言模型中仍然是一个持续的挑战，常常导致意义的反转或事实错误。在本研究中，我们对GPT-2 Small内部如何处理这种语言转换进行了因果分析。我们考察了其在层级和头级的隐藏表示。我们的分析基于一个自我策划的12,000对匹配的肯定句和否定句的数据集，涵盖了多种语言模板和否定形式。为了量化这种行为，我们定义了一个指标，即否定效应分数（Negation Effect Score, NES），该指标测量模型在区分肯定陈述及其否定形式时的敏感性。我们进行了两项关键干预以探查因果结构。在激活补丁中，将肯定句的内部激活插入其否定对应句中，以观察意义的变化。在消融实验中，暂时禁用特定的注意力头，以观察逻辑极性的变化。这些步骤共同揭示了否定信号如何在GPT-2的层中移动和演变。我们的研究结果表明，这种能力并不普遍；相反，它高度集中在有限数量的中层注意力头中，主要位于第4层到第6层。消融这些特定组件直接破坏了模型的否定敏感性：在我们的领域内，消融增加了NES（表明否定敏感性减弱），而重新引入缓存的肯定激活（救援）进一步增加了NES，确认这些头承载的是肯定信号而不是恢复基线行为。在xNot360上，消融略微降低了NES，而救援则使性能恢复到基线之上。这一模式表明，这些因果模式在各种否定形式中是一致的，并且在外部xNot360基准测试中仍然可检测到，尽管幅度较小。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2603.12453

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

CSE-UOI在SemEval-2026任务6中的表现：一种具有深思复杂性门控的两阶段异构集成方法用于政治逃避检测

Tzouvaras, Christos, Skianis, Konstantinos, Voulodimos, Athanasios

Abstract

This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place.

Chinese Translation

本文描述了我们在SemEval-2026任务6中的系统，该系统将政治访谈中的回应清晰度分类为三类：清晰回复、模棱两可和清晰非回复。我们提出了一种通过自一致性（SC）和加权投票实现的异构双大型语言模型（LLM）集成，以及一种新颖的事后修正机制——深思复杂性门控（DCG）。该机制利用跨模型行为信号，并利用LLM响应长度代理与样本模糊性之间的强相关性。为了进一步研究提高模糊性检测的机制，我们评估了多智能体辩论作为增加深思能力的替代策略。与DCG通过跨模型行为信号自适应地门控推理不同，辩论在不增加模型多样性的情况下增加了智能体数量。我们的解决方案在评估集上达到了0.85的宏观F1分数，获得了第三名。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2603.12458

Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

打破捷径：一个针对大语言模型多跳医学推理的拓扑正则化基准

Zi, Xing, Zhou, Xinying, Xiao, Jinghao, Moreira, Catarina, Prasad, Mukesh

Abstract

While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning", where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/

Chinese Translation

尽管大型语言模型（LLMs）在标准医学基准上通过单跳事实回忆实现了专家级表现，但它们在现实临床环境中所需的复杂多跳诊断推理方面却面临严重挑战。一个主要障碍是“捷径学习”，模型利用知识图谱中高度连接的通用中心节点（例如，“炎症”）来绕过真实的微病理级联。为了解决这一问题，我们推出了ShatterMed-QA，这是一个包含10,558个多跳临床问题的双语基准，旨在严格评估深度诊断推理。我们的框架使用一种新颖的$k$-Shattering算法构建了一个拓扑正则化的医学知识图谱，物理性地修剪通用中心节点，以明确切断逻辑捷径。我们通过应用隐式桥接实体掩蔽和基于拓扑的困难负采样来合成评估情境，迫使模型在不依赖表面消除的情况下，导航生物学上合理的干扰项。对21个LLM的全面评估显示，在我们的多跳任务中，表现出现了巨大的下降，特别是在特定领域模型中。至关重要的是，通过检索增强生成（Retrieval-Augmented Generation, RAG）恢复被掩蔽的证据几乎触发了普遍的性能恢复，验证了ShatterMed-QA的结构保真性，并证明了其在诊断当前医学人工智能基本推理缺陷方面的有效性。请访问我们的项目网站：https://shattermed-qa-web.vercel.app/，探索数据集、互动示例和完整排行榜。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2603.12471

Marked Pedagogies: Examining Linguistic Biases in Personalized Automated Writing Feedback

标记化教学法：审视个性化自动写作反馈中的语言偏见

Tan, Mei, Phalen, Lena, Demszky, Dorottya

Abstract

Effective personalized feedback is critical to students' literacy development. Though LLM-powered tools now promise to automate such feedback at scale, LLMs are not language-neutral: they privilege standard academic English and reproduce social stereotypes, raising concerns about how "personalization" shapes the feedback students receive. We examine how four widely used LLMs (GPT-4o, GPT-3.5-turbo, Llama-3.3 70B, Llama-3.1 8B) adapt written feedback in response to student attributes. Using 600 eighth-grade persuasive essays from the PERSUADE dataset, we generated feedback under prompt conditions embedding gender, race/ethnicity, learning needs, achievement, and motivation. We analyze lexical shifts across model outputs by adapting the Marked Words framework. Our results reveal systematic, stereotype-aligned shifts in feedback conditioned on presumed student attributes--even when essay content was identical. Feedback for students marked by race, language, or disability often exhibited positive feedback bias and feedback withholding bias--overuse of praise, less substantive critique, and assumptions of limited ability. Across attributes, models tailored not only what content was emphasized but also how writing was judged and how students were addressed. We term these instructional orientations Marked Pedagogies and highlight the need for transparency and accountability in automated feedback tools.

Chinese Translation

有效的个性化反馈对学生的读写能力发展至关重要。尽管基于大型语言模型（LLM）的工具现在承诺能够大规模自动化此类反馈，但LLM并非语言中立：它们偏向标准学术英语并再现社会刻板印象，这引发了关于“个性化”如何影响学生所获得反馈的担忧。我们考察了四种广泛使用的LLM（GPT-4o、GPT-3.5-turbo、Llama-3.3 70B、Llama-3.1 8B）如何根据学生特征调整书面反馈。我们使用来自PERSUADE数据集的600篇八年级说服性论文，在嵌入性别、种族/民族、学习需求、成就和动机的提示条件下生成反馈。我们通过调整标记词（Marked Words）框架分析模型输出中的词汇变化。我们的结果揭示了基于假定学生特征的反馈中存在系统性的、与刻板印象一致的变化——即使论文内容相同。被种族、语言或残疾标记的学生的反馈通常表现出积极反馈偏见和反馈抑制偏见——过度赞美、缺乏实质性批评以及对能力有限的假设。在各个特征中，模型不仅调整了强调的内容，还调整了写作的评判标准和对学生的称呼方式。我们将这些教学取向称为标记化教学法，并强调在自动反馈工具中需要透明度和问责制。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2603.12522

LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation

LLM 偏见范围：用于比较 LLM 评估的实时偏见分析平台

Ghosh, Himel, Werner, Nick Elias

Abstract

As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web application for side-by-side comparison of LLM outputs with real-time bias analysis. The system supports multiple providers (Google Gemini, DeepSeek, MiniMax, Mistral, Meituan, Meta Llama) and enables researchers and practitioners to compare models on the same prompts while analyzing bias patterns. LLM BiasScope uses a two-stage bias detection pipeline: sentence-level bias detection followed by bias type classification for biased sentences. The analysis runs automatically on both user prompts and model responses, providing statistics, visualizations, and detailed breakdowns of bias types. The interface displays two models side-by-side with synchronized streaming responses, per-model bias summaries, and a comparison view highlighting differences in bias distributions. The system is built on Next.js with React, integrates Hugging Face inference endpoints for bias detection, and uses the Vercel AI SDK for multi-provider LLM access. Features include real-time streaming, export to JSON/PDF, and interactive visualizations (bar charts, radar charts) for bias analysis. LLM BiasScope is available as an open-source web application, providing a practical tool for bias evaluation and comparative analysis of LLM behaviour.

Chinese Translation

随着大型语言模型（LLMs）的广泛部署，检测和理解其输出中的偏见变得至关重要。我们提出了 LLM 偏见范围（LLM BiasScope），这是一个用于 LLM 输出的并排比较及实时偏见分析的网络应用。该系统支持多个提供者（Google Gemini、DeepSeek、MiniMax、Mistral、美团、Meta Llama），使研究人员和从业者能够在相同提示下比较模型并分析偏见模式。LLM 偏见范围采用两阶段偏见检测流程：首先进行句子级偏见检测，然后对有偏见的句子进行偏见类型分类。该分析自动运行于用户提示和模型响应上，提供统计数据、可视化和偏见类型的详细分类。界面显示两个模型并排，具有同步流式响应、每个模型的偏见摘要以及突出偏见分布差异的比较视图。该系统基于 Next.js 和 React 构建，集成了 Hugging Face 推理端点以进行偏见检测，并使用 Vercel AI SDK 访问多个提供者的 LLM。其功能包括实时流式传输、导出为 JSON/PDF，以及用于偏见分析的交互式可视化（条形图、雷达图）。LLM 偏见范围作为一个开源网络应用可用，为偏见评估和 LLM 行为的比较分析提供了实用工具。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2603.12564

AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

AgentDrift：在排名指标掩盖下工具腐败导致的不安全推荐漂移

Wu, Zekun, Koshiyama, Adriano, Bulathwela, Sahan, Perez-Ortiz, Maria

Abstract

Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

Chinese Translation

工具增强的LLM代理越来越多地在高风险领域担任多轮顾问，然而它们的评估依赖于衡量推荐内容的排名质量指标，而非其对用户的安全性。我们引入了一种配对轨迹协议，在七个LLM（从7B到前沿模型）中重放真实的金融对话，分别在干净和受污染的工具输出条件下进行，并将偏差分解为信息通道和记忆通道机制。在测试的七个模型中，我们一致观察到评估盲点模式：在污染下推荐质量基本保持（效用保持比率约为1.0），而不适合的风险产品出现在65-93%的轮次中，这是一个系统性的安全失败，标准的NDCG未能很好反映这一点。安全违规主要由信息通道驱动，在第一次受污染的轮次中出现，并在23步轨迹中持续存在而没有自我修正；在1,563个受污染的轮次中，没有任何代理明确质疑工具数据的可靠性。即使是仅有叙述的腐败（偏见标题，无数值操控）也会引发显著的漂移，同时完全逃避一致性监控。安全惩罚的NDCG变体（sNDCG）将保持比率降低到0.51-0.74，表明一旦明确测量安全性，评估差距中的大部分将变得可见。这些结果促使我们考虑在高风险环境中部署的多轮代理的轨迹级安全监测，而不仅仅是单轮质量。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2603.12572

LMEB: Long-horizon Memory Embedding Benchmark

LMEB：长时间记忆嵌入基准

Zhao, Xinping, Hu, Xinshuo, Xu, Jiaxin, Tang, Danyu, Zhang, Xin, Zhou, Mengjia, Zhong, Yan, Zhou, Yao, Shan, Zifei, Zhang, Meishan, Hu, Baotian, Zhang, Min

Abstract

Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.

Chinese Translation

记忆嵌入对于增强记忆的系统（如 OpenClaw）至关重要，但在当前的文本嵌入基准中，它们的评估尚未得到充分探讨，这些基准狭隘地关注传统的段落检索，未能评估模型处理涉及碎片化、上下文依赖和时间上遥远信息的长时间记忆检索任务的能力。为了解决这一问题，我们引入了长时间记忆嵌入基准（LMEB），这是一个全面的框架，用于评估嵌入模型在处理复杂的长时间记忆检索任务中的能力。LMEB 涉及 22 个数据集和 193 个零样本检索任务，涵盖 4 种记忆类型：情节记忆、对话记忆、语义记忆和过程记忆，数据来源包括 AI 生成的数据和人工标注的数据。这些记忆类型在抽象层次和时间依赖性方面存在差异，捕捉了记忆检索的不同方面，反映了现实世界的多样挑战。我们评估了 15 种广泛使用的嵌入模型，这些模型的参数从数亿到十亿不等。结果表明：（1）LMEB 提供了合理的难度水平；（2）更大的模型并不总是表现更好；（3）LMEB 和 MTEB 之间存在正交性。这表明该领域尚未收敛于一个能够在所有记忆检索任务中表现出色的通用模型，并且在传统段落检索中的表现可能无法推广到长时间记忆检索。总之，通过提供一个标准化和可重复的评估框架，LMEB 填补了记忆嵌入评估中的一个关键空白，推动了文本嵌入在处理长期、上下文依赖的记忆检索方面的进一步发展。LMEB 可在 https://github.com/KaLM-Embedding/LMEB 获得。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2603.12577

Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

专家金字塔调优：基于专业知识的任务分配高效参数微调

Zhang, Jia-Chen, Yan, Zhen-Wei, Xiong, Yu-Jie, Xia, Chun-Ming

Abstract

Parameter-Efficient Fine-Tuning (PEFT) has become a dominant paradigm for deploying LLMs in multi-task scenarios due to its extreme parameter efficiency. While Mixture-of-Experts (MoE) based LoRA variants have achieved promising results by dynamically routing tokens to different low-rank experts, they largely overlook the hierarchical nature of task complexity. Existing methods typically employ experts with uniform architectures, limiting their ability to capture diverse feature granularities required by distinct tasks--where some tasks demand high-level semantic abstraction while others require fine-grained syntactic manipulation. To bridge this gap, we propose Expert Pyramid Tuning (EPT), a novel architecture that integrates the multi-scale feature pyramid concept from computer vision into the realm of PEFT. Unlike standard LoRA, EPT decomposes task adaptation into two stages: (1) A shared meta-knowledge Subspace that encodes universal linguistic patterns in low dimensions; (2) A Pyramid Projection Mechanism that utilizes learnable up-projection operators to reconstruct high-dimensional features at varying scales. A task-aware router then dynamically selects the optimal combination of these multi-scale features. Extensive experiments across multiple multi-task benchmarks demonstrate that EPT significantly outperforms SOTA MoE-LoRA variants. Crucially, thanks to the re-parameterization capability of our design, EPT achieves this performance improvement while simultaneously reducing the number of training parameters.

Chinese Translation

参数高效微调（PEFT）已成为在多任务场景中部署大规模语言模型（LLMs）的主流范式，因其极高的参数效率。尽管基于专家混合（MoE）的LoRA变体通过动态路由令牌到不同的低秩专家取得了良好的效果，但它们在很大程度上忽视了任务复杂性的层次特征。现有方法通常采用统一架构的专家，这限制了它们捕捉不同任务所需的多样化特征粒度的能力——某些任务需要高层次的语义抽象，而其他任务则需要细粒度的句法操作。为了解决这一问题，我们提出了专家金字塔调优（EPT），这是一种新颖的架构，将计算机视觉中的多尺度特征金字塔概念引入PEFT领域。与标准的LoRA不同，EPT将任务适应分解为两个阶段：（1）一个共享的元知识子空间，用于在低维中编码通用语言模式；（2）一个金字塔投影机制，利用可学习的向上投影算子在不同尺度上重构高维特征。然后，任务感知路由器动态选择这些多尺度特征的最佳组合。在多个多任务基准上的广泛实验表明，EPT显著优于当前最先进的MoE-LoRA变体。重要的是，由于我们设计的重新参数化能力，EPT在提高性能的同时，显著减少了训练参数的数量。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2603.12582

RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

RTD-Guard：一种通过替换标记检测的黑箱文本对抗检测框架

Zhu, He, Li, Yanshu, Liu, Wen, Yang, Haitian

Abstract

Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence shift of the victim model before and after intervention. The entire process requires no adversarial data, model tuning, or internal model access, and uses only two black-box queries. Comprehensive experiments on multiple benchmark datasets demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods. It surpasses existing detection baselines across multiple metrics, offering a highly efficient, practical, and resource-light defense mechanism-particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments.

Chinese Translation

文本对抗攻击通过引入不可察觉的扰动，对自然语言处理（NLP）系统构成严重的安全威胁，这些扰动会误导深度学习模型。虽然对抗样本检测提供了一种轻量级的替代方案以增强训练的鲁棒性，但现有方法通常依赖于对攻击的先验知识、对受害模型的白箱访问或大量查询，这严重限制了它们的实际部署。本文介绍了RTD-Guard，一种新颖的黑箱框架，用于检测文本对抗样本。我们的关键见解是，对抗攻击中的词替换扰动与被预训练用于识别的“替换标记”的Replaced Token Detection（RTD）鉴别器非常相似。基于此，RTD-Guard利用现成的RTD鉴别器（无需微调）来定位可疑标记，掩盖它们，并通过观察受害模型在干预前后的预测置信度变化来检测对抗样本。整个过程不需要对抗数据、模型调优或内部模型访问，仅使用两个黑箱查询。对多个基准数据集的全面实验表明，RTD-Guard能够有效检测由多种最先进攻击方法生成的对抗文本。在多个指标上超越现有检测基线，提供了一种高效、实用且资源占用低的防御机制，特别适合在资源受限或隐私敏感的环境中进行实际部署。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2603.12638

Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SCILIRE System

利用人机协作方法通过SCILIRE系统创建和管理科学数据集

Bölücü, Necva, Irons, Jessica, Lee, Changhyun, Jin, Brian, Rybinski, Maciej, Yang, Huichen, Duenser, Andreas, Wan, Stephen

Abstract

The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.

Chinese Translation

科学文献的快速增长使得手动提取结构化知识变得越来越不切实际。为了解决这一挑战，我们介绍了SCILIRE，一个用于从科学文献中创建数据集的系统。SCILIRE的设计基于人机协作原则，围绕验证和管理数据的工作流程展开。它促进了一种迭代工作流程，研究人员可以在其中审查和修正AI输出。此外，这种互动被用作反馈信号，以改善未来基于大型语言模型（LLM）的推理。我们通过结合内在基准测试结果和多个领域的真实案例研究来评估我们的设计。结果表明，SCILIRE提高了提取的准确性，并促进了高效的数据集创建。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2603.12646

98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

无需专用GPU的98$ imes$更快LLM路由：Flash Attention、提示压缩和近流处理的vLLM语义路由器

Liu, Xunzhuo, He, Bowei, Liu, Xue, Luo, Andy, Zhang, Haichen, Chen, Huamin

Abstract

System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need ${\sim}$4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O(n^2)$ to $O(n)$ and end-to-end (E2E) latency from 4{,}918\,ms to 127\,ms (\textbf{38.7$\times$}), enabling 8K--32K tokens where SDPA OOMs. \emph{Stage~2}: classical NLP prompt compression (TextRank, position weighting, TF-IDF, and novelty scoring) reduces all inputs to ${\sim}$512 tokens without neural inference, capping both latency and GPU memory at a constant regardless of original prompt length (E2E 127$\to$62\,ms, \textbf{2.0$\times$}). \emph{Stage~3}: near-streaming body processing with adaptive chunking and zero-copy JSON eliminates serialization overhead (E2E 62$\to$50\,ms, \textbf{1.2$\times$}). Cumulatively: \textbf{98$\times$} improvement (4{,}918\,ms to 50\,ms), 16K-token routing in 108\,ms, and a total router GPU footprint under 800\,MB -- small enough to share a GPU with LLM serving and removing the need for a dedicated accelerator. Stage~1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~2 and~3 are hardware-agnostic.

Chinese Translation

系统级路由器需要拦截LLM请求以进行安全分类、领域路由和个人身份信息（PII）检测，必须既快速又操作轻量：它们应对每个请求增加最小延迟，同时不需要专用GPU——这一昂贵资源更适合用于LLM推理。当路由器与vLLM服务实例共用同一GPU时，标准注意力的$O(n^2)$内存使得长上下文分类（8K--32K标记）变得不可能：在8K标记时，三个并发分类器仅注意力掩码就需要约4.5GB的内存，远远超过vLLM所剩的内存。我们提出了针对vLLM语义路由器的三阶段优化，并在AMD Instinct MI300X上进行了基准测试，解决了延迟和内存问题。 extbf{第一阶段}：为ONNX Runtime在ROCm上定制的CK Flash Attention运算符将注意力内存从$O(n^2)$降低到$O(n)$，并将端到端（E2E）延迟从4{,}918毫秒降低到127毫秒（ extbf{38.7$ imes$}），使得8K--32K标记成为可能，而SDPA则会出现OOM。 extbf{第二阶段}：经典的自然语言处理提示压缩（TextRank、位置加权、TF-IDF和新颖性评分）在不进行神经推理的情况下将所有输入减少到约512个标记，无论原始提示长度如何，都将延迟和GPU内存限制在一个常数值（E2E 127$ o$62毫秒， extbf{2.0$ imes$}）。 extbf{第三阶段}：通过自适应分块和零拷贝JSON的近流处理消除了序列化开销（E2E 62$ o$50毫秒， extbf{1.2$ imes$}）。累计效果： extbf{98$ imes$}的改进（从4{,}918毫秒降至50毫秒），16K标记路由在108毫秒内完成，总路由器GPU占用低于800MB——足够小以与LLM服务共享GPU，消除了对专用加速器的需求。第一阶段针对AMD ROCm（NVIDIA GPU已通过cuDNN实现FlashAttention）；第二和第三阶段则与硬件无关。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2603.12658

Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

大语言模型中的持续学习：方法、挑战与机遇

Chen, Hongyang, Sun, Zhongwu, Ye, Hongfei, Li, Kunchi, Lin, Xuemin

Abstract

Continual learning (CL) has emerged as a pivotal paradigm to enable large language models (LLMs) to dynamically adapt to evolving knowledge and sequential tasks while mitigating catastrophic forgetting-a critical limitation of the static pre-training paradigm inherent to modern LLMs. This survey presents a comprehensive overview of CL methodologies tailored for LLMs, structured around three core training stages: continual pre-training, continual fine-tuning, and continual alignment.Beyond the canonical taxonomy of rehearsal-, regularization-, and architecture-based methods, we further subdivide each category by its distinct forgetting mitigation mechanisms and conduct a rigorous comparative analysis of the adaptability and critical improvements of traditional CL methods for LLMs. In doing so, we explicitly highlight core distinctions between LLM CL and traditional machine learning, particularly with respect to scale, parameter efficiency, and emergent capabilities. Our analysis covers essential evaluation metrics, including forgetting rates and knowledge transfer efficiency, along with emerging benchmarks for assessing CL performance. This survey reveals that while current methods demonstrate promising results in specific domains, fundamental challenges persist in achieving seamless knowledge integration across diverse tasks and temporal scales. This systematic review contributes to the growing body of knowledge on LLM adaptation, providing researchers and practitioners with a structured framework for understanding current achievements and future opportunities in lifelong learning for language models.

Chinese Translation

持续学习（CL）已成为一种关键范式，使大语言模型（LLMs）能够动态适应不断发展的知识和顺序任务，同时减轻灾难性遗忘——这是现代LLMs固有的静态预训练范式的一个关键限制。本文综述了针对LLMs的CL方法，围绕三个核心训练阶段进行结构化：持续预训练、持续微调和持续对齐。除了经典的基于重演、正则化和架构的方法分类外，我们进一步根据各自独特的遗忘减轻机制对每个类别进行细分，并对传统CL方法在LLMs中的适应性和关键改进进行了严格的比较分析。在此过程中，我们明确强调了LLM CL与传统机器学习之间的核心区别，特别是在规模、参数效率和新兴能力方面。我们的分析涵盖了基本的评估指标，包括遗忘率和知识迁移效率，以及评估CL性能的新兴基准。该综述揭示，尽管当前方法在特定领域表现出良好的结果，但在实现跨多样任务和时间尺度的无缝知识整合方面仍然存在根本性挑战。这一系统性回顾为LLM适应性日益增长的知识体系做出了贡献，为研究人员和从业者提供了一个结构化框架，以理解当前成就和语言模型终身学习的未来机遇。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2603.12664

From Text to Forecasts: Bridging Modality Gap with Temporal Evolution Semantic Space

从文本到预测：通过时间演化语义空间弥合模态差距

Li, Lehui, Wang, Yuyao, Yan, Jisheng, Zhang, Wei, Deng, Jinliang, Sun, Haoliang, Han, Zhongyi, Gong, Yongshun

Abstract

Incorporating textual information into time-series forecasting holds promise for addressing event-driven non-stationarity; however, a fundamental modality gap hinders effective fusion: textual descriptions express temporal impacts implicitly and qualitatively, whereas forecasting models rely on explicit and quantitative signals. Through controlled semi-synthetic experiments, we show that existing methods over-attend to redundant tokens and struggle to reliably translate textual semantics into usable numerical cues. To bridge this gap, we propose TESS, which introduces a Temporal Evolution Semantic Space as an intermediate bottleneck between modalities. This space consists of interpretable, numerically grounded temporal primitives (mean shift, volatility, shape, and lag) extracted from text by an LLM via structured prompting and filtered through confidence-aware gating. Experiments on four real-world datasets demonstrate up to a 29 percent reduction in forecasting error compared to state-of-the-art unimodal and multimodal baselines. The code will be released after acceptance.

Chinese Translation

将文本信息纳入时间序列预测有望解决事件驱动的非平稳性；然而，根本的模态差距阻碍了有效的融合：文本描述隐性且定性地表达时间影响，而预测模型依赖于显性且定量的信号。通过控制的半合成实验，我们表明现有方法过度关注冗余标记，并且在将文本语义可靠地转化为可用的数值线索方面存在困难。为了解决这一差距，我们提出了TESS，它引入了一个时间演化语义空间，作为模态之间的中间瓶颈。该空间由可解释的、数值基础的时间原语（均值变化、波动性、形状和滞后）构成，这些原语通过结构化提示从文本中提取，并通过基于置信度的门控进行过滤。在四个真实世界数据集上的实验表明，与最先进的单模态和多模态基线相比，预测误差减少了多达29%。代码将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2603.12677

MetaKE: Meta-learning Aligned Knowledge Editing via Bi-level Optimization

MetaKE：通过双层优化实现对齐的元学习知识编辑

Liu, Shuxin, Wu, Ou

Abstract

Knowledge editing (KE) aims to precisely rectify specific knowledge in Large Language Models (LLMs) without disrupting general capabilities. State-of-the-art methods suffer from an open-loop control mismatch. We identify a critical "Semantic-Execution Disconnect": the semantic target is derived independently without feedback from the downstream's feasible region. This misalignment often causes valid semantic targets to fall within the prohibited space, resulting in gradient truncation and editing failure. To bridge this gap, we propose MetaKE (Meta-learning Aligned Knowledge Editing), a new framework that reframes KE as a bi-level optimization problem. Departing from static calculation, MetaKE treats the edit target as a learnable meta-parameter: the upper-level optimizer seeks a feasible target to maximize post-edit performance, while the lower-level solver executes the editing. To address the challenge of differentiating through complex solvers, we derive a Structural Gradient Proxy, which explicitly backpropagates editability constraints to the target learning phase. Theoretical analysis demonstrates that MetaKE automatically aligns the edit direction with the model's feasible manifold. Extensive experiments confirm that MetaKE significantly outperforms strong baselines, offering a new perspective on knowledge editing.

Chinese Translation

知识编辑（KE）旨在精确修正大型语言模型（LLMs）中的特定知识，而不影响其一般能力。现有的最先进方法存在开放环控制不匹配的问题。我们识别出一个关键的“语义执行断裂”：语义目标独立推导，而没有来自下游可行区域的反馈。这种不对齐常常导致有效的语义目标落入禁止空间，导致梯度截断和编辑失败。为了解决这一问题，我们提出了MetaKE（元学习对齐知识编辑），这是一个将知识编辑重新框定为双层优化问题的新框架。MetaKE不同于静态计算，它将编辑目标视为可学习的元参数：上层优化器寻求一个可行目标，以最大化编辑后的性能，而下层求解器执行编辑。为了应对通过复杂求解器进行微分的挑战，我们推导出了一种结构梯度代理，明确地将可编辑性约束反向传播到目标学习阶段。理论分析表明，MetaKE能够自动将编辑方向与模型的可行流形对齐。大量实验确认MetaKE显著优于强基线，为知识编辑提供了新的视角。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2603.12683

Experimental evidence of progressive ChatGPT models self-convergence

渐进式ChatGPT模型自我收敛的实验证据

Xylogiannopoulos, Konstantinos F., Xanthopoulos, Petros, Karampelas, Panagiotis, Bakamitsos, Georgios A.

Abstract

Large Language Models (LLMs) that undergo recursive training on synthetically generated data are susceptible to model collapse, a phenomenon marked by the generation of meaningless output. Existing research has examined this issue from either theoretical or empirical perspectives, often focusing on a single model trained recursively on its own outputs. While prior studies have cautioned against the potential degradation of LLM output quality under such conditions, no longitudinal investigation has yet been conducted to assess this effect over time. In this study, we employ a text similarity metric to evaluate different ChatGPT models' capacity to generate diverse textual outputs. Our findings indicate a measurable decline of recent ChatGPT releases' ability to produce varied text, even when explicitly prompted to do so, by setting the temperature parameter to one. The observed reduction in output diversity may be attributed to the influence of the amounts of synthetic data incorporated within their training datasets as the result of internet infiltration by LLM generated data. The phenomenon is defined as model self-convergence because of the gradual increase of similarities of produced texts among different ChatGPT versions.

Chinese Translation

大型语言模型（LLMs）在合成生成数据上进行递归训练时，容易出现模型崩溃现象，这一现象的特征是生成无意义的输出。现有研究从理论或实证的角度探讨了这一问题，通常关注于单一模型在其自身输出上进行递归训练的情况。尽管先前的研究警告在这种条件下LLM输出质量可能会下降，但尚未进行纵向研究以评估这一效应随时间的变化。在本研究中，我们采用文本相似性度量来评估不同ChatGPT模型生成多样化文本输出的能力。我们的发现表明，近期ChatGPT版本在被明确提示生成多样文本时，即使将温度参数设置为1，其生成多样文本的能力也显著下降。输出多样性的减少可能归因于合成数据在其训练数据集中所占比例的影响，这些数据是由于LLM生成数据的互联网渗透而产生的。该现象被定义为模型自我收敛，因为不同ChatGPT版本生成的文本之间的相似性逐渐增加。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2603.12698

EvolveCoder: Evolving Test Cases via Adversarial Verification for Code Reinforcement Learning

EvolveCoder：通过对抗验证演化测试用例以实现代码强化学习

Ruan, Chi, Jiang, Dongfu, Zeng, Huaye, Nie, Ping, Chen, Wenhu

Abstract

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving code generation in large language models, but its effectiveness is limited by weak and static verification signals in existing coding RL datasets. In this paper, we propose a solution-conditioned and adversarial verification framework that iteratively refines test cases based on the execution behaviors of candidate solutions, with the goal of increasing difficulty, improving discriminative power, and reducing redundancy. Based on this framework, we introduce EvolveCoder-22k, a large-scale coding reinforcement learning dataset constructed through multiple rounds of adversarial test case evolution. Empirical analysis shows that iterative refinement substantially strengthens verification, with pass@1 decreasing from 43.80 to 31.22. Reinforcement learning on EvolveCoder-22k yields stable optimization and consistent performance gains, improving Qwen3-4B by an average of 4.2 points across four downstream benchmarks and outperforming strong 4B-scale baselines. Our results highlight the importance of adversarial, solution-conditioned verification for effective and scalable reinforcement learning in code generation.

Chinese Translation

带有可验证奖励的强化学习（RLVR）是一种有前景的方法，用于改善大语言模型中的代码生成，但其有效性受到现有编码强化学习数据集中弱且静态的验证信号的限制。本文提出了一种基于解决方案条件的对抗验证框架，该框架根据候选解决方案的执行行为迭代地优化测试用例，旨在增加难度、提高区分能力并减少冗余。在此框架基础上，我们引入了EvolveCoder-22k，这是一个通过多轮对抗测试用例演化构建的大规模编码强化学习数据集。实证分析表明，迭代优化显著增强了验证，pass@1从43.80降至31.22。在EvolveCoder-22k上进行的强化学习实现了稳定的优化和一致的性能提升，使Qwen3-4B在四个下游基准测试中平均提高了4.2分，并超越了强大的4B规模基线。我们的结果强调了对抗性、解决方案条件验证在代码生成中实现有效且可扩展的强化学习的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2603.12754

A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora

从语义标注语料库中学习大规模计算构式语法的方法

Van Eecke, Paul, Beuls, Katrien

Abstract

We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.

Chinese Translation

我们提出了一种从语言使用语料库中学习大规模、广覆盖构式语法的方法。该方法以带有成分结构和语义框架标注的发话为起点，促进了可理解的人类计算构式语法的学习，这些语法捕捉了句法结构与其表达的语义关系之间的复杂关系。所得到的语法由数万个构式的网络组成，这些构式在流体构式语法（Fluid Construction Grammar）框架内进行了形式化。这些语法不仅支持开放领域文本的框架语义分析，还包含了关于从中学习的数据中存在的句法-语义使用模式的大量信息。该方法和学习到的语法为基于使用的构式语言学方法的扩展做出了贡献，因为它们证实了若干基本构式语法假设的可扩展性，同时也为广覆盖语料库中英语论元结构的构式研究提供了实用工具。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2603.12768

SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models

SectEval：评估大型语言模型的潜在宗派偏好

Maheshwari, Aditya, Gajkeshwar, Amit, Sharma, Kaushal, Patel, Vivek

Abstract

As Large Language Models (LLMs) becomes a popular source for religious knowledge, it is important to know if it treats different groups fairly. This study is the first to measure how LLMs handle the differences between the two main sects of Islam: Sunni and Shia. We present a test called SectEval, available in both English and Hindi, consisting of 88 questions, to check the bias-ness of 15 top LLM models, both proprietary and open-weights. Our results show a major inconsistency based on language. In English, many powerful models DeepSeek-v3 and GPT-4o often favored Shia answers. However, when asked the exact same questions in Hindi, these models switched to favoring Sunni answers. This means a user could get completely different religious advice just by changing languages. We also looked at how models react to location. Advanced models Claude-3.5 changed their answers to match the user's country-giving Shia answers to a user from Iran and Sunni answers to a user from Saudi Arabia. In contrast, smaller models (especially in Hindi) ignored the user's location and stuck to a Sunni viewpoint. These findings show that AI is not neutral; its religious ``truth'' changes depending on the language you speak and the country you claim to be from. The data set is available at https://github.com/secteval/SectEval/

Chinese Translation

随着大型语言模型（LLMs）成为宗教知识的重要来源，了解其是否公平对待不同群体显得尤为重要。本研究首次测量了大型语言模型在处理伊斯兰教两个主要宗派：逊尼派和什叶派之间的差异时的表现。我们提出了一项名为SectEval的测试，提供英语和印地语版本，共包含88个问题，用于检查15个顶尖大型语言模型（包括专有模型和开源模型）的偏见程度。我们的结果显示，模型的表现存在显著的语言不一致性。在英语中，许多强大的模型如DeepSeek-v3和GPT-4o常常偏向于什叶派的答案。然而，当用印地语询问相同的问题时，这些模型则转向偏向逊尼派的答案。这意味着用户仅通过更改语言就可能获得完全不同的宗教建议。我们还研究了模型对用户位置的反应。先进的模型Claude-3.5会根据用户所在国家调整其答案——对来自伊朗的用户提供什叶派答案，而对来自沙特阿拉伯的用户提供逊尼派答案。相比之下，较小的模型（特别是在印地语中）则忽视用户的位置，始终坚持逊尼派的观点。这些发现表明，人工智能并非中立；其宗教“真理”会根据你所使用的语言和声称来自的国家而变化。数据集可在https://github.com/secteval/SectEval/获取。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2603.12795

SteerRM: Debiasing Reward Models via Sparse Autoencoders

SteerRM：通过稀疏自编码器去偏见奖励模型

Sun, Mengyuan, Yu, Zhuohao, Gu, Weizheng, Zhang, Shikun, Ye, Wei

Abstract

Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format bias further suggest generalization across RM architectures and bias types. We further find that format-related features are concentrated in shallow layers and transfer across models, revealing shared architecture-level bias encoding patterns. These results show that SAE-based interventions can mitigate reward-model biases without retraining, providing a practical and interpretable solution for alignment pipelines.

Chinese Translation

奖励模型（RMs）是对齐管道中的关键组件，但它们表现出对表面风格线索的偏见，更倾向于更好呈现的响应而非语义上更优的响应。现有的去偏见方法通常需要重新训练或架构修改，而直接抑制激活会因表示纠缠而降低性能。我们提出了SteerRM，这是首个无需训练的奖励模型去偏见方法，采用基于稀疏自编码器（SAE）的干预。SteerRM通过对比配对响应来隔离风格效应，利用强度-稳定性标准识别与偏见相关的SAE特征，并在推理时抑制这些特征。在RM-Bench上的六个奖励模型中，SteerRM平均提高了7.3个点的Hard-split准确率，同时保持了整体性能。基于Gemma的奖励模型和受控的非格式偏见的结果进一步表明了在RM架构和偏见类型之间的泛化。我们还发现，与格式相关的特征集中在浅层，并在模型之间转移，揭示了共享的架构级偏见编码模式。这些结果表明，基于SAE的干预可以在不重新训练的情况下减轻奖励模型的偏见，为对齐管道提供了一种实用且可解释的解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2603.12823

Adaptive Vision-Language Model Routing for Computer Use Agents

计算机使用代理的自适应视觉-语言模型路由

Liu, Xunzhuo, He, Bowei, Liu, Xue, Luo, Andy, Zhang, Haichen, Chen, Huamin

Abstract

Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.

Chinese Translation

计算机使用代理（CUAs）通过依赖视觉-语言模型（VLM）将自然语言指令翻译为图形用户界面（GUI）操作，例如点击、按键和滚动，来解释屏幕截图并预测具体的工具调用。然而，不同的VLM在基础准确性上差异显著，而当前的CUA系统通常将每个操作路由到单一固定模型，而不考虑其难度。我们提出了 extbf{自适应VLM路由}（AVR），这是一个在CUA调度器和VLM池之间插入轻量级语义路由层的框架。对于每个工具调用，AVR根据多模态嵌入估计操作难度，探测一个小型VLM以测量置信度，并将操作路由到预测准确性满足目标可靠性阈值的最便宜模型。对于具有先前用户界面交互记忆的 extit{温暖}代理，检索的上下文进一步缩小了小型和大型模型之间的能力差距，使得许多操作能够在不升级的情况下处理。我们将路由形式化为成本-准确性权衡，推导出基于阈值的模型选择策略，并使用ScreenSpot-Pro基础数据以及OpenClaw代理路由基准评估AVR。在这些设置中，AVR预计推理成本降低高达78%，同时与全大型模型基线保持在2个百分点之内。当与视觉混淆副手（Visual Confused Deputy）保护措施结合时，AVR还将高风险操作直接升级到最强可用模型，在单一路由框架内统一了效率和安全性。材料也已提供，模型、基准和代码可在：https://github.com/vllm-project/semantic-router.

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2603.12826

Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design

重新思考RLVR中的多项选择题：通过干扰项设计释放潜力

Guo, Xu, Ge, Qiming, Tong, Jian, Chen, Kedi, Zhang, Jin, Yang, Xiaogui, Gao, Xuan, Lv, Haijun, Lu, Zhihui, Zou, Yicheng, Guo, Qipeng

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.

Chinese Translation

可验证奖励的强化学习（RLVR）显著增强了大型语言模型的推理能力。在RLVR中，多项选择题（MCQs）提供了一种可扩展的可验证数据来源，但可能导致奖励黑客行为，即模型通过随机猜测或简单消除来简化推理。目前的方法通常通过将多项选择题转换为开放式格式来缓解这一问题，从而丢弃了专家设计的干扰项所提供的对比信号。在本研究中，我们系统地调查了选项设计对RLVR的影响。我们的分析突出了两个主要见解：（1）训练和测试之间选项数量的不匹配会降低性能。（2）强干扰项有效地减轻了随机猜测，即使在2选项问题中也能实现有效的RLVR训练。基于这些发现，我们提出了迭代干扰项策划（IDC）框架，该框架主动构建高质量的干扰项，以阻止消除捷径并促进深度推理。在各种基准测试上的实验表明，我们的方法有效提高了干扰项的质量，并在RLVR训练中相比于原始数据取得了显著提升。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2603.12872

CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility

CLARIN-PT-LDB：一个开放的葡萄牙语大型语言模型排行榜，用于评估语言、文化和文明

Silva, João, Gomes, Luís, Branco, António

Abstract

This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchmarks. This leaderboard comes as a way to address a gap in the evaluation of LLM for European Portuguese, which so far had no leaderboard dedicated to this variant of the language. The paper also reports on novel benchmarks, including some that address aspects of performance that so far have not been available in benchmarks for European Portuguese, namely model safeguards and alignment to Portuguese culture. The leaderboard is available at https://huggingface.co/spaces/PORTULAN/portuguese-llm-leaderboard.

Chinese Translation

本文报告了一个针对欧洲葡萄牙语（PT-PT）的开放大型语言模型（LLM）排行榜及其相关基准的开发。该排行榜旨在填补对欧洲葡萄牙语的LLM评估的空白，迄今为止尚未有专门针对这种语言变体的排行榜。本文还报告了一些新颖的基准，包括一些针对性能方面的基准，这些方面在欧洲葡萄牙语的基准中尚未涉及，即模型的安全性和与葡萄牙文化的对齐。该排行榜可在 https://huggingface.co/spaces/PORTULAN/portuguese-llm-leaderboard 获取。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2603.12906

Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study

在双语场景中从儿童导向语言中学习：法英案例研究

Binyamin, Liel, Sulem, Elior

Abstract

Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a systematic study of compact language models by extending BabyBERTa to English-French scenarios under strictly size-matched data conditions, covering monolingual, bilingual, and cross-lingual settings. Our design contrasts two types of training corpora: (i) child-directed speech (about 2.5M tokens), following BabyBERTa and related work, and (ii) multi-domain corpora (about 10M tokens), extending the BabyLM framework to French. To enable fair evaluation, we also introduce new resources, including French versions of QAMR and QASRL, as well as English and French multi-domain corpora. We evaluate the models on both syntactic and semantic tasks and compare them with models trained on Wikipedia-only data. The results reveal context-dependent effects: training on Wikipedia consistently benefits semantic tasks, whereas child-directed speech improves grammatical judgments in monolingual settings. Bilingual pretraining yields notable gains for textual entailment, with particularly strong improvements for French. Importantly, similar patterns emerge across BabyBERTa, RoBERTa, and LTG-BERT, suggesting consistent trends across architectures.

Chinese Translation

关于发展性合理语言模型的研究主要集中在英语上，关于多语言环境的问题仍未解决。我们通过在严格匹配数据条件下将 BabyBERTa 扩展到英法场景，系统地研究了紧凑型语言模型，涵盖了单语、双语和跨语言设置。我们的设计对比了两种类型的训练语料库：(i) 儿童导向语言（约 250 万个标记），遵循 BabyBERTa 及相关工作，以及 (ii) 多领域语料库（约 1000 万个标记），将 BabyLM 框架扩展到法语。为了实现公平评估，我们还引入了新的资源，包括 QAMR 和 QASRL 的法语版本，以及英语和法语的多领域语料库。我们在句法和语义任务上评估模型，并将其与仅在维基百科数据上训练的模型进行比较。结果揭示了上下文依赖效应：在维基百科上训练始终对语义任务有利，而儿童导向语言则改善了单语环境中的语法判断。双语预训练在文本蕴含方面带来了显著提升，尤其是法语的改善尤为明显。重要的是，类似的模式在 BabyBERTa、RoBERTa 和 LTG-BERT 中出现，表明不同架构之间存在一致的趋势。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2603.12920

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

HMS-BERT：用于多语言和多标签网络欺凌检测的混合多任务自我训练

Feng, Zixin, Cui, Xinying, Sun, Yifan, Wei, Zheng, Yuan, Jiachen, Hu, Jiazhen, Xin, Ning, Hasan, Md Maruf

Abstract

Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.

Chinese Translation

社交媒体上的网络欺凌本质上是多语言和多方面的，虐待行为往往在多个类别之间重叠。现有方法通常受到单语假设或单任务形式的限制，这限制了它们在现实多语言和多标签场景中的有效性。本文提出了HMS-BERT，一种用于多语言和多标签网络欺凌检测的混合多任务自我训练框架。HMS-BERT基于预训练的多语言BERT骨干网络，结合上下文表示与手工制作的语言特征，联合优化细粒度的多标签虐待分类任务和三类主分类任务。为了解决低资源语言中标记数据稀缺的问题，提出了一种基于置信度的伪标签迭代自我训练策略，以促进跨语言知识转移。在四个公共数据集上的实验表明，HMS-BERT表现出色，在多标签任务上达到了高达0.9847的宏F1分数，在主分类任务上的准确率为0.6775。消融研究进一步验证了所提组件的有效性。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2603.12932

DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

DS$^2$-Instruct：针对大型语言模型指令调优的领域特定数据合成

Xu, Ruiyao, Samia, Noelle I., Liu, Han

Abstract

Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.

Chinese Translation

将大型语言模型（LLMs）适应于专业领域需要高质量的指令调优数据集，而通过人工标注创建这些数据集的成本非常高。现有的数据合成方法主要集中于通用任务，未能有效捕捉领域特定的术语和推理模式。为了解决这个问题，我们提出了DS$^2$-Instruct，这是一种零样本框架，可以在没有人工监督的情况下生成领域特定的指令数据集。我们的方法首先生成与任务相关的关键词，以确保全面覆盖领域。然后，通过将这些关键词与布鲁姆分类法（Bloom's Taxonomy）中的不同认知水平配对，创建多样化的指令。最后，我们使用自一致性验证来确保数据质量。我们将该框架应用于生成七个具有挑战性的领域的数据集，如数学、金融和逻辑推理。全面的评估表明，在我们生成的数据上微调的模型相比现有的数据生成方法取得了显著的改进。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2603.12963

Long-form RewardBench: Evaluating Reward Models for Long-form Generation

长文本奖励基准：评估长文本生成的奖励模型

Huang, Hui, He, Yancheng, Liu, Wei, Yang, Muyun, Liu, Jiaheng, Chen, Kehai, Xu, Bing, Zhu, Conghui, Cao, Hailong, Zhao, Tiejun

Abstract

The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.

Chinese Translation

强化学习对齐的广泛应用凸显了奖励模型日益重要的地位。为评估各个领域和场景中的奖励模型，已经建立了多种基准。然而，尽管长文本生成在实际应用中扮演着关键角色，评估长文本生成的奖励模型仍存在显著的空白。为此，我们推出了长文本奖励基准（Long-form RewardBench），这是第一个专门为长文本生成设计的奖励建模测试平台。我们的基准涵盖了五个关键子任务：问答（QA）、检索增强生成（RAG）、聊天（Chat）、写作（Writing）和推理（Reasoning）。我们通过精心设计的多阶段数据收集过程收集了指令和偏好数据，并对20多个主流奖励模型进行了广泛的实验，包括分类器和生成模型。我们的研究发现，当前模型在长文本奖励建模能力上仍然不足。此外，我们设计了一种新颖的长文本“干草堆中的针”测试（Long-form Needle-in-a-Haystack Test），揭示了奖励建模性能与响应中错误位置以及整体响应长度之间的相关性，并观察到分类模型和生成模型之间存在明显的特征差异。最后，我们证明了与在相同数据上训练的生成模型相比，分类器表现出更好的泛化能力。作为长文本奖励建模的第一个基准，本研究旨在提供一个强有力的平台，以可视化这一关键领域的进展。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2603.12983

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

人类标注是否必要？基于迭代最小贝叶斯风险的机器翻译错误范围检测蒸馏

Lyu, Boxuan, Song, Haiyue, Qu, Zhi

Abstract

Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels.Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

Chinese Translation

错误范围检测（Error Span Detection, ESD）是机器翻译（Machine Translation, MT）评估中的一个关键子任务，旨在识别翻译错误的位置和严重性。尽管在人工标注数据上微调模型可以提高ESD性能，但获取此类数据既昂贵又容易受到标注者之间不一致性的影响。为了解决这个问题，我们提出了一种基于最小贝叶斯风险（Minimum Bayes Risk, MBR）解码的新型自我演化框架，称为用于ESD的迭代MBR蒸馏（Iterative MBR Distillation），该框架通过利用现成的大型语言模型（LLM）生成伪标签，从而消除了对人工标注的依赖。在WMT指标共享任务数据集上的大量实验表明，仅在这些自生成的伪标签上训练的模型在系统和范围层面上均优于未适应的基础模型和基于人工标注训练的监督基线，同时在句子级别上保持了竞争力的性能。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2603.13038

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

可解释的语义梯度在SSD中的应用：PCA筛选方法及人工智能话语的案例研究

Plisiecki, Hubert, Leniarska, Maria, Piotrowski, Jan, Zajenkowski, Marcin

Abstract

Supervised Semantic Differential (SSD) is a mixed quantitative-interpretive method that models how text meaning varies with continuous individual-difference variables by estimating a semantic gradient in an embedding space and interpreting its poles through clustering and text retrieval. SSD applies PCA before regression, but currently no systematic method exists for choosing the number of retained components, introducing avoidable researcher degrees of freedom in the analysis pipeline. We propose a PCA sweep procedure that treats dimensionality selection as a joint criterion over representation capacity, gradient interpretability, and stability across nearby values of K. We illustrate the method on a corpus of short posts about artificial intelligence written by Prolific participants who also completed Admiration and Rivalry narcissism scales. The sweep yields a stable, interpretable Admiration-related gradient contrasting optimistic, collaborative framings of AI with distrustful and derisive discourse, while no robust alignment emerges for Rivalry. We also show that a counterfactual using a high-PCA dimension solution heuristic produces diffuse, weakly structured clusters instead, reinforcing the value of the sweep-based choice of K. The case study shows how the PCA sweep constrains researcher degrees of freedom while preserving SSD's interpretive aims, supporting transparent and psychologically meaningful analyses of connotative meaning.

Chinese Translation

监督语义差异法（Supervised Semantic Differential，SSD）是一种混合定量与解释的方法，通过在嵌入空间中估计语义梯度并通过聚类和文本检索解释其极端值，来建模文本意义如何随连续个体差异变量而变化。SSD在回归之前应用主成分分析（PCA），但目前尚无系统的方法来选择保留的成分数量，这在分析流程中引入了可避免的研究者自由度。我们提出了一种PCA筛选程序，将维度选择视为在表示能力、梯度可解释性和K值附近稳定性之间的联合标准。我们在一组关于人工智能的短文语料库上展示了该方法，该语料库由完成了钦佩与竞争自恋量表的Prolific参与者撰写。筛选结果产生了一个稳定且可解释的与钦佩相关的梯度，突出了乐观、合作的人工智能框架与不信任和嘲讽话语之间的对比，而在竞争方面没有出现稳健的对齐。我们还展示了使用高PCA维度解决方案启发式的反事实分析产生了分散且结构松散的聚类，进一步强化了基于筛选的K选择的价值。案例研究表明，PCA筛选在保持SSD解释目标的同时限制了研究者的自由度，支持对内涵意义进行透明且具有心理学意义的分析。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2603.13045

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

修补漏洞：缓解多语言翻译中的强化学习奖励操控

Liu, Yifeng, Ouyang, Siqi, Revanasiddappa, Yatish Hosmane, Li, Lei

Abstract

Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.

Chinese Translation

大型语言模型（LLMs）在高资源语言对的机器翻译中表现出色，但在低资源翻译方面的表现仍然滞后。现有的后训练方法严重依赖高质量的平行数据，而这些数据在低资源语言中往往稀缺或不可用。本文介绍了一种名为WALAR的强化训练方法，仅使用单语文本来提升LLMs在大量低资源语言上的翻译能力，同时保持其在高资源语言上的表现。我们的关键见解基于对现有基于源的多语言质量估计（QE）模型中失败模式（或“漏洞”）的观察。使用这些QE模型的强化学习（RL）往往会放大这些漏洞，导致多语言LLMs的性能下降。我们开发了包括词对齐和语言对齐在内的技术，以缓解WALAR在RL训练中的奖励漏洞。我们持续训练了一种支持101种语言翻译的LLM，采用WALAR方法。实验结果表明，我们的新模型在Flores-101数据集的1400个语言方向上，显著超越了LLaMAX，这是最强的开源多语言LLMs之一。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2603.13154

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

ESG-Bench：用于减轻幻觉的长上下文ESG报告基准测试

Sun, Siqi, Wu, Ben Peng, Jin, Mali, Bai, Peizhen, Zhang, Hanpei, Song, Xingyi

Abstract

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

Chinese Translation

随着企业责任越来越多地纳入环境、社会和治理（ESG）标准，ESG报告在许多地区正成为法律要求，并成为记录可持续实践和评估企业长期及伦理表现的关键渠道。然而，ESG披露的长度和复杂性使其难以解释，并且难以可靠地自动化分析。为了支持可扩展和可信的分析，本文介绍了ESG-Bench，一个用于理解ESG报告和减轻大型语言模型（LLMs）幻觉的基准数据集。ESG-Bench包含基于真实世界ESG报告上下文的人类注释问答（QA）对，具有细粒度标签，指示模型输出是否得到事实支持或为幻觉。将ESG报告分析框架化为具有可验证约束的QA任务，使得系统评估LLMs提取和推理ESG内容的能力成为可能，并提供了一个新的应用场景：在社会敏感和合规关键的环境中减轻幻觉。我们设计了特定任务的思维链（CoT）提示策略，并使用CoT注释的推理对多个最先进的LLMs在ESG-Bench上进行了微调。我们的实验表明，这些基于CoT的方法在减少幻觉方面显著优于标准提示和直接微调，并且这种提升能够转移到ESG领域以外的现有QA基准上。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2603.13201

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

神经元感知的数据选择在大型语言模型的指令调优中的应用

Chen, Xin, Wu, Junchao, Yang, Shu, Zhan, Runzhe, Wu, Zeyu, Yang, Min, Huang, Shujian, Chao, Lidia S., Wong, Derek F.

Abstract

Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting a small subset of high-quality IT data can significantly enhance their capabilities. Therefore, identifying the most efficient subset data from the IT dataset to effectively develop either specific or general abilities in LLMs has become a critical challenge. To address this, we propose a novel and efficient framework called NAIT. NAIT evaluates the impact of IT data on LLMs performance by analyzing the similarity of neuron activation patterns between the IT dataset and the target domain capability. Specifically, NAIT captures neuron activation patterns from in-domain datasets of target domain capabilities to construct reusable and transferable neuron activation features. It then evaluates and selects optimal samples based on the similarity between candidate samples and the expected activation features of the target capabilities. Experimental results show that training on the 10\% Alpaca-GPT4 IT data subset selected by NAIT consistently outperforms methods that rely on external advanced models or uncertainty-based features across various tasks. Our findings also reveal the transferability of neuron activation features across different capabilities of LLMs. In particular, IT data with more logical reasoning and programmatic features possesses strong general transferability, enabling models to develop stronger capabilities across multiple tasks, while a stable core subset of data is sufficient to consistently activate fundamental model capabilities and universally improve performance across diverse tasks.

Chinese Translation

指令调优（Instruction Tuning, IT）已被证明是释放大型语言模型（Large Language Models, LLMs）强大能力的有效方法。近期研究表明，过量的IT数据可能会降低LLMs的性能，而精心选择一小部分高质量的IT数据则可以显著提升其能力。因此，从IT数据集中识别出最有效的子集数据，以有效地发展LLMs的特定或一般能力，已成为一个关键挑战。为此，我们提出了一种新颖且高效的框架，称为NAIT。NAIT通过分析IT数据集与目标领域能力之间神经元激活模式的相似性，评估IT数据对LLMs性能的影响。具体而言，NAIT从目标领域能力的领域内数据集中捕获神经元激活模式，以构建可重用和可转移的神经元激活特征。然后，它根据候选样本与目标能力的预期激活特征之间的相似性来评估和选择最佳样本。实验结果表明，在NAIT选择的10% Alpaca-GPT4 IT数据子集上进行训练，始终优于依赖外部先进模型或基于不确定性特征的方法，适用于各种任务。我们的研究还揭示了神经元激活特征在LLMs不同能力之间的可转移性。特别是，具有更多逻辑推理和程序特征的IT数据具有强大的通用可转移性，使模型能够在多个任务中发展更强的能力，而一个稳定的核心数据子集足以持续激活基本模型能力，并普遍提高在不同任务中的表现。

View on arXiv Download PDF AI Translation

arXiv Papers

A Learning-Based Approach for Contact Detection, Localization, and Force Estimation of Continuum Manipulators With Integrated OFDR Optical Fiber

GNN-DIP: Neural Corridor Selection for Decomposition-Based Motion Planning

Push, Press, Slide: Mode-Aware Planar Contact Manipulation via Reduced-Order Models

Beyond Motion Imitation: Is Human Motion Data Alone Sufficient to Explain Gait Control and Biomechanics?

Predictive and adaptive maps for long-term visual navigation in changing environments

One-Step Flow Policy: Self-Distillation for Fast Visuomotor Policies

COAD: Constant-Time Planning for Continuous Goal Manipulation with Compressed Library and Online Adaptation

Robots that redesign themselves through kinematic self-destruction

Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

From Woofs to Words: Towards Intelligent Robotic Guide Dogs with Verbal Communication

Skill-informed Data-driven Haptic Nudges for High-dimensional Human Motor Learning

CarPLAN: Context-Adaptive and Robust Planning with Dynamic Scene Awareness for Autonomous Driving

Autonomous Integration and Improvement of Robotic Assembly using Skill Graph Representations

TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data

HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation

Altered Thoughts, Altered Actions: Probing Chain-of-Thought Vulnerabilities in VLA Robotic Manipulation

AnchorVLA4D: an Anchor-Based Spatial-Temporal Vision-Language-Action Model for Robotic Manipulation

Conflict Mitigation in Shared Environments using Flow-Aware Multi-Agent Path Finding

Easy-IIL: Reducing Human Operational Burden in Interactive Imitation Learning via Assistant Experts

Motion-Specific Battery Health Assessment for Quadrotors Using High-Fidelity Battery Models

FLUX: Accelerating Cross-Embodiment Generative Navigation Policies via Rectified Flow and Static-to-Dynamic Learning

Reinforcement Learning for Elliptical Cylinder Motion Control Tasks

SmoothTurn: Learning to Turn Smoothly for Agile Navigation with Quadrupedal Robots

Beyond Imitation: Reinforcement Learning Fine-Tuning for Adaptive Diffusion Navigation Policies

Consistent and Efficient MSCKF-based LiDAR-Inertial Odometry with Inferred Cluster-to-Plane Constraints for UAVs

GoalSwarm: Multi-UAV Semantic Coordination for Open-Vocabulary Object Navigation

MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins

RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics

Coordinated Manipulation of Hybrid Deformable-Rigid Objects in Constrained Environments

ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries

Efficient Real-World Autonomous Racing via Attenuated Residual Policy Optimization

Language-Grounded Decoupled Action Representation for Robotic Manipulation

Route Fragmentation Based on Resource-centric Prioritisation for Efficient Multi-Robot Path Planning in Agricultural Environments

From Passive Monitoring to Active Defence: Resilient Control of Manipulators Under Cyberattacks

SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences

A Feasibility-Enhanced Control Barrier Function Method for Multi-UAV Collision Avoidance

Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots

DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection

A Neuro-Symbolic Framework Combining Inductive and Deductive Reasoning for Autonomous Driving Planning

Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

Revisiting Model Stitching In the Foundation Model Era

Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions

Unleashing Video Language Models for Fine-grained HRCT Report Generation

Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning

RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution

Na\"ive PAINE: Lightweight Text-to-Image Generation Improvement with Prompt Evaluation

MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens

Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation

CVGL: Causal Learning and Geometric Topology

AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

DINOLight: Robust Ambient Light Normalization with Self-supervised Visual Prior Integration

MRGeo: Robust Cross-View Geo-Localization of Corrupted Images via Spatial and Channel Feature Enhancement

SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification

Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating

A Prediction-as-Perception Framework for 3D Object Detection

A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering

Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors

AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization