arXiv Daily Digest

277

Papers

ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation

ProFocus：视觉与语言导航中的主动感知与聚焦推理

Xue, Wei, Li, Mingcheng, Wu, Xuecheng, Tang, Jingqun, Yang, Dingkang, Zhang, Lihua

Abstract

Vision-and-Language Navigation (VLN) requires agents to accurately perceive complex visual environments and reason over navigation instructions and histories. However, existing methods passively process redundant visual inputs and treat all historical contexts indiscriminately, resulting in inefficient perception and unfocused reasoning. To address these challenges, we propose \textbf{ProFocus}, a training-free progressive framework that unifies \underline{Pro}active Perception and \underline{Focus}ed Reasoning through collaboration between large language models (LLMs) and vision-language models (VLMs). For proactive perception, ProFocus transforms panoramic observations into structured ego-centric semantic maps, enabling the orchestration agent to identify missing visual information needed for reliable decision-making, and to generate targeted visual queries with corresponding focus regions that guide the perception agent to acquire the required observations. For focused reasoning, we propose Branch-Diverse Monte Carlo Tree Search (BD-MCTS) to identify top-$k$ high-value waypoints from extensive historical candidates. The decision agent focuses reasoning on the historical contexts associated with these waypoints, rather than considering all historical waypoints equally. Extensive experiments validate the effectiveness of ProFocus, achieving state-of-the-art performance among zero-shot methods on R2R and REVERIE benchmarks.

Chinese Translation

视觉与语言导航（VLN）要求智能体准确感知复杂的视觉环境，并对导航指令和历史进行推理。然而，现有方法被动地处理冗余的视觉输入，并对所有历史上下文一视同仁，导致感知效率低下和推理不集中。为了解决这些挑战，我们提出了 extbf{ProFocus}，一个无训练的渐进框架，通过大型语言模型（LLMs）与视觉语言模型（VLMs）之间的协作统一了 extunderline{主}动感知和 extunderline{聚}焦推理。在主动感知方面，ProFocus将全景观察转换为结构化的自我中心语义地图，使得协调智能体能够识别可靠决策所需的缺失视觉信息，并生成针对性的视觉查询及相应的聚焦区域，引导感知智能体获取所需的观察。在聚焦推理方面，我们提出了分支多样的蒙特卡洛树搜索（BD-MCTS），以从大量历史候选中识别出前$k$个高价值的路径点。决策智能体将推理集中在与这些路径点相关的历史上下文上，而不是对所有历史路径点一视同仁。大量实验验证了ProFocus的有效性，在R2R和REVERIE基准测试中实现了零样本方法的最先进性能。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2603.05546

Digital-Twin Losses for Lane-Compliant Trajectory Prediction at Urban Intersections

城市交叉口车道合规轨迹预测的数字双胞胎损失

Chao, Kuo-Yi, Haß, Erik Leo, Gegg, Melina, Zhang, Jiajie, Raßhofer, Ralph, Knoll, Alois Christian

Abstract

Accurate and safety-conscious trajectory prediction is a key technology for intelligent transportation systems, especially in V2X-enabled urban environments with complex multi-agent interactions. In this paper, we created a digital twin-driven V2X trajectory prediction pipeline that jointly leverages cooperative perception from vehicles and infrastructure to forecast multi-agent motion at signalized intersections. The proposed model combines a Bi-LSTM-based generator with a structured training objective consisting of a standard mean squared error (MSE) loss and a novel twin loss. The twin loss encodes infrastructure constraints, collision avoidance, diversity across predicted modes, and rule-based priors derived from the digital twin. While the MSE term ensures point-wise accuracy, the twin loss penalizes traffic rule violations, predicted collisions, and mode collapse, guiding the model toward scene-consistent and safety-compliant predictions. We train and evaluate our approach on real-world V2X data sent from the intersection to the vehicle and collected in urban corridors. In addition to standard trajectory metrics (ADE, FDE), we introduce ITS-relevant safety indicators, including infrastructure and rule violation rates. Experimental results demonstrate that the proposed training scheme significantly reduces critical violations while maintaining comparable prediction accuracy and real-time performance, highlighting the potential of digital twin-driven multi-loss learning for V2X-enabled intelligent transportation systems.

Chinese Translation

准确且注重安全的轨迹预测是智能交通系统的一项关键技术，尤其是在具有复杂多智能体交互的V2X（车联网）城市环境中。本文提出了一种基于数字双胞胎的V2X轨迹预测管道，该管道联合利用车辆和基础设施的协同感知，以预测信号交叉口的多智能体运动。所提出的模型结合了基于双向长短期记忆网络（Bi-LSTM）的生成器和一个结构化的训练目标，该目标由标准均方误差（MSE）损失和一种新颖的双胞胎损失组成。双胞胎损失编码了基础设施约束、碰撞避免、预测模式的多样性以及源自数字双胞胎的基于规则的先验知识。虽然MSE项确保了逐点的准确性，但双胞胎损失则惩罚交通规则的违反、预测碰撞和模式崩溃，引导模型朝向场景一致且符合安全要求的预测。我们在从交叉口发送到车辆的真实世界V2X数据上训练和评估了我们的方法，并在城市走廊中收集数据。除了标准的轨迹指标（ADE, FDE）外，我们还引入了与智能交通系统相关的安全指标，包括基础设施和规则违反率。实验结果表明，所提出的训练方案显著减少了关键违规行为，同时保持了可比的预测准确性和实时性能，突显了基于数字双胞胎的多损失学习在V2X智能交通系统中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2603.05552

TEGA: A Tactile-Enhanced Grasping Assistant for Assistive Robotics via Sensor Fusion and Closed-Loop Haptic Feedback

TEGA：一种通过传感器融合和闭环触觉反馈增强的助力机器人抓取助手

You, Hengxu, Zhou, Tianyu, Xu, Fang, Smith, Kaleb, Du, Eric Jing

Abstract

Recent advances in teleoperation have enabled sophisticated manipulation of dexterous robotic hands, with most systems concentrating on guiding finger positions to achieve desired grasp configurations. However, while accurate finger positioning is essential, it often overlooks the equally critical task of grasp force modulation, vital for handling objects of diverse hardness, texture, and shape. This limitation poses a significant challenge for users, especially individuals with upper limb disabilities who lack natural tactile feedback and rely on indirect cues to infer appropriate force levels. To address this gap, We present the tactile enhanced grasping assistant (TEGA), a closed loop assistive teleoperation framework that fuses EMG based intent2force inference with visuotactile sensing mapped into real time vibrotactile feedback via a wearable haptic vest, enabling intuitive, proportional force adjustment during manipulation. A wearable haptic vest delivers real time tactile feedback, allowing users to dynamically refine grasp force during manipulation. User studies confirm that the system substantially improves grasp stability and task success, underscoring its potential for assistive robotic applications.

Chinese Translation

近年来，远程操作技术的进步使得灵巧机器人手的复杂操控成为可能，大多数系统集中于引导手指位置以实现期望的抓取配置。然而，尽管准确的手指定位至关重要，但往往忽视了同样重要的抓取力调节任务，这对于处理不同硬度、纹理和形状的物体至关重要。这一局限性对用户构成了重大挑战，尤其是那些缺乏自然触觉反馈的上肢残疾人士，他们依赖间接线索来推断适当的施力水平。为了解决这一问题，我们提出了触觉增强抓取助手（TEGA），这是一种闭环助力远程操作框架，融合了基于肌电图（EMG）的意图到力（intent2force）推断与视觉触觉感知，并实时映射到通过可穿戴触觉背心提供的振动触觉反馈，从而在操控过程中实现直观的、比例的力调节。可穿戴触觉背心提供实时触觉反馈，使用户能够在操作过程中动态调整抓取力。用户研究证实，该系统显著提高了抓取稳定性和任务成功率，突显了其在助力机器人应用中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2603.05574

PRISM: Personalized Refinement of Imitation Skills for Manipulation via Human Instructions

PRISM：通过人类指令个性化精炼模仿技能以进行操控

Boix-Granell, Arnau, San-Miguel-Tello, Alberto, Dalmau-Moreno, Magí, García, Néstor

Abstract

This paper presents PRISM: an instruction-conditioned refinement method for imitation policies in robotic manipulation. This approach bridges Imitation Learning (IL) and Reinforcement Learning (RL) frameworks into a seamless pipeline, such that an imitation policy on a broad generic task, generated from a set of user-guided demonstrations, can be refined through reinforcement to generate new unseen fine-grain behaviours. The refinement process follows the Eureka paradigm, where reward functions for RL are iteratively generated from an initial natural-language task description. Presented approach, builds on top of this mechanism to adapt a refined IL policy of a generic task to new goal configurations and the introduction of constraints by adding also human feedback correction on intermediate rollouts, enabling policy reusability and therefore data efficiency. Results for a pick-and-place task in a simulated scenario show that proposed method outperforms policies without human feedback, improving robustness on deployment and reducing computational burden.

Chinese Translation

本文提出了PRISM：一种用于机器人操控中模仿策略的基于指令的精炼方法。该方法将模仿学习（Imitation Learning, IL）和强化学习（Reinforcement Learning, RL）框架无缝结合，允许从一组用户指导的示范中生成的广泛通用任务的模仿策略，通过强化学习进行精炼，从而生成新的未见的细粒度行为。精炼过程遵循Eureka范式，其中强化学习的奖励函数是从初始自然语言任务描述中迭代生成的。所提出的方法在此机制之上构建，旨在将通用任务的精炼IL策略适应于新的目标配置，并通过在中间回滚中添加人类反馈修正来引入约束，从而实现策略的可重用性，提高数据效率。在模拟场景中的拾取与放置任务的结果表明，所提方法优于没有人类反馈的策略，提高了部署的鲁棒性并减少了计算负担。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2603.05576

Task Parameter Extrapolation via Learning Inverse Tasks from Forward Demonstrations

通过从正向示范学习逆任务进行任务参数外推

Bahar, Serdar, Dogangun, Fatih, Saveriano, Matteo, Nagai, Yukie, Ugur, Emre

Abstract

Generalizing skill policies to novel conditions remains a key challenge in robot learning. Imitation learning methods, while data-efficient, are largely confined to the training region and consistently fail on input data outside it, leading to unpredictable policy failures. Alternatively, transfer learning approaches offer methods for trajectory generation robust to both changes in environment or tasks, but they remain data-hungry and lack accuracy in zero-shot generalization. We address these challenges by framing the problem in the context of task inversion learning and proposing a novel joint learning approach to achieve accurate and efficient knowledge transfer. Our method constructs a common representation of the forward and inverse tasks, and leverages auxiliary forward demonstrations from novel configurations to successfully execute the corresponding inverse tasks, without any direct supervision. We show the extrapolation capabilities of our framework via ablation studies and experiments in simulated and real-world environments that require complex manipulation skills with a diverse set of objects and tools, where we outperform diffusion-based alternatives.

Chinese Translation

将技能策略推广到新条件仍然是机器人学习中的一个关键挑战。模仿学习方法虽然数据效率高，但主要局限于训练区域，并且在该区域外的输入数据上始终失败，导致不可预测的策略失败。另一方面，迁移学习方法提供了对环境或任务变化具有鲁棒性的轨迹生成方法，但它们仍然需要大量数据，并且在零样本泛化方面缺乏准确性。我们通过将问题框定在任务反演学习的背景下，提出了一种新颖的联合学习方法，以实现准确和高效的知识转移。我们的方法构建了正向任务和逆任务的共同表示，并利用来自新配置的辅助正向示范，成功执行相应的逆任务，而无需任何直接监督。我们通过消融研究和在需要复杂操作技能的多样化对象和工具的模拟和真实环境中的实验展示了我们框架的外推能力，并超越了基于扩散的替代方法。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2603.05621

RACAS: Controlling Diverse Robots With a Single Agentic System

RACAS：通过单一智能系统控制多样化机器人

Ashley, Dylan R., Przepióra, Jan, Chen, Yimeng, Abualsaud, Ali, Yesmagambet, Nurzhan, Park, Shinkyu, Feron, Eric, Schmidhuber, Jürgen

Abstract

Many robotic platforms expose an API through which external software can command their actuators and read their sensors. However, transitioning from these low-level interfaces to high-level autonomous behaviour requires a complicated pipeline, whose components demand distinct areas of expertise. Existing approaches to bridging this gap either require retraining for every new embodiment or have only been validated across structurally similar platforms. We introduce RACAS (Robot-Agnostic Control via Agentic Systems), a cooperative agentic architecture in which three LLM/VLM-based modules (Monitors, a Controller, and a Memory Curator) communicate exclusively through natural language to provide closed-loop robot control. RACAS requires only a natural language description of the robot, a definition of available actions, and a task specification; no source code, model weights, or reward functions need to be modified to move between platforms. We evaluate RACAS on several tasks using a wheeled ground robot, a recently published novel multi-jointed robotic limb, and an underwater vehicle. RACAS consistently solved all assigned tasks across these radically different platforms, demonstrating the potential of agentic AI to substantially reduce the barrier to prototyping robotic solutions.

Chinese Translation

许多机器人平台通过API向外部软件暴露其执行器的控制和传感器的读取功能。然而，从这些低级接口过渡到高级自主行为需要一个复杂的流程，其组件需要不同领域的专业知识。现有的弥合这一差距的方法要么要求对每个新实体进行重新训练，要么仅在结构相似的平台上经过验证。我们提出了RACAS（Robot-Agnostic Control via Agentic Systems），这是一种合作智能架构，其中三个基于大型语言模型/视觉语言模型的模块（监控器、控制器和记忆策展人）通过自然语言进行独占通信，以提供闭环机器人控制。RACAS仅需要对机器人进行自然语言描述、可用动作的定义和任务规范；在不同平台之间切换时，无需修改源代码、模型权重或奖励函数。我们在多个任务上评估了RACAS，使用了一种轮式地面机器人、一种最近发布的新型多关节机器人肢体和一种水下车辆。RACAS在这些截然不同的平台上始终成功解决了所有分配的任务，展示了智能AI在显著降低机器人解决方案原型开发门槛方面的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2603.05638

Control Lyapunov Functions for Underactuated Soft Robots

欠驱动软机器人控制李雅普诺夫函数

Pham, Huy, Patterson, Zach J.

Abstract

Soft and soft-rigid hybrid robots are inherently underactuated and operate under tight actuator limits, making task-space control with stability guarantees challenging. Common nonlinear strategies for soft robots (e.g., those based on PD control) often rely on the assumption of full actuation with no actuator limits. This paper presents a general control framework for task-space regulation and tracking of underactuated soft robots under bounded inputs. The method enforces a rapidly exponentially stabilizing control Lyapunov function as a convex inequality constraint while simultaneously satisfying underactuated full-body dynamics and actuator bounds. We validate the approach in simulation on several platforms spanning increasing underactuation: a simple two link tendon-driven "finger", a trimmed helicoid manipulator, and a highly underactuated spiral robot. We compare against a number of baseline methods from the literature. Results show improved task-space accuracy and consistent Lyapunov convergence under input limits, achieving superior set-point and trajectory-tracking performance.

Chinese Translation

软机器人和软-刚性混合机器人本质上是欠驱动的，并在严格的执行器限制下运行，这使得在任务空间中进行稳定性保证的控制变得具有挑战性。常见的软机器人非线性策略（例如基于PD控制的方法）通常依赖于完全驱动且没有执行器限制的假设。本文提出了一种通用控制框架，用于在有界输入下对欠驱动软机器人进行任务空间的调节和跟踪。该方法将快速指数稳定的控制李雅普诺夫函数作为凸不等式约束，同时满足欠驱动全身动力学和执行器限制。我们在多个平台上进行仿真验证，涵盖了不同程度的欠驱动：一个简单的双连杆腱驱动“手指”、一个修剪的螺旋操纵器，以及一个高度欠驱动的螺旋机器人。我们与文献中的多种基线方法进行了比较。结果表明，在输入限制下，任务空间的准确性和李雅普诺夫收敛性得到了改善，实现了优越的设定点和轨迹跟踪性能。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2603.05641

RFM-HRI : A Multimodal Dataset of Medical Robot Failure, User Reaction and Recovery Preferences for Item Retrieval Tasks

RFM-HRI：医疗机器人故障、用户反应及物品检索任务恢复偏好的多模态数据集

Batra, Yashika, Pioldi, Giuliano, Ekpo, Promise, Sayatqyzy, Arman, Maruur, Purnjay, Otieno, Shalom, Ching, Kevin, Taylor, Angelique

Abstract

While robots deployed in real-world environments inevitably experience interaction failures, understanding how users respond through verbal and non-verbal behaviors remains under-explored in human-robot interaction (HRI). This gap is particularly significant in healthcare-inspired settings, where interaction failures can directly affect task performance and user trust. We present the Robot Failures in Medical HRI (RFM-HRI) Dataset, a multimodal dataset capturing dyadic interactions between humans and robots embodied in crash carts, where communication failures are systematically induced during item retrieval tasks. Through Wizard-of-Oz studies with 41 participants across laboratory and hospital settings, we recorded responses to four failure types (speech, timing, comprehension, and search) derived from three years of crash-cart robot interaction data. The dataset contains 214 interaction samples including facial action units, head pose, speech transcripts, and post-interaction self-reports. Our analysis shows that failures significantly degrade affective valence and reduce perceived control compared to successful interactions. Failures are strongly associated with confusion, annoyance, and frustration, while successful interactions are characterized by surprise, relief, and confidence in task completion. Emotional responses also evolve across repeated failures, with confusion decreasing and frustration increasing over time. This work contributes (1) a publicly available multimodal dataset (RFM-HRI), (2) analysis of user responses to different failure types and preferred recovery strategies, and (3) a crash-cart retrieval scenario enabling systematic comparison of recovery strategies with implications for safety-critical failure recovery. Our findings provide a foundation for failure detection and recovery methods in embodied HRI.

Chinese Translation

在现实环境中部署的机器人不可避免地会经历交互故障，但用户通过言语和非言语行为的反应仍然在人与机器人交互（HRI）领域中未得到充分探索。这一差距在医疗保健相关的环境中尤为显著，因为交互故障可能直接影响任务表现和用户信任。我们提出了医疗HRI中的机器人故障数据集（RFM-HRI），这是一个多模态数据集，捕捉了人与机器人（以急救车为载体）之间的双向交互，其中在物品检索任务中系统性地诱发了沟通故障。通过在实验室和医院环境中进行的Wizard-of-Oz研究，我们记录了41名参与者对四种故障类型（言语、时机、理解和搜索）的反应，这些反应源自三年的急救车机器人交互数据。该数据集包含214个交互样本，包括面部动作单元、头部姿态、语音转录和交互后自我报告。我们的分析表明，与成功的交互相比，故障显著降低了情感效价并减少了感知控制。故障与困惑、恼怒和挫败感有很强的关联，而成功的交互则以惊讶、宽慰和对任务完成的信心为特征。情感反应在重复故障中也会演变，困惑感逐渐减少而挫败感则随时间增加。这项工作贡献了（1）一个公开可用的多模态数据集（RFM-HRI），（2）对用户对不同故障类型反应及其偏好的恢复策略的分析，以及（3）一个急救车检索场景，使得恢复策略的系统比较成为可能，并对安全关键的故障恢复具有重要意义。我们的发现为具身HRI中的故障检测和恢复方法提供了基础。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2603.05642

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

基于关系语义推理的3D场景图在开放世界互动物体搜索中的应用

Mahdi, Imen, Cassinelli, Matteo, Despinoy, Fabien, Welschehold, Tim, Valada, Abhinav

Abstract

Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models (LLMs), which are too slow and costly for real-time deployment. We introduce SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search, a novel method that searches directly over 3D scene graphs by assigning utility scores to rooms, frontiers, and objects using relational exploration heuristics such as room-object containment and object-object co-occurrence. To make this practical without sacrificing open-vocabulary generalization, we propose an offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference. Furthermore, we present SymSearch, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks. Extensive evaluations across symbolic and simulation environments show that SCOUT outperforms embedding similarity-based methods and matches LLM-level performance while remaining computationally efficient. Finally, real-world experiments demonstrate effective transfer to physical environments, enabling open-world interactive object search under realistic sensing and navigation constraints.

Chinese Translation

在家庭环境中，开放世界的互动物体搜索需要理解物体与其周围环境之间的语义关系，以有效指导探索。以往的方法要么依赖于视觉-语言嵌入的相似性，这无法可靠地捕捉与任务相关的关系语义，要么依赖于大型语言模型（LLMs），这在实时部署中速度过慢且成本过高。我们提出了SCOUT：基于场景图的开放世界互动物体搜索的学习效用探索新方法，该方法通过使用关系探索启发式（如房间-物体包含关系和物体-物体共现）直接在3D场景图上进行搜索，为房间、边界和物体分配效用分数。为了在不牺牲开放词汇泛化的情况下使其具有实用性，我们提出了一种离线程序蒸馏框架，从LLMs中提取结构化的关系知识，以便在机器人上进行轻量级推理。此外，我们还提出了SymSearch，一个可扩展的符号基准，用于评估互动物体搜索任务中的语义推理。广泛的评估结果显示，SCOUT在符号和仿真环境中均优于基于嵌入相似性的方法，并且在计算效率上与LLM水平的性能相匹配。最后，现实世界的实验表明，该方法能够有效地转移到物理环境中，使得在现实的感知和导航约束下实现开放世界的互动物体搜索成为可能。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2603.05670

TransMASK: Masked State Representation through Learned Transformation

TransMASK：通过学习变换实现的掩码状态表示

Parekh, Sagar, Culbertson, Preston, Losey, Dylan P.

Abstract

Humans train robots to complete tasks in one environment, and expect robots to perform those same tasks in new environments. As humans, we know which aspects of the environment (i.e., the state) are relevant to the task. But there are also things that do not matter; e.g., the color of the table or the presence of clutter in the background. Ideally, the robot's policy learns to ignore these irrelevant state components. Achieving this invariance improves generalization: the robot knows not to factor irrelevant variables into its control decisions, making the policy more robust to environment changes. In this paper we therefore propose a self-supervised method to learn a mask which, when multiplied by the observed state, transforms that state into a latent representation that is biased towards relevant elements. Our method -- which we call TransMASK -- can be combined with a variety of imitation learning frameworks (such as diffusion policies) without any additional labels or alterations to the loss function. To achieve this, we recognize that the learned policy updates to better match the human's true policy. This true policy only depends on the relevant parts of the state; hence, as the gradients pass back through the learned policy and our proposed mask, they increase the value for elements that cause the robot to better imitate the human. We can therefore train TransMASK at the same time as we learn the policy. By normalizing the magnitude of each row in TransMASK, we force the mask to align with the Jacobian of the expert policy: columns that correspond to relevant states have large magnitudes, while columns for irrelevant states approach zero magnitude. We compare our approach to other methods that extract relevant states for downstream imitation learning. See our project website: https://collab.me.vt.edu/TransMASK/

Chinese Translation

人类训练机器人在一个环境中完成任务，并期望机器人能够在新环境中执行相同的任务。作为人类，我们知道环境中的哪些方面（即状态）与任务相关。但也有一些因素并不重要；例如，桌子的颜色或背景中的杂物。理想情况下，机器人的策略应学习忽略这些无关的状态组件。实现这种不变性可以提高泛化能力：机器人知道不应将无关变量纳入其控制决策，从而使策略对环境变化更具鲁棒性。因此，在本文中，我们提出了一种自监督方法来学习一个掩码，当该掩码与观察到的状态相乘时，可以将该状态转化为一个偏向于相关元素的潜在表示。我们的方法——称为TransMASK——可以与多种模仿学习框架（如扩散策略）结合使用，而无需任何额外的标签或对损失函数的修改。为此，我们认识到学习到的策略会更新以更好地匹配人类的真实策略。该真实策略仅依赖于状态的相关部分；因此，当梯度通过学习到的策略和我们提出的掩码反向传播时，它们会增加使机器人更好地模仿人类的元素的值。因此，我们可以在学习策略的同时训练TransMASK。通过归一化TransMASK中每一行的幅度，我们强制掩码与专家策略的雅可比矩阵对齐：对应于相关状态的列具有较大的幅度，而对应于无关状态的列的幅度接近零。我们将我们的方法与其他提取相关状态以进行下游模仿学习的方法进行了比较。请参见我们的项目网站：https://collab.me.vt.edu/TransMASK/

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2603.05687

Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding

接触基础策略：具有生成接触基础的灵巧视觉触觉策略

Xu, Zhengtong, Wang, Yeping, Abbatematteo, Ben, Preechayasomboon, Jom, Chan, Sonny, Colonnese, Nick, Memar, Amirhossein H.

Abstract

Contact-Grounded Policy (CGP) enables fine-grained, contact-rich dexterous manipulation by grounding multi-point contacts through predicting the actual robot state and tactile feedback, and by using a learned contact-consistency mapping to convert these predictions into controller-executable targets for a compliance controller. CGP supports both dense tactile arrays and vision-based tactile sensors mounted on the hand. We collect demonstrations via teleoperation in both simulation and on a physical robot, and evaluate CGP across multiple dexterous manipulation tasks.

Chinese Translation

接触基础策略（Contact-Grounded Policy, CGP）通过预测实际机器人状态和触觉反馈，并利用学习到的接触一致性映射将这些预测转换为合规控制器可执行的目标，从而实现细粒度、接触丰富的灵巧操作。CGP支持密集的触觉阵列和安装在手部的基于视觉的触觉传感器。我们通过在模拟环境和物理机器人上进行远程操作收集演示，并在多个灵巧操作任务中评估CGP的表现。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2603.05716

Introducing the transitional autonomous vehicle lane-changing dataset: Empirical Experiments

引入过渡性自主车辆变道数据集：实证实验

Sharma, Abhinav, He, Zijun, Chen, Danjue

Abstract

Transitional autonomous vehicles (tAVs), which operate beyond SAE Level 1-2 automation but short of full autonomy, are increasingly sharing the road with human-driven vehicles (HDVs). As these systems interact during complex maneuvers such as lane changes, new patterns may emerge with implications for traffic stability and safety. Assessing these dynamics, particularly during mandatory lane changes, requires high-resolution trajectory data, yet datasets capturing tAV lane-changing behavior are scarce. This study introduces the North Carolina Transitional Autonomous Vehicle Lane-Changing (NC-tALC) Dataset, a high-fidelity trajectory dataset designed to characterize tAV interactions during lane-changing maneuvers. The dataset includes two controlled experimental series. In the first, tAV lane-changing experiments, a tAV executes lane changes in the presence of adaptive cruise control (ACC) equipped target vehicles, enabling analysis of lane-changing execution. In the second, tAV responding experiments, two tAVs act as followers and respond to cut-in maneuvers initiated by another tAV, enabling analysis of follower response dynamics. The dataset contains 152 trials (72 lane-changing and 80 responding trials) sampled at 20 Hz with centimeter-level RTK-GPS accuracy. The NC-tALC dataset provides a rigorous empirical foundation for evaluating tAV decision-making and interaction dynamics in controlled mandatory lane-changing scenarios.

Chinese Translation

过渡性自主车辆（tAVs）在超越SAE 1-2级自动化但尚未实现完全自主的情况下，越来越多地与人类驾驶的车辆（HDVs）共享道路。当这些系统在复杂的操作中（如变道）进行交互时，可能会出现新的模式，这对交通稳定性和安全性产生影响。评估这些动态，特别是在强制变道期间，需要高分辨率的轨迹数据，但捕捉tAV变道行为的数据集相对稀缺。本研究引入了北卡罗来纳州过渡性自主车辆变道（NC-tALC）数据集，这是一个高保真轨迹数据集，旨在表征tAV在变道操作中的交互。该数据集包括两组受控实验系列。在第一组tAV变道实验中，一辆tAV在配备自适应巡航控制（ACC）的目标车辆存在下执行变道，从而能够分析变道执行情况。在第二组tAV响应实验中，两辆tAV作为跟随者，响应另一辆tAV发起的插入操作，从而能够分析跟随者的响应动态。该数据集包含152个试验（72个变道试验和80个响应试验），以20 Hz的频率采样，具有厘米级的RTK-GPS精度。NC-tALC数据集为评估tAV在受控强制变道场景中的决策和交互动态提供了严格的实证基础。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2603.05748

Environment-Aware Path Generation for Robotic Additive Manufacturing of Structures

面向环境的结构机器人增材制造路径生成

Rabiei, Mahsa, Moini, Reza

Abstract

Robotic Additive Manufacturing (AM) has emerged as a scalable and customizable construction method in the last decade. However, current AM design methods rely on pre-conceived (A priori) toolpath of the structure, often developed via offline slicing software. Moreover, considering the dynamic construction environments involving obstacles on terrestrial and extraterrestrial environments, there is a need for online path generation methods. Here, an environment-aware path generation framework (PGF) is proposed for the first time in which structures are designed in an online fashion by utilizing four path planning (PP) algorithms (two search-based and two sampling-based). To evaluate the performance of the proposed PGF in different obstacle arrangements (periodic, random) for two types of structures (closed and open), structural (path roughness, turns, offset, Root Mean Square Error (RMSE), deviation) and computational (run time) performance metrics are developed. Most challenging environments (i.e., dense with high number of obstacles) are considered to saturate the feasibility limits of PP algorithms. The capability of each of the four path planners used in the PGF in finding a feasible path is assessed. Finally, the effectiveness of the proposed structural performance metrics is evaluated individually and comparatively, and most essential metrics necessary for evaluation of toolpath of the resulting structures are prescribed. Consequently, the most promising path planners in challenging environments are identified for robotic additive manufacturing applications.

Chinese Translation

机器人增材制造（AM）在过去十年中已成为一种可扩展和可定制的建筑方法。然而，目前的增材制造设计方法依赖于预先设定的（A priori）结构工具路径，这些路径通常通过离线切片软件开发。此外，考虑到涉及地面和外星环境中的障碍物的动态施工环境，迫切需要在线路径生成方法。在此，首次提出了一种面向环境的路径生成框架（PGF），该框架通过利用四种路径规划（PP）算法（两种基于搜索的和两种基于采样的）以在线方式设计结构。为了评估所提出的PGF在不同障碍物排列（周期性、随机）下对两种类型结构（封闭和开放）的性能，开发了结构（路径粗糙度、转弯、偏移、均方根误差（RMSE）、偏差）和计算（运行时间）性能指标。考虑了最具挑战性的环境（即障碍物数量密集）以饱和路径规划算法的可行性极限。评估了PGF中使用的四种路径规划器在找到可行路径方面的能力。最后，分别和比较地评估了所提出的结构性能指标的有效性，并规定了评估所生成结构工具路径所需的最重要指标。因此，识别出在挑战性环境中最有前景的路径规划器，以便于机器人增材制造应用。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2603.05751

Vision-Language System using Open-Source LLMs for Gestures in Medical Interpreter Robots

基于开源大型语言模型的医疗翻译机器人手势视觉-语言系统

Ngo, Thanh-Tung, Murphy, Emma, Ross, Robert J.

Abstract

Effective communication is vital in healthcare, especially across language barriers, where non-verbal cues and gestures are critical. This paper presents a privacy-preserving vision-language framework for medical interpreter robots that detects specific speech acts (consent and instruction) and generates corresponding robotic gestures. Built on locally deployed open-source models, the system utilizes a Large Language Model (LLM) with few-shot prompting for intent detection. We also introduce a novel dataset of clinical conversations annotated for speech acts and paired with gesture clips. Our identification module achieved 0.90 accuracy, 0.93 weighted precision, and a 0.91 weighted F1-Score. Our approach significantly improves computational efficiency and, in user studies, outperforms the speech-gesture generation baseline in human-likeness while maintaining comparable appropriateness.

Chinese Translation

有效的沟通在医疗保健中至关重要，尤其是在语言障碍的情况下，非语言线索和手势显得尤为重要。本文提出了一种保护隐私的视觉-语言框架，用于医疗翻译机器人，该框架能够检测特定的言语行为（同意和指令）并生成相应的机器人手势。该系统基于本地部署的开源模型，利用大型语言模型（Large Language Model, LLM）通过少量示例提示进行意图检测。我们还引入了一个新的临床对话数据集，标注了言语行为并配对了手势片段。我们的识别模块达到了0.90的准确率、0.93的加权精确率和0.91的加权F1分数。我们的方法显著提高了计算效率，并且在用户研究中，在人类相似性方面超越了语音-手势生成基线，同时保持了可比的适当性。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2603.05754

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

安全夜间视觉语言行动模型：通过热感知视觉-语言-行动模型实现安全关键操作中的隐形观察

Yu, Dian, Zhou, Qingchuan, Huang, Bingkun, Khadiv, Majid, Yang, Zewen

Abstract

Current Vision-Language-Action (VLA) models rely primarily on RGB perception, preventing them from capturing modalities such as thermal signals that are imperceptible to conventional visual sensors. Moreover, end-to-end generative policies lack explicit safety constraints, making them fragile when encountering obstacles and novel scenarios outside the training distribution. To address these limitations, we propose Safe-Night VLA, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments. Specifically, Safe-Night VLA integrates long-wave infrared thermal perception into a pre-trained vision-language backbone, enabling semantic reasoning grounded in thermodynamic properties. To ensure safe execution under out-of-distribution conditions, we incorporate a safety filter via control barrier functions, which provide deterministic workspace constraint enforcement during policy execution. We validate our framework through real-world experiments on a Franka manipulator, introducing a novel evaluation paradigm featuring temperature-conditioned manipulation, subsurface target localization, and reflection disambiguation, while maintaining constrained execution at inference time. Results demonstrate that Safe-Night VLA outperforms RGB-only baselines and provide empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.

Chinese Translation

当前的视觉-语言-行动（VLA）模型主要依赖于RGB感知，无法捕捉到传统视觉传感器无法感知的热信号等模态。此外，端到端生成策略缺乏明确的安全约束，使其在遇到障碍物和训练分布之外的新场景时显得脆弱。为了解决这些局限性，我们提出了安全夜间VLA（Safe-Night VLA），这是一个多模态操作框架，使机器人能够观察隐形物体，同时在非结构化环境中对热感知操作施加严格的安全约束。具体而言，安全夜间VLA将长波红外热感知集成到预训练的视觉-语言骨干网络中，使其能够基于热力学特性进行语义推理。为了确保在分布外条件下的安全执行，我们通过控制障碍函数引入了安全过滤器，在策略执行过程中提供确定性的工作空间约束执行。我们通过在Franka操纵器上的真实世界实验验证了我们的框架，提出了一种新的评估范式，涉及温度条件下的操作、地下目标定位和反射消歧，同时在推理时保持约束执行。结果表明，安全夜间VLA优于仅使用RGB的基线，并提供了基础模型能够有效利用不可见物理模态进行稳健操作的实证证据。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2603.05757

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

EmboAlign：将视频生成与组合约束对齐以实现零-shot 操作

Zhang, Gehao, Ni, Zhenyang, Mohapatra, Payal, Liu, Han, Zhang, Ruohan, Zhu, Qi

Abstract

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present \method{}, a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, \method{} uses a VLM to automatically extract a set of compositional constraints capturing task-specific requirements, which are then applied at two stages: (1) constraint-guided rollout selection, which scores and filters a batch of VGM rollouts to retain the most physically plausible candidate, and (2) constraint-based trajectory optimization, which uses the selected rollout as initialization and refines the robot trajectory under the same constraint set to correct retargeting errors. We evaluate \method{} on six real-robot manipulation tasks requiring precise, constraint-sensitive execution, improving the overall success rate by 43.3\% points over the strongest baseline without any task-specific training data.

Chinese Translation

在大规模互联网数据上预训练的视频生成模型（VGM）能够生成时间上连贯的回放视频，捕捉丰富的物体动态，为零-shot 机器人操作提供了一个引人注目的基础。然而，VGM 常常生成物理上不合理的回放，并且通过几何重定向将其像素空间运动转换为机器人动作时，进一步引入了来自不完美深度估计和关键点跟踪的累积误差。为了解决这些挑战，我们提出了 extit{EmboAlign}，一个无数据框架，在推理时将 VGM 输出与由视觉-语言模型（VLM）生成的组合约束对齐。关键的见解是，VLM 提供了一种与 VGM 互补的能力：结构化的空间推理，可以识别对操作执行的成功和安全至关重要的物理约束。给定语言指令， extit{EmboAlign} 使用 VLM 自动提取一组捕捉任务特定要求的组合约束，然后在两个阶段应用这些约束：（1）约束引导的回放选择，通过对一批 VGM 回放进行评分和过滤，以保留最物理合理的候选；（2）基于约束的轨迹优化，使用选定的回放作为初始化，并在相同的约束集下优化机器人轨迹，以纠正重定向错误。我们在六个需要精确、敏感于约束的真实机器人操作任务上评估了 extit{EmboAlign}，在没有任何任务特定训练数据的情况下，整体成功率提高了 43.3 个百分点，超越了最强基线。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2603.05767

Multi-Robot Trajectory Planning via Constrained Bayesian Optimization and Local Cost Map Learning with STL-Based Conflict Resolution

基于约束贝叶斯优化和局部成本地图学习的多机器人轨迹规划与基于时序逻辑的冲突解决

Raxit, Sourav, Newaz, Abdullah Al Redwan, Fuentes, Jose, Padrao, Paulo, Cavalcanti, Ana, Bobadilla, Leonardo

Abstract

We address multi-robot motion planning under Signal Temporal Logic (STL) specifications with kinodynamic constraints. Exact approaches face scalability bottlenecks and limited adaptability, while conventional sampling-based methods require excessive samples to construct optimal trajectories. We propose a two-stage framework integrating sampling-based online learning with formal STL reasoning. At the single-robot level, our constrained Bayesian Optimization-based Tree search (cBOT) planner uses a Gaussian process as a surrogate model to learn local cost maps and feasibility constraints, generating shorter collision-free trajectories with fewer samples. At the multi-robot level, our STL-enhanced Kinodynamic Conflict-Based Search (STL-KCBS) algorithm incorporates STL monitoring into conflict detection and resolution, ensuring specification satisfaction while maintaining scalability and probabilistic completeness. Benchmarking demonstrates improved trajectory efficiency and safety over existing methods. Real-world experiments with autonomous surface vehicles validate robustness and practical applicability in uncertain environments. The STLcBOT Planner will be released as an open-source package, and videos of real-world and simulated experiments are available at https://stlbot.github.io/.

Chinese Translation

我们针对具有运动学动态约束的信号时序逻辑（STL）规范下的多机器人运动规划问题进行了研究。精确的方法面临可扩展性瓶颈和适应性有限，而传统的基于采样的方法需要过多的样本来构建最优轨迹。我们提出了一个两阶段框架，将基于采样的在线学习与正式的STL推理相结合。在单机器人层面，我们的基于约束贝叶斯优化的树搜索（cBOT）规划器使用高斯过程作为代理模型来学习局部成本地图和可行性约束，从而生成更短的无碰撞轨迹，并减少样本数量。在多机器人层面，我们的STL增强的运动学冲突搜索（STL-KCBS）算法将STL监控纳入冲突检测和解决中，确保规范的满足，同时保持可扩展性和概率完整性。基准测试表明，相较于现有方法，轨迹效率和安全性得到了改善。与自主水面车辆的实际实验验证了在不确定环境中的鲁棒性和实际适用性。STLcBOT规划器将作为开源包发布，真实世界和模拟实验的视频可在 https://stlbot.github.io/ 获取。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2603.05783

Task-Level Decisions to Gait Level Control: A Hierarchical Policy Approach for Quadruped Navigation

任务级决策到步态控制：一种用于四足导航的层次化策略方法

Li, Sijia, Wang, Haoyu, Yuan, Shenghai, Yang, Yizhuo, Nguyen, Thien-Minh

Abstract

Real-world quadruped navigation is constrained by a scale mismatch between high-level navigation decisions and low-level gait execution, as well as by instabilities under out-of-distribution environmental changes. Such variations challenge sim-to-real transfer and can trigger falls when policies lack explicit interfaces for adaptation. In this paper, we present a hierarchical policy architecture for quadrupedal navigation, termed Task-level Decision to Gait Control (TDGC). A low-level policy, trained with reinforcement learning in simulation, delivers gait-conditioned locomotion and maps task requirements to a compact set of controllable behavior parameters, enabling robust mode generation and smooth switching. A high-level policy makes task-centric decisions from sparse semantic or geometric terrain cues and translates them into low-level targets, forming a traceable decision pipeline without dense maps or high-resolution terrain reconstruction. Different from end-to-end approaches, our architecture provides explicit interfaces for deployment-time tuning, fault diagnosis, and policy refinement. We introduce a structured curriculum with performance-driven progression that expands environmental difficulty and disturbance ranges. Experiments show higher task success rates on mixed terrains and out-of-distribution tests.

Chinese Translation

现实世界中的四足导航受到高层导航决策与低层步态执行之间尺度不匹配的限制，以及在分布外环境变化下的不稳定性影响。这些变化对模拟到现实的转移构成挑战，并可能在策略缺乏明确适应接口时导致跌倒。在本文中，我们提出了一种用于四足导航的层次化策略架构，称为任务级决策到步态控制（Task-level Decision to Gait Control, TDGC）。一个低层策略通过在模拟环境中进行强化学习训练，提供步态条件的运动，并将任务需求映射到一组紧凑的可控行为参数，从而实现稳健的模式生成和平滑切换。高层策略从稀疏的语义或几何地形线索中做出以任务为中心的决策，并将其转化为低层目标，形成一个可追踪的决策流程，无需密集地图或高分辨率地形重建。与端到端方法不同，我们的架构提供了明确的接口，用于部署时的调优、故障诊断和策略优化。我们引入了一种结构化的课程，采用基于性能的进展，扩展环境的难度和干扰范围。实验表明，在混合地形和分布外测试中，任务成功率更高。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2603.05804

CDF-Glove: A Cable-Driven Force Feedback Glove for Dexterous Teleoperation

CDF-Glove：一种用于灵巧遥操作的电缆驱动力反馈手套

Liang, Huayue, Li, Ruochong, Yang, Yaodong, Zeng, Long, Chen, Yuanpei, Wang, Xueqian

Abstract

High-quality teleoperated demonstrations are a primary bottleneck for imitation learning (IL) in dexterous manipulation. However, haptic feedback provides operators with real-time contact information, enabling real-time finger posture adjustments, and thereby improving demonstration quality. Existing dexterous teleoperation platforms typically omit haptic feedback and remain bulky and expensive. We introduce CDF-Glove, a lightweight and low cost cable-driven force-feedback glove. The real-time state is available for 20 finger degrees of freedom (DoF), of which 16 are directly sensed and 4 are passively coupled (inferred from kinematic constraints). We develop a kinematic model and control stack for the glove, and validate them across multiple robotic hands with diverse kinematics and DoF. The CDF-Glove achieves distal joint repeatability of 0.4 degrees, and delivers about 200 ms force feedback latency, yielding a 4x improvement in task success rate relative to no-feedback teleoperation. We collect two bimanual teleoperation datasets, on which we train and evaluate Diffusion Policy baselines. Compared to kinesthetic teaching, the policies trained in our teleoperated demonstrations increase the average success rate by 55% and reduce the mean completion time by approximately 15.2 seconds (a 47.2% relative reduction). In particular, the CDF-Glove costs approximately US$230. The code and designs are released as open source at https://cdfglove.github.io/.

Chinese Translation

高质量的遥操作演示是灵巧操作中模仿学习（IL）的主要瓶颈。然而，触觉反馈为操作员提供实时接触信息，使得实时手指姿态调整成为可能，从而提高演示质量。现有的灵巧遥操作平台通常省略触觉反馈，并且体积庞大且昂贵。我们介绍了CDF-Glove，这是一种轻量且低成本的电缆驱动力反馈手套。该手套实时状态可用于20个手指自由度（DoF），其中16个是直接感知的，4个是被动耦合的（根据运动学约束推断）。我们为手套开发了运动学模型和控制系统，并在多种具有不同运动学和自由度的机器人手上进行了验证。CDF-Glove实现了0.4度的远端关节重复性，并提供约200毫秒的力反馈延迟，相较于无反馈遥操作，任务成功率提高了4倍。我们收集了两个双手遥操作数据集，并在其上训练和评估扩散策略基线。与动觉教学相比，在我们的遥操作演示中训练的策略将平均成功率提高了55%，并将平均完成时间减少了约15.2秒（相对减少47.2%）。特别是，CDF-Glove的成本约为230美元。代码和设计已作为开源发布，网址为https://cdfglove.github.io/。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2603.05815

Hierarchical Latent Action Model

层次化潜在动作模型

Kim, Hanjung, Pinto, Lerrel, Kim, Seon Joo

Abstract

Latent Action Models (LAMs) enable learning from actionless data for applications ranging from robotic control to interactive world models. However, existing LAMs typically focus on short-horizon frame transitions and capture low-level motion while overlooking longer-term temporal structure. In contrast, actionless videos often contain temporally extended and high-level skills. We present HiLAM, a hierarchical latent action model that discovers latent skills by modeling long-term temporal information. To capture these dependencies across long horizons, we utilize a pretrained LAM as a low-level extractor. This architecture aggregates latent action sequences, which contain the underlying dynamic patterns of the video, into high-level latent skills. Our experiments demonstrate that HiLAM improves over the baseline and exhibits robust dynamic skill discovery.

Chinese Translation

潜在动作模型（Latent Action Models, LAMs）能够从无动作数据中学习，应用于从机器人控制到互动世界模型等多个领域。然而，现有的LAMs通常专注于短期帧转换，捕捉低级运动，而忽视了长期的时间结构。相比之下，无动作视频通常包含时间延续性强且高水平的技能。我们提出了HiLAM，一种层次化潜在动作模型，通过建模长期时间信息来发现潜在技能。为了捕捉这些跨越长时间范围的依赖关系，我们利用一个预训练的LAM作为低级特征提取器。该架构将潜在动作序列聚合，这些序列包含视频的基本动态模式，形成高层次的潜在技能。我们的实验表明，HiLAM在基准测试中表现优于其他方法，并展现出稳健的动态技能发现能力。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2603.05830

OpenHEART: Opening Heterogeneous Articulated Objects with a Legged Manipulator

OpenHEART：使用腿式机械手打开异构关节物体

Lim, Seonghyeon, Lee, Hyeonwoo, Lee, Seunghyun, Nahrendra, I Made Aswin, Myung, Hyun

Abstract

Legged manipulators offer high mobility and versatile manipulation. However, robust interaction with heterogeneous articulated objects, such as doors, drawers, and cabinets, remains challenging because of the diverse articulation types of the objects and the complex dynamics of the legged robot. Existing reinforcement learning (RL)-based approaches often rely on high-dimensional sensory inputs, leading to sample inefficiency. In this paper, we propose a robust and sample-efficient framework for opening heterogeneous articulated objects with a legged manipulator. In particular, we propose Sampling-based Abstracted Feature Extraction (SAFE), which encodes handle and panel geometry into a compact low-dimensional representation, improving cross-domain generalization. Additionally, Articulation Information Estimator (ArtIEst) is introduced to adaptively mix proprioception with exteroception to estimate opening direction and range of motion for each object. The proposed framework was deployed to manipulate various heterogeneous articulated objects in simulation and real-world robot systems. Videos can be found on the project website: https://openheart-icra.github.io/OpenHEART/

Chinese Translation

腿式机械手具有高机动性和多功能操作能力。然而，由于物体的关节类型多样以及腿式机器人复杂的动态特性，与异构关节物体（如门、抽屉和橱柜）进行稳健交互仍然具有挑战性。现有的基于强化学习（RL）的方法通常依赖于高维感知输入，导致样本效率低下。本文提出了一种稳健且样本高效的框架，用于使用腿式机械手打开异构关节物体。特别地，我们提出了基于采样的抽象特征提取（SAFE），该方法将把手和面板几何形状编码为紧凑的低维表示，从而改善跨域泛化。此外，引入了关节信息估计器（ArtIEst），以自适应地混合本体感知与外部感知，估计每个物体的打开方向和运动范围。所提出的框架已在仿真和实际机器人系统中部署，以操作各种异构关节物体。项目网站上可以找到相关视频：https://openheart-icra.github.io/OpenHEART/

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2603.05837

Terrain characterization and locomotion adaptation in a small-scale lizard-inspired robot

小规模蜥蜴启发机器人中的地形特征化与运动适应

Andrews, Duncan, Zimmerman, Landon, Martin, Evan, DiGennaro, Joe, Chong, Baxi

Abstract

Unlike their large-scale counterparts, small-scale robots are largely confined to laboratory environments and are rarely deployed in real-world settings. As robot size decreases, robot-terrain interactions fundamentally change; however, there remains a lack of systematic understanding of what sensory information small-scale robots should acquire and how they should respond when traversing complex natural terrains. To address these challenges, we develop a Small-scale, Intelligent, Lizard-inspired, Adaptive Robot (SILA Bot) capable of adapting to diverse substrates. We use granular media of varying depths as a controlled yet representative terrain paradigm. We show that the optimal body movement pattern (ranging from standing-wave bending that assists limb retraction on flat ground to traveling-wave undulation that generates thrust in deep granular media) can be parameterized and approximated as a linear function of granular depth. Furthermore, proprioceptive signals, such as joint torque, provide sufficient information to estimate granular depth via a K-Nearest Neighbors classifier, achieving 95% accuracy. Leveraging these relationships, we design a simple linear feedback controller that modulates body phase and substantially improves locomotion performance on terrains with unknown depth. Together, these results establish a principled framework for perception and control in small-scale locomotion and enable effective terrain-adaptive locomotion while maintaining low computational complexity.

Chinese Translation

与大型机器人不同，小规模机器人主要局限于实验室环境，鲜少在真实世界中部署。随着机器人尺寸的减小，机器人与地形的交互发生了根本变化；然而，目前对小规模机器人应获取何种感知信息以及在穿越复杂自然地形时应如何响应仍缺乏系统性的理解。为了解决这些挑战，我们开发了一种小规模智能蜥蜴启发自适应机器人（SILA Bot），能够适应多种基质。我们使用不同深度的颗粒介质作为一种受控但具有代表性的地形范例。我们展示了最佳的身体运动模式（从在平坦地面上辅助肢体收缩的驻波弯曲到在深颗粒介质中产生推力的行波波动）可以被参数化并近似为颗粒深度的线性函数。此外，关节扭矩等本体感觉信号提供了足够的信息，通过K-近邻分类器估计颗粒深度，准确率达到95%。利用这些关系，我们设计了一种简单的线性反馈控制器，调节身体相位，显著提高了在未知深度地形上的运动性能。这些结果共同建立了一个小规模运动感知与控制的原则性框架，使得在保持低计算复杂度的同时实现有效的地形适应性运动。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2603.05842

Expert Knowledge-driven Reinforcement Learning for Autonomous Racing via Trajectory Guidance and Dynamics Constraints

基于专家知识的强化学习在自主赛车中的应用：轨迹引导与动态约束

Leng, Bo, Zhang, Weiqi, Li, Zhuoren, Xiong, Lu, Jin, Guizhe, Yu, Ran, Lv, Chen

Abstract

Reinforcement learning has demonstrated significant potential in the field of autonomous driving. However, it suffers from defects such as training instability and unsafe action outputs when faced with autonomous racing environments characterized by high dynamics and strong nonlinearities. To this end, this paper proposes a trajectory guidance and dynamics constraints Reinforcement Learning (TraD-RL) method for autonomous racing. The key features of this method are as follows: 1) leveraging the prior expert racing line to construct an augmented state representation and facilitate reward shaping, thereby integrating domain knowledge to stabilize early-stage policy learning; 2) embedding explicit vehicle dynamic priors into a safe operating envelope formulated via control barrier functions to enable safety-constrained learning; and 3) adopting a multi-stage curriculum learning strategy that shifts from expert-guided learning to autonomous exploration, allowing the learned policy to surpass expert-level performance. The proposed method is evaluated in a high-fidelity simulation environment modeled after the Tempelhof Airport Street Circuit. Experimental results demonstrate that TraD-RL effectively improves both lap speed and driving stability of the autonomous racing vehicle, achieving a synergistic optimization of racing performance and safety.

Chinese Translation

强化学习在自主驾驶领域展现出了显著的潜力。然而，在面对具有高动态性和强非线性的自主赛车环境时，它存在训练不稳定和不安全的动作输出等缺陷。为此，本文提出了一种轨迹引导与动态约束的强化学习方法（TraD-RL）用于自主赛车。该方法的关键特点如下：1）利用先前专家赛车线路构建增强状态表示并促进奖励塑造，从而整合领域知识以稳定早期策略学习；2）将明确的车辆动态先验嵌入通过控制障碍函数制定的安全操作范围中，以实现安全约束学习；3）采用多阶段课程学习策略，从专家引导学习转向自主探索，使得学习到的策略超越专家水平的表现。所提方法在以Tempelhof机场街道赛道为模型的高保真仿真环境中进行了评估。实验结果表明，TraD-RL有效提高了自主赛车的圈速和驾驶稳定性，实现了赛车性能与安全性的协同优化。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2603.05861

DexEMG: Towards Dexterous Teleoperation System via EMG2Pose Generalization

DexEMG：通过 EMG2Pose 泛化实现灵巧遥操作系统

Zhao, Qianyou, Li, Wenqiao, Wang, Chiyu, Zhang, Kaifeng

Abstract

High-fidelity teleoperation of dexterous robotic hands is essential for bringing robots into unstructured domestic environments. However, existing teleoperation systems often face a trade-off between performance and portability: vision-based capture systems are constrained by costs and line-of-sight requirements, while mechanical exoskeletons are bulky and physically restrictive. In this paper, we present DexEMG, a lightweight and cost-effective teleoperation system leveraging surface electromyography (sEMG) to bridge the gap between human intent and robotic execution. We first collect a synchronized dataset of sEMG signals and hand poses via a MoCap glove to train EMG2Pose, a neural network capable of continuously predicting hand kinematics directly from muscle activity. To ensure seamless control, we develop a robust hand retargeting algorithm that maps the predicted poses onto a multi-fingered dexterous hand in real-time. Experimental results demonstrate that DexEMG achieves high precision in diverse teleoperation tasks. Notably, our system exhibits strong generalization capabilities across novel objects and complex environments without the need for intensive individual-specific recalibration. This work offers a scalable and intuitive interface for both general-purpose robotic manipulation and assistive technologies.

Chinese Translation

高保真度的灵巧机器人手遥操作对于将机器人引入非结构化的家庭环境至关重要。然而，现有的遥操作系统往往面临性能与便携性之间的权衡：基于视觉的捕捉系统受到成本和视线要求的限制，而机械外骨骼则笨重且在物理上具有限制性。本文提出了 DexEMG，一个轻量且具有成本效益的遥操作系统，利用表面肌电图（sEMG）来弥合人类意图与机器人执行之间的差距。我们首先通过运动捕捉手套收集同步的 sEMG 信号和手部姿态数据，以训练 EMG2Pose，一个能够直接从肌肉活动中连续预测手部运动学的神经网络。为了确保无缝控制，我们开发了一种强大的手部重定向算法，能够实时将预测的姿态映射到多指灵巧手上。实验结果表明，DexEMG 在多样化的遥操作任务中实现了高精度。值得注意的是，我们的系统在新颖物体和复杂环境中展现出强大的泛化能力，无需进行密集的个体特定重新校准。这项工作为通用机器人操作和辅助技术提供了可扩展且直观的接口。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2603.05868

AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models

AnyCamVLA：面向视角鲁棒的视觉-语言-动作模型的零-shot相机适应

Heo, Hyeongjun, Woo, Seungyeon, Kim, Sang Min, Kim, Junho, Lee, Junho, Lee, Yonghyeon, Kim, Young Min

Abstract

Despite remarkable progress in Vision-Language-Action models (VLAs) for robot manipulation, these large pre-trained models require fine-tuning to be deployed in specific environments. These fine-tuned models are highly sensitive to camera viewpoint changes that frequently occur in unstructured environments. In this paper, we propose a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification. Our key idea is to virtually adjust test-time camera observations to match the training camera configuration in real-time. For that, we use a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters. This plug-and-play approach preserves the pre-trained capabilities of VLAs and applies to any RGB-based policy. Through extensive experiments on the LIBERO benchmark, our method consistently outperforms baselines that use data augmentation for policy fine-tuning or additional 3D-aware features for visual input. We further validate that our approach constantly enhances viewpoint robustness in real-world robotic manipulation scenarios, including settings with varying camera extrinsics, intrinsics, and freely moving handheld cameras.

Chinese Translation

尽管在机器人操作的视觉-语言-动作模型（VLA）方面取得了显著进展，但这些大型预训练模型在特定环境中部署时仍需进行微调。这些微调后的模型对在非结构化环境中频繁发生的相机视角变化非常敏感。本文提出了一种零-shot相机适应框架，无需额外的演示数据、策略微调或架构修改。我们的关键思路是在测试时实时虚拟调整相机观察，以匹配训练时的相机配置。为此，我们使用了一种最新的前馈新视图合成模型，该模型输出高质量的目标视图图像，处理外部和内部参数。这种即插即用的方法保留了VLA的预训练能力，并适用于任何基于RGB的策略。通过在LIBERO基准上的大量实验，我们的方法始终优于使用数据增强进行策略微调或额外的3D感知特征进行视觉输入的基线。我们进一步验证了我们的方法在现实世界的机器人操作场景中持续增强视角鲁棒性，包括具有不同相机外部参数、内部参数和自由移动手持相机的设置。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2603.05902

Improved hopping control on slopes for small robots using spring mass modeling

基于弹簧质量模型的小型机器人在斜坡上的改进跳跃控制

Roberts, Heston, Sarker, Pronoy, Islam, Sm Ashikul, Kim, Min Gyu

Abstract

Hopping robots often lose balance on slopes because the tilted ground creates unwanted rotation at landing. This work analyzes that effect using a simple spring mass model and identifies how slope induced impulses destabilize the robot. To address this, we introduce two straightforward fixes, adjusting the bodys touchdown angle based on the slope and applying a small corrective torque before takeoff. Together, these steps effectively cancel the unwanted rotation caused by inclined terrain, allowing the robot to land smoothly and maintain stable hopping even on steep slopes. Moreover, the proposed method remains simple enough to implement on low cost robotic platforms without requiring complex sensing or computation. By combining this analytical model with minimal control actions, this approach provides a practical path toward reliable hopping on uneven terrain. The results from simulation confirm that even small slope aware adjustments can dramatically improve landing stability, making the technique suitable for future autonomous field robots that must navigate natural environments such as hills, rubble, and irregular outdoor landscapes.

Chinese Translation

跳跃机器人在斜坡上常常失去平衡，因为倾斜的地面在着陆时会产生不必要的旋转。本研究使用简单的弹簧质量模型分析了这一效应，并识别了斜坡引起的冲击如何使机器人失稳。为了解决这个问题，我们提出了两个简单的修正措施：根据斜坡调整机器人的着陆角度，并在起跳前施加小的修正扭矩。这些步骤有效地消除了由倾斜地形引起的不必要旋转，使机器人能够平稳着陆并在陡坡上保持稳定的跳跃。此外，所提出的方法足够简单，能够在低成本的机器人平台上实施，而无需复杂的传感或计算。通过将这一分析模型与最小控制动作相结合，该方法为在不平坦地形上实现可靠跳跃提供了一条实用的路径。模拟结果确认，即使是小的斜坡感知调整也能显著提高着陆稳定性，使该技术适用于未来必须在自然环境中导航的自主场地机器人，例如丘陵、碎石和不规则的户外景观。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2603.05916

Iterative Convex Optimization with Control Barrier Functions for Obstacle Avoidance among Polytopes

基于控制障碍函数的迭代凸优化在多面体障碍物避让中的应用

Liu, Shuo, Huang, Zhe, Belta, Calin A.

Abstract

Obstacle avoidance of polytopic obstacles by polytopic robots is a challenging problem in optimization-based control and trajectory planning. Many existing methods rely on smooth geometric approximations, such as hyperspheres or ellipsoids, which allow differentiable distance expressions but distort the true geometry and restrict the feasible set. Other approaches integrate exact polytope distances into nonlinear model predictive control (MPC), resulting in nonconvex programs that limit real-time performance. In this paper, we construct linear discrete-time control barrier function (DCBF) constraints by deriving supporting hyperplanes from exact closest-point computations between convex polytopes. We then propose a novel iterative convex MPC-DCBF framework, where local linearization of system dynamics and robot geometry ensures convexity of the finite-horizon optimization at each iteration. The resulting formulation reduces computational complexity and enables fast online implementation for safety-critical control and trajectory planning of general nonlinear dynamics. The framework extends to multi-robot and three-dimensional environments. Numerical experiments demonstrate collision-free navigation in cluttered maze scenarios with millisecond-level solve times.

Chinese Translation

多面体机器人在多面体障碍物中的避障问题是基于优化的控制和轨迹规划中的一个挑战性问题。许多现有方法依赖于光滑的几何近似，例如超球体或椭球体，这些方法允许可微分的距离表达，但扭曲了真实几何形状并限制了可行集。其他方法将精确的多面体距离集成到非线性模型预测控制（MPC）中，导致非凸程序，从而限制了实时性能。本文通过从凸多面体之间的精确最近点计算中推导支持超平面，构建线性离散时间控制障碍函数（DCBF）约束。然后，我们提出了一种新颖的迭代凸MPC-DCBF框架，其中系统动态和机器人几何的局部线性化确保了每次迭代的有限时间优化的凸性。所得到的公式减少了计算复杂性，并使得安全关键控制和一般非线性动态的轨迹规划能够快速在线实现。该框架扩展到多机器人和三维环境。数值实验表明，在拥挤的迷宫场景中实现了无碰撞导航，求解时间达到毫秒级。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2603.05935

Swooper: Learning High-Speed Aerial Grasping With a Simple Gripper

Swooper：使用简单夹持器学习高速空中抓取

Huang, Ziken, Niu, Xinze, Chai, Bowen, Jin, Renbiao, Zou, Danping

Abstract

High-speed aerial grasping presents significant challenges due to the high demands on precise, responsive flight control and coordinated gripper manipulation. In this work, we propose Swooper, a deep reinforcement learning (DRL) based approach that achieves both precise flight control and active gripper control using a single lightweight neural network policy. Training such a policy directly via DRL is nontrivial due to the complexity of coordinating flight and grasping. To address this, we adopt a two-stage learning strategy: we first pre-train a flight control policy, and then fine-tune it to acquire grasping skills. With the carefully designed reward functions and training framework, the entire training process completes in under 60 minutes on a standard desktop with an Nvidia RTX 3060 GPU. To validate the trained policy in the real world, we develop a lightweight quadrotor grasping platform equipped with a simple off-the-shelf gripper, and deploy the policy in a zero-shot manner on the onboard Raspberry Pi 4B computer, where each inference takes only about 1.0 ms. In 25 real-world trials, our policy achieves an 84% grasp success rate and grasping speeds of up to 1.5 m/s without any fine-tuning. This matches the robustness and agility of state-of-the-art classical systems with sophisticated grippers, highlighting the capability of DRL for learning a robust control policy that seamlessly integrates high-speed flight and grasping. The supplementary video is available for more results. Video: https://zikenhuang.github.io/Swooper/.

Chinese Translation

高速空中抓取面临重大挑战，因为它对精确、响应迅速的飞行控制和协调的夹持器操作提出了很高的要求。在本研究中，我们提出了Swooper，这是一种基于深度强化学习（DRL）的方法，能够通过单个轻量级神经网络策略实现精确的飞行控制和主动的夹持器控制。通过DRL直接训练这样的策略并非易事，因为协调飞行和抓取的复杂性。为了解决这个问题，我们采用了两阶段学习策略：首先预训练飞行控制策略，然后对其进行微调以获得抓取技能。通过精心设计的奖励函数和训练框架，整个训练过程在配备Nvidia RTX 3060 GPU的标准桌面上完成，耗时不到60分钟。为了在现实世界中验证训练好的策略，我们开发了一个轻量级四旋翼抓取平台，配备了简单的现成夹持器，并在机载Raspberry Pi 4B计算机上以零-shot方式部署该策略，每次推理仅需约1.0毫秒。在25次现实世界试验中，我们的策略实现了84%的抓取成功率和高达1.5米/秒的抓取速度，无需任何微调。这与配备复杂夹持器的最先进经典系统的鲁棒性和灵活性相匹配，突显了DRL在学习能够无缝集成高速飞行和抓取的鲁棒控制策略方面的能力。补充视频可供查看更多结果。视频链接：https://zikenhuang.github.io/Swooper/

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2603.05944

How to Model Your Crazyflie Brushless

如何建模你的 Crazyflie 无刷电机

Gräfe, Alexander, Scherer, Christoph, Hönig, Wolfgang, Trimpe, Sebastian

Abstract

The Crazyflie quadcopter is widely recognized as a leading platform for nano-quadcopter research. In early 2025, the Crazyflie Brushless was introduced, featuring brushless motors that provide around 50% more thrust compared to the brushed motors of its predecessor, the Crazyflie 2.1. This advancement has opened new opportunities for research in agile nano-quadcopter control. To support researchers utilizing this new platform, this work presents a dynamics model of the Crazyflie Brushless and identifies its key parameters. Through simulations and hardware analyses, we assess the accuracy of our model. We furthermore demonstrate its suitability for reinforcement learning applications by training an end-to-end neural network position controller and learning a backflip controller capable of executing two complete rotations with a vertical movement of just 1.8 meters. This showcases the model's ability to facilitate the learning of controllers and acrobatic maneuvers that successfully transfer from simulation to hardware. Utilizing this application, we investigate the impact of domain randomization on control performance, offering valuable insights into bridging the sim-to-real gap with the presented model. We have open-sourced the entire project, enabling users of the Crazyflie Brushless to swiftly implement and test their own controllers on an accurate simulation platform.

Chinese Translation

Crazyflie 四旋翼无人机被广泛认为是纳米四旋翼研究的领先平台。在2025年初，Crazyflie 无刷电机问世，配备了无刷电机，相比于前身 Crazyflie 2.1 的有刷电机，提供了约50%的推力提升。这一进展为灵活的纳米四旋翼控制研究开辟了新的机会。为了支持利用这一新平台的研究人员，本研究提出了 Crazyflie 无刷电机的动力学模型，并识别了其关键参数。通过仿真和硬件分析，我们评估了模型的准确性。此外，我们通过训练一个端到端的神经网络位置控制器，展示了其在强化学习应用中的适用性，并学习了一个能够完成两次完整旋转且垂直移动仅为1.8米的后空翻控制器。这展示了该模型在促进控制器和特技动作学习方面的能力，并成功实现了从仿真到硬件的转移。利用这一应用，我们研究了领域随机化对控制性能的影响，为缩小仿真与现实之间的差距提供了有价值的见解。我们已将整个项目开源，使 Crazyflie 无刷电机的用户能够快速在准确的仿真平台上实现和测试自己的控制器。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2603.05965

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

PROBE：具有分析性平移鲁棒性的概率占用BEV编码用于3D地点识别

Lee, Jinseop, Lee, Byoungho, Yoo, Gichul

Abstract

We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $\sigma_\theta = \sigma_t / r$ in $\mathcal{O}(R \times S)$ time. The primary parameter $\sigma_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity allowing cross-sensor generalization without per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance against both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.

Chinese Translation

我们提出了PROBE（概率占用BEV编码），这是一种无学习的激光雷达地点识别描述符，它将每个BEV单元的占用建模为伯努利随机变量。PROBE并不依赖于离散点云扰动，而是通过极坐标雅可比（polar Jacobian）对连续笛卡尔平移进行解析边际化，从而在$ ext{O}(R imes S)$时间内产生距离自适应的角度不确定性$ heta = rac{ ext{σ}_t}{r}$。主要参数$ ext{σ}_t$表示以米为单位的期望平移不确定性，这是一个与传感器无关的物理量，允许在不同传感器之间进行泛化，而无需针对每个数据集进行调优。成对相似性结合了伯努利-KL雅卡尔（Bernoulli-KL Jaccard）与指数不确定性门控以及基于快速傅里叶变换（FFT）的高度余弦相似性以实现旋转对齐。在涵盖四种不同激光雷达类型的四个数据集上进行评估，PROBE在多会话评估中在手工设计描述符中实现了最高的准确性，并在单会话性能上与手工设计和监督基线相比具有竞争力。源代码和补充材料可在https://sites.google.com/view/probe-pr获取。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2603.05976

Proprioceptive Shape Estimation of Tensegrity Manipulators Using Energy Minimisation

基于能量最小化的张力结构操控器的本体形状估计

Bhat, Tufail Ahmad, Ikemoto, Shuhei

Abstract

Shape estimation is fundamental for controlling continuously bending tensegrity manipulators, yet achieving it remains a challenge. Although using exteroceptive sensors makes the implementation straightforward, it is costly and limited to specific environments. Proprioceptive approaches, by contrast, do not suffer from these limitations. So far, several methods have been proposed; however, to our knowledge, there are no proven examples of large-scale tensegrity structures used as manipulators. This paper demonstrates that shape estimation of the entire tensegrity manipulator can be achieved using only the inclination angle information relative to gravity for each strut. Inclination angle information is intrinsic sensory data that can be obtained simply by attaching an inertial measurement unit (IMU) to each strut. Experiments conducted on a five-layer tensegrity manipulator with 20 struts and a total length of 1160 mm demonstrate that the proposed method can estimate the shape with an accuracy of 2.1 \% of the total manipulator length, from arbitrary initial conditions under both static conditions and maintains stable shape estimation under external disturbances.

Chinese Translation

形状估计对于控制连续弯曲的张力结构操控器至关重要，但实现这一目标仍然面临挑战。尽管使用外部传感器使得实现过程相对简单，但其成本高昂且仅限于特定环境。相比之下，本体感知方法则不受这些限制。迄今为止，已经提出了几种方法；然而，据我们所知，目前尚无大规模张力结构作为操控器的成功实例。本文展示了仅通过获取每根支杆相对于重力的倾斜角信息即可实现整个张力结构操控器的形状估计。倾斜角信息是内在的感知数据，可以通过将惯性测量单元（IMU）简单地附加到每根支杆上获得。在一台具有20根支杆、总长度为1160毫米的五层张力结构操控器上进行的实验表明，所提出的方法能够在任意初始条件下，在静态条件下以2.1%的精度估计形状，并在外部干扰下保持稳定的形状估计。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2603.05982

HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild

HarvestFlex：通过视觉-语言-动作策略适应实现野外草莓采摘

Zhao, Ziyang, Wang, Shuheng, Miao, Zhonghua, Xiong, Ya

Abstract

This work presents the first study on transferring vision-language-action (VLA) policies to real greenhouse tabletop strawberry harvesting, a long-horizon, unstructured task challenged by occlusion and specular reflections. We built an end-to-end closed-loop system on the HarvestFlex platform using three-view RGB sensing (two fixed scene views plus a wrist-mounted view) and intentionally avoided depth clouds and explicit geometric calibration. We collected 3.71 h of VR teleoperated demonstrations (227 episodes) and fine-tuned pi_0, pi_0.5, and WALL-OSS with full fine-tuning and LoRA. Under a unified 50 trials real-greenhouse protocol and metrics spanning completion, pi_0.5 with full fine-tuning achieved success rate of 74.0% with 32.6 s/pick and damage rate of 4.1%. Asynchronous inference-control decoupling further improved performance over synchronous deployment. Results showed non-trivial closed-loop picking with fewer than four hours of real data, while remaining limited by close-range observability loss and contact-dynamics mismatch. A demonstration video is available at: https://youtu.be/bN8ZowZKPMI.

Chinese Translation

本研究首次探讨了将视觉-语言-动作（VLA）策略转移到真实温室桌面草莓采摘的应用，这是一项长期、非结构化的任务，面临遮挡和镜面反射的挑战。我们在HarvestFlex平台上构建了一个端到端的闭环系统，使用三视图RGB传感（两个固定场景视图加一个手腕安装视图），并故意避免使用深度云和显式几何校准。我们收集了3.71小时的虚拟现实遥控演示（227个回合），并对pi_0、pi_0.5和WALL-OSS进行了全量调优和LoRA微调。在统一的50次真实温室试验协议和涵盖完成度的指标下，经过全量调优的pi_0.5实现了74.0%的成功率，采摘时间为32.6秒/次，损坏率为4.1%。异步推理控制解耦进一步提升了性能，相较于同步部署。结果表明，在不到四小时的真实数据下实现了非平凡的闭环采摘，但仍受到近距离可观测性丧失和接触动力学不匹配的限制。演示视频可在以下链接观看：https://youtu.be/bN8ZowZKPMI。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2603.05992

MagRobot:An Open Simulator for Magnetically Navigated Robots

MagRobot：一种开放的磁导航机器人模拟器

Wang, Heng, Song, Haoyu, Zheng, Jiatao, Han, Yuxiang, Wang, Kunli

Abstract

Magnetic navigation systems, including magnetic tracking systems and magnetic actuation systems, have shown great potential for occlusion-free localization and remote control of intracorporeal medical devices and robots in minimally invasive medicine, such as capsule endoscopy and cardiovascular intervention. However, the design of magnetically navigated robots remains heavily reliant on experimental prototyping, which is time-consuming and costly. Furthermore, there is a lack of a consistent experimental environment to compare and benchmark the hardware and algorithms across different magnetic navigation systems. To address these challenges, we propose the first universal open-source simulation platform to facilitate research, design and benchmarking of magnetically navigated robots. Our simulator features an intuitive graphical user interface that enables the user to efficiently design, visualize, and analyze magnetic navigation systems for both rigid and soft robots. The proposed simulator is versatile, which can simulate both magnetic actuation and magnetic tracking tasks in diverse medical applications that involve deformable anatomies. The proposed simulator provides an open development environment, where the user can load third-party anatomical models and customize both hardware and algorithms of magnetic navigation systems. The fidelity of the simulator is validated using both phantom and ex vivo experiments of magnetic navigation of a continuum robot and a capsule robot with diverse magnetic actuation setups. Three use cases of the simulator, i.e., bronchoscopy, endovascular intervention, and gastrointestinal endoscopy, are implemented to demonstrate the functionality of the simulator. It is shown that the configuration and algorithms of magnetic navigation systems can be flexibly designed and optimized for better performance using the simulator.

Chinese Translation

磁导航系统，包括磁跟踪系统和磁驱动系统，在无阻塞定位和远程控制内窥医疗设备及机器人方面展现了巨大的潜力，特别是在微创医学领域，如胶囊内窥镜和心血管介入。然而，磁导航机器人的设计仍然严重依赖于实验原型，这既耗时又昂贵。此外，缺乏一致的实验环境来比较和基准测试不同磁导航系统的硬件和算法。为了解决这些挑战，我们提出了第一个通用的开源模拟平台，以促进磁导航机器人的研究、设计和基准测试。我们的模拟器具有直观的图形用户界面，使用户能够高效地设计、可视化和分析刚性和柔性机器人的磁导航系统。所提出的模拟器具有多功能性，可以模拟涉及可变解剖结构的多种医疗应用中的磁驱动和磁跟踪任务。该模拟器提供了一个开放的开发环境，用户可以加载第三方解剖模型，并自定义磁导航系统的硬件和算法。通过对连续机器人和胶囊机器人的磁导航进行虚拟和体外实验，验证了模拟器的准确性。实现了三种模拟器的应用案例，即支气管镜检查、血管内介入和胃肠内窥镜检查，以展示模拟器的功能。结果表明，磁导航系统的配置和算法可以灵活设计和优化，以提高性能。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2603.05993

Moving Through Clutter: Scaling Data Collection and Benchmarking for 3D Scene-Aware Humanoid Locomotion via Virtual Reality

穿越杂物：通过虚拟现实扩展3D场景感知类人运动的数据收集与基准测试

Wang, Beichen, Lu, Yuanjie, Wang, Linji, Yu, Liuchuan, Xiao, Xuesu

Abstract

Recent advances in humanoid locomotion have enabled dynamic behaviors such as dancing, martial arts, and parkour, yet these capabilities are predominantly demonstrated in open, flat, and obstacle-free settings. In contrast, real-world environments such as homes, offices, and public spaces, are densely cluttered, three-dimensional, and geometrically constrained, requiring scene-aware whole-body coordination, precise balance control, and reasoning over spatial constraints imposed by furniture and household objects. However, humanoid locomotion in cluttered 3D environments remains underexplored, and no public dataset systematically couples full-body human locomotion with the scene geometry that shapes it. To address this gap, we present Moving Through Clutter (MTC), an opensource Virtual Reality (VR) based data collection and evaluation framework for scene-aware humanoid locomotion in cluttered environments. Our system procedurally generates scenes with controllable clutter levels and captures embodiment-consistent, whole-body human motion through immersive VR navigation, which is then automatically retargeted to a humanoid robot model. We further introduce benchmarks that quantify environment clutter level and locomotion performance, including stability and collision safety. Using this framework, we compile a dataset of 348 trajectories across 145 diverse 3D cluttered scenes. The dataset provides a foundation for studying geometry-induced adaptation in humanoid locomotion and developing scene-aware planning and control methods.

Chinese Translation

近年来，类人运动的进展使得动态行为如舞蹈、武术和跑酷成为可能，但这些能力主要在开放、平坦且无障碍的环境中展示。相比之下，现实世界的环境如家庭、办公室和公共空间则是密集杂乱的三维空间，具有几何约束，要求场景感知的全身协调、精确的平衡控制以及对家具和家居物品所施加的空间约束的推理。然而，在杂乱的3D环境中进行类人运动仍然未被充分探索，且没有公共数据集系统地将全身人类运动与塑造其运动的场景几何结合起来。为了解决这一空白，我们提出了“穿越杂物”（Moving Through Clutter, MTC），这是一个基于开源虚拟现实（Virtual Reality, VR）的数据收集与评估框架，旨在支持杂乱环境中的场景感知类人运动。我们的系统程序性地生成具有可控杂乱程度的场景，并通过沉浸式VR导航捕捉一致的全身人类运动，这些运动随后被自动重新定向到类人机器人模型。我们还引入了量化环境杂乱程度和运动表现的基准，包括稳定性和碰撞安全性。利用这一框架，我们编制了一个包含348条轨迹的数据库，涵盖145个多样的3D杂乱场景。该数据集为研究几何引起的类人运动适应性以及开发场景感知的规划和控制方法提供了基础。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2603.05995

TADPO: Reinforcement Learning Goes Off-road

TADPO：强化学习走向越野

Wu, Zhouchonghao, Song, Raymond, Mundheda, Vedant, Navarro-Serment, Luis E., Schoenborn, Christof, Schneider, Jeff

Abstract

Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.

Chinese Translation

越野自动驾驶面临着显著的挑战，例如在未映射的、变化多端的地形中导航，并应对不确定和多样的动态环境。解决这些挑战需要有效的长时间规划和适应性控制。强化学习（Reinforcement Learning, RL）通过直接从交互中学习控制策略，提供了一种有前景的解决方案。然而，由于越野驾驶是一项长时间任务且奖励信号较低，标准的RL方法在这一环境中应用面临挑战。我们提出了TADPO，这是一种新颖的策略梯度公式，扩展了近端策略优化（Proximal Policy Optimization, PPO），利用离策略轨迹进行教师指导，并利用在策略轨迹进行学生探索。在此基础上，我们开发了一个基于视觉的端到端RL系统，能够进行高速越野驾驶，能够在极端坡度和障碍物丰富的地形中导航。我们在模拟环境中展示了我们的性能，并且重要的是，在全尺度越野车辆上实现了零-shot的模拟到现实转移。据我们所知，这项工作代表了RL基础策略在全尺度越野平台上的首次部署。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2603.06001

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

通过无训练注意力重校准恢复VLA模型中的语言基础

Zhang, Ninghao, Zhu, Bin, Zhou, Shijie, Chen, Jingjing

Abstract

Vision-Language-Action (VLA) models enable robots to perform manipulation tasks directly from natural language instructions and are increasingly viewed as a foundation for generalist robotic policies. However, their reliability under Out-of-Distribution (OOD) instructions remains underexplored. In this paper, we reveal a critical failure mode in which VLA policies continue executing visually plausible actions even when the language instruction contradicts the scene. We refer to this phenomenon as linguistic blindness, where VLA policies prioritize visual priors over instruction semantics during action generation. To systematically analyze this issue, we introduce ICBench, a diagnostic benchmark constructed from the LIBERO dataset that probes language-action coupling by injecting controlled OOD instruction contradictions while keeping the visual environment unchanged. Evaluations on three representative VLA architectures, including Pi0, Pi0.5 and OpenVLA OFT, show that these models frequently succeed at tasks despite logically impossible instructions, revealing a strong visual bias in action generation. To mitigate this issue, we propose Instruction-Guided Attention Recalibration (IGAR), a train-free inference-time mechanism that rebalances attention distributions to restore the influence of language instructions. IGAR operates without retraining or architectural modification and can be directly applied to existing VLA models. Experiments across 30 LIBERO tasks demonstrate that IGAR substantially reduces erroneous execution under OOD contradictory instructions while preserving baseline task performance. We additionally validate the approach on a real Franka robotic arm, where IGAR effectively prevents manipulation triggered by inconsistent instructions.

Chinese Translation

视觉-语言-动作（VLA）模型使机器人能够直接根据自然语言指令执行操作任务，并越来越被视为通用机器人策略的基础。然而，它们在面对分布外（OOD）指令时的可靠性仍然未得到充分探索。在本文中，我们揭示了一种关键的失败模式，即当语言指令与场景相矛盾时，VLA策略仍然继续执行视觉上合理的动作。我们将这一现象称为语言盲目性，在此情况下，VLA策略在动作生成过程中优先考虑视觉先验而非指令语义。为了系统地分析这个问题，我们引入了ICBench，这是一个基于LIBERO数据集构建的诊断基准，通过注入受控的OOD指令矛盾，同时保持视觉环境不变，来探测语言与动作的耦合。对三种代表性的VLA架构（包括Pi0、Pi0.5和OpenVLA OFT）的评估显示，这些模型在面对逻辑上不可能的指令时，仍然经常成功完成任务，揭示了动作生成中的强视觉偏见。为了解决这个问题，我们提出了指令引导的注意力重校准（IGAR），这是一种无训练的推理时机制，能够重新平衡注意力分布，以恢复语言指令的影响。IGAR在不进行再训练或架构修改的情况下运行，并可以直接应用于现有的VLA模型。在30个LIBERO任务上的实验表明，IGAR显著减少了在OOD矛盾指令下的错误执行，同时保持了基线任务性能。此外，我们还在真实的Franka机器人手臂上验证了该方法，IGAR有效地防止了因不一致指令而触发的操作。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2603.06058

RODEO: RObotic DEcentralized Organization

RODEO：机器人去中心化组织

Groshev, Milan, Ferrer, Eduardo Castelló

Abstract

Robots are improving their autonomy with minimal human supervision. However, auditable actions, transparent decision processes, and new human-robot interaction models are still missing requirements to achieve extended robot autonomy. To tackle these challenges, we propose RODEO (RObotic DEcentralized Organization), a blockchain-based framework that integrates trust and accountability mechanisms for robots. This paper formalizes Decentralized Autonomous Organizations (DAOs) for service robots. First, it provides a ROS-ETH bridge between the DAO and the robots. Second, it offers templates that enable organizations (e.g., companies, universities) to integrate service robots into their operations. Third, it provides proof-verification mechanisms that allow robot actions to be auditable. In our experimental setup, a mobile robot was deployed as a trash collector in a lab scenario. The robot collects trash and uses a smart bin to sort and dispose of it correctly. Then, the robot submits a proof of the successful operation and is compensated in DAO tokens. Finally, the robot re-invests the acquired funds to purchase battery charging services. Data collected in a three day experiment show that the robot doubled its income and reinvested funds to extend its operating time. The proof validation times of approximately one minute ensured verifiable task execution, while the accumulated robot income successfully funded up to 88 hours of future autonomous operation. The results of this research give insights about how robots and organizations can coordinate tasks and payments with auditable execution proofs and on-chain settlement.

Chinese Translation

机器人正在以最小的人类监督提高其自主性。然而，审计可追溯的行为、透明的决策过程以及新的人与机器人交互模型仍然是实现扩展机器人自主性的缺失要求。为了解决这些挑战，我们提出了RODEO（机器人去中心化组织），这是一个基于区块链的框架，集成了机器人信任和问责机制。本文形式化了服务机器人的去中心化自治组织（DAO）。首先，它提供了DAO与机器人之间的ROS-ETH桥接。其次，它提供了模板，使组织（例如公司、大学）能够将服务机器人整合到其运营中。第三，它提供了证明验证机制，允许机器人行为可审计。在我们的实验设置中，一台移动机器人被部署为实验室场景中的垃圾收集器。该机器人收集垃圾，并使用智能垃圾桶进行分类和正确处理。然后，机器人提交成功操作的证明，并以DAO代币获得补偿。最后，机器人将获得的资金再投资于购买电池充电服务。在为期三天的实验中收集的数据表明，机器人收入翻倍，并将再投资资金用于延长其操作时间。大约一分钟的证明验证时间确保了可验证的任务执行，而累积的机器人收入成功资助了未来多达88小时的自主操作。该研究的结果为机器人和组织如何协调任务和支付提供了见解，具有可审计的执行证明和链上结算。

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2603.06073

Lifelong Embodied Navigation Learning

终身具身导航学习

Wang, Xudong, Dong, Jiahua, Liu, Baichen, Lyu, Qi, Liu, Lianqing, Han, Zhi

Abstract

Embodied navigation agents powered by large language models have shown strong performance on individual tasks but struggle to continually acquire new navigation skills, which suffer from catastrophic forgetting. We formalize this challenge as lifelong embodied navigation learning (LENL), where an agent is required to adapt to a sequence of navigation tasks spanning multiple scenes and diverse user instruction styles, while retaining previously learned knowledge. To tackle this problem, we propose Uni-Walker, a lifelong embodied navigation framework that decouples navigation knowledge into task-shared and task-specific components with Decoder Extension LoRA (DE-LoRA). To learn the shared knowledge, we design a knowledge inheritance strategy and an experts co-activation strategy to facilitate shared knowledge transfer and refinement across multiple navigation tasks. To learn the specific knowledge, we propose an expert subspace orthogonality constraint together and a navigation-specific chain-of-thought reasoning mechanism to capture specific knowledge and enhance instruction-style understanding. Extensive experiments demonstrate the superiority of Uni-Walker for building universal navigation agents with lifelong learning.

Chinese Translation

由大型语言模型驱动的具身导航代理在单一任务上表现出色，但在持续获取新导航技能方面却面临困难，容易出现灾难性遗忘。我们将这一挑战形式化为终身具身导航学习（Lifelong Embodied Navigation Learning, LENL），要求代理适应跨多个场景和多样化用户指令风格的一系列导航任务，同时保留先前学习的知识。为了解决这个问题，我们提出了Uni-Walker，一个终身具身导航框架，该框架将导航知识解耦为任务共享和任务特定的组件，并使用解码器扩展LoRA（Decoder Extension LoRA, DE-LoRA）。为了学习共享知识，我们设计了一种知识继承策略和专家共同激活策略，以促进跨多个导航任务的共享知识转移和精炼。为了学习特定知识，我们提出了专家子空间正交约束和导航特定的思维链推理机制，以捕捉特定知识并增强对指令风格的理解。大量实验表明，Uni-Walker在构建具有终身学习能力的通用导航代理方面具有优越性。

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2603.06084

Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning

多模态行为树生成：用于机器人任务规划的小型视觉语言模型

Battistini, Cristiano, Izzo, Riccardo Andrea, Bardaro, Gianluca, Matteucci, Matteo

Abstract

Large and small language models have been widely used for robotic task planning. At the same time, vision-language models (VLMs) have successfully tackled problems such as image captioning, scene understanding, and visual question answering. In this work, we combine these two approaches by deploying a compact, open-source multimodal model to generate behavior trees for robotic task planning. The main obstacle to achieving this goal is the lack of an existing dataset that links visual observations and instructions to executable behavior trees. We propose a method to construct such a dataset starting from existing robotic episodes (i.e., Open X-Embodiment), in which a large model serves as a teacher in a multi-stage generation pipeline. We use this dataset to fine-tune VLMs ranging from 500M to 4B parameters via parameter-efficient fine-tuning (PEFT). The generated behavior trees, compatible with the BehaviorTree.CPP library, are evaluated both offline, using structural and lexical metrics, and online through the execution of household tasks in a state-of-the-art embodied simulator. Our results demonstrate that our fine-tuned 4B-parameter VLM approaches the performance of state-of-the-art closed-source models, achieving an 87\% success rate while requiring only a fraction of the computational resources.

Chinese Translation

大型和小型语言模型已广泛应用于机器人任务规划。同时，视觉语言模型（VLMs）成功解决了图像描述、场景理解和视觉问答等问题。在本研究中，我们通过部署一个紧凑的开源多模态模型，将这两种方法结合起来，以生成用于机器人任务规划的行为树。实现这一目标的主要障碍是缺乏现有数据集，将视觉观察和指令与可执行行为树相链接。我们提出了一种构建此类数据集的方法，从现有的机器人情节（即 Open X-Embodiment）出发，其中一个大型模型作为多阶段生成管道中的教师。我们使用该数据集通过参数高效微调（PEFT）对范围从 500M 到 4B 参数的 VLMs 进行微调。生成的行为树与 BehaviorTree.CPP 库兼容，分别通过结构性和词汇性指标进行离线评估，并通过在最先进的具身模拟器中执行家庭任务进行在线评估。我们的结果表明，经过微调的 4B 参数 VLM 的性能接近最先进的闭源模型，成功率达到 87\%，同时仅需少量计算资源。

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2603.06121

Sticky-Glance: Robust Intent Recognition for Human Robot Collaboration via Single-Glance

Sticky-Glance：通过单次注视实现人机协作的稳健意图识别

Lai, Yuzhi, Yuan, Shenghai, Li, Peizheng, Zell, Andreas

Abstract

Gaze is a valuable means of communication for impaired people with extremely limited motor capabilities. However, robust gaze-based intent recognition in multi-object environments is challenging due to gaze noise, micro-saccades, viewpoint changes, and dynamic objects. To address this, we propose an object-centric gaze grounding framework that stabilizes intent through a sticky-glance algorithm, jointly modeling geometric distance and direction trends. The inferred intent remains anchored to the object even under short glances with minimal 3 gaze samples, achieving a tracking rate of 0.94 for dynamic targets and selection accuracy of 0.98 for static targets. We further introduce a continuous shared control and multi-modal interaction paradigm, enabling high-readiness control and human-in-loop feedback, thereby reducing task duration for nearly 10 \%. Experiments across dynamic tracking, multi-perspective alignment, a baseline comparison, user studies, and ablation studies demonstrate improved robustness, efficiency, and reduced workload compared to representative baselines.

Chinese Translation

注视是对运动能力极为有限的残障人士的一种宝贵沟通方式。然而，由于注视噪声、微眨眼、视角变化和动态物体的存在，在多物体环境中实现稳健的基于注视的意图识别面临挑战。为此，我们提出了一种以物体为中心的注视定位框架，通过粘性注视（sticky-glance）算法稳定意图，联合建模几何距离和方向趋势。即使在仅有最少3个注视样本的短暂注视下，推断出的意图仍然与物体保持锚定，实现了动态目标的跟踪率为0.94和静态目标的选择准确率为0.98。我们进一步引入了一种连续共享控制和多模态交互范式，能够实现高准备度控制和人机反馈，从而将任务持续时间减少近10%。在动态跟踪、多视角对齐、基线比较、用户研究和消融研究中的实验表明，与代表性基线相比，系统的稳健性、效率和工作负荷均有所改善。

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2603.06130

A Hazard-Informed Data Pipeline for Robotics Physical Safety

基于风险信息的数据管道用于机器人物理安全

Odinokov, Alexei, Yavorskiy, Rostislav

Abstract

This report presents a structured Robotics Physical Safety Framework based on explicit asset declaration, systematic vulnerability enumeration, and hazard-driven synthetic data generation. The approach bridges classical risk engineering with modern machine learning pipelines, enabling safety envelope learning grounded in a formalized hazard ontology. The key contribution of this framework is the alignment between classical safety engineering, digital twin simulation, synthetic data generation, and machine learning model training.

Chinese Translation

本报告提出了一种结构化的机器人物理安全框架，该框架基于明确的资产声明、系统的脆弱性枚举和以风险为驱动的合成数据生成。该方法将经典的风险工程与现代机器学习管道相结合，实现了基于形式化风险本体的安全边界学习。该框架的关键贡献在于经典安全工程、数字双胞胎仿真、合成数据生成和机器学习模型训练之间的对齐。

View on arXiv Download PDF AI Translation

cs.RO / 42 / 2603.06163

Dual-Agent Multiple-Model Reinforcement Learning for Event-Triggered Human-Robot Co-Adaptation in Decoupled Task Spaces

双代理多模型强化学习用于解耦任务空间中的事件触发人机共适应

Li, Yaqi, Han, Zhengqi, Liu, Huifang, Su, Steven W.

Abstract

This paper presents a shared-control rehabilitation policy for a custom 6-degree-of-freedom (6-DoF) upper-limb robot that decomposes complex reaching tasks into decoupled spatial axes. The patient governs the primary reaching direction using binary commands, while the robot autonomously manages orthogonal corrective motions. Because traditional fixed-frequency control often induces trajectory oscillations due to variable inverse-kinematics execution times, an event-driven progression strategy is proposed. This architecture triggers subsequent control actions only when the end-effector enters an admission sphere centred on the immediate target waypoint, and was validated in a semi-virtual setup linking a physical pressure sensor to a MuJoCo simulation. To optimise human--robot co-adaptation safely and efficiently, this study introduces Dual Agent Multiple Model Reinforcement Learning (DAMMRL). This framework discretises decision characteristics: the human agent selects the admission sphere radius to reflect their inherent speed--accuracy trade-off, while the robot agent dynamically adjusts its 3D Cartesian step magnitudes to complement the user's cognitive state. Trained in simulation and deployed across mixed environments, this event-triggered DAMMRL approach effectively suppresses waypoint chatter, balances spatial precision with temporal efficiency, and significantly improves success rates in object acquisition tasks.

Chinese Translation

本文提出了一种共享控制的康复策略，针对定制的6自由度（6-DoF）上肢机器人，将复杂的到达任务分解为解耦的空间轴。患者通过二进制命令控制主要的到达方向，而机器人则自主管理正交的修正运动。由于传统的固定频率控制常常因逆运动学执行时间的变化而导致轨迹振荡，因此提出了一种事件驱动的进程策略。该架构仅在末端执行器进入围绕即时目标航点的接纳球时触发后续控制动作，并在将物理压力传感器与MuJoCo仿真连接的半虚拟设置中进行了验证。为了安全有效地优化人机共适应，本研究引入了双代理多模型强化学习（DAMMRL）。该框架离散化了决策特征：人类代理选择接纳球半径以反映其固有的速度-准确性权衡，而机器人代理则动态调整其3D笛卡尔步幅，以补充用户的认知状态。在仿真中训练并在混合环境中部署，这种事件触发的DAMMRL方法有效抑制了航点噪声，平衡了空间精度与时间效率，并显著提高了物体获取任务的成功率。

View on arXiv Download PDF AI Translation

cs.RO / 43 / 2603.06190

DreamToNav: Generalizable Navigation for Robots via Generative Video Planning

DreamToNav：通过生成视频规划实现机器人的通用导航

Serpiva, Valerii, Sam, Jeffrin, Simon, Chidera, Amjad, Hajira, Zhura, Iana, Lykov, Artem, Tsetserukou, Dzmitry

Abstract

We present DreamToNav, a novel autonomous robot framework that uses generative video models to enable intuitive, human-in-the-loop control. Instead of relying on rigid waypoint navigation, users provide natural language prompts (e.g. ``Follow the person carefully''), which the system translates into executable motion. Our pipeline first employs Qwen 2.5-VL-7B-Instruct to refine vague user instructions into precise visual descriptions. These descriptions condition NVIDIA Cosmos 2.5, a state-of-the-art video foundation model, to synthesize a physically consistent video sequence of the robot performing the task. From this synthetic video, we extract a valid kinematic path using visual pose estimation, robot detection and trajectory recovery. By treating video generation as a planning engine, DreamToNav allows robots to visually "dream" complex behaviors before executing them, providing a unified framework for obstacle avoidance and goal-directed navigation without task-specific engineering. We evaluate the approach on both a wheeled mobile robot and a quadruped robot in indoor navigation tasks. DreamToNav achieves a success rate of 76.7%, with final goal errors typically within 0.05-0.10 m and trajectory tracking errors below 0.15 m. These results demonstrate that trajectories extracted from generative video predictions can be reliably executed on physical robots across different locomotion platforms.

Chinese Translation

我们提出了DreamToNav，一个新颖的自主机器人框架，利用生成视频模型实现直观的人机协作控制。用户不再依赖于固定的航点导航，而是提供自然语言提示（例如“仔细跟随那个人”），系统将其转化为可执行的运动指令。我们的流程首先使用Qwen 2.5-VL-7B-Instruct将模糊的用户指令细化为精确的视觉描述。这些描述为NVIDIA Cosmos 2.5，一个先进的视频基础模型，提供条件，以合成机器人执行任务的物理一致性视频序列。我们从合成视频中提取有效的运动路径，使用视觉姿态估计、机器人检测和轨迹恢复。通过将视频生成视为规划引擎，DreamToNav使机器人能够在执行复杂行为之前进行视觉“梦境”，提供一个统一的框架用于避障和目标导向导航，而无需特定任务的工程设计。我们在室内导航任务中对轮式移动机器人和四足机器人进行了评估。DreamToNav的成功率达到76.7%，最终目标误差通常在0.05-0.10米之间，轨迹跟踪误差低于0.15米。这些结果表明，从生成视频预测中提取的轨迹可以在不同运动平台的物理机器人上可靠地执行。

View on arXiv Download PDF AI Translation

cs.RO / 44 / 2603.06205

KISS-IMU: Self-supervised Inertial Odometry with Motion-balanced Learning and Uncertainty-aware Inference

KISS-IMU：具有运动平衡学习和不确定性感知推理的自监督惯性里程计

Choi, Jiwon, Kim, Hogyun, Yang, Geonmo, Lee, Juhui, Cho, Younggun

Abstract

Inertial measurement units (IMUs), which provide high-frequency linear acceleration and angular velocity measurements, serve as fundamental sensing modalities in robotic systems. Recent advances in deep neural networks have led to remarkable progress in inertial odometry. However, the heavy reliance on ground truth data during training fundamentally limits scalability and generalization to unseen and diverse environments. We propose KISS-IMU, a novel self-supervised inertial odometry framework that eliminates ground truth dependency by leveraging simple LiDAR-based ICP registration and pose graph optimization as a supervisory signal. Our approach embodies two key principles: keeping the IMU stable through motion-aware balanced training and keeping the IMU strong through uncertainty-driven adaptive weighting during inference. To evaluate performance across diverse motion patterns and scenarios, we conducted comprehensive experiments on various real-world platforms, including quadruped robots. Importantly, we train only the IMU network in a self-supervised manner, with LiDAR serving solely as a lightweight supervisory signal rather than requiring additional learnable processes. This design enables the framework to ensure robustness without relying on joint multi-modal learning or ground truth supervision. The supplementary materials are available at https://sparolab.github.io/research/kiss_imu.

Chinese Translation

惯性测量单元（IMU）提供高频率的线性加速度和角速度测量，是机器人系统中的基本传感方式。近年来，深度神经网络的进展使得惯性里程计取得了显著的进展。然而，在训练过程中对真实数据的高度依赖，根本上限制了其在未见和多样化环境中的可扩展性和泛化能力。我们提出了KISS-IMU，这是一种新颖的自监督惯性里程计框架，通过利用基于LiDAR的简单ICP配准和姿态图优化作为监督信号，消除了对真实数据的依赖。我们的方法体现了两个关键原则：通过运动感知的平衡训练保持IMU的稳定性，以及通过不确定性驱动的自适应加权在推理过程中保持IMU的强度。为了评估在多样化运动模式和场景下的性能，我们在包括四足机器人在内的各种真实平台上进行了全面实验。重要的是，我们仅以自监督的方式训练IMU网络，LiDAR仅作为轻量级的监督信号，而不需要额外的可学习过程。这一设计使得框架能够确保鲁棒性，而无需依赖联合多模态学习或真实数据监督。补充材料可在 https://sparolab.github.io/research/kiss_imu 获取。

View on arXiv Download PDF AI Translation

cs.RO / 45 / 2603.06218

Few-Shot Neural Differentiable Simulator: Real-to-Sim Rigid-Contact Modeling

少量样本神经可微模拟器：从真实到模拟的刚性接触建模

Huang, Zhenhao, Luo, Siyuan, Zhou, Bingyang, Zeng, Ziqiu, Pho, Jason, Shi, Fan

Abstract

Accurate physics simulation is essential for robotic learning and control, yet analytical simulators often fail to capture complex contact dynamics, while learning-based simulators typically require large amounts of costly real-world data. To bridge this gap, we propose a few-shot real-to-sim approach that combines the physical consistency of analytical formulations with the representational capacity of graph neural network (GNN)-based models. Using only a small amount of real-world data, our method calibrates analytical simulators to generate large-scale synthetic datasets that capture diverse contact interactions. On this foundation, we introduce a mesh-based GNN that implicitly models rigid-body forward dynamics and derive surrogate gradients for collision detection, achieving full differentiability. Experimental results demonstrate that our approach enables learning-based simulators to outperform differentiable baselines in replicating real-world trajectories. In addition, the differentiable design supports gradient-based optimization, which we validate through simulation-based policy learning in multi-object interaction scenarios. Extensive experiments show that our framework not only improves simulation fidelity with minimal supervision but also increases the efficiency of policy learning. Taken together, these findings suggest that differentiable simulation with few-shot real-world grounding provides a powerful direction for advancing future robotic manipulation and control.

Chinese Translation

准确的物理模拟对于机器人学习和控制至关重要，但分析模拟器往往无法捕捉复杂的接触动态，而基于学习的模拟器通常需要大量昂贵的真实世界数据。为了弥补这一差距，我们提出了一种少量样本的从真实到模拟的方法，该方法结合了分析公式的物理一致性和基于图神经网络（GNN）模型的表现能力。仅使用少量真实世界数据，我们的方法校准分析模拟器，以生成捕捉多样接触交互的大规模合成数据集。在此基础上，我们引入了一种基于网格的GNN，隐式建模刚体的前向动态，并推导出用于碰撞检测的替代梯度，实现了完全可微性。实验结果表明，我们的方法使基于学习的模拟器在复制真实世界轨迹方面超越了可微基线。此外，可微设计支持基于梯度的优化，我们通过在多物体交互场景中的基于模拟的策略学习进行了验证。大量实验表明，我们的框架不仅在最小监督下提高了模拟的保真度，还提高了策略学习的效率。综合来看，这些发现表明，具有少量真实世界基础的可微模拟为推动未来机器人操作和控制提供了强有力的方向。

View on arXiv Download PDF AI Translation

cs.RO / 46 / 2603.06266

Towards Robotic Lake Maintenance: Integrating SONAR and Satellite Data to Assist Human Operators

朝向机器人湖泊维护：整合声纳和卫星数据以辅助人类操作员

Elsayed, Ahmed H., Manss, Christoph, El-Mihoub, Tarek A., Lejman, Andrej, Stahl, Frederic

Abstract

Artificial Water Bodies (AWBs) are human-made systems that require continuous monitoring due to their artificial biological processes. These systems demand regular maintenance to manage their ecosystems effectively. As a result of these artificial conditions, underwater vegetation can grow rapidly and must be harvested to preserve the ecological balance. This paper proposes a two-step approach to support targeted weed harvesting for the maintenance of artificial lakes. The first step is the initial detection of Submerged Aquatic Vegetation (SAV), also referred to in this paper as areas of interest, is performed using satellite-derived indices, specifically the Aquatic Plants and Algae (APA) index, which highlights submerged vegetation in water bodies. Subsequently, an Unmanned Surface Vehicle (USV) equipped with multibeam SOund NAvigation and Ranging (SONAR) performs high-resolution bathymetric mapping to locate and quantify aquatic vegetation precisely. This two-stage approach offers an effective human-robot collaboration, where satellite data guides the USV missions and boat skippers leverage detailed SONAR maps for targeted harvesting. This setup narrows the search space and reduces manual workload from human operators, making the harvesting process less labour-intensive for operators. Preliminary results demonstrate the feasibility of integrating satellite imagery and underwater acoustic sensing to improve vegetation management in artificial lakes.

Chinese Translation

人工水体（AWBs）是人造系统，由于其人工生物过程，需要持续监测。这些系统要求定期维护，以有效管理其生态系统。由于这些人工条件，水下植被可能迅速生长，必须进行采收以维持生态平衡。本文提出了一种两步法，以支持针对性的杂草采收，维护人工湖泊。第一步是通过卫星衍生指数初步检测水下水生植被（SAV），在本文中称为关注区域，采用水生植物和藻类（APA）指数，该指数突出显示水体中的水下植被。随后，配备多波束声纳（SONAR）的无人水面艇（USV）进行高分辨率的水深测绘，以精确定位和量化水生植被。这种两阶段的方法提供了有效的人机协作，其中卫星数据指导USV任务，而船只驾驶员利用详细的声纳地图进行针对性采收。该设置缩小了搜索空间，减少了人类操作员的手动工作量，使采收过程对操作员的劳动强度降低。初步结果表明，整合卫星影像和水下声学传感器以改善人工湖泊的植被管理是可行的。

View on arXiv Download PDF AI Translation

cs.RO / 47 / 2603.06280

SuperSuit: An Isomorphic Bimodal Interface for Scalable Mobile Manipulation

SuperSuit：一种同构双模界面用于可扩展的移动操控

Chen, Tongqing, Wu, Hang, Wang, Jiasen, Li, Xiaotao, Jin, Zhu, Fang, Lu

Abstract

High-quality, long-horizon demonstrations are essential for embodied AI, yet acquiring such data for tightly coupled wheeled mobile manipulators remains a fundamental bottleneck. Unlike fixed-base systems, mobile manipulators require continuous coordination between $SE(2)$ locomotion and precise manipulation, exposing limitations in existing teleoperation and wearable interfaces. We present \textbf{SuperSuit}, a bimodal data acquisition framework that supports both robot-in-the-loop teleoperation and active demonstration under a shared kinematic interface. Both modalities produce structurally identical joint-space trajectories, enabling direct data mixing without modifying downstream policies. For locomotion, SuperSuit maps natural human stepping to continuous planar base velocities, eliminating discrete command switches. For manipulation, it employs a strictly isomorphic wearable arm in both modes, while policy training is formulated in a shift-invariant delta-joint representation to mitigate calibration offsets and structural compliance without inverse kinematics. Real-world experiments on long-horizon mobile manipulation tasks show 2.6$\times$ higher demonstration throughput in active mode compared to a teleoperation baseline, comparable policy performance when substituting teleoperation data with active demonstrations at fixed dataset size, and monotonic performance improvement as active data volume increases. These results indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for long-horizon mobile manipulation.

Chinese Translation

高质量、长时间跨度的演示对于具身人工智能至关重要，但为紧密耦合的轮式移动操控器获取此类数据仍然是一个基本瓶颈。与固定基座系统不同，移动操控器需要在 $SE(2)$ 运动和精确操控之间进行持续协调，这暴露了现有远程操作和可穿戴界面的局限性。我们提出了 extbf{SuperSuit}，一种双模数据采集框架，支持在共享运动学界面下的机器人环路远程操作和主动演示。这两种模式产生结构上相同的关节空间轨迹，使得在不修改下游策略的情况下可以直接混合数据。在运动方面，SuperSuit 将自然的人类步态映射为连续的平面基座速度，消除了离散命令切换。在操控方面，它在两种模式下采用严格同构的可穿戴臂，同时策略训练在一个平移不变的增量关节表示中进行，以减轻校准偏差和结构顺应性，而无需逆运动学。在长时间跨度的移动操控任务上的实际实验表明，主动模式下的演示吞吐量比远程操作基线高出 2.6 倍，当用主动演示替代固定数据集大小的远程操作数据时，策略性能相当，并且随着主动数据量的增加，性能单调提升。这些结果表明，不同采集模式下的一致运动学表示能够实现可扩展的数据采集，以支持长时间跨度的移动操控。

View on arXiv Download PDF AI Translation

cs.RO / 48 / 2603.06343

Open-Source Based and ETSI Compliant Cooperative, Connected, and Automated Mini-Cars

基于开源和符合ETSI标准的合作、连接与自动化迷你汽车

Farina, Lorenzo, Gavioli, Federico, Iandolo, Salvatore, Moretti, Francesco, Perrone, Giuseppe, Piccoli, Matteo, Raviglione, Francesco, Rapelli, Marco, Solida, Antonio, Burgio, Paolo, Grazia, Carlo Augusto, Bazzi, Alessandro

Abstract

The automotive sector is following a revolutionary path from vehicles controlled by humans to vehicles that will be fully automated, fully connected, and ultimately fully cooperative. Along this road, new cooperative algorithms and protocols will be designed and field tested, which represents a great challenge in terms of costs. In this context, in particular, moving from simulations to practical experiments requires huge investments that are not always affordable and may become a barrier in some cases. To solve this issue and provide the community with an intermediate step, we here propose the use of 1:10 scaled cooperative, autonomous, and connected mini-cars. The mini-car is equipped with a Jetson Orin board running the open Robot Operating System 2 (ROS2), sensors for autonomous operations, and a Raspberry Pi board for connectivity mounting the open source Open Stack for Car (OScar). A key aspect of the proposal is the use of OScar, which implements a full ETSI cooperative-intelligent transport systems (C-ITS) compliant stack. The feasibility and potential of the proposed platform is here demonstrated through the implementation of a case study where the Day-1 intersection collision warning (ICW) application is implemented and validated.

Chinese Translation

汽车行业正经历从人类控制的车辆到完全自动化、完全连接和最终完全合作的车辆的革命性转变。在这一过程中，将设计和现场测试新的合作算法和协议，这在成本方面代表了巨大的挑战。在这种背景下，特别是从模拟转向实际实验需要巨额投资，这并不总是可承受的，并且在某些情况下可能成为障碍。为了解决这个问题，并为社区提供一个中间步骤，我们在此提出使用1:10比例的合作、自动和连接迷你汽车。该迷你汽车配备了运行开源机器人操作系统2（Robot Operating System 2, ROS2）的Jetson Orin开发板、用于自主操作的传感器，以及用于连接的Raspberry Pi开发板，后者安装了开源汽车操作系统（Open Stack for Car, OScar）。该提案的一个关键方面是使用OScar，它实现了完整的符合ETSI合作智能交通系统（Cooperative-Intelligent Transport Systems, C-ITS）标准的堆栈。通过实施一个案例研究，在其中实现并验证了第一天交叉口碰撞警告（Intersection Collision Warning, ICW）应用，展示了所提议平台的可行性和潜力。

View on arXiv Download PDF AI Translation

cs.RO / 49 / 2603.06356

Safe Consensus of Cooperative Manipulation with Hierarchical Event-Triggered Control Barrier Functions

基于分层事件触发控制障碍函数的安全共识合作操控

Zhuang, Simiao, Huang, Bingkun, Yang, Zewen

Abstract

Cooperative transport and manipulation of heavy or bulky payloads by multiple manipulators requires coordinated formation tracking, while simultaneously enforcing strict safety constraints in varying environments with limited communication and real-time computation budgets. This paper presents a distributed control framework that achieves consensus coordination with safety guarantees via hierarchical event-triggered control barrier functions (CBFs). We first develop a consensus-based protocol that relies solely on local neighbor information to enforce both translational and rotational consistency in task space. Building on this coordination layer, we propose a three-level hierarchical event-triggered safety architecture with CBFs, which is integrated with a risk-aware leader selection and smooth switching strategy to reduce online computation. The proposed approach is validated through real-world hardware experiments using two Franka manipulators operating with static obstacles, as well as comprehensive simulations demonstrating scalable multi-arm cooperation with dynamic obstacles. Results demonstrate higher precision cooperation under strict safety constraints, achieving substantially reduced computational cost and communication frequency compared to baseline methods.

Chinese Translation

多台操控器对重型或大宗货物的合作运输和操控需要协调的队形跟踪，同时在有限的通信和实时计算预算下，在不同环境中强制执行严格的安全约束。本文提出了一种分布式控制框架，通过分层事件触发控制障碍函数（CBFs）实现具有安全保证的共识协调。我们首先开发了一种基于共识的协议，仅依赖于局部邻居信息，以在任务空间中强制执行平移和旋转的一致性。在此协调层的基础上，我们提出了一种三层分层事件触发安全架构，结合了风险感知的领导者选择和光滑切换策略，以减少在线计算。通过使用两台Franka操控器在静态障碍物下进行的实际硬件实验，以及展示动态障碍物下可扩展多臂合作的全面仿真，验证了所提方法。结果表明，在严格的安全约束下实现了更高精度的合作，相较于基线方法显著降低了计算成本和通信频率。

View on arXiv Download PDF AI Translation

cs.RO / 50 / 2603.06450

Data Analogies Enable Efficient Cross-Embodiment Transfer

数据类比促进高效的跨体现转移

Yang, Jonathan, Finn, Chelsea, Sadigh, Dorsa

Abstract

Generalist robot policies are trained on demonstrations collected across a wide variety of robots, scenes, and viewpoints. Yet it remains unclear how to best organize and scale such heterogeneous data so that it genuinely improves performance in a given target setting. In this work, we ask: what form of demonstration data is most useful for enabling transfer across robot set-ups? We conduct controlled experiments that vary end-effector morphology, robot platform appearance, and camera perspective, and compare the effects of simply scaling the number of demonstrations against systematically broadening the diversity in different ways. Our simulated experiments show that while perceptual shifts such as viewpoint benefit most from broad diversity, morphology shifts benefit far less from unstructured diversity and instead see the largest gains from data analogies, i.e. paired demonstrations that align scenes, tasks, and/or trajectories across different embodiments. Informed by the simulation results, we improve real-world cross-embodiment transfer success by an average of $22.5\%$ over large-scale, unpaired datasets by changing only the composition of the data.

Chinese Translation

通用机器人策略是在各种机器人、场景和视角下收集的演示数据上训练的。然而，如何最佳地组织和扩展这些异构数据，以真正提高在特定目标环境中的表现，仍然不清楚。在本研究中，我们提出了一个问题：哪种形式的演示数据最有助于实现机器人设置之间的转移？我们进行了一系列受控实验，改变末端执行器的形态、机器人平台的外观和相机视角，并比较了简单增加演示数量与系统性扩展多样性在不同方式下的效果。我们的模拟实验表明，尽管视角等感知变化最受益于广泛的多样性，但形态变化对无结构多样性的受益较小，而是通过数据类比（即对齐不同体现之间场景、任务和/或轨迹的配对演示）获得最大的提升。根据模拟结果，我们通过仅改变数据的组成，将现实世界中的跨体现转移成功率提高了平均$22.5\%$，相较于大规模的无配对数据集。

View on arXiv Download PDF AI Translation

cs.RO / 51 / 2603.06480

History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation

历史条件下的时空视觉标记剪枝用于高效的视觉-语言导航

Wang, Qitong, Liang, Yijun, Li, Ming, Zhou, Tianyi, Rasmussen, Christopher

Abstract

Vision-Language Navigation (VLN) enables robots to follow natural-language instructions in visually grounded environments, serving as a key capability for embodied robotic systems. Recent Vision-Language-Action (VLA) models have demonstrated strong navigation performance, but their high computational cost introduces latency that limits real-time deployment. We propose a training-free spatio-temporal vision token pruning framework tailored to VLA-based VLN. We apply spatial token selection to the current view, alongside spatio-temporal compression for historical memories, enabling efficient long-horizon inference while reducing redundant computation. Leveraging attention-based token importance and query-guided spatio-temporal filtering, the proposed approach preserves navigation-relevant information without retraining or modifying pretrained models, allowing plug-and-play integration into existing VLA systems. Through experiments on standard VLN benchmarks, we confirm that our method significantly outperforms existing pruning strategies. It successfully preserves superior navigation accuracy under extreme pruning scenarios, all while maintaining the highly competitive inference efficiency. Real-world deployment on a Unitree Go2 quadruped robot further validates reliable and low-latency instruction-following navigation under practical robotic constraints. We hope this work helps bridge the gap between large-scale multimodal modeling and efficient, real-time embodied deployment in robotic navigation systems.

Chinese Translation

视觉-语言导航（VLN）使机器人能够在视觉基础环境中遵循自然语言指令，这是具身机器人系统的一项关键能力。近期的视觉-语言-动作（VLA）模型展示了强大的导航性能，但其高计算成本引入的延迟限制了实时部署。我们提出了一种针对基于VLA的VLN的无训练时空视觉标记剪枝框架。我们对当前视图应用空间标记选择，同时对历史记忆进行时空压缩，从而实现高效的长时间推理，同时减少冗余计算。通过利用基于注意力的标记重要性和查询引导的时空过滤，所提出的方法在不重新训练或修改预训练模型的情况下，保留了与导航相关的信息，允许无缝集成到现有的VLA系统中。通过在标准VLN基准上的实验，我们确认我们的方法显著优于现有的剪枝策略。在极端剪枝场景下，它成功保持了卓越的导航准确性，同时维持了高度竞争的推理效率。在Unitree Go2四足机器人上的实际部署进一步验证了在实际机器人约束下可靠且低延迟的指令跟随导航。我们希望这项工作能帮助弥合大规模多模态建模与机器人导航系统中高效、实时具身部署之间的差距。

View on arXiv Download PDF AI Translation

cs.RO / 52 / 2603.06494

Control Barrier Corridors: From Safety Functions to Safe Sets

控制障碍走廊：从安全功能到安全集合

Arslan, Ömür, Atanasov, Nikolay

Abstract

Safe autonomy is a critical requirement and a key enabler for robots to operate safely in unstructured complex environments. Control barrier functions and safe motion corridors are two widely used but technically distinct safety methods, functional and geometric, respectively, for safe motion planning and control. Control barrier functions are applied to the safety filtering of control inputs to limit the decay rate of system safety, whereas safe motion corridors are geometrically constructed to define a local safe zone around the system state for use in motion optimization and reference-governor design. This paper introduces a new notion of control barrier corridors, which unifies these two approaches by converting control barrier functions into local safe goal regions for reference goal selection in feedback control systems. We show, with examples on fully actuated systems, kinematic unicycles, and linear output regulation systems, that individual state safety can be extended locally over control barrier corridors for convex barrier functions, provided the control convergence rate matches the barrier decay rate, highlighting a trade-off between safety and reactiveness. Such safe control barrier corridors enable safely reachable persistent goal selection over continuously changing barrier corridors during system motion, which we demonstrate for verifiably safe and persistent path following in autonomous exploration of unknown environments.

Chinese Translation

安全自主是机器人在非结构化复杂环境中安全操作的关键要求和重要推动力。控制障碍函数和安全运动走廊是两种广泛使用但在技术上有所不同的安全方法，分别为功能性和几何性，旨在实现安全的运动规划和控制。控制障碍函数用于对控制输入进行安全过滤，以限制系统安全性的衰减速率，而安全运动走廊则通过几何构造来定义系统状态周围的局部安全区域，以便在运动优化和参考控制器设计中使用。本文引入了一种新的控制障碍走廊概念，通过将控制障碍函数转换为反馈控制系统中参考目标选择的局部安全目标区域，从而统一了这两种方法。我们通过对全驱动系统、运动单轮车和线性输出调节系统的示例展示了，对于凸障碍函数，只要控制收敛速率与障碍衰减速率匹配，个体状态安全可以在控制障碍走廊上局部扩展，突显了安全性与反应性之间的权衡。这种安全控制障碍走廊使得在系统运动过程中，能够在不断变化的障碍走廊上安全地选择可达的持久目标，我们在自主探索未知环境中的可验证安全和持久路径跟踪中进行了演示。

View on arXiv Download PDF AI Translation

cs.RO / 53 / 2603.06497

A Unified Low-Dimensional Design Embedding for Joint Optimization of Shape, Material, and Actuation in Soft Robots

一种统一的低维设计嵌入用于软机器人形状、材料和驱动的联合优化

Candiello, Vittorio, Mekkattu, Manuel, Michelis, Mike Y., Katzschmann, Robert K.

Abstract

Soft robots achieve functionality through tight coupling among geometry, material composition, and actuation. As a result, effective design optimization requires these three aspects to be considered jointly rather than in isolation. This coupling is computationally challenging: nonlinear large-deformation mechanics increase simulation cost, while contact, collision handling, and non-smooth state transitions limit the applicability of standard gradient-based approaches. We introduce a smooth, low-dimensional design embedding for soft robots that unifies shape morphing, multi-material distribution, and actuation within a single structured parameter space. Shape variation is modeled through continuous deformation maps of a reference geometry, while material properties are encoded as spatial fields. Both are constructed from shared basis functions. This representation enables expressive co-design while drastically reducing the dimensionality of the search space. In our experiments, we show that design expressiveness increases with the number of basis functions, unlike comparable neural network encodings whose representational capacity does not scale predictably with parameter count. We further show that joint co-optimization of shape, material, and actuation using our unified embedding consistently outperforms sequential strategies. All experiments are performed independently of the underlying simulator, confirming compatibility with black-box simulation pipelines. Across multiple dynamic tasks, the proposed embedding surpasses neural network and voxel-based baseline parameterizations while using significantly fewer design parameters. Together, these findings demonstrate that structuring the design space itself enables efficient co-design of soft robots.

Chinese Translation

软机器人通过几何形状、材料组成和驱动之间的紧密耦合实现功能。因此，有效的设计优化需要将这三个方面共同考虑，而不是孤立地进行。这种耦合在计算上具有挑战性：非线性大变形力学增加了仿真成本，而接触、碰撞处理和非光滑状态转变限制了标准基于梯度的方法的适用性。我们提出了一种平滑的低维设计嵌入，用于软机器人，将形状变形、多材料分布和驱动统一在一个结构化的参数空间内。形状变化通过参考几何体的连续变形图进行建模，而材料属性则被编码为空间场。这两者都是由共享基函数构建的。这种表示方式使得表达性共同设计成为可能，同时大幅降低了搜索空间的维度。在我们的实验中，我们展示了设计的表达性随着基函数数量的增加而增加，这与可比的神经网络编码不同，后者的表示能力并不随着参数数量的增加而可预测地扩展。我们进一步表明，使用我们统一嵌入的形状、材料和驱动的联合共同优化始终优于顺序策略。所有实验均独立于底层仿真器进行，确认与黑箱仿真管道的兼容性。在多个动态任务中，所提出的嵌入在使用显著更少的设计参数的情况下超越了神经网络和基于体素的基线参数化。这些发现共同表明，结构化设计空间本身能够有效促进软机器人的共同设计。

View on arXiv Download PDF AI Translation

cs.RO / 54 / 2603.06501

CFEAR-Teach-and-Repeat: Fast and Accurate Radar-only Localization

CFEAR-教学与重复：快速准确的雷达单一定位

Hilger, Maximilian, Adolfsson, Daniel, Becker, Ralf, Andreasson, Henrik, Lilienthal, Achim J.

Abstract

Reliable localization in prior maps is essential for autonomous navigation, particularly under adverse weather, where optical sensors may fail. We present CFEAR-TR, a teach-and-repeat localization pipeline using a single spinning radar, which is designed for easily deployable, lightweight, and robust navigation in adverse conditions. Our method localizes by jointly aligning live scans to both stored scans from the teach mapping pass, and to a sliding window of recent live keyframes. This ensures accurate and robust pose estimation across different seasons and weather phenomena. Radar scans are represented using a sparse set of oriented surface points, computed from Doppler-compensated measurements. The map is stored in a pose graph that is traversed during localization. Experiments on the held-out test sequences from the Boreas dataset show that CFEAR-TR can localize with an accuracy as low as 0.117 m and 0.096{\deg}, corresponding to improvements of up to 63% over the previous state of the art, while running efficiently at 29 Hz. These results substantially narrow the gap to lidar-level localization, particularly in heading estimation. We make the C++ implementation of our work available to the community.

Chinese Translation

在先前地图中可靠的定位对于自主导航至关重要，尤其是在恶劣天气条件下，光学传感器可能失效。我们提出了CFEAR-TR，一种使用单个旋转雷达的教学与重复定位管道，旨在实现易于部署、轻量且在恶劣条件下稳健的导航。我们的方法通过将实时扫描与来自教学映射过程的存储扫描以及最近实时关键帧的滑动窗口进行联合对齐来实现定位。这确保了在不同季节和天气现象下的准确和稳健的姿态估计。雷达扫描使用一组稀疏的定向表面点表示，这些点是从多普勒补偿测量中计算得出的。地图存储在一个在定位过程中遍历的姿态图中。在Boreas数据集的保留测试序列上的实验表明，CFEAR-TR的定位精度低至0.117米和0.096°，相较于之前的最先进技术提高了多达63%，同时以29 Hz的高效速度运行。这些结果显著缩小了与激光雷达级别定位之间的差距，特别是在航向估计方面。我们将C++实现的工作提供给社区。

View on arXiv Download PDF AI Translation

cs.RO / 55 / 2603.06512

SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants

SG-DOR：基于方向条件遮挡推理的场景图学习用于辣椒植物

Menon, Rohit, Mueller-Goldingen, Niklas, Pan, Sicong, Chenchani, Gokul Krishna, Bennewitz, Maren

Abstract

Robotic harvesting in dense crop canopies requires effective interventions that depend not only on geometry, but also on explicit, direction-conditioned relations identifying which organs obstruct a target fruit. We present SG-DOR (Scene Graphs with Direction-Conditioned Occlusion Reasoning), a relational framework that, given instance-segmented organ point clouds, infers a scene graph encoding physical attachments and direction-conditioned occlusion. We introduce an occlusion ranking task for retrieving and ranking candidate leaves for a target fruit and approach direction, and propose a direction-aware graph neural architecture with per-fruit leaf-set attention and union-level aggregation. Experiments on a multi-plant synthetic pepper dataset show improved occlusion prediction (F1=0.73, NDCG@3=0.85) and attachment inference (edge F1=0.83) over strong ablations, yielding a structured relational signal for downstream intervention planning.

Chinese Translation

在密集作物冠层中的机器人采摘需要有效的干预措施，这不仅依赖于几何形状，还依赖于明确的、基于方向的关系，以识别哪些器官会遮挡目标果实。我们提出了SG-DOR（基于方向条件遮挡推理的场景图），这是一个关系框架，给定实例分割的器官点云，推断出编码物理附着和方向条件遮挡的场景图。我们引入了一项遮挡排名任务，用于检索和排名目标果实及其接近方向的候选叶片，并提出了一种方向感知的图神经网络架构，具有每个果实的叶片集注意力和联合级聚合。在一个多植物合成辣椒数据集上的实验表明，相较于强消融，遮挡预测（F1=0.73，NDCG@3=0.85）和附着推断（边缘F1=0.83）有所改善，为下游干预规划提供了结构化的关系信号。

View on arXiv Download PDF AI Translation

cs.RO / 56 / 2603.06525

Underactuated multimodal jumping robot for extraterrestrial exploration

用于外星探索的欠驱动多模态跳跃机器人

Wagner, Neil R., Yim, Justin K.

Abstract

We present a rolling and jumping underactuated monopedal robot designed to explore multimodal locomotion on low-gravity bodies. It uses only two reaction wheels to control its spatial orientation with two controllers: a balancing controller which can aim the robot's jump direction on the ground, and an aerial reorientation controller which can aim the robot's leg for landing after flight. We demonstrate rolling, targeted jumping and landing, and self-righting using only three actuators total, keeping system size to 0.33m and 1.25kg. Simple switching between locomotion modes enables the system to deal with differing landscapes and environmental conditions.

Chinese Translation

我们提出了一种滚动和跳跃的欠驱动单足机器人，旨在探索低重力天体上的多模态运动。该机器人仅使用两个反应轮通过两个控制器来控制其空间方向：一个平衡控制器可以调整机器人在地面的跳跃方向，另一个空中重定向控制器可以在飞行后调整机器人的腿部以实现着陆。我们展示了仅使用三个执行器实现的滚动、定向跳跃和着陆，以及自我翻正，系统尺寸保持在0.33米和1.25千克。简单的运动模式切换使得该系统能够应对不同的地形和环境条件。

View on arXiv Download PDF AI Translation

cs.RO / 57 / 2603.06538

Unified Learning of Temporal Task Structure and Action Timing for Bimanual Robot Manipulation

双手机器人操作的时间任务结构与动作时序的统一学习

Dreher, Christian, Dormanns, Patrick, Meixner, Andre, Asfour, Tamim

Abstract

Temporal task structure is fundamental for bimanual manipulation: a robot must not only know that one action precedes or overlaps another, but also when each action should occur and how long it should take. While symbolic temporal relations enable high-level reasoning about task structure and alternative execution sequences, concrete timing parameters are equally essential for coordinating two hands at the execution level. Existing approaches address these two levels in isolation, leaving a gap between high-level task planning and low-level movement synchronization. This work presents an approach for learning both symbolic and subsymbolic temporal task constraints from human demonstrations and deriving executable, temporally parametrized plans for bimanual manipulation. Our contributions are (i) a 3-dimensional representation of timings between two actions with methods based on multivariate Gaussian Mixture Models to represent temporal relationships between actions on a subsymbolic level, (ii) a method based on the Davis-Putnam-Logemann-Loveland (DPLL) algorithm that finds and ranks all contradiction-free assignments of Allen relations to action pairs, representing different modes of a task, and (iii) an optimization-based planning system that combines the identified symbolic and subsymbolic temporal task constraints to derive temporally parametrized plans for robot execution. We evaluate our approach on several datasets, demonstrating that our method generates temporally parametrized plans closer to human demonstrations than the most characteristic demonstration baseline.

Chinese Translation

时间任务结构对于双手操作至关重要：机器人不仅需要知道一个动作在另一个动作之前或重叠，而且还需要知道每个动作何时发生以及需要多长时间。虽然符号时间关系使得对任务结构和替代执行序列进行高层次推理成为可能，但具体的时间参数同样对于在执行层面协调双手至关重要。现有方法分别处理这两个层面，导致高层次任务规划与低层次运动同步之间存在差距。本研究提出了一种从人类示范中学习符号和亚符号时间任务约束的方法，并推导出可执行的、时间参数化的双手操作计划。我们的贡献包括：(i) 一种基于多元高斯混合模型的两动作之间时序的三维表示方法，用于在亚符号层面表示动作之间的时间关系；(ii) 一种基于戴维斯-普特南-洛根-洛夫兰德（DPLL）算法的方法，寻找并排名所有无矛盾的艾伦关系分配给动作对，表示任务的不同模式；(iii) 一种基于优化的规划系统，结合识别出的符号和亚符号时间任务约束，推导出用于机器人执行的时间参数化计划。我们在多个数据集上评估了我们的方法，结果表明，我们的方法生成的时间参数化计划比最具代表性的示范基线更接近人类示范。

View on arXiv Download PDF AI Translation

cs.RO / 58 / 2603.06548

Uncertainty-Aware Adaptive Dynamics For Underwater Vehicle-Manipulator Robots

基于不确定性意识的水下车辆-操纵器机器人自适应动力学

Morgan, Edward, Dadson, Nenyi K, Barbalata, Corina

Abstract

Accurate and adaptive dynamic models are critical for underwater vehicle-manipulator systems where hydrodynamic effects induce time-varying parameters. This paper introduces a novel uncertainty-aware adaptive dynamics model framework that remains linear in lumped vehicle and manipulator parameters, and embeds convex physical consistency constraints during online estimation. Moving horizon estimation is used to stack horizon regressors, enforce realizable inertia, damping, friction, and hydrostatics, and quantify uncertainty from parameter evolution. Experiments on a BlueROV2 Heavy with a 4-DOF manipulator demonstrate rapid convergence and calibrated predictions. Manipulator fits achieve R2 = 0.88 to 0.98 with slopes near unity, while vehicle surge, heave, and roll are reproduced with good fidelity under stronger coupling and noise. Median solver time is approximately 0.023 s per update, confirming online feasibility. A comparison against a fixed parameter model shows consistent reductions in MAE and RMSE across degrees of freedom. Results indicate physically plausible parameters and confidence intervals with near 100% coverage, enabling reliable feedforward control and simulation in underwater environments.

Chinese Translation

准确和自适应的动态模型对于水下车辆-操纵器系统至关重要，因为水动力效应会引起时间变化的参数。本文提出了一种新颖的不确定性意识自适应动力学模型框架，该框架在集成的车辆和操纵器参数中保持线性，并在在线估计过程中嵌入凸物理一致性约束。采用移动视界估计方法堆叠视界回归器，强制实现可实现的惯性、阻尼、摩擦和静水力学，并量化参数演变带来的不确定性。在配备4自由度操纵器的BlueROV2 Heavy上进行的实验表明，模型快速收敛并提供了经过校准的预测。操纵器拟合的R²值达到0.88至0.98，斜率接近于1，而车辆的前进、浮力和滚动在更强耦合和噪声下也能良好再现。每次更新的中位数求解时间约为0.023秒，确认了在线可行性。与固定参数模型的比较显示，在各个自由度上MAE和RMSE均有一致的降低。结果表明，参数在物理上是合理的，并且置信区间的覆盖率接近100%，从而实现了在水下环境中可靠的前馈控制和仿真。

View on arXiv Download PDF AI Translation

cs.RO / 59 / 2603.06573

Fly360: Omnidirectional Obstacle Avoidance within Drone View

Fly360：无人机视角下的全向障碍物避让

Zhang, Xiangkai, Zhang, Dizhe, Cao, WenZhuo, Wan, Zhaoliang, Niu, Yingjie, Qi, Lu, Yang, Xu, Liu, Zhiyong

Abstract

Obstacle avoidance in unmanned aerial vehicles (UAVs), as a fundamental capability, has gained increasing attention with the growing focus on spatial intelligence. However, current obstacle-avoidance methods mainly depend on limited field-of-view sensors and are ill-suited for UAV scenarios which require full-spatial awareness when the movement direction differs from the UAV's heading. This limitation motivates us to explore omnidirectional obstacle avoidance for panoramic drones with full-view perception. We first study an under explored problem setting in which a UAV must generate collision-free motion in environments with obstacles from arbitrary directions, and then construct a benchmark that consists of three representative flight tasks. Based on such settings, we propose Fly360, a two-stage perception-decision pipeline with a fixed random-yaw training strategy. At the perception stage, panoramic RGB observations are input and converted into depth maps as a robust intermediate representation. For the policy network, it is lightweight and used to output body-frame velocity commands from depth inputs. Extensive simulation and real-world experiments demonstrate that Fly360 achieves stable omnidirectional obstacle avoidance and outperforms forward-view baselines across all tasks. Our model is available at https://zxkai.github.io/fly360/

Chinese Translation

障碍物避让是无人机（UAV）的一项基本能力，随着对空间智能的日益关注，该领域受到了越来越多的关注。然而，目前的障碍物避让方法主要依赖于有限视野的传感器，难以适应无人机在运动方向与其航向不一致时所需的全空间感知。这一局限性促使我们探索具有全视角感知的全向障碍物避让方法，适用于全景无人机。我们首先研究一个较少探索的问题设置，即无人机必须在来自任意方向的障碍物环境中生成无碰撞的运动，然后构建一个包含三个代表性飞行任务的基准。在这样的设置基础上，我们提出了Fly360，一个具有固定随机偏航训练策略的两阶段感知-决策管道。在感知阶段，输入全景RGB观测并将其转换为深度图，作为一种稳健的中间表示。对于策略网络，它轻量且用于从深度输入输出机体框架的速度指令。大量的仿真和现实世界实验表明，Fly360实现了稳定的全向障碍物避让，并在所有任务中优于前视基线。我们的模型可在 https://zxkai.github.io/fly360/ 获取。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

151

cs.CV / 1 / 2603.05537

Edges Are All You Need: Robust Gait Recognition via Label-Free Structure

边缘即一切：通过无标签结构实现稳健的步态识别

Zhang, Chao, Zheng, Zhuang, Li, Ruixin, Mei, Zhanyong

Abstract

Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.

Chinese Translation

步态识别是一种非侵入式生物特征技术，广泛应用于安全领域，但现有研究主要集中在轮廓和解析基础的表示上。轮廓稀疏，缺乏内部结构细节，限制了可区分性。解析通过部分级结构丰富了轮廓，但过于依赖上游的人类解析器（例如，标签粒度和边界精度），导致在不同数据集上的性能不稳定，有时甚至不如轮廓。我们从结构的角度重新审视步态表示，并描述一个由边缘密度和监督形式定义的设计空间：轮廓使用稀疏的边界边缘和弱单标签监督，而解析使用更密集的线索和强语义先验。在这个空间中，我们识别出一个未被充分探索的范式：没有显式语义标签的密集部分级结构，并引入SKETCH作为步态识别的新视觉模态。Sketch通过基于边缘的检测器以无标签的方式直接从RGB图像中提取高频结构线索（例如，肢体关节和自遮挡轮廓）。我们进一步表明，标签引导的解析和无标签的Sketch在语义上是解耦的，并在结构上是互补的。基于此，我们提出SKETCHGAIT，一个分层解耦的多模态框架，具有两个独立的流用于模态特定学习，以及一个轻量级的早期融合分支以捕捉结构互补性。在SUSTech1K和CCPG上的大量实验验证了所提模态和框架：SketchGait在SUSTech1K上达到了92.9%的Rank-1，在CCPG上达到了93.1%的平均Rank-1。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2603.05591

Thinking with Spatial Code for Physical-World Video Reasoning

基于空间编码的物理世界视频推理

Chen, Jieneng, Ma, Wenxin, Yuan, Ruisheng, Zhang, Yunzhi, Wu, Jiajun, Yuille, Alan

Abstract

We introduce Thinking with Spatial Code, a framework that transforms RGB video into explicit, temporally coherent 3D representations for physical-world visual question answering. We highlight the empirical finding that our proposed spatial encoder can parse videos into structured spatial code with explicit 3D oriented bounding boxes and semantic labels, enabling large language models (LLMs) to reason directly over explicit spatial variables. Specifically, we propose the spatial encoder that encodes image and geometric features by unifying 6D object parsing and tracking backbones with geometric prediction, and we further finetuning LLMs with reinforcement learning using a spatial rubric reward that encourages perspective-aware, geometrically grounded inference. As a result, our model outperforms proprietary vision-language models on VSI-Bench, setting a new state-of-the-art. Code is available at https://github.com/Beckschen/spatialcode.

Chinese Translation

我们提出了基于空间编码的思维框架，该框架将RGB视频转化为明确的、时间一致的3D表示，以用于物理世界的视觉问答。我们强调了一个实证发现，即我们提出的空间编码器能够将视频解析为结构化的空间编码，具有明确的3D定向边界框和语义标签，从而使大型语言模型（LLMs）能够直接在明确的空间变量上进行推理。具体而言，我们提出了一种空间编码器，通过将6D物体解析和跟踪主干与几何预测相结合来编码图像和几何特征，并进一步使用空间评分奖励对LLMs进行强化学习微调，该奖励鼓励基于视角的、几何基础的推理。因此，我们的模型在VSI-Bench上超越了专有的视觉-语言模型，创造了新的最先进水平。代码可在 https://github.com/Beckschen/spatialcode 获取。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2603.05604

From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

从解耦到耦合：基于学习的关键点检测的鲁棒性验证与联合规范

Luo, Xusheng, Liu, Changliu

Abstract

Keypoint detection underpins many vision tasks, including pose estimation, viewpoint recovery, and 3D reconstruction, yet modern neural models remain vulnerable to small input perturbations. Despite its importance, formal robustness verification for keypoint detectors is largely unexplored due to high-dimensional inputs and continuous coordinate outputs. We propose the first coupled robustness verification framework for heatmap-based keypoint detectors that bounds the joint deviation across all keypoints, capturing their interdependencies and downstream task requirements. Unlike prior decoupled, classification-style approaches that verify each keypoint independently and yield conservative guarantees, our method verifies collective behavior. We formulate verification as a falsification problem using a mixed-integer linear program (MILP) that combines reachable heatmap sets with a polytope encoding joint deviation constraints. Infeasibility certifies robustness, while feasibility provides counterexamples, and we prove the method is sound: if it certifies the model as robust, then the keypoint detection model is guaranteed to be robust. Experiments show that our coupled approach achieves high verified rates and remains effective under strict error thresholds where decoupled methods fail.

Chinese Translation

关键点检测是许多视觉任务的基础，包括姿态估计、视角恢复和三维重建，但现代神经模型仍然容易受到小幅输入扰动的影响。尽管其重要性不言而喻，针对关键点检测器的正式鲁棒性验证仍然未得到充分研究，原因在于高维输入和连续坐标输出。我们提出了第一个针对基于热图的关键点检测器的耦合鲁棒性验证框架，该框架限制了所有关键点的联合偏差，捕捉了它们之间的相互依赖性和下游任务的需求。与之前的解耦分类风格方法不同，后者独立验证每个关键点并提供保守的保证，我们的方法验证了集体行为。我们将验证问题表述为一个伪造问题，使用混合整数线性规划（MILP），该规划结合了可达热图集与编码联合偏差约束的多面体。不满足性证明鲁棒性，而满足性则提供反例，我们证明该方法是可靠的：如果它证明模型是鲁棒的，那么关键点检测模型就保证是鲁棒的。实验表明，我们的耦合方法在严格的误差阈值下仍然有效，并且实现了高验证率，而解耦方法在此情况下失效。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2603.05607

DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces

DreamCAD：利用可微分参数曲面扩展多模态计算机辅助设计生成

Khan, Mohammad Sadil, Usama, Muhammad, Potamias, Rolandos Alexandros, Stricker, Didier, Afzal, Muhammad Zeshan, Deng, Jiankang, Elezi, Ismail

Abstract

Computer-Aided Design (CAD) relies on structured and editable geometric representations, yet existing generative methods are constrained by small annotated datasets with explicit design histories or boundary representation (BRep) labels. Meanwhile, millions of unannotated 3D meshes remain untapped, limiting progress in scalable CAD generation. To address this, we propose DreamCAD, a multi-modal generative framework that directly produces editable BReps from point-level supervision, without CAD-specific annotations. DreamCAD represents each BRep as a set of parametric patches (e.g., B\'ezier surfaces) and uses a differentiable tessellation method to generate meshes. This enables large-scale training on 3D datasets while reconstructing connected and editable surfaces. Furthermore, we introduce CADCap-1M, the largest CAD captioning dataset to date, with 1M+ descriptions generated using GPT-5 for advancing text-to-CAD research. DreamCAD achieves state-of-the-art performance on ABC and Objaverse benchmarks across text, image, and point modalities, improving geometric fidelity and surpassing 75% user preference. Code and dataset will be publicly available.

Chinese Translation

计算机辅助设计（CAD）依赖于结构化和可编辑的几何表示，然而现有的生成方法受到小规模带有明确设计历史或边界表示（BRep）标签的标注数据集的限制。同时，数百万个未标注的3D网格尚未被利用，限制了可扩展CAD生成的进展。为了解决这个问题，我们提出了DreamCAD，一个多模态生成框架，能够直接从点级监督生成可编辑的BRep，而无需特定于CAD的标注。DreamCAD将每个BRep表示为一组参数补丁（例如，Bézier曲面），并使用可微分的镶嵌方法生成网格。这使得在3D数据集上进行大规模训练成为可能，同时重建连接且可编辑的表面。此外，我们推出了CADCap-1M，这是迄今为止最大的CAD标注数据集，包含超过100万条使用GPT-5生成的描述，以推动文本到CAD的研究。DreamCAD在文本、图像和点模态的ABC和Objaverse基准测试中实现了最先进的性能，提高了几何保真度，并超过了75%的用户偏好。代码和数据集将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2603.05622

Adversarial Batch Representation Augmentation for Batch Correction in High-Content Cellular Screening

用于高通量细胞筛选中的批量校正的对抗性批量表示增强

Tong, Lei, Yao, Xujing, Corrigan, Adam, Chen, Long, Kumar, Navin Rathna, Hallbrook, Kerry, Orme, Jonathan, Wang, Yinhai, Zhou, Huiyu

Abstract

High-Content Screening routinely generates massive volumes of cell painting images for phenotypic profiling. However, technical variations across experimental executions inevitably induce biological batch (bio-batch) effects. These cause covariate shifts and degrade the generalization of deep learning models on unseen data. Existing batch correction methods typically rely on additional prior knowledge (e.g., treatment or cell culture information) or struggle to generalize to unseen bio-batches. In this work, we frame bio-batch mitigation as a Domain Generalization (DG) problem and propose Adversarial Batch Representation Augmentation (ABRA). ABRA explicitly models batch-wise statistical fluctuations by parameterizing feature statistics as structured uncertainties. Through a min-max optimization framework, it actively synthesizes worst-case bio-batch perturbations in the representation space, guided by a strict angular geometric margin to preserve fine-grained class discriminability. To prevent representation collapse during this adversarial exploration, we introduce a synergistic distribution alignment objective. Extensive evaluations on the large-scale RxRx1 and RxRx1-WILDS benchmarks demonstrate that ABRA establishes a new state-of-the-art for siRNA perturbation classification.

Chinese Translation

高通量筛选常规生成大量细胞绘制图像用于表型分析。然而，实验执行中的技术变异不可避免地引入生物批次（bio-batch）效应。这些效应导致协变量转移，并降低深度学习模型在未见数据上的泛化能力。现有的批量校正方法通常依赖于额外的先验知识（例如，处理或细胞培养信息），或者难以对未见的生物批次进行泛化。在本研究中，我们将生物批次缓解框架视为一个领域泛化（Domain Generalization, DG）问题，并提出对抗性批量表示增强（Adversarial Batch Representation Augmentation, ABRA）。ABRA通过将特征统计参数化为结构化的不确定性，明确建模批量统计波动。通过最小-最大优化框架，它在表示空间中主动合成最坏情况的生物批次扰动，并通过严格的角几何边界来指导，以保持细粒度类别的可区分性。为了防止在这种对抗性探索过程中表示崩溃，我们引入了一种协同分布对齐目标。在大规模RxRx1和RxRx1-WILDS基准上的广泛评估表明，ABRA在siRNA扰动分类方面建立了新的最先进水平。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2603.05623

Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection

后融合鸟瞰视图特征稳定化用于鲁棒的多模态3D检测

Dong, Trung Tien, Thakkar, Dev, Sargolzaei, Arman, Lin, Xiaomin

Abstract

Camera-LiDAR fusion is widely used in autonomous driving to enable accurate 3D object detection. However, bird's-eye view (BEV) fusion detectors can degrade significantly under domain shift and sensor failures, limiting reliability in real-world deployment. Existing robustness approaches often require modifying the fusion architecture or retraining specialized models, making them difficult to integrate into already deployed systems. We propose a Post Fusion Stabilizer (PFS), a lightweight module that operates on intermediate BEV representations of existing detectors and produces a refined feature map for the original detection head. The design stabilizes feature statistics under domain shift, suppresses spatial regions affected by sensor degradation, and adaptively restores weakened cues through residual correction. Designed as a near-identity transformation, PFS preserves performance while improving robustness under diverse camera and LiDAR corruptions. Evaluations on the nuScenes benchmark demonstrate that PFS achieves state-of-the-art results in several failure modes, notably improving camera dropout robustness by +1.2% and low-light performance by +4.4% mAP while maintaining a lightweight footprint of only 3.3 M parameters.

Chinese Translation

相机与激光雷达的融合在自动驾驶中被广泛应用，以实现准确的3D物体检测。然而，鸟瞰视图（BEV）融合检测器在领域转移和传感器故障下可能显著降级，限制了其在实际部署中的可靠性。现有的鲁棒性方法通常需要修改融合架构或重新训练专门的模型，这使得它们难以集成到已经部署的系统中。我们提出了一种后融合稳定器（Post Fusion Stabilizer, PFS），这是一种轻量级模块，作用于现有检测器的中间BEV表示，并为原始检测头生成精炼的特征图。该设计在领域转移下稳定特征统计，抑制受传感器降级影响的空间区域，并通过残差校正自适应地恢复减弱的线索。PFS被设计为近似恒等变换，能够在改善鲁棒性的同时保持性能，适应多种相机和激光雷达的干扰。在nuScenes基准上的评估表明，PFS在多个故障模式下实现了最先进的结果，显著提高了相机掉线的鲁棒性（+1.2%）和低光照下的性能（+4.4% mAP），同时保持了仅3.3M参数的轻量级特征。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2603.05629

Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

重新思考概念瓶颈模型：从陷阱到解决方案

Tapli, Merve, Bouniot, Quentin, Stammer, Wolfgang, Akata, Zeynep, Akbas, Emre

Abstract

Concept Bottleneck Models (CBMs) ground predictions in human-understandable concepts but face fundamental limitations: the absence of a metric to pre-evaluate concept relevance, the "linearity problem" causing recent CBMs to bypass the concept bottleneck entirely, an accuracy gap compared to opaque models, and finally the lack of systematic study on the impact of different visual backbones and VLMs. We introduce CBM-Suite, a methodological framework to systematically addresses these challenges. First, we propose an entropy-based metric to quantify the intrinsic suitability of a concept set for a given dataset. Second, we resolve the linearity problem by inserting a non-linear layer between concept activations and the classifier, which ensures that model accuracy faithfully reflects concept relevance. Third, we narrow the accuracy gap by leveraging a distillation loss guided by a linear teacher probe. Finally, we provide comprehensive analyses on how different vision encoders, vision-language models, and concept sets interact to influence accuracy and interpretability in CBMs. Extensive evaluations show that CBM-Suite yields more accurate models and provides insights for improving concept-based interpretability.

Chinese Translation

概念瓶颈模型（CBMs）将预测建立在易于人类理解的概念之上，但面临着根本性的局限性：缺乏预评估概念相关性的度量标准、导致近期CBMs完全绕过概念瓶颈的“线性问题”、与不透明模型相比的准确性差距，以及对不同视觉骨干网络和视觉语言模型（VLMs）影响的系统性研究的缺乏。我们提出了CBM-Suite，一个方法论框架，系统地解决这些挑战。首先，我们提出了一种基于熵的度量标准，以量化给定数据集的概念集的内在适用性。其次，我们通过在概念激活和分类器之间插入非线性层来解决线性问题，从而确保模型的准确性真实反映概念的相关性。第三，我们通过利用由线性教师探针引导的蒸馏损失来缩小准确性差距。最后，我们提供了全面的分析，探讨不同视觉编码器、视觉语言模型和概念集如何相互作用以影响CBMs中的准确性和可解释性。广泛的评估表明，CBM-Suite产生了更准确的模型，并为改善基于概念的可解释性提供了见解。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2603.05630

Making Reconstruction FID Predictive of Diffusion Generation FID

使重建FID能够预测扩散生成FID

Xu, Tongda, He, Mingwei, Abu-Hussein, Shady, Hernandez-Lobato, Jose Miguel, Zhang, Haotian, Zhao, Kai, Zhou, Chao, Zhang, Ya-Qin, Wang, Yan

Abstract

It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each element in the dataset, we retrieve its nearest neighbor (NN) in the latent space and interpolate their latent representations. We then decode the interpolated latent and compute the FID between the decoded samples and the original dataset. Additionally, we refine the claim that rFID correlates poorly with gFID, by showing that rFID correlates with sample quality in the diffusion refinement phase, whereas iFID correlates with sample quality in the diffusion navigation phase. Furthermore, we provide an explanation for why iFID correlates well with gFID, and why reconstruction metrics are negatively correlated with gFID, by connecting to results in the diffusion generalization and hallucination. Empirically, iFID is the first metric to demonstrate a strong correlation with diffusion gFID, achieving Pearson linear and Spearman rank correlations approximately 0.85. The source code is provided in https://github.com/tongdaxu/Making-rFID-Predictive-of-Diffusion-gFID.

Chinese Translation

众所周知，变分自编码器（VAE）的重建FID（rFID）与潜在扩散模型的生成FID（gFID）相关性较差。我们提出了一种重建FID的简单变体——插值FID（iFID），它与生成FID表现出较强的相关性。具体而言，对于数据集中每个元素，我们在潜在空间中检索其最近邻（NN），并对它们的潜在表示进行插值。然后，我们解码插值后的潜在表示，并计算解码样本与原始数据集之间的FID。此外，我们通过展示rFID与扩散精炼阶段样本质量的相关性，进一步完善了rFID与gFID相关性差的论断，而iFID则与扩散导航阶段的样本质量相关。此外，我们还解释了为何iFID与gFID相关性良好，以及为何重建指标与gFID呈负相关，联系到扩散泛化和幻觉的结果。从实证上看，iFID是第一个显示出与扩散gFID强相关性的指标，其皮尔逊线性相关系数和斯皮尔曼等级相关系数约为0.85。源代码可在 https://github.com/tongdaxu/Making-rFID-Predictive-of-Diffusion-gFID 获取。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2603.05659

When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

当评分标准失效时：无参考强化学习后训练中的错误枚举作为奖励

Ikezogwo, Wisdom, Seyfioglu, Mehmet Saygin, Krishna, Ranjay, Bouyarmane, Karim

Abstract

Reinforcement learning with verifiable rewards (RLVR) and Rubrics as Rewards (RaR) have driven strong gains in domains with clear correctness signals and even in subjective domains by synthesizing evaluation criteria from ideal reference answers. But many real-world tasks admit multiple valid outputs and lack the single ideal answer that rubric generation depends on. We identify this reference-free setting as a gap in current post-training methods and propose Implicit Error Counting (IEC) to fill it. Instead of checking what a response gets right against a rubric, IEC enumerates what it gets wrong, applying severity-weighted scores across task-relevant axes and converting them into calibrated per-aspect rewards. We show that na\"ive explicit enumeration is too noisy for stable optimization, and that two design choices: implicit score emission and group calibration are necessary to make error counting a reliable reward. As a case study, we validate IEC on virtual try-on (VTO), a domain that is simultaneously too constrained for holistic scoring and too permissive for rubric-based evaluation: subtle garment errors are unacceptable, yet many output variations are correct. We introduce Cascaded Error Counting (CEC) as an evaluation metric, which tracks human preferences well (60% top-1 vs. 30% others), and curate Mismatch-DressCode (MDressBench), a benchmark with maximal attribute mismatch to stress-test reward designs. On MDressBench, IEC outperforms RaR across all metrics (CEC: 5.31 vs. 5.60 on flat references; 5.20 vs. 5.53 on non-flat). On VITON-HD and DressCode, IEC matches or surpasses six baselines on 6 of 8 perceptual metrics. These results suggest that when ideal answers are unavailable, counting errors provide a stronger signal than constructing rubrics.

Chinese Translation

具有可验证奖励的强化学习（RLVR）和将评分标准作为奖励（RaR）在具有明确正确性信号的领域以及通过从理想参考答案中综合评估标准的主观领域中取得了显著进展。然而，许多现实世界任务允许多种有效输出，并且缺乏评分生成所依赖的单一理想答案。我们将这种无参考设置视为当前后训练方法中的一个缺口，并提出隐式错误计数（IEC）来填补这一空白。IEC并不是检查响应在评分标准下的正确性，而是枚举其错误之处，应用基于严重程度加权的分数，跨任务相关维度进行评估，并将其转换为经过校准的每个方面奖励。我们展示了简单的显式枚举对于稳定优化而言噪声过大，而两个设计选择：隐式分数发射和组校准是使错误计数成为可靠奖励所必需的。作为案例研究，我们在虚拟试穿（VTO）领域验证了IEC，该领域在整体评分上过于受限，而在基于评分标准的评估上又过于宽松：微妙的服装错误是不可接受的，但许多输出变体是正确的。我们引入了级联错误计数（CEC）作为评估指标，能够很好地跟踪人类偏好（60% 的首选 vs. 30% 的其他），并策划了最大属性不匹配的基准数据集Mismatch-DressCode（MDressBench）以对奖励设计进行压力测试。在MDressBench上，IEC在所有指标上均优于RaR（CEC：5.31 vs. 5.60在平面参考上；5.20 vs. 5.53在非平面上）。在VITON-HD和DressCode上，IEC在8个感知指标中的6个上与六个基线相匹配或超越。这些结果表明，当理想答案不可用时，计数错误提供的信号比构建评分标准更强。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2603.05663

Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

保持证据链：用于无训练令牌剪枝的视频时间定位的语义证据分配

Li, Jiaqi, Zheng, Shuntian, Shen, Yixian, Huang, Jia-Hong, Lu, Xiaoman, Ni, Minzhe, Guan, Yu

Abstract

Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.

Chinese Translation

视频时间定位（VTG）在长时间未剪辑的视频中定位与查询相关的时刻的时间边界，这使得视频语言模型（VLM）管道的成本极高。尽管最近的无训练视觉令牌剪枝在视频问答中取得了成功，但将这些目标简单地应用于VTG往往会导致显著的性能下降，因为VTG关键依赖于对边界敏感的证据和跨帧推理链。因此，我们确定了两个VTG特定的剪枝原则：证据保留（Evidence Retention, ER），即保留特别是在事件边界附近的查询关键补丁；连接强度（Connectivity Strength, CS），即保留令牌级别的跨帧连接性以便进行长距离证据聚合。在这些见解的基础上，我们提出了SemVID，一个无训练剪枝框架，构建一个紧凑而连贯的令牌子集，具有互补的语义角色。SemVID首先通过平衡查询相关性和帧间变化来分配每帧的令牌预算，以避免过度剪枝的片段，然后选择三种类型的令牌：用于多样化查询关键证据的对象令牌、用于捕捉有意义的过渡并作为跨帧中继的运动令牌，以及一小组用于场景连续性的上下文令牌。在VTG基准上的大量实验表明，SemVID在准确性和效率之间实现了良好的平衡，在仅使用12.5%的视觉令牌的情况下，保留了高达95.4%的mIoU，并提供了高达5.8倍的预填充加速，始终在相同预算下超越了先前的方法。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2603.05686

OWL: A Novel Approach to Machine Perception During Motion

OWL：一种新颖的运动中机器感知方法

Raviv, Daniel, Yepes, Juan D.

Abstract

We introduce a perception-related function, OWL, designed to address the complex challenges of 3D perception during motion. It derives its values directly from two fundamental visual motion cues, with one set of cue values per point per time instant. During motion, two visual motion cues relative to a fixation point emerge: 1) perceived local visual looming of points near the fixation point, and 2) perceived rotation of the rigid object relative to the fixation point. It also expresses the relation between two well-known physical quantities, the relative instantaneous directional range and directional translation in 3D between the camera and any visible 3D point, without explicitly requiring their measurement or prior knowledge of their individual values. OWL offers a unified, analytical time-based approach that enhances and simplifies key perception capabilities, including scaled 3D mapping and camera heading. Simulations demonstrate that OWL achieves geometric constancy of 3D objects over time and enables scaled 3D scene reconstruction from visual motion cues alone. By leveraging direct measurements from raw visual motion image sequences, OWL values can be obtained without prior knowledge of stationary environments, moving objects, or camera motion. This approach employs minimalistic, pixel-based, parallel computations, providing an alternative real-time representation for 3D points in relative motion. OWL bridges the gap between theoretical concepts and practical applications in robotics and autonomous navigation and may unlock new possibilities for real-time decision-making and interaction, potentially serving as a building block for next-generation autonomous systems. This paper offers an alternative perspective on machine perception, with implications that may extend to natural perception and contribute to a better understanding of behavioral psychology and neural functionality.

Chinese Translation

我们提出了一种与感知相关的函数OWL，旨在解决运动中三维感知的复杂挑战。该函数的值直接来源于两个基本的视觉运动线索，每个时间点每个点都有一组线索值。在运动过程中，围绕固定点出现了两个视觉运动线索：1）固定点附近点的局部视觉逼近感知，和2）相对于固定点的刚体物体的旋转感知。它还表达了两个著名物理量之间的关系，即相对瞬时方向范围和相机与任何可见三维点之间的方向位移，而不需要明确测量或事先了解它们的各自值。OWL提供了一种统一的、基于时间的分析方法，增强和简化了关键的感知能力，包括缩放的三维映射和相机朝向。模拟结果表明，OWL在时间上实现了三维物体的几何恒常性，并能够仅通过视觉运动线索实现缩放的三维场景重建。通过利用原始视觉运动图像序列的直接测量，OWL的值可以在没有对静态环境、运动物体或相机运动的先验知识的情况下获得。这种方法采用最小化的、基于像素的并行计算，为相对运动中的三维点提供了一种替代的实时表示。OWL弥合了理论概念与机器人技术和自主导航实际应用之间的差距，并可能为实时决策和交互开辟新的可能性，潜在地作为下一代自主系统的构建模块。本文提供了一种关于机器感知的替代视角，其影响可能扩展到自然感知，并有助于更好地理解行为心理学和神经功能。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2603.05697

MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

MultiHaystack：对 40,000 张图像、视频和文档进行多模态检索与推理的基准测试

Xu, Dannong, Yang, Zhongyu, Chen, Jun, Yuan, Yingfang, Hu, Ming, Sun, Lei, Van Gool, Luc, Paudel, Danda Pani, Feng, Chun-Mei

Abstract

Multimodal large language models (MLLMs) achieve strong performance on benchmarks that evaluate text, image, or video understanding separately. However, these settings do not assess a critical real-world requirement, which involves retrieving relevant evidence from large, heterogeneous multimodal corpora prior to reasoning. Most existing benchmarks restrict retrieval to small, single-modality candidate sets, substantially simplifying the search space and overstating end-to-end reliability. To address this gap, we introduce MultiHaystack, the first benchmark designed to evaluate both retrieval and reasoning under large-scale, cross-modal conditions. MultiHaystack comprises over 46,000 multimodal retrieval candidates across documents, images, and videos, along with 747 open yet verifiable questions. Each question is grounded in a unique validated evidence item within the retrieval pool, requiring evidence localization across modalities and fine-grained reasoning. In our study, we find that models perform competitively when provided with the corresponding evidence, but their performance drops sharply when required to retrieve that evidence from the full corpus. Additionally, even the strongest retriever, E5-V, achieves only 40.8% Recall@1, while state-of-the-art MLLMs such as GPT-5 experience a significant drop in reasoning accuracy from 80.86% when provided with the corresponding evidence to 51.4% under top-5 retrieval. These results indicate that multimodal retrieval over heterogeneous pools remains a primary bottleneck for MLLMs, positioning MultiHaystack as a valuable testbed that highlights underlying limitations obscured by small-scale evaluations and promotes retrieval-centric advances in multimodal systems.

Chinese Translation

多模态大型语言模型（MLLMs）在评估文本、图像或视频理解的基准测试中表现出色。然而，这些设置并未评估一个关键的现实世界需求，即在推理之前从大型异构多模态语料库中检索相关证据。现有的大多数基准测试将检索限制在小规模的单一模态候选集，显著简化了搜索空间，并夸大了端到端的可靠性。为了解决这一问题，我们引入了 MultiHaystack，这是第一个旨在评估大规模跨模态条件下检索与推理的基准。MultiHaystack 包含超过 46,000 个跨文档、图像和视频的多模态检索候选项，以及 747 个开放但可验证的问题。每个问题都基于检索池中的一个独特的验证证据项，需要跨模态的证据定位和细致的推理。在我们的研究中，我们发现当提供相应证据时，模型的表现具有竞争力，但当需要从完整语料库中检索该证据时，其表现急剧下降。此外，即使是最强的检索器 E5-V，其 Recall@1 也仅为 40.8%，而像 GPT-5 这样的最先进的 MLLMs 在提供相应证据时推理准确率为 80.86%，而在前 5 次检索下则降至 51.4%。这些结果表明，在异构池中进行多模态检索仍然是 MLLMs 的主要瓶颈，MultiHaystack 作为一个有价值的测试平台，突显了小规模评估所掩盖的潜在局限性，并促进了多模态系统中以检索为中心的进展。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2603.05708

Interpretable Perception and Reasoning for Audiovisual Geolocation

可解释的视听地理定位感知与推理

Su, Yiyang, Liu, Xiaoming

Abstract

While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded "acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy Optimization (GRPO) to synthesize these atoms with visual features; and (3) a Precision Prediction stage using Riemannian Flow Matching on the $S^2$ manifold. Our experiments demonstrate that our framework significantly outperforms unimodal baselines. These results entail that interpretable perception of the soundscape provides a critical, orthogonal signal that, when coupled with multimodal reasoning, enables high-precision global localization.

Chinese Translation

尽管近年来多模态大型语言模型（MLLMs）的进展改善了基于图像的定位，但由于视觉景观固有的模糊性和听觉线索的潜在利用尚未充分开发，精确的全球地理定位仍然是一个艰巨的挑战。本文提出了一种视听地理定位框架，旨在通过可解释的感知和推理来解决地理模糊性。我们介绍了AVG，这是一个高质量的全球视频基准，包含20,000个经过精心策划的剪辑，覆盖1,000个不同地点。为了应对视听地理定位的复杂性，我们提出了一个三阶段框架：(1) 感知阶段，利用混合自回归稀疏自编码器将嘈杂音频分解为语义基础的“声学原子”；(2) 多模态推理阶段，采用通过群体相对策略优化（GRPO）微调的MLLM，将这些原子与视觉特征合成；(3) 精确预测阶段，使用在$S^2$流形上的黎曼流匹配。我们的实验表明，该框架显著优于单模态基线。这些结果表明，声音景观的可解释感知提供了一个关键的、正交的信号，当与多模态推理结合时，可以实现高精度的全球定位。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2603.05711

Any to Full: Prompting Depth Anything for Depth Completion in One Stage

从任意到完整：单阶段深度完成中的深度提示适应

Zhou, Zhiyuan, Liu, Ruofeng, Liu, Taichi, Zuo, Weijian, Wang, Shanshan, Hong, Zhiqing, Zhang, Desheng

Abstract

Accurate, dense depth estimation is crucial for robotic perception, but commodity sensors often yield sparse or incomplete measurements due to hardware limitations. Existing RGBD-fused depth completion methods learn priors jointly conditioned on training RGB distribution and specific depth patterns, limiting domain generalization and robustness to various depth patterns. Recent efforts leverage monocular depth estimation (MDE) models to introduce domain-general geometric priors, but current two-stage integration strategies relying on explicit relative-to-metric alignment incur additional computation and introduce structured distortions. To this end, we present Any2Full, a one-stage, domain-general, and pattern-agnostic framework that reformulates completion as a scale-prompting adaptation of a pretrained MDE model. To address varying depth sparsity levels and irregular spatial distributions, we design a Scale-Aware Prompt Encoder. It distills scale cues from sparse inputs into unified scale prompts, guiding the MDE model toward globally scale-consistent predictions while preserving its geometric priors. Extensive experiments demonstrate that Any2Full achieves superior robustness and efficiency. It outperforms OMNI-DC by 32.2\% in average AbsREL and delivers a 1.4$\times$ speedup over PriorDA with the same MDE backbone, establishing a new paradigm for universal depth completion. Codes and checkpoints are available at https://github.com/zhiyuandaily/Any2Full.

Chinese Translation

准确、密集的深度估计对于机器人感知至关重要，但由于硬件限制，普通传感器往往会产生稀疏或不完整的测量结果。现有的RGBD融合深度完成方法联合学习基于训练RGB分布和特定深度模式的先验知识，这限制了领域泛化能力和对各种深度模式的鲁棒性。最近的研究利用单目深度估计（MDE）模型引入领域通用的几何先验，但当前依赖于显式相对度量对齐的两阶段集成策略会增加额外的计算负担并引入结构性失真。为此，我们提出了Any2Full，这是一个单阶段、领域通用且模式无关的框架，将深度完成重新表述为对预训练MDE模型的尺度提示适应。为了应对不同的深度稀疏水平和不规则的空间分布，我们设计了一个尺度感知提示编码器。它从稀疏输入中提取尺度线索，生成统一的尺度提示，引导MDE模型朝向全局尺度一致的预测，同时保留其几何先验。大量实验表明，Any2Full在鲁棒性和效率上表现优越。它在平均绝对相对误差（AbsREL）上比OMNI-DC提高了32.2%，并在相同的MDE骨干网络上实现了1.4倍的速度提升，为通用深度完成建立了新的范式。代码和检查点可在https://github.com/zhiyuandaily/Any2Full获取。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2603.05729

Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation

解锁 ImageNet 的多对象特性：自动化大规模多标签注释

Chen, Junyu, Harun, Md Yousuf, Kanan, Christopher

Abstract

The original ImageNet benchmark enforces a single-label assumption, despite many images depicting multiple objects. This leads to label noise and limits the richness of the learning signal. Multi-label annotations more accurately reflect real-world visual scenes, where multiple objects co-occur and contribute to semantic understanding, enabling models to learn richer and more robust representations. While prior efforts (e.g., ReaL, ImageNetv2) have improved the validation set, there has not yet been a scalable, high-quality multi-label annotation for the training set. To this end, we present an automated pipeline to convert the ImageNet training set into a multi-label dataset, without human annotations. Using self-supervised Vision Transformers, we perform unsupervised object discovery, select regions aligned with original labels to train a lightweight classifier, and apply it to all regions to generate coherent multi-label annotations across the dataset. Our labels show strong alignment with human judgment in qualitative evaluations and consistently improve performance across quantitative benchmarks. Compared to traditional single-label scheme, models trained with our multi-label supervision achieve consistently better in-domain accuracy across architectures (up to +2.0 top-1 accuracy on ReaL and +1.5 on ImageNet-V2) and exhibit stronger transferability to downstream tasks (up to +4.2 and +2.3 mAP on COCO and VOC, respectively). These results underscore the importance of accurate multi-label annotations for enhancing both classification performance and representation learning. Project code and the generated multi-label annotations are available at https://github.com/jchen175/MultiLabel-ImageNet.

Chinese Translation

原始的 ImageNet 基准测试强加了单标签假设，尽管许多图像描绘了多个对象。这导致了标签噪声，并限制了学习信号的丰富性。多标签注释更准确地反映了现实世界的视觉场景，其中多个对象共存并有助于语义理解，使模型能够学习更丰富和更强健的表示。尽管之前的努力（例如 ReaL、ImageNetv2）改善了验证集，但尚未对训练集进行可扩展的高质量多标签注释。为此，我们提出了一种自动化流程，将 ImageNet 训练集转换为多标签数据集，而无需人工注释。通过自监督视觉变换器（Vision Transformers），我们进行无监督对象发现，选择与原始标签对齐的区域来训练轻量级分类器，并将其应用于所有区域，以生成跨数据集一致的多标签注释。我们的标签在定性评估中与人类判断表现出强烈的一致性，并在定量基准测试中持续提高性能。与传统的单标签方案相比，使用我们多标签监督训练的模型在各个架构上实现了一致更好的领域内准确性（在 ReaL 上提高了最多 +2.0 的 top-1 准确率，在 ImageNet-V2 上提高了 +1.5），并在下游任务中表现出更强的可迁移性（在 COCO 和 VOC 上分别提高了 +4.2 和 +2.3 的 mAP）。这些结果强调了准确的多标签注释在提升分类性能和表示学习中的重要性。项目代码和生成的多标签注释可在 https://github.com/jchen175/MultiLabel-ImageNet 获得。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2603.05732

From Phase Grounding to Intelligent Surgical Narratives

从阶段基础到智能外科叙事

Peterson, Ethan, Zhan, Huixin

Abstract

Video surgery timelines are an important part of tool-assisted surgeries, as they allow surgeons to quickly focus on key parts of the procedure. Current methods involve the surgeon filling out a post-operation (OP) report, which is often vague, or manually annotating the surgical videos, which is highly time-consuming. Our proposed method sits between these two extremes: we aim to automatically create a surgical timeline and narrative directly from the surgical video. To achieve this, we employ a CLIP-based multi-modal framework that aligns surgical video frames with textual gesture descriptions. Specifically, we use the CLIP visual encoder to extract representations from surgical video frames and the text encoder to embed the corresponding gesture sentences into a shared embedding space. We then fine-tune the model to improve the alignment between video gestures and textual tokens. Once trained, the model predicts gestures and phases for video frames, enabling the construction of a structured surgical timeline. This approach leverages pretrained multi-modal representations to bridge visual gestures and textual narratives, reducing the need for manual video review and annotation by surgeons.

Chinese Translation

视频外科时间线是工具辅助手术的重要组成部分，因为它们使外科医生能够快速聚焦于手术的关键部分。目前的方法涉及外科医生填写术后（OP）报告，这通常比较模糊，或者手动注释外科视频，这非常耗时。我们提出的方法位于这两者之间：我们旨在直接从外科视频中自动创建手术时间线和叙事。为此，我们采用基于CLIP的多模态框架，将外科视频帧与文本手势描述进行对齐。具体而言，我们使用CLIP视觉编码器从外科视频帧中提取表示，并使用文本编码器将相应的手势句子嵌入到共享的嵌入空间中。然后，我们微调模型以改善视频手势与文本标记之间的对齐。一旦训练完成，模型就可以预测视频帧的手势和阶段，从而构建结构化的外科时间线。这种方法利用预训练的多模态表示来桥接视觉手势和文本叙事，减少了外科医生对视频的手动审查和注释的需求。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2603.05758

Full Dynamic Range Sky-Modelling For Image Based Lighting

基于图像照明的全动态范围天空建模

Maquignaz, Ian J.

Abstract

Accurate environment maps are a key component to modelling real-world outdoor scenes. They enable captivating visual arts, immersive virtual reality and a wide range of scientific and engineering applications. To alleviate the burden of physical-capture, physically-simulation and volumetric rendering, sky-models have been proposed as fast, flexible, and cost-saving alternatives. In recent years, sky-models have been extended through deep learning to be more comprehensive and inclusive of cloud formations, but recent work has demonstrated these models fall short in faithfully recreating accurate and photorealistic natural skies. Particularly at higher resolutions, DNN sky-models struggle to accurately model the 14EV+ class-imbalanced solar region, resulting in poor visual quality and scenes illuminated with skewed light transmission, shadows and tones. In this work, we propose Icarus, an all-weather sky-model capable of learning the exposure range of Full Dynamic Range (FDR) physically captured outdoor imagery. Our model allows conditional generation of environment maps with intuitive user-positioning of solar and cloud formations, and extends on current state-of-the-art to enable user-controlled texturing of atmospheric formations. Through our evaluation, we demonstrate Icarus is interchangeable with FDR physically captured outdoor imagery or parametric sky-models, and illuminates scenes with unprecedented accuracy, photorealism, lighting directionality (shadows), and tones in Image Based Lightning (IBL).

Chinese Translation

准确的环境贴图是建模现实世界户外场景的关键组成部分。它们使得引人入胜的视觉艺术、沉浸式虚拟现实以及广泛的科学和工程应用成为可能。为了减轻物理捕捉、物理模拟和体积渲染的负担，天空模型被提出作为快速、灵活且节省成本的替代方案。近年来，天空模型通过深度学习得到了扩展，以更全面地涵盖云层形态，但最近的研究表明，这些模型在忠实再现准确和照片级真实的自然天空方面存在不足。特别是在更高的分辨率下，深度神经网络（DNN）天空模型难以准确建模14EV+类不平衡的太阳区域，导致视觉质量差，场景照明出现失真光传输、阴影和色调。在本研究中，我们提出了Icarus，一个能够学习全动态范围（FDR）物理捕捉户外图像曝光范围的全天气天空模型。我们的模型允许根据用户对太阳和云层形态的直观定位进行条件生成环境贴图，并在当前最先进的技术基础上扩展，支持用户控制大气形态的纹理化。通过我们的评估，我们证明Icarus可以与FDR物理捕捉的户外图像或参数化天空模型互换，并以前所未有的准确性、照片级真实感、光照方向性（阴影）和色调照亮场景，适用于基于图像的照明（IBL）。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2603.05769

Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

层级实例绑定用于文本到图像扩散变换器中的区域和遮挡控制

Chen, Ruidong, Bai, Yancheng, Zhang, Xuanpu, Zeng, Jianhao, Wang, Lanjun, Song, Dan, Sun, Lei, Chu, Xiangxiang, Liu, Anan

Abstract

Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. By modeling regional generation as distinct layers and binding them during the generation, our method enables precise regional and occlusion controllability. Our motivation stems from the observation that spatial layout and occlusion are established at a very early denoising stage, suggesting that rearranging the early latent structure is sufficient to modify the final output. Building on this, we structure the scheme into two phases: instance initialization and subsequent semantic nursing. (1) First, leveraging the contextual sharing mechanism in multimodal joint attention, Layer-wise Instance Initialization creates per-instance branches that attend to their own regions while anchoring to the shared background. At a designated early step, these branches are fused according to the layer order to form a unified latent with a pre-established layout. (2) Then, Layer-wise Semantic Nursing reinforces regional details and maintains the occlusion order via a layer-wise attention enhancement. Specifically, a sequential layered attention path operates alongside the standard global path, with updates composited under a layer-transparency scheduler. LayerBind is training-free and plug-and-play, serving as a regional and occlusion controller across Diffusion Transformers. Beyond generation, it natively supports editable workflows, allowing for flexible modifications like changing instances or rearranging visible orders. Both qualitative and quantitative results demonstrate LayerBind's effectiveness, highlighting its strong potential for creative applications.

Chinese Translation

在文本到图像生成中，区域指令布局控制具有很高的实用性，但现有方法存在一些局限性：（i）基于训练的方法继承了数据偏差，往往导致图像质量下降；（ii）当前技术在遮挡顺序上存在困难，限制了其在现实世界中的可用性。为了解决这些问题，我们提出了LayerBind。通过将区域生成建模为不同的层并在生成过程中进行绑定，我们的方法实现了精确的区域和遮挡可控性。我们的动机源于观察到空间布局和遮挡在去噪的早期阶段就已建立，这表明重新排列早期潜在结构足以修改最终输出。在此基础上，我们将该方案结构化为两个阶段：实例初始化和后续的语义护理。（1）首先，利用多模态联合注意力中的上下文共享机制，层级实例初始化创建每个实例的分支，这些分支关注自身区域，同时锚定于共享背景。在指定的早期步骤，这些分支根据层级顺序融合，以形成具有预先建立布局的统一潜在表示。（2）然后，层级语义护理通过层级注意力增强来强化区域细节并维持遮挡顺序。具体而言，一个顺序层级注意力路径与标准全局路径并行操作，更新在层透明度调度器下进行组合。LayerBind 是无训练且即插即用的，作为扩散变换器中的区域和遮挡控制器。除了生成之外，它还原生支持可编辑工作流程，允许灵活修改，如更改实例或重新排列可见顺序。定性和定量结果均表明LayerBind的有效性，突显了其在创意应用中的强大潜力。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2603.05781

Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval

视觉词汇与 BM25 相遇：用于图像检索的稀疏自编码器视觉词汇评分

Han, Donghoon, Park, Eunhwan, Seo, Seunghyeon

Abstract

Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$\% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.

Chinese Translation

密集图像检索虽然准确，但在可解释性和归因方面有限，并且在大规模应用时计算开销较大。我们提出了 extbf{BM25-V}，该方法将 Okapi BM25 评分应用于来自稀疏自编码器（Sparse Auto-Encoder, SAE）在视觉变换器（Vision Transformer）补丁特征上的稀疏视觉词激活。在一个大型图库中，视觉词文档频率高度不平衡，并遵循类似 Zipf 分布的特征，这使得 BM25 的逆文档频率（IDF）加权非常适合抑制普遍存在的低信息词汇，并强调稀有的、具有区分性的词汇。BM25-V 通过稀疏倒排索引操作检索高召回候选项，并作为密集重排序的高效第一阶段检索器。在七个基准测试中，BM25-V 实现了 Recall@200 $ geq$ 0.993，使得仅对每个查询重排序 $K{=}200$ 个候选项的两阶段管道成为可能，并在平均情况下恢复近密集的准确性，误差在 $0.2\%$ 内。一次在 ImageNet-1K 上训练的 SAE 可以零样本迁移到七个细粒度基准测试，而无需微调，BM25-V 的检索决策可以归因于特定的视觉词汇，并量化其 IDF 贡献。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2603.05787

Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction

二维到三维场景重建中特征上采样器的光谱探测

Xiao, Ling, Xiu, Yuliang, Chen, Yue, Wang, Guoming, Yamasaki, Toshihiko

Abstract

A typical 2D-to-3D pipeline takes multi-view images as input, where a Vision Foundation Model (VFM) extracts features that are spatially upsampled to dense representations for 3D reconstruction. If dense features across views preserve geometric consistency, differentiable rendering can recover an accurate 3D representation, making the feature upsampler a critical component. Recent learnable upsampling methods mainly aim to enhance spatial details, such as sharper geometry or richer textures, yet their impact on 3D awareness remains underexplored. To address this gap, we introduce a spectral diagnostic framework with six complementary metrics that characterize amplitude redistribution, structural spectral alignment, and directional stability. Across classical interpolation and learnable upsampling methods on CLIP and DINO backbones, we observe three key findings. First, structural spectral consistency (SSC/CSC) is the strongest predictor of NVS quality, whereas High-Frequency Spectral Slope Drift (HFSS) often correlates negatively with reconstruction performance, indicating that emphasizing high-frequency details alone does not necessarily improve 3D reconstruction. Second, geometry and texture respond to different spectral properties: Angular Energy Consistency (ADC) correlates more strongly with geometry-related metrics, while SSC/CSC influence texture fidelity slightly more than geometric accuracy. Third, although learnable upsamplers often produce sharper spatial features, they rarely outperform classical interpolation in reconstruction quality, and their effectiveness depends on the reconstruction model. Overall, our results indicate that reconstruction quality is more closely related to preserving spectral structure than to enhancing spatial detail, highlighting spectral consistency as an important principle for designing upsampling strategies in 2D-to-3D pipelines.

Chinese Translation

典型的二维到三维管道以多视图图像为输入，其中视觉基础模型（Vision Foundation Model, VFM）提取特征并将其空间上采样为用于三维重建的密集表示。如果跨视图的密集特征保持几何一致性，可微分渲染能够恢复准确的三维表示，这使得特征上采样器成为关键组件。近期的可学习上采样方法主要旨在增强空间细节，例如更清晰的几何形状或更丰富的纹理，但它们对三维感知的影响仍未得到充分探索。为了解决这一空白，我们引入了一个光谱诊断框架，包含六个互补指标，用于表征幅度重分布、结构光谱对齐和方向稳定性。在对CLIP和DINO骨干网络上的经典插值和可学习上采样方法进行分析时，我们观察到三个关键发现。首先，结构光谱一致性（Structural Spectral Consistency, SSC/CSC）是NVS质量的最强预测因子，而高频光谱斜率漂移（High-Frequency Spectral Slope Drift, HFSS）通常与重建性能呈负相关，这表明单纯强调高频细节并不一定能改善三维重建。其次，几何形状和纹理对不同的光谱特性有不同的响应：角能量一致性（Angular Energy Consistency, ADC）与几何相关指标的相关性更强，而SSC/CSC对纹理保真度的影响略大于几何准确性。第三，尽管可学习的上采样器通常产生更清晰的空间特征，但在重建质量上，它们很少超过经典插值方法，其有效性依赖于重建模型。总体而言，我们的结果表明，重建质量与保持光谱结构的关系更为密切，而非增强空间细节，强调了光谱一致性在设计二维到三维管道中的上采样策略时的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2603.05807

EventGeM: Global-to-Local Feature Matching for Event-Based Visual Place Recognition

EventGeM：基于事件的视觉位置识别中的全局到局部特征匹配

Hines, Adam D., Nair, Gokul B., Marticorena, Nicolás, Milford, Michael, Fischer, Tobias

Abstract

Dynamic vision sensors, also known as event cameras, are rapidly rising in popularity for robotic and computer vision tasks due to their sparse activation and high-temporal resolution. Event cameras have been used in robotic navigation and localization tasks where accurate positioning needs to occur on small and frequent time scales, or when energy concerns are paramount. In this work, we present EventGeM, a state-of-the-art global to local feature fusion pipeline for event-based Visual Place Recognition. We use a pre-trained vision transformer (ViT-S/16) backbone to obtain global feature patch for initial match predictions embeddings from event histogram images. Local feature keypoints were then detected using a pre-trained MaxViT backbone for 2D-homography based re-ranking with RANSAC. For additional re-ranking refinement, we subsequently used a pre-trained vision foundation model for depth estimation to compare structural similarity between references and queries. Our work performs state-of-the-art localization when compared to the best currently available event-based place recognition method across several benchmark datasets and lighting conditions all whilst being fully capable of running in real-time when deployed across a variety of compute architectures. We demonstrate the capability of EventGeM in a real-world deployment on a robotic platform for online localization using event streams directly from an event camera. Project page: https://eventgemvpr.github.io/

Chinese Translation

动态视觉传感器，也称为事件相机，由于其稀疏激活和高时间分辨率，在机器人和计算机视觉任务中迅速受到欢迎。事件相机已被应用于机器人导航和定位任务，这些任务需要在小且频繁的时间尺度上进行精确定位，或在能耗问题至关重要时使用。在本研究中，我们提出了EventGeM，一个用于基于事件的视觉位置识别的最先进的全局到局部特征融合管道。我们使用预训练的视觉变换器（ViT-S/16）作为主干网络，从事件直方图图像中获取全局特征块以进行初始匹配预测嵌入。然后，使用预训练的MaxViT主干网络检测局部特征关键点，以便通过RANSAC进行基于2D单应性的重新排序。为了进一步的重新排序精细化，我们随后使用了预训练的视觉基础模型进行深度估计，以比较参考和查询之间的结构相似性。与当前最佳的基于事件的位置识别方法相比，我们的工作在多个基准数据集和光照条件下实现了最先进的定位性能，并且在各种计算架构上部署时完全能够实时运行。我们展示了EventGeM在机器人平台上的实际部署能力，利用来自事件相机的事件流进行在线定位。项目页面：https://eventgemvpr.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2603.05811

Training-free Latent Inter-Frame Pruning with Attention Recovery

无训练的潜在帧间剪枝与注意力恢复

Menn, Dennis, Yang, Yuedong, Wang, Bokun, Wei, Xiwen, Munir, Mustafa, Liang, Feng, Marculescu, Radu, Xu, Chenfeng, Marculescu, Diana

Abstract

Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework, which detects and skips recomputing duplicated latent patches. Additionally, we introduce a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method. Empirically, our method increases video editing throughput by $1.45\times$, on average achieving 12.2 FPS on an NVIDIA A6000 compared to the baseline 8.4 FPS. The proposed method does not compromise generation quality and can be seamlessly integrated with the model without additional training. Our approach effectively bridges the gap between traditional compression algorithms and modern generative pipelines.

Chinese Translation

当前的视频生成模型面临着高计算延迟的问题，使得实时应用变得成本高昂。本文通过利用视频潜在补丁中固有的时间冗余来解决这一限制。为此，我们提出了潜在帧间剪枝与注意力恢复（Latent Inter-frame Pruning with Attention Recovery, LIPAR）框架，该框架能够检测并跳过重复的潜在补丁的重新计算。此外，我们引入了一种新颖的注意力恢复机制，该机制近似剪枝令牌的注意力值，从而消除因简单应用剪枝方法而产生的视觉伪影。实证结果表明，我们的方法平均提高了视频编辑的吞吐量至 $1.45 imes$，在NVIDIA A6000上实现了12.2 FPS，相较于基线的8.4 FPS。所提方法不影响生成质量，并且可以无缝集成到模型中，无需额外训练。我们的方法有效地弥合了传统压缩算法与现代生成管道之间的差距。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2603.05812

Margin and Consistency Supervision for Calibrated and Robust Vision Models

用于校准和鲁棒视觉模型的边际与一致性监督

Khazem, Salim

Abstract

Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.

Chinese Translation

深度视觉分类器通常在高准确率的同时，在小的分布变化下表现出较差的校准性和脆弱性。我们提出了边际与一致性监督（Margin and Consistency Supervision，MaCS），这是一个简单的、与架构无关的正则化框架，旨在共同强化对数空间的分离性和局部预测的稳定性。MaCS通过(i)施加一个铰链平方边际惩罚，强制正确类别与最强竞争者之间保持目标对数间隔，以及(ii)一个一致性正则化项，最小化干净输入与轻微扰动视图之间预测的KL散度，从而增强交叉熵。我们提供了一个统一的理论分析，表明在通过Lipschitz类型的稳定性代理形式化的局部敏感性降低的同时，增加分类边际可以提高泛化保证，并且可证明的鲁棒半径界限与边际-敏感性比率成比例增长。在多个图像分类基准测试和涵盖卷积神经网络（CNN）及视觉变换器（Vision Transformers）的多个骨干网络上，MaCS始终改善了校准性（降低了ECE和NLL）和对常见干扰的鲁棒性，同时保持或提高了top-1准确率。我们的方法不需要额外的数据，不需要架构更改，并且推理开销微乎其微，使其成为标准训练目标的有效替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2603.05844

Remote Sensing Image Classification Using Deep Ensemble Learning

基于深度集成学习的遥感图像分类

Islam, Niful, Ahmed, Md. Rayhan, Fahad, Nur Mohammad, Islam, Salekul, Islam, A. K. M. Muzahidul, Mukta, Saddam, Shatabda, Swakkhar

Abstract

Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While Convolutional Neural Networks (CNNs) are mostly used for image classification, they excel at local feature extraction, but struggle to capture global contextual information. Vision Transformers (ViTs) address this limitation through self attention mechanisms that model long-range dependencies. Integrating CNNs and ViTs, therefore, leads to better performance than standalone architectures. However, the use of additional CNN and ViT components does not lead to further performance improvement and instead introduces a bottleneck caused by redundant feature representations. In this research, we propose a fusion model that combines the strengths of CNNs and ViTs for remote sensing image classification. To overcome the performance bottleneck, the proposed approach trains four independent fusion models that integrate CNN and ViT backbones and combine their outputs at the final prediction stage through ensembling. The proposed method achieves accuracy rates of 98.10 percent, 94.46 percent, and 95.45 percent on the UC Merced, RSSCN7, and MSRSI datasets, respectively. These results outperform competing architectures and highlight the effectiveness of the proposed solution, particularly due to its efficient use of computational resources during training.

Chinese Translation

遥感图像在许多应用中发挥着至关重要的作用，需要准确的计算机分类技术。可靠的分类对于将原始图像转化为结构化和可用的信息至关重要。尽管卷积神经网络（CNN）通常用于图像分类，但它们在局部特征提取方面表现优异，但在捕捉全局上下文信息方面存在困难。视觉变换器（ViT）通过自注意力机制解决了这一限制，能够建模长距离依赖关系。因此，集成CNN和ViT的模型在性能上优于单一架构。然而，额外的CNN和ViT组件的使用并未进一步提升性能，反而引入了由冗余特征表示造成的瓶颈。在本研究中，我们提出了一种融合模型，结合了CNN和ViT在遥感图像分类中的优势。为了解决性能瓶颈，所提方法训练了四个独立的融合模型，这些模型集成了CNN和ViT的主干，并在最终预测阶段通过集成结合它们的输出。所提方法在UC Merced、RSSCN7和MSRSI数据集上分别达到了98.10%、94.46%和95.45%的准确率。这些结果超越了竞争架构，突显了所提解决方案的有效性，特别是在训练过程中高效利用计算资源的表现。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2603.05845

Cog2Gen3D: Sculpturing 3D Semantic-Geometric Cognition for 3D Generation

Cog2Gen3D：为3D生成雕塑3D语义-几何认知

Wang, Haonan, Zhou, Hanyu, Liu, Haoyue, Gu, Tao, Yan, Luxin

Abstract

Generative models have achieved success in producing semantically plausible 2D images, but it remains challenging in 3D generation due to the absence of spatial geometry constraints. Typically, existing methods utilize geometric features as conditions to enhance spatial awareness. However, these methods can only model relative relationships and are prone to scale inconsistency of absolute geometry. Thus, we argue that semantic information and absolute geometry empower 3D cognition, thereby enabling controllable 3D generation for the physical world. In this work, we propose Cog2Gen3D, a 3D cognition-guided diffusion framework for 3D generation. Our model is guided by three key designs: 1) Cognitive Feature Embeddings. We encode different modalities into semantic and geometric representations and further extract logical representations. 2) 3D Latent Cognition Graph. We structure different representations into dual-stream semantic-geometric graphs and fuse them via common-based cross-attention to obtain a 3D cognition graph. 3) Cognition-Guided Latent Diffusion. We leverage the fused 3D cognition graph as the condition to guide the latent diffusion process for 3D Gaussian generation. Under this unified framework, the 3D cognition graph ensures the physical plausibility and structural rationality of 3D generation. Moreover, we construct a validation subset based on the Marble World Labs. Extensive experiments demonstrate that our Cog2Gen3D significantly outperforms existing methods in both semantic fidelity and geometric plausibility.

Chinese Translation

生成模型在生成语义上合理的2D图像方面取得了成功，但由于缺乏空间几何约束，3D生成仍然面临挑战。通常，现有方法利用几何特征作为条件来增强空间意识。然而，这些方法只能建模相对关系，并且容易导致绝对几何的尺度不一致。因此，我们认为语义信息和绝对几何赋予了3D认知能力，从而实现对物理世界的可控3D生成。在本研究中，我们提出了Cog2Gen3D，一种基于3D认知引导的扩散框架用于3D生成。我们的模型由三个关键设计引导：1）认知特征嵌入。我们将不同模态编码为语义和几何表示，并进一步提取逻辑表示。2）3D潜在认知图。我们将不同的表示结构化为双流语义-几何图，并通过基于共同的交叉注意力进行融合，以获得3D认知图。3）认知引导的潜在扩散。我们利用融合的3D认知图作为条件，引导潜在扩散过程以实现3D高斯生成。在这个统一框架下，3D认知图确保了3D生成的物理合理性和结构合理性。此外，我们基于Marble World Labs构建了一个验证子集。大量实验表明，我们的Cog2Gen3D在语义保真度和几何合理性方面显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2603.05851

VS3R: Robust Full-frame Video Stabilization via Deep 3D Reconstruction

VS3R：通过深度3D重建实现稳健的全帧视频稳定

Zhu, Muhua, Jin, Xinhao, Zhang, Yu, Xue, Yifei, Ji, Tie, Lao, Yizhen

Abstract

Video stabilization aims to mitigate camera shake but faces a fundamental trade-off between geometric robustness and full-frame consistency. While 2D methods suffer from aggressive cropping, 3D techniques are often undermined by fragile optimization pipelines that fail under extreme motions. To bridge this gap, we propose VS3R, a framework that synergizes feed-forward 3D reconstruction with generative video diffusion. Our pipeline jointly estimates camera parameters, depth, and masks to ensure all-scenario reliability, and introduces a Hybrid Stabilized Rendering module that fuses semantic and geometric cues for dynamic consistency. Finally, a Dual-Stream Video Diffusion Model restores disoccluded regions and rectifies artifacts by synergizing structural guidance with semantic anchors. Collectively, VS3R achieves high-fidelity, full-frame stabilization across diverse camera models and significantly outperforms state-of-the-art methods in robustness and visual quality.

Chinese Translation

视频稳定旨在减轻相机抖动，但面临几何稳健性与全帧一致性之间的基本权衡。虽然2D方法由于过度裁剪而受到限制，3D技术常常受到脆弱优化流程的影响，在极端运动下无法有效工作。为了解决这一问题，我们提出了VS3R，一个将前馈3D重建与生成视频扩散相结合的框架。我们的流程共同估计相机参数、深度和掩码，以确保在各种场景下的可靠性，并引入了一个混合稳定渲染模块，该模块融合了语义和几何线索以实现动态一致性。最后，双流视频扩散模型通过将结构指导与语义锚点相结合，恢复了被遮挡区域并修正了伪影。总体而言，VS3R在不同相机模型上实现了高保真度的全帧稳定，并在稳健性和视觉质量方面显著超越了现有最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2603.05867

TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis

肿瘤链：用于可追溯临床肿瘤分析的交错多模态思维链推理

Li, Sijing, Qiu, Zhongwei, Liu, Jiang, Zhang, Wenqiao, Lin, Tianwei, Xie, Yihan, An, Jianxiang, Yun, Boxiang, Yang, Chenglin, Xiao, Jun, Guo, Guangyu, Yao, Jiawen, Liu, Wei, Gao, Yuan, Yan, Ke, Cao, Weiwei, Zheng, Zhilin, Mok, Tony C. W., Cao, Kai, Shi, Yu, Zhang, Jiuyu, Zhou, Jian, Ooi, Beng Chin, Xia, Yingda, Zhang, Ling

Abstract

Accurate tumor analysis is central to clinical radiology and precision oncology, where early detection, reliable lesion characterization, and pathology-level risk assessment guide diagnosis and treatment planning. Chain-of-Thought (CoT) reasoning is particularly important in this setting because it enables step-by-step interpretation from imaging findings to clinical impressions and pathology conclusions, improving traceability and reducing diagnostic errors. Here, we target the clinical tumor analysis task and build a large-scale benchmark that operationalizes a multimodal reasoning pipeline, spanning findings, impressions, and pathology predictions. We curate TumorCoT, a large-scale dataset of 1.5M CoT-labeled VQA instructions paired with 3D CT scans, with step-aligned rationales and cross-modal alignments along the trajectory from findings to impression to pathology, enabling evaluation of both answer accuracy and reasoning consistency. We further propose TumorChain, a multimodal interleaved reasoning framework that tightly couples 3D imaging encoders, clinical text understanding, and organ-level vision-language alignment. Through cross-modal alignment and iterative interleaved causal reasoning, TumorChain grounds visual evidence, aggregates conclusions, and issues pathology predictions after multiple rounds of self-refinement, improving traceability and reducing hallucination risk. Experiments show consistent improvements over strong baselines in lesion detection, impression generation, and pathology classification, and demonstrate strong generalization on the DeepTumorVQA benchmark. These results highlight the potential of multimodal reasoning for reliable and interpretable tumor analysis in clinical practice. Detailed information about our project can be found on our project homepage at https://github.com/ZJU4HealthCare/TumorChain.

Chinese Translation

准确的肿瘤分析是临床放射学和精准肿瘤学的核心，其中早期检测、可靠的病变特征描述和病理级风险评估指导着诊断和治疗规划。在这一背景下，思维链（Chain-of-Thought, CoT）推理尤为重要，因为它能够从影像发现到临床印象再到病理结论进行逐步解释，提高可追溯性并减少诊断错误。在此，我们针对临床肿瘤分析任务构建了一个大规模基准，实施了一个跨越发现、印象和病理预测的多模态推理管道。我们整理了TumorCoT，这是一个包含150万条CoT标注的视觉问答（VQA）指令与3D CT扫描配对的大规模数据集，提供了与发现、印象和病理之间轨迹对齐的逐步推理，能够评估答案的准确性和推理的一致性。我们进一步提出了TumorChain，一个紧密结合3D影像编码器、临床文本理解和器官级视觉-语言对齐的多模态交错推理框架。通过跨模态对齐和迭代交错因果推理，TumorChain将视觉证据与结论相结合，并在多轮自我精炼后发出病理预测，提高可追溯性并降低幻觉风险。实验结果显示，在病变检测、印象生成和病理分类方面相较于强基线有持续的改善，并在DeepTumorVQA基准上展现出强大的泛化能力。这些结果突显了多模态推理在临床实践中实现可靠和可解释的肿瘤分析的潜力。有关我们项目的详细信息，请访问我们的项目主页：https://github.com/ZJU4HealthCare/TumorChain。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2603.05869

PatchCue: Enhancing Vision-Language Model Reasoning with Patch-Based Visual Cues

PatchCue：通过基于补丁的视觉线索增强视觉-语言模型推理

Qi, Yukun, Fu, Pei, Li, Hang, Liu, Yuhan, Jiang, Chao, Qin, Bin, Luo, Zhenbo, Luan, Jian

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress on a wide range of challenging multimodal understanding and reasoning tasks. However, existing reasoning paradigms, such as the classical Chain-of-Thought (CoT), rely solely on textual information and often underutilize important visual cues. While prior work has incorporated pixel-level visual cues, these representations require precise spatial localization, introducing additional learning complexity. To address this, we propose PatchCue, a novel patch-based visual cue paradigm designed to significantly enhance the visual reasoning capabilities of VLMs. By partitioning images into patches and representing cues at the patch level, PatchCue aligns better with human perceptual habits and leverages the patch-tokenized input of modern VLMs. We train VLMs using a two-stage approach: cold-start supervised fine-tuning to output patch-level cues, followed by reinforcement learning with a process-supervised cue reward that guides intermediate visual reasoning steps. Extensive experiments on multiple VLMs and diverse benchmarks, including general visual question answering, complex reasoning, and document understanding, demonstrate that PatchCue consistently improves overall model performance. Our results show that patch-level cues outperform both pixel-level bounding boxes and point-based cues, providing a more effective and cognitively aligned visual reasoning paradigm.

Chinese Translation

视觉-语言模型（VLMs）在多种具有挑战性的多模态理解和推理任务上取得了显著进展。然而，现有的推理范式，如经典的思维链（Chain-of-Thought, CoT），仅依赖文本信息，往往未能充分利用重要的视觉线索。尽管先前的研究已将像素级视觉线索纳入考虑，但这些表示需要精确的空间定位，增加了学习的复杂性。为了解决这一问题，我们提出了PatchCue，一种新颖的基于补丁的视觉线索范式，旨在显著增强VLMs的视觉推理能力。通过将图像划分为补丁并在补丁级别表示线索，PatchCue更好地与人类的感知习惯相一致，并利用现代VLMs的补丁标记输入。我们采用两阶段的方法训练VLMs：首先进行冷启动的监督微调以输出补丁级线索，其次通过强化学习与过程监督的线索奖励指导中间视觉推理步骤。在多个VLMs和多样化基准（包括一般视觉问答、复杂推理和文档理解）上的广泛实验表明，PatchCue始终提高了整体模型性能。我们的结果显示，补丁级线索优于像素级边界框和基于点的线索，提供了一种更有效且符合认知的视觉推理范式。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2603.05873

Shifting Adaptation from Weight Space to Memory Space: A Memory-Augmented Agent for Medical Image Segmentation

从权重空间到记忆空间的适应转变：一种用于医学图像分割的记忆增强代理

Chen, Bowen, Gao, Qiaohui, Wan, Shaowen, Sun, Shanhui, Liu, Wei, Li, Xiang, Liu, Tianming, Zhao, Lin

Abstract

Medical image segmentation is fundamental to clinical workflows, yet models trained on a single dataset often fail to generalize across institutions, scanners, or patient populations. While vision foundation models have shown great promise in addressing this challenge, their deployment typically requires task-specific fine-tuning, which introduces substantial communication overhead in federated learning and prevents continuous knowledge evolution during deployment. In this work, we propose a memory-augmented segmentation agent (MemSeg-Agent) that shifts adaptation from weight space to memory space, enabling few-shot learning, federated supervised learning, and test-time adaptation within a unified architecture. MemSeg-Agent conditions a fixed backbone with lightweight static, few-shot, and test-time working memories, which are dynamically composed by an agentic controller. In federated settings, we update compact memory units instead of model parameters, substantially reducing communication overhead. Experiments on four public datasets demonstrate strong performance and robustness to domain shift: Static memory alone matches or surpasses strong supervised baselines with high parameter efficiency, and test-time working memory further improves in-domain and cross-domain performance without fine-tuning. Overall, MemSeg-Agent introduces a new paradigm for scalable and adaptive medical image segmentation in the era of agentic AI.

Chinese Translation

医学图像分割是临床工作流程的基础，但在单一数据集上训练的模型往往无法在不同机构、扫描仪或患者群体中实现泛化。尽管视觉基础模型在解决这一挑战方面显示出巨大潜力，但其部署通常需要特定任务的微调，这在联邦学习中引入了大量通信开销，并阻碍了在部署期间的持续知识演化。在本研究中，我们提出了一种记忆增强的分割代理（MemSeg-Agent），将适应性从权重空间转移到记忆空间，从而在统一架构内实现少样本学习、联邦监督学习和测试时适应。MemSeg-Agent通过轻量级的静态、少样本和测试时工作记忆来调节固定的主干网络，这些记忆由代理控制器动态组合。在联邦设置中，我们更新紧凑的记忆单元而不是模型参数，从而显著减少通信开销。在四个公共数据集上的实验表明，MemSeg-Agent在领域转移方面表现出强大的性能和鲁棒性：仅使用静态记忆就能匹配或超越强大的监督基线，并且具有高参数效率，而测试时工作记忆在不进行微调的情况下进一步提高了领域内和跨领域的性能。总体而言，MemSeg-Agent为在智能代理时代实现可扩展和自适应的医学图像分割引入了一种新范式。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2603.05876

Systematic Evaluation of Novel View Synthesis for Video Place Recognition

视频地点识别中新视角合成的系统评估

Mahmud, Muhammad Zawad, Islam, Samiha, Lyons, Damian

Abstract

The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset.

Chinese Translation

合成新视角的生成在多个方面有潜力积极影响机器人导航。在基于图像的导航中，从地面机器人拍摄的场景生成的新上方视角可以用于引导空中机器人到达该位置。在视频地点识别（Video Place Recognition, VPR）中，可以添加从空中拍摄的地面位置的新视角，使无人机（UAV）能够识别地面机器人所看到的地方，类似地，上方视角也可以用于生成新的地面视角。本文对VPR中的合成新视角进行了系统评估，使用了五个公共VPR图像数据库和七种典型的图像相似性方法。我们展示了对于小规模的合成添加，新视角改善了VPR识别统计数据。我们发现，对于较大规模的添加，视角变化的幅度不如添加的视角数量和数据集中图像类型重要。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2603.05882

CylinderSplat: 3D Gaussian Splatting with Cylindrical Triplanes for Panoramic Novel View Synthesis

CylinderSplat：基于圆柱三平面的3D高斯点云用于全景新视角合成

Wang, Qiwei, Ze, Xianghui, Yu, Jingyi, Shi, Yujiao

Abstract

Feed-forward 3D Gaussian Splatting (3DGS) has shown great promise for real-time novel view synthesis, but its application to panoramic imagery remains challenging. Existing methods often rely on multi-view cost volumes for geometric refinement, which struggle to resolve occlusions in sparse-view scenarios. Furthermore, standard volumetric representations like Cartesian Triplanes are poor in capturing the inherent geometry of $360^\circ$ scenes, leading to distortion and aliasing. In this work, we introduce CylinderSplat, a feed-forward framework for panoramic 3DGS that addresses these limitations. The core of our method is a new {cylindrical Triplane} representation, which is better aligned with panoramic data and real-world structures adhering to the Manhattan-world assumption. We use a dual-branch architecture: a pixel-based branch reconstructs well-observed regions, while a volume-based branch leverages the cylindrical Triplane to complete occluded or sparsely-viewed areas. Our framework is designed to flexibly handle a variable number of input views, from single to multiple panoramas. Extensive experiments demonstrate that CylinderSplat achieves state-of-the-art results in both single-view and multi-view panoramic novel view synthesis, outperforming previous methods in both reconstruction quality and geometric accuracy.

Chinese Translation

前馈式3D高斯点云（3DGS）在实时新视角合成中展现了巨大的潜力，但其在全景图像中的应用仍然面临挑战。现有方法通常依赖于多视图代价体积进行几何精细化，但在稀疏视图场景中难以解决遮挡问题。此外，标准的体积表示如笛卡尔三平面在捕捉$360^ ext{°}$场景的固有几何方面表现不佳，导致失真和混叠。在本研究中，我们提出了CylinderSplat，一种针对全景3DGS的前馈框架，旨在解决这些局限性。我们方法的核心是新的圆柱三平面表示，它与全景数据和遵循曼哈顿世界假设的真实世界结构更为一致。我们采用双分支架构：基于像素的分支重建观察良好的区域，而基于体积的分支利用圆柱三平面来补全被遮挡或稀疏视图的区域。我们的框架旨在灵活处理可变数量的输入视图，从单个到多个全景。大量实验表明，CylinderSplat在单视图和多视图全景新视角合成中均达到了最先进的结果，在重建质量和几何准确性方面均优于之前的方法。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2603.05888

PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

PixARMesh：自回归网格原生单视图场景重建

Zhang, Xiang, Yoo, Sohyun, Wu, Hongrui, Li, Chuan, Xie, Jianwen, Tu, Zhuowen

Abstract

We introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.

Chinese Translation

我们提出了PixARMesh，一种从单个RGB图像自回归重建完整3D室内场景网格的方法。与依赖隐式有符号距离场和事后布局优化的先前方法不同，PixARMesh在一个统一模型中共同预测物体布局和几何形状，在一次前向传递中生成连贯且适合艺术创作的网格。基于最近在网格生成模型方面的进展，我们通过交叉注意力将点云编码器与像素对齐的图像特征和全局场景上下文相结合，使得从单个图像进行准确的空间推理成为可能。场景通过包含上下文、姿态和网格的统一标记流自回归生成，产生紧凑的网格和高保真几何。对合成和真实世界数据集的实验表明，PixARMesh在实现最先进的重建质量的同时，生成轻量级、高质量的网格，适合下游应用。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2603.05898

InnoAds-Composer: Efficient Condition Composition for E-Commerce Poster Generation

InnoAds-Composer：用于电子商务海报生成的高效条件组合

Qin, Yuxin, Cao, Ke, Liu, Haowei, Ma, Ao, Li, Fengheng, Zhu, Honghe, Zhang, Zheng, Ling, Run, Feng, Wei, He, Xuanhua, Zhang, Zhanjie, Guo, Zhen, Bian, Haoyi, Lv, Jingjing, Shen, Junjie, Law, Ching

Abstract

E-commerce product poster generation aims to automatically synthesize a single image that effectively conveys product information by presenting a subject, text, and a designed style. Recent diffusion models with fine-grained and efficient controllability have advanced product poster synthesis, yet they typically rely on multi-stage pipelines, and simultaneous control over subject, text, and style remains underexplored. Such naive multi-stage pipelines also show three issues: poor subject fidelity, inaccurate text, and inconsistent style. To address these issues, we propose InnoAds-Composer, a single-stage framework that enables efficient tri-conditional control tokens over subject, glyph, and style. To alleviate the quadratic overhead introduced by naive tri-conditional token concatenation, we perform importance analysis over layers and timesteps and route each condition only to the most responsive positions, thereby shortening the active token sequence. Besides, to improve the accuracy of Chinese text rendering, we design a Text Feature Enhancement Module (TFEM) that integrates features from both glyph images and glyph crops. To support training and evaluation, we also construct a high-quality e-commerce product poster dataset and benchmark, which is the first dataset that jointly contains subject, text, and style conditions. Extensive experiments demonstrate that InnoAds-Composer significantly outperforms existing product poster methods without obviously increasing inference latency.

Chinese Translation

电子商务产品海报生成旨在自动合成一幅有效传达产品信息的单一图像，通过展示主题、文本和设计风格。近期的扩散模型在细粒度和高效可控性方面推动了产品海报合成的发展，但它们通常依赖于多阶段管道，并且对主题、文本和风格的同时控制仍然未得到充分探索。这种简单的多阶段管道还存在三个问题：主题保真度差、文本不准确和风格不一致。为了解决这些问题，我们提出了InnoAds-Composer，一个单阶段框架，能够对主题、字形和风格进行高效的三条件控制标记。为了减轻简单三条件标记连接所带来的二次开销，我们对层和时间步进行了重要性分析，仅将每个条件路由到最响应的位置，从而缩短了活动标记序列。此外，为了提高中文文本渲染的准确性，我们设计了一个文本特征增强模块（Text Feature Enhancement Module, TFEM），该模块整合了字形图像和字形裁剪的特征。为了支持训练和评估，我们还构建了一个高质量的电子商务产品海报数据集和基准，这是第一个同时包含主题、文本和风格条件的数据集。大量实验表明，InnoAds-Composer显著优于现有的产品海报方法，而没有明显增加推理延迟。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2603.05899

Mitigating Bias in Concept Bottleneck Models for Fair and Interpretable Image Classification

减轻概念瓶颈模型中的偏见以实现公平和可解释的图像分类

Tong, Schrasing, Salaun, Antoine, Yuan, Vincent, Adeyeri, Annabel, Kagal, Lalana

Abstract

Ensuring fairness in image classification prevents models from perpetuating and amplifying bias. Concept bottleneck models (CBMs) map images to high-level, human-interpretable concepts before making predictions via a sparse, one-layer classifier. This structure enhances interpretability and, in theory, supports fairness by masking sensitive attribute proxies such as facial features. However, CBM concepts have been known to leak information unrelated to concept semantics and early results reveal only marginal reductions in gender bias on datasets like ImSitu. We propose three bias mitigation techniques to improve fairness in CBMs: 1. Decreasing information leakage using a top-k concept filter, 2. Removing biased concepts, and 3. Adversarial debiasing. Our results outperform prior work in terms of fairness-performance tradeoffs, indicating that our debiased CBM provides a significant step towards fair and interpretable image classification.

Chinese Translation

确保图像分类的公平性可以防止模型延续和放大偏见。概念瓶颈模型（Concept Bottleneck Models, CBMs）在通过稀疏的一层分类器进行预测之前，将图像映射到高层次的人类可解释概念。这种结构增强了可解释性，并在理论上通过掩盖敏感属性代理（如面部特征）来支持公平性。然而，CBM概念已知会泄露与概念语义无关的信息，早期结果显示在ImSitu等数据集上性别偏见的减少仅为边际性。我们提出了三种减轻偏见的技术以改善CBMs中的公平性：1. 使用top-k概念过滤器减少信息泄露，2. 移除有偏见的概念，3. 对抗性去偏见。我们的结果在公平性与性能的权衡方面优于以往的研究，表明我们的去偏见CBM为实现公平和可解释的图像分类迈出了重要一步。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2603.05905

CollabOD: Collaborative Multi-Backbone with Cross-scale Vision for UAV Small Object Detection

CollabOD：具有跨尺度视觉的协作多骨干无人机小物体检测

Bai, Xuecheng, Wang, Yuxiang, Xu, Chuanzhi, Hu, Boyu, Han, Kang, Pan, Ruijie, Niu, Xiaowei, Guan, Xiaotian, Fu, Liqiang, Ye, Pengfei

Abstract

Small object detection in unmanned aerial vehicle (UAV) imagery is challenging, mainly due to scale variation, structural detail degradation, and limited computational resources. In high-altitude scenarios, fine-grained features are further weakened during hierarchical downsampling and cross-scale fusion, resulting in unstable localization and reduced robustness. To address this issue, we propose CollabOD, a lightweight collaborative detection framework that explicitly preserves structural details and aligns heterogeneous feature streams before multi-scale fusion. The framework integrates Structural Detail Preservation, Cross-Path Feature Alignment, and Localization-Aware Lightweight Design strategies. From the perspectives of image processing, channel structure, and lightweight design, it optimizes the architecture of conventional UAV perception models. The proposed design enhances representation stability while maintaining efficient inference. A unified detail-aware detection head further improves regression robustness without introducing additional deployment overhead. The code is available at: https://github.com/Bai-Xuecheng/CollabOD.

Chinese Translation

在无人机（UAV）图像中，小物体检测面临挑战，主要由于尺度变化、结构细节退化和计算资源有限。在高空场景中，细粒度特征在层次下采样和跨尺度融合过程中进一步减弱，导致定位不稳定和鲁棒性降低。为了解决这一问题，我们提出了CollabOD，一种轻量级协作检测框架，明确保留结构细节并在多尺度融合之前对异构特征流进行对齐。该框架集成了结构细节保留、跨路径特征对齐和定位感知轻量设计策略。从图像处理、通道结构和轻量设计的角度优化了传统无人机感知模型的架构。所提出的设计在保持高效推理的同时增强了表示的稳定性。统一的细节感知检测头进一步提高了回归的鲁棒性，而不引入额外的部署开销。代码可在以下链接获取：https://github.com/Bai-Xuecheng/CollabOD。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2603.05906

Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D

超越几何：沉浸式2D到3D的艺术差异合成

Chen, Ping, Chen, Zezhou, Zhang, Xingpeng, Qian, Yanlin, Hu, Huan, Liu, Xiang, Wang, Zipeng, Wang, Xin, Liu, Zhaoxiang, Wang, Kai, Lian, Shiguo

Abstract

Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because geometric reconstruction paradigms mistake deliberate artistic intent, such as strategic zero-plane shifts for pop-out effects and local depth sculpting, for data noise or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.

Chinese Translation

当前的2D到3D转换方法虽然在几何精度上表现良好，但在艺术表现上却显得不足，未能复制专业3D电影所带来的沉浸式和情感共鸣的体验。这是因为几何重建范式将有意的艺术意图（例如用于突显效果的战略性零平面移动和局部深度雕刻）误认为数据噪声或模糊性。本文提出了一种新的范式：艺术差异合成，将目标从物理上准确的差异估计转变为艺术上连贯的差异合成。我们提出了Art3D，一个探索这一范式的初步框架。Art3D采用双路径架构，将全局深度参数（宏观意图）与局部艺术效果（视觉笔触）解耦，并通过间接监督从专业3D电影数据中学习。我们还引入了一种初步评估方法来量化电影对齐效果。实验表明，我们的方法在复制关键的局部出屏效果和与电影3D内容的全局深度风格对齐方面展示了潜力，为新一类以艺术驱动的转换工具奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2603.05908

Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image

Pano3DComposer：基于单一全景图像的前馈组合3D场景生成

Qiu, Zidian, Wu, Ancong

Abstract

Current compositional image-to-3D scene generation approaches construct 3D scenes by time-consuming iterative layout optimization or inflexible joint object-layout generation. Moreover, most methods rely on limited field-of-view perspective images, hindering the creation of complete 360-degree environments. To address these limitations, we design Pano3DComposer, an efficient feed-forward framework for panoramic images. To decouple object generation from layout estimation, we propose a plug-and-play Object-World Transformation Predictor. This module converts the 3D objects generated by off-the-shelf image-to-3D models from local to world coordinates. To achieve this, we adapt the VGGT architecture to Alignment-VGGT by using target object crop, multi-view object renderings and camera parameters to predict the transformation. The predictor is trained using pseudo-geometric supervision to address the shape discrepancy between generated and ground-truth objects. For input images from unseen domains, we further introduce a Coarse-to-Fine (C2F) alignment mechanism for Pano3DComposer that iteratively refines geometric consistency with feedback of scene rendering. Our method achieves superior geometric accuracy for image/text-to-3D tasks on synthetic and real-world datasets. It can generate a high-fidelity 3D scene in approximately 20 seconds on an RTX 4090 GPU. Project page: https://qiuzidian.github.io/pano3dcomposer-page/.

Chinese Translation

当前的图像到3D场景生成方法通过耗时的迭代布局优化或不灵活的联合物体布局生成来构建3D场景。此外，大多数方法依赖于有限视场的透视图像，这限制了完整360度环境的创建。为了解决这些局限性，我们设计了Pano3DComposer，这是一个高效的全景图像前馈框架。为了将物体生成与布局估计解耦，我们提出了一种即插即用的物体-世界转换预测器（Object-World Transformation Predictor）。该模块将由现成的图像到3D模型生成的3D物体从局部坐标转换为世界坐标。为此，我们通过使用目标物体裁剪、多视角物体渲染和相机参数，将VGGT架构调整为Alignment-VGGT，以预测转换。该预测器使用伪几何监督进行训练，以解决生成物体与真实物体之间的形状差异。对于来自未见领域的输入图像，我们进一步为Pano3DComposer引入了一种粗到细（Coarse-to-Fine, C2F）对齐机制，通过场景渲染的反馈迭代地优化几何一致性。我们的方法在合成和真实世界数据集上实现了图像/文本到3D任务的优越几何准确性。在RTX 4090 GPU上，它可以在大约20秒内生成高保真度的3D场景。项目页面：https://qiuzidian.github.io/pano3dcomposer-page/

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2603.05911

CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning

CORE-Seg：基于推理驱动的复杂病变分割方法通过强化学习

Xie, Yuxin, Chen, Yuming, Yang, Yishan, Zhou, Yi, Zhou, Tao, Zhao, Zhen, Liu, Jiacheng, Fu, Huazhu

Abstract

Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06\% (14.89\% higher than the second-best baseline), while reducing the failure rate to 18.42\%. Project Page: https://xyxl024.github.io/CORE-Seg.github.io/

Chinese Translation

医学图像分割正经历从传统的视觉模式匹配到认知推理分析的范式转变。尽管多模态大型语言模型（Multimodal Large Language Models, MLLMs）在整合语言与视觉知识方面展现出潜力，但仍存在显著差距：现有的通用MLLMs具备广泛的常识，但缺乏处理复杂病变所需的专业视觉推理能力，而传统分割模型在像素级分割方面表现出色，但缺乏逻辑可解释性。本文介绍了ComLesion-14K，这是首个多样化的推理驱动复杂病变分割的链式思维（Chain-of-Thought, CoT）基准。为完成这一任务，我们提出了CORE-Seg，一个通过语义引导提示适配器将推理与分割相结合的端到端框架。我们设计了一种从SFT到GRPO的渐进式训练策略，并配备自适应双粒度奖励机制，以减轻奖励稀疏性。我们的方法在平均Dice系数上达到了37.06%（比第二好的基线高出14.89%），同时将失败率降低至18.42%。项目页面：https://xyxl024.github.io/CORE-Seg.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2603.05921

BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models via Instruction-Response Deviation

BlackMirror：基于指令-响应偏差的文本到图像模型黑箱后门检测

Li, Feiran, Xu, Qianqian, Bao, Shilong, Yang, Zhiyong, Zhao, Xilin, Cao, Xiaochun, Huang, Qingming

Abstract

This paper investigates the challenging task of detecting backdoored text-to-image models under black-box settings and introduces a novel detection framework BlackMirror. Existing approaches typically rely on analyzing image-level similarity, under the assumption that backdoor-triggered generations exhibit strong consistency across samples. However, they struggle to generalize to recently emerging backdoor attacks, where backdoored generations can appear visually diverse. BlackMirror is motivated by an observation: across backdoor attacks, {only partial semantic patterns within the generated image are steadily manipulated, while the rest of the content remains diverse or benign. Accordingly, BlackMirror consists of two components: MirrorMatch, which aligns visual patterns with the corresponding instructions to detect semantic deviations; and MirrorVerify, which evaluates the stability of these deviations across varied prompts to distinguish true backdoor behavior from benign responses. BlackMirror is a general, training-free framework that can be deployed as a plug-and-play module in Model-as-a-Service (MaaS) applications. Comprehensive experiments demonstrate that BlackMirror achieves accurate detection across a wide range of attacks. Code is available at https://github.com/Ferry-Li/BlackMirror.

Chinese Translation

本文研究了在黑箱环境下检测带后门的文本到图像模型的挑战性任务，并提出了一种新颖的检测框架BlackMirror。现有方法通常依赖于分析图像级相似性，假设后门触发的生成在样本之间表现出强一致性。然而，它们在应对最近出现的后门攻击时面临困难，因为带后门的生成可能在视觉上表现出多样性。BlackMirror的设计灵感来源于一个观察：在后门攻击中，生成图像中的部分语义模式被持续操控，而其余内容则保持多样或良性。因此，BlackMirror由两个组件组成：MirrorMatch，它将视觉模式与相应的指令对齐以检测语义偏差；以及MirrorVerify，它评估这些偏差在不同提示下的稳定性，以区分真实的后门行为和良性响应。BlackMirror是一个通用的、无训练的框架，可以作为即插即用模块部署在模型即服务（MaaS）应用中。全面的实验表明，BlackMirror在广泛的攻击中实现了准确的检测。代码可在 https://github.com/Ferry-Li/BlackMirror 获取。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2603.05925

RAC: Rectified Flow Auto Coder

RAC：整流流自动编码器

Fang, Sen, Feng, Yalin, Zhang, Yanxin, Metaxas, Dimitris N.

Abstract

In this paper, we propose a Rectified Flow Auto Coder (RAC) inspired by Rectified Flow to replace the traditional VAE: 1. It achieves multi-step decoding by applying the decoder to flow timesteps. Its decoding path is straight and correctable, enabling step-by-step refinement. 2. The model inherently supports bidirectional inference, where the decoder serves as the encoder through time reversal (hence Coder rather than encoder or decoder), reducing parameter count by nearly 41%. 3. This generative decoding method improves generation quality since the model can correct latent variables along the path, partially addressing the reconstruction--generation gap. Experiments show that RAC surpasses SOTA VAEs in both reconstruction and generation with approximately 70% lower computational cost.

Chinese Translation

在本文中，我们提出了一种受整流流启发的整流流自动编码器（RAC），以替代传统的变分自编码器（VAE）：1. 通过将解码器应用于流时间步，实现多步解码。其解码路径直观且可纠正，能够逐步优化。2. 该模型本质上支持双向推理，其中解码器通过时间反转作为编码器（因此称为Coder而非编码器或解码器），将参数数量减少近41%。3. 这种生成解码方法提高了生成质量，因为模型可以沿路径纠正潜在变量，部分解决了重建与生成之间的差距。实验表明，RAC在重建和生成方面均超越了最新的变分自编码器（SOTA VAE），且计算成本降低了约70%。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2603.05926

Towards Driver Behavior Understanding: Weakly-Supervised Risk Perception in Driving Scenes

迈向驾驶行为理解：驾驶场景中的弱监督风险感知

Agarwal, Nakul, Chen, Yi-Ting, Dariush, Behzad

Abstract

Achieving zero-collision mobility remains a key objective for intelligent vehicle systems, which requires understanding driver risk perception-a complex cognitive process shaped by voluntary response of the driver to external stimuli and the attentiveness of surrounding road users towards the ego-vehicle. To support progress in this area, we introduce RAID (Risk Assessment In Driving scenes)-a large-scale dataset specifically curated for research on driver risk perception and contextual risk assessment. RAID comprises 4,691 annotated video clips, covering diverse traffic scenarios with labels for driver's intended maneuver, road topology, risk situations (e.g., crossing pedestrians), driver responses, and pedestrian attentiveness. Leveraging RAID, we propose a weakly supervised risk object identification framework that models the relationship between driver's intended maneuver and responses to identify potential risk sources. Additionally, we analyze the role of pedestrian attention in estimating risk and demonstrate the value of the proposed dataset. Experimental evaluations demonstrate that our method achieves 20.6% and 23.1% performance gains over prior state-of-the-art approaches on the RAID and HDDS datasets, respectively.

Chinese Translation

实现零碰撞出行仍然是智能车辆系统的一个关键目标，这需要理解驾驶员的风险感知——这是一个复杂的认知过程，受驾驶员对外部刺激的自愿反应以及周围道路使用者对自我车辆的关注程度的影响。为了支持这一领域的进展，我们引入了RAID（驾驶场景中的风险评估）——一个专门为驾驶员风险感知和情境风险评估研究而策划的大规模数据集。RAID包含4,691个标注的视频片段，涵盖了多样的交通场景，并标注了驾驶员的意图动作、道路拓扑、风险情境（例如，过马路的行人）、驾驶员反应和行人注意力。利用RAID，我们提出了一种弱监督风险目标识别框架，该框架建模驾驶员意图动作与反应之间的关系，以识别潜在的风险源。此外，我们分析了行人注意力在风险估计中的作用，并展示了所提数据集的价值。实验评估表明，我们的方法在RAID和HDDS数据集上分别比之前的最先进方法提高了20.6%和23.1%的性能。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2603.05929

Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation

超越静态框架：用于人体姿态估计的时间聚合与恢复视觉变换器

Fang, Hongwei, Cai, Jiahang, Wang, Xun, Yang, Wenwu

Abstract

Vision Transformers (ViTs) have recently achieved state-of-the-art performance in 2D human pose estimation due to their strong global modeling capability. However, existing ViT-based pose estimators are designed for static images and process each frame independently, thereby ignoring the temporal coherence that exists in video sequences. This limitation often results in unstable predictions, especially in challenging scenes involving motion blur, occlusion, or defocus. In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. TAR-ViTPose enhances static ViT representations by aggregating temporal cues across frames in a plug-and-play manner, leading to more robust and accurate pose estimation. To effectively aggregate joint-specific features that are temporally aligned across frames, we introduce a joint-centric temporal aggregation (JTA) that assigns each joint a learnable query token to selectively attend to its corresponding regions from neighboring frames. Furthermore, we develop a global restoring attention (GRA) to restore the aggregated temporal features back into the token sequence of the current frame, enriching its pose representation while fully preserving global context for precise keypoint localization. Extensive experiments demonstrate that TAR-ViTPose substantially improves upon the single-frame baseline ViTPose, achieving a +2.3 mAP gain on the PoseTrack2017 benchmark. Moreover, our approach outperforms existing state-of-the-art video-based methods, while also achieving a noticeably higher runtime frame rate in real-world applications. Project page: https://github.com/zgspose/TARViTPose.

Chinese Translation

视觉变换器（Vision Transformers, ViTs）由于其强大的全局建模能力，最近在二维人体姿态估计中取得了最先进的性能。然而，现有的基于ViT的姿态估计器是为静态图像设计的，独立处理每一帧，从而忽视了视频序列中存在的时间一致性。这一局限性常常导致不稳定的预测，特别是在涉及运动模糊、遮挡或失焦的挑战性场景中。在本文中，我们提出了TAR-ViTPose，一种新颖的时间聚合与恢复视觉变换器，专为基于视频的二维人体姿态估计而设计。TAR-ViTPose通过以即插即用的方式聚合跨帧的时间线索，增强了静态ViT表示，从而实现更稳健和准确的姿态估计。为了有效聚合在时间上对齐的关节特征，我们引入了一种以关节为中心的时间聚合（Joint-Centric Temporal Aggregation, JTA），为每个关节分配一个可学习的查询标记，以选择性地关注来自相邻帧的对应区域。此外，我们开发了一种全局恢复注意力（Global Restoring Attention, GRA），将聚合的时间特征恢复回当前帧的标记序列，丰富其姿态表示，同时充分保留全局上下文以实现精确的关键点定位。大量实验表明，TAR-ViTPose在PoseTrack2017基准上显著优于单帧基线ViTPose，获得了+2.3 mAP的提升。此外，我们的方法在现有最先进的基于视频的方法中表现出色，同时在实际应用中实现了明显更高的运行帧率。项目页面：https://github.com/zgspose/TARViTPose。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2603.05932

FTSplat: Feed-forward Triangle Splatting Network

FTSplat：前馈三角形溅射网络

Jinlin, Xiong, Can, Li, Jiawei, Shen, Zhigang, Qi, Lei, Sun, Dongyang, Zhao

Abstract

High-fidelity three-dimensional (3D) reconstruction is essential for robotics and simulation. While Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) achieve impressive rendering quality, their reliance on time-consuming per-scene optimization limits real-time deployment. Emerging feed-forward Gaussian splatting methods improve efficiency but often lack explicit, manifold geometry required for direct simulation. To address these limitations, we propose a feed-forward framework for triangle primitive generation that directly predicts continuous triangle surfaces from calibrated multi-view images. Our method produces simulation-ready models in a single forward pass, obviating the need for per-scene optimization or post-processing. We introduce a pixel-aligned triangle generation module and incorporate relative 3D point cloud supervision to enhance geometric learning stability and consistency. Experiments demonstrate that our method achieves efficient reconstruction while maintaining seamless compatibility with standard graphics and robotic simulators.

Chinese Translation

高保真三维（3D）重建对于机器人技术和仿真至关重要。尽管神经辐射场（NeRF）和三维高斯溅射（3DGS）在渲染质量上取得了令人印象深刻的成果，但它们对耗时的每场景优化的依赖限制了实时部署。新兴的前馈高斯溅射方法提高了效率，但通常缺乏直接仿真所需的明确流形几何。为了解决这些局限性，我们提出了一种前馈框架，用于三角形原语生成，该框架直接从标定的多视图图像中预测连续的三角形表面。我们的方法在一次前向传递中生成适合仿真的模型，消除了对每场景优化或后处理的需求。我们引入了一个像素对齐的三角形生成模块，并结合相对三维点云监督，以增强几何学习的稳定性和一致性。实验表明，我们的方法在高效重建的同时，保持了与标准图形和机器人仿真器的无缝兼容性。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2603.05936

OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving

OD-RASE：基于本体的自主驾驶风险评估与安全增强

Shimomura, Kota, Nambata, Masaki, Ishikawa, Atsuya, Mimura, Ryota, Kawabuchi, Takayuki, Yamashita, Takayoshi, Inoue, Koki

Abstract

Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.

Chinese Translation

尽管自主驾驶系统表现出高水平的感知性能，但在处理稀有情况或复杂道路结构时仍面临局限。这些道路基础设施是为人类驾驶员设计的，安全改进通常是在事故发生后才引入。这种反应式的方法对自主系统构成了重大挑战，因为它们需要主动的风险缓解。为了解决这一问题，我们提出了OD-RASE，一个通过检测导致交通事故的道路结构并将这些发现与基础设施发展相连接来增强自主驾驶系统安全性的框架。首先，我们基于道路交通系统的专业领域知识形式化一个本体。同时，我们使用大规模视觉语言模型（LVLM）生成基础设施改进建议，并利用基于本体的数据过滤来增强其可靠性。这个过程自动在事故前的道路图像上标注改进建议，从而构建一个新的数据集。此外，我们介绍了基线方法（OD-RASE模型），该方法利用LVLM和扩散模型生成基础设施改进建议以及改进后的道路环境图像。我们的实验表明，基于本体的数据过滤能够高度准确地预测导致事故的道路结构及相应的改进计划。我们相信，这项工作有助于提高交通环境的整体安全性，并标志着自主驾驶系统更广泛应用的重要一步。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2603.05937

Facial Expression Recognition Using Residual Masking Network

基于残差掩蔽网络的面部表情识别

Pham, Luan, Vu, The Huynh, Tran, Tuan Anh

Abstract

Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets. The source code is available at https://github.com/phamquiluan/ResidualMaskingNetwork.

Chinese Translation

自动面部表情识别（FER）因其在人机交互中的应用而受到广泛关注。在改善FER任务的方法中，本文重点研究了具有注意力机制的深度架构。我们提出了一种新颖的掩蔽思想，以提升卷积神经网络（CNN）在面部表情任务中的性能。该方法利用分割网络来细化特征图，使网络能够专注于相关信息，从而做出正确的决策。在实验中，我们结合了广泛使用的深度残差网络（Deep Residual Network）和类似U-Net的架构，构建了残差掩蔽网络（Residual Masking Network）。所提出的方法在著名的FER2013和私有的VEMO数据集上达到了最先进的（SOTA）准确率。源代码可在https://github.com/phamquiluan/ResidualMaskingNetwork获取。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2603.05940

SLER-IR: Spherical Layer-wise Expert Routing for All-in-One Image Restoration

SLER-IR：用于一体化图像修复的球面层级专家路由

Shurui, Peng, Lin, Xin, Luo, Shi, Ou, Jincen, Zhang, Dizhe, Qi, Lu, Nguyen, Truong, Ren, Chao

Abstract

Image restoration under diverse degradations remains challenging for unified all-in-one frameworks due to feature interference and insufficient expert specialization. We propose SLER-IR, a spherical layer-wise expert routing framework that dynamically activates specialized experts across network layers. To ensure reliable routing, we introduce a Spherical Uniform Degradation Embedding with contrastive learning, which maps degradation representations onto a hypersphere to eliminate geometry bias in linear embedding spaces. In addition, a Global-Local Granularity Fusion (GLGF) module integrates global semantics and local degradation cues to address spatially non-uniform degradations and the train-test granularity gap. Experiments on three-task and five-task benchmarks demonstrate that SLER-IR achieves consistent improvements over state-of-the-art methods in both PSNR and SSIM. Code and models will be publicly released.

Chinese Translation

在多种退化条件下的图像修复对于统一的一体化框架仍然具有挑战性，这主要是由于特征干扰和专家专业化不足。我们提出了SLER-IR，一个球面层级专家路由框架，能够动态激活网络层中的专业专家。为了确保可靠的路由，我们引入了一种球面均匀退化嵌入（Spherical Uniform Degradation Embedding）与对比学习相结合的方法，该方法将退化表示映射到超球面上，以消除线性嵌入空间中的几何偏差。此外，全球-局部粒度融合（Global-Local Granularity Fusion, GLGF）模块整合了全局语义和局部退化线索，以应对空间上不均匀的退化和训练-测试粒度差距。在三个任务和五个任务的基准测试中的实验表明，SLER-IR在PSNR和SSIM方面均实现了对最先进方法的一致性提升。代码和模型将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2603.05942

Adaptive Radial Projection on Fourier Magnitude Spectrum for Document Image Skew Estimation

基于傅里叶幅度谱的自适应径向投影文档图像倾斜估计

Pham, Luan, Hoang, Phu Hao, Mai, Xuan Toan, Tran, Tuan Anh

Abstract

Skew estimation is one of the vital tasks in document processing systems, especially for scanned document images, because its performance impacts subsequent steps directly. Over the years, an enormous number of researches focus on this challenging problem in the rise of digitization age. In this research, we first propose a novel skew estimation method that extracts the dominant skew angle of the given document image by applying an Adaptive Radial Projection on the 2D Discrete Fourier Magnitude spectrum. Second, we introduce a high quality skew estimation dataset DISE-2021 to assess the performance of different estimators. Finally, we provide comprehensive analyses that focus on multiple improvement aspects of Fourier-based methods. Our results show that the proposed method is robust, reliable, and outperforms all compared methods. The source code is available at https://github.com/phamquiluan/jdeskew.

Chinese Translation

倾斜估计是文档处理系统中的一项重要任务，尤其对于扫描文档图像而言，因为其性能直接影响后续步骤。多年来，随着数字化时代的到来，众多研究集中于这一具有挑战性的问题。在本研究中，我们首先提出了一种新颖的倾斜估计方法，通过对二维离散傅里叶幅度谱应用自适应径向投影，提取给定文档图像的主倾斜角度。其次，我们引入了一个高质量的倾斜估计数据集DISE-2021，以评估不同估计器的性能。最后，我们提供了全面的分析，重点关注基于傅里叶方法的多个改进方面。我们的结果表明，所提出的方法具有鲁棒性和可靠性，并且优于所有比较方法。源代码可在 https://github.com/phamquiluan/jdeskew 获取。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2603.05947

LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution

LucidNFT：基于LR的多奖励偏好优化用于生成真实世界超分辨率

Fei, Song, Ye, Tian, Chen, Sixiang, Xing, Zhaohu, Lai, Jianyu, Zhu, Lei

Abstract

Generative real-world image super-resolution (Real-ISR) can synthesize visually convincing details from severely degraded low-resolution (LR) inputs, yet its stochastic sampling makes a critical failure mode hard to avoid: outputs may look sharp but be unfaithful to the LR evidence (semantic and structural hallucination), while such LR-anchored faithfulness is difficult to assess without HR ground truth. Preference-based reinforcement learning (RL) is a natural fit because each LR input yields a rollout group of candidates to compare. However, effective alignment in Real-ISR is hindered by (i) the lack of a degradation-robust LR-referenced faithfulness signal, and (ii) a rollout-group optimization bottleneck where naive multi-reward scalarization followed by normalization compresses objective-wise contrasts, causing advantage collapse and weakening the reward-weighted updates in DiffusionNFT-style forward fine-tuning. Moreover, (iii) limited coverage of real degradations restricts rollout diversity and preference signal quality. We propose LucidNFT, a multi-reward RL framework for flow-matching Real-ISR. LucidNFT introduces LucidConsistency, a degradation-robust semantic evaluator that makes LR-anchored faithfulness measurable and optimizable; a decoupled advantage normalization strategy that preserves objective-wise contrasts within each LR-conditioned rollout group before fusion, preventing advantage collapse; and LucidLR, a large-scale collection of real-world degraded images to support robust RL fine-tuning. Experiments show that LucidNFT consistently improves strong flow-based Real-ISR baselines, achieving better perceptual-faithfulness trade-offs with stable optimization dynamics across diverse real-world scenarios.

Chinese Translation

生成真实世界图像超分辨率（Real-ISR）能够从严重退化的低分辨率（LR）输入中合成视觉上令人信服的细节，然而其随机采样使得一种关键的失败模式难以避免：输出可能看起来清晰，但与LR证据不一致（语义和结构幻觉），而这种基于LR的忠实度在没有高分辨率（HR）真实值的情况下难以评估。基于偏好的强化学习（RL）是一种自然的选择，因为每个LR输入产生一组候选输出进行比较。然而，Real-ISR中的有效对齐受到以下因素的阻碍：（i）缺乏一种对退化具有鲁棒性的LR参考忠实度信号，以及（ii）一种滚动组优化瓶颈，其中简单的多奖励标量化后进行归一化压缩了目标间的对比度，导致优势崩溃并削弱了在DiffusionNFT风格的前向微调中的奖励加权更新。此外，（iii）真实退化的覆盖范围有限，限制了滚动的多样性和偏好信号的质量。我们提出了LucidNFT，一个用于流匹配Real-ISR的多奖励RL框架。LucidNFT引入了LucidConsistency，一种对退化具有鲁棒性的语义评估器，使得基于LR的忠实度可测量和可优化；一种解耦的优势归一化策略，在融合之前保留每个LR条件滚动组内的目标间对比度，防止优势崩溃；以及LucidLR，一个大规模的真实世界退化图像集合，以支持鲁棒的RL微调。实验表明，LucidNFT在强大的基于流的Real-ISR基线中始终表现出改善，在各种真实世界场景中实现了更好的感知忠实度权衡和稳定的优化动态。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2603.05950

Energy-Driven Adaptive Visual Token Pruning for Efficient Vision-Language Models

基于能量驱动的自适应视觉令牌修剪用于高效的视觉-语言模型

He, Jialuo, Chen, Huangxun

Abstract

Visual token reduction is critical for accelerating Vision-Language Models (VLMs), yet most existing approaches rely on a fixed budget shared across all inputs, overlooking the substantial variation in image information density. We propose E-AdaPrune, an energy-driven adaptive pruning framework that determines the token budget from the singular value spectrum of the visual features space. By preserving a certain proportion of spectral energy, our method allocates more tokens to information-dense scenes while aggressively compressing redundant ones, without introducing additional learnable parameters. We evaluate E-AdaPrune on nine benchmarks and three VLM backbones, LLaVA-1.5-7B, LLaVA-1.5-13B, and LLaVA-NeXT-8B. Under matched average token budgets, E-AdaPrune consistently yields an average improvement of up to 0.6\%, including a significant +5.1\% relative boost on the MMVet reasoning task. Using randomized singular value decomposition, the additional latency is limited to 8ms per image.

Chinese Translation

视觉令牌的减少对于加速视觉-语言模型（VLMs）至关重要，但大多数现有方法依赖于在所有输入中共享的固定预算，忽视了图像信息密度的显著变化。我们提出了E-AdaPrune，这是一种基于能量驱动的自适应修剪框架，它通过视觉特征空间的奇异值谱来确定令牌预算。通过保留一定比例的谱能量，我们的方法为信息密集的场景分配更多的令牌，同时对冗余场景进行强有力的压缩，而不引入额外的可学习参数。我们在九个基准测试和三个VLM骨干网络（LLaVA-1.5-7B、LLaVA-1.5-13B和LLaVA-NeXT-8B）上评估了E-AdaPrune。在匹配的平均令牌预算下，E-AdaPrune始终实现了高达0.6%的平均提升，包括在MMVet推理任务上显著的+5.1%的相对提升。使用随机奇异值分解，额外的延迟限制在每张图像8毫秒以内。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2603.05952

Unify the Views: View-Consistent Prototype Learning for Few-Shot Segmentation

统一视角：视图一致的原型学习用于少样本分割

Liu, Hongli, Wang, Yu, Zhao, Shengjie

Abstract

Few-shot segmentation (FSS) has gained significant attention for its ability to generalize to novel classes with limited supervision, yet remains challenged by structural misalignment and cross-view inconsistency under large appearance or viewpoint variations. This paper tackles these challenges by introducing VINE (View-Informed NEtwork), a unified framework that jointly models structural consistency and foreground discrimination to refine class-specific prototypes. Specifically, VINE introduces a spatial-view graph on backbone features, where the spatial graph captures local geometric topology and the view graph connects features from different perspectives to propagate view-invariant structural semantics. To further alleviate foreground ambiguity, we derive a discriminative prior from the support-query feature discrepancy to capture category-specific contrast, which reweights SAM features by emphasizing salient regions and recalibrates backbone activations for improved structural focus. The foreground-enhanced SAM features and structurally enriched ResNet features are progressively integrated through masked cross-attention, yielding class-consistent prototypes used as adaptive prompts for the SAM decoder to generate accurate masks. Extensive experiments on multiple FSS benchmarks validate the effectiveness and robustness of VINE, particularly under challenging scenarios with viewpoint shifts and complex structures. The code is available at https://github.com/HongliLiu1/VINE-main.

Chinese Translation

少样本分割（FSS）因其在有限监督下对新类别的泛化能力而受到广泛关注，但在大外观或视角变化下，仍面临结构不对齐和视角间不一致的挑战。本文通过引入VINE（视角信息网络），提出了一种统一框架，联合建模结构一致性和前景区分，以优化类别特定的原型。具体而言，VINE在主干特征上引入了空间视图图，其中空间图捕捉局部几何拓扑，而视图图则连接来自不同视角的特征，以传播视角不变的结构语义。为了进一步缓解前景模糊性，我们从支持-查询特征差异中推导出一种区分性先验，以捕捉类别特定的对比，这通过强调显著区域重新加权SAM特征，并重新校准主干激活以提高结构聚焦。前景增强的SAM特征和结构丰富的ResNet特征通过掩码交叉注意力逐步整合，生成用于SAM解码器的类别一致原型，以生成准确的掩码。在多个FSS基准上的广泛实验验证了VINE的有效性和鲁棒性，特别是在视角变化和复杂结构的挑战场景下。代码可在https://github.com/HongliLiu1/VINE-main获取。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2603.05959

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

OVGGT：O(1) 常量成本流式视觉几何变换器

Lu, Si-Yu, Chen, Po-Ting, Hsu, Hui-Che, Jhong, Sin-Ye, Cheng, Wen-Huang, Chen, Yung-Yao

Abstract

Reconstructing 3D geometry from streaming video requires continuous inference under bounded resources. Recent geometric foundation models achieve impressive reconstruction quality through all-to-all attention, yet their quadratic cost confines them to short, offline sequences. Causal-attention variants such as StreamVGGT enable single-pass streaming but accumulate an ever-growing KV cache, exhausting GPU memory within hundreds of frames and precluding the long-horizon deployment that motivates streaming inference in the first place. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching, which leverages FFN residual magnitudes to compress the KV cache while remaining fully compatible with FlashAttention, with Dynamic Anchor Protection, which shields coordinate-critical tokens from eviction to suppress geometric drift over extended trajectories. Extensive experiments on indoor, outdoor, and ultra-long sequence benchmarks demonstrate that OVGGT processes arbitrarily long videos within a constant VRAM envelope while achieving state-of-the-art 3D geometric accuracy.

Chinese Translation

从流式视频重建三维几何体需要在有限资源下进行持续推理。近期的几何基础模型通过全对全注意力实现了令人印象深刻的重建质量，然而其二次成本限制了它们在短的离线序列中的应用。因果注意力变体如 StreamVGGT 使得单次流式推理成为可能，但会累积不断增长的键值（KV）缓存，导致在数百帧内耗尽 GPU 内存，从而阻碍了长时间部署，这正是流式推理的初衷。我们提出了 OVGGT，这是一种无训练框架，无论序列长度如何，都将内存和计算限制在固定预算内。我们的方法结合了自选择缓存（Self-Selective Caching），利用前馈网络（FFN）残差幅度压缩 KV 缓存，同时与 FlashAttention 完全兼容，以及动态锚点保护（Dynamic Anchor Protection），保护坐标关键令牌不被驱逐，以抑制在长时间轨迹上的几何漂移。在室内、室外和超长序列基准上的广泛实验表明，OVGGT 能够在一个恒定的 VRAM 限制内处理任意长度的视频，同时实现最先进的三维几何精度。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2603.05962

Exploring Open-Vocabulary Object Recognition in Images using CLIP

基于CLIP的开放词汇图像对象识别探索

Chen, Wei Yu, Dai, Ying

Abstract

To address the limitations of existing open-vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open-Vocabulary Object Recognition (OVOR) framework based on a streamlined two-stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor-intensive annotation. After cropping object regions, we generate object-level image embeddings alongside category-level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP-based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and processed via Singular Value Decomposition (SVD) to construct a shared representation space. Finally, recognition is performed through embedding similarity matching. Experiments on COCO, Pascal VOC, and ADE20K demonstrate that training-free, CLIP-based encoding without SVD achieves the highest average AP, outperforming current state-of-the-art methods. Simultaneously, the results highlight the potential of CNN/MLP-based image encoding for OVOR.

Chinese Translation

为了解决现有开放词汇对象识别方法的局限性，特别是系统复杂性高、训练成本高和泛化能力有限，本文提出了一种基于简化两阶段策略的新型开放词汇对象识别（OVOR）框架：对象分割后进行识别。该框架消除了复杂的再训练和劳动密集型标注的需求。在裁剪对象区域后，我们使用CLIP生成对象级图像嵌入和类别级文本嵌入，从而支持任意词汇。为了减少对CLIP的依赖并增强编码灵活性，我们进一步引入了一种基于CNN/MLP的方法，该方法提取卷积神经网络（CNN）特征图，并利用多层感知器（MLP）将视觉特征与文本嵌入对齐。这些嵌入被连接并通过奇异值分解（SVD）处理，以构建共享表示空间。最后，通过嵌入相似性匹配进行识别。在COCO、Pascal VOC和ADE20K上的实验表明，无需训练的基于CLIP的编码方法在不使用SVD的情况下实现了最高的平均AP，超越了当前的最先进方法。同时，结果突显了基于CNN/MLP的图像编码在OVOR中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2603.05963

Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

骨架到图像编码：通过视觉预训练模型实现骨架表示学习

Yang, Siyuan, Liu, Jun, Cheng, Hao, Wang, Chong, Lu, Shijian, Kjellstrom, Hedvig, Lin, Weisi, Kot, Alex C.

Abstract

Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for self-supervised skeleton representation learning, effectively transferring rich visual-domain knowledge to skeleton analysis. While existing skeleton methods often design models tailored to specific, homogeneous skeleton formats, they overlook the structural heterogeneity that naturally arises from diverse data sources. In contrast, our S2I representation offers a unified image-like format that naturally accommodates heterogeneous skeleton data. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate the effectiveness and generalizability of our method for self-supervised skeleton representation learning, including under challenging cross-format evaluation settings.

Chinese Translation

近期大规模预训练视觉模型的进展在多种下游任务中展现了令人印象深刻的能力，包括跨模态和多模态场景。然而，由于数据格式的根本差异，这些模型在3D人类骨架数据上的直接应用仍然面临挑战。此外，大规模骨架数据集的稀缺以及将骨架数据融入多模态动作识别而不引入额外模型分支的需求，提供了重要的研究机会。为了解决这些挑战，我们提出了骨架到图像编码（Skeleton-to-Image Encoding, S2I），这是一种新颖的表示方法，通过基于身体部位语义对关节进行分区和排列，并调整为标准化的图像尺寸，将骨架序列转换为类似图像的数据。这种编码首次使得强大的视觉预训练模型能够用于自监督骨架表示学习，有效地将丰富的视觉领域知识转移到骨架分析中。虽然现有的骨架方法通常设计针对特定、同质骨架格式的模型，但它们忽视了来自多样化数据源的结构异质性。相比之下，我们的S2I表示提供了一种统一的类似图像格式，自然适应异质骨架数据。在NTU-60、NTU-120和PKU-MMD上的大量实验展示了我们的方法在自监督骨架表示学习中的有效性和普适性，包括在具有挑战性的跨格式评估设置下。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2603.05964

CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection

CR-QAT：课程关系量化感知训练用于开放词汇物体检测

Park, Jinyeong, Kim, Donghwa, Kang, Brent ByungHoon, Baek, Hyeongboo, Kim, Jibum

Abstract

Open-vocabulary object detection (OVOD) enables novel category detection via vision-language alignment, but massive model sizes hinder deployment on resource-constrained devices. While quantization offers practical compression, we reveal that naive extreme low-bit (e.g., 4-bit) quantization severely degrades fine-grained vision-language alignment and distorts inter-region relational structures. To address this, we propose curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation. Within CR-QAT, curriculum QAT (CQAT) mitigates error accumulation by partitioning the model for progressive quantization, ensuring stable optimization via error isolation. Concurrently, text-centric relational KD (TRKD) is applied to task-relevant modules. By constructing text-anchored pairwise similarity matrices, TRKD comprehensively transfers the teacher's multi-dimensional relational knowledge. Experiments on LVIS and COCO zero-shot benchmarks demonstrate that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings, achieving relative AP improvements of up to 38.9% and 40.9%, respectively.

Chinese Translation

开放词汇物体检测（OVOD）通过视觉-语言对齐实现新类别检测，但庞大的模型规模阻碍了在资源受限设备上的部署。尽管量化提供了实用的压缩方案，但我们发现简单的极低位（例如，4位）量化会严重降低细粒度的视觉-语言对齐，并扭曲区域间的关系结构。为了解决这个问题，我们提出了课程关系量化感知训练（CR-QAT），这是一个将阶段性优化与关系知识蒸馏相结合的综合框架。在CR-QAT中，课程量化感知训练（CQAT）通过对模型进行分区以实现渐进量化，从而减轻误差累积，确保通过误差隔离实现稳定优化。同时，文本中心的关系知识蒸馏（TRKD）被应用于与任务相关的模块。通过构建文本锚定的成对相似性矩阵，TRKD全面传递教师的多维关系知识。在LVIS和COCO零样本基准上的实验表明，CR-QAT在激进的低位设置下始终优于现有的量化感知训练基线，分别实现了高达38.9%和40.9%的相对平均精度提升。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2603.05969

Imagine How To Change: Explicit Procedure Modeling for Change Captioning

想象如何改变：用于变化描述的显式过程建模

Sun, Jiayang, Guo, Zixin, Cao, Min, Zhu, Guibo, Laaksonen, Jorma

Abstract

Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which is the key to understand not only what has changed but also how it occurs. We introduce ProCap, a novel framework that reformulates change modeling from static image comparison to dynamic procedure modeling. ProCap features a two-stage design: The first stage trains a procedure encoder to learn the change procedure from a sparse set of keyframes. These keyframes are obtained by automatically generating intermediate frames to make the implicit procedural dynamics explicit and then sampling them to mitigate redundancy. Then the encoder learns to capture the latent dynamics of these keyframes via a caption-conditioned, masked reconstruction task. The second stage integrates this trained encoder within an encoder-decoder model for captioning. Instead of relying on explicit frames from the previous stage -- a process incurring computational overhead and sensitivity to visual noise -- we introduce learnable procedure queries to prompt the encoder for inferring the latent procedure representation, which the decoder then translates into text. The entire model is then trained end-to-end with a captioning loss, ensuring the encoder's output is both temporally coherent and captioning-aligned. Experiments on three datasets demonstrate the effectiveness of ProCap. Code and pre-trained models are available at https://github.com/BlueberryOreo/ProCap

Chinese Translation

变化描述生成明确描述两幅视觉相似图像之间差异的文本。现有方法基于静态图像对进行操作，因此忽略了变化过程的丰富时间动态，而这些动态是理解变化不仅发生了什么，而且如何发生的关键。我们提出了ProCap，一个新颖的框架，将变化建模从静态图像比较重新构建为动态过程建模。ProCap具有两阶段设计：第一阶段训练一个过程编码器，从稀疏的关键帧集中学习变化过程。这些关键帧通过自动生成中间帧来获得，以使隐式的过程动态显式化，然后对其进行采样以减轻冗余。接着，编码器通过一个基于描述条件的掩蔽重建任务学习捕捉这些关键帧的潜在动态。第二阶段将这个训练好的编码器集成到一个编码器-解码器模型中进行描述生成。我们引入可学习的过程查询，以提示编码器推断潜在的过程表示，而不是依赖于前一阶段的显式帧——这一过程会带来计算开销和对视觉噪声的敏感性，解码器随后将其转换为文本。整个模型通过描述损失进行端到端训练，确保编码器的输出在时间上连贯且与描述对齐。在三个数据集上的实验表明了ProCap的有效性。代码和预训练模型可在 https://github.com/BlueberryOreo/ProCap 获取。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2603.05970

Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions

打破平滑运动假设：复杂和不利条件下的无人机多目标跟踪基准

Ye, Jingtao, Zhang, Kexin, Ma, Xunchi, Li, Yuehan, Zhu, Guangming, Shen, Peiyi, Jiang, Linhua, Zhang, Xiangdong, Zhang, Liang

Abstract

The rapid movements and agile maneuvers of unmanned aerial vehicles (UAVs) induce significant observational challenges for multi-object tracking (MOT). However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. The benchmark comprises 42 video sequences with over 1.7 million bounding box annotations, covering vehicles, pedestrians, and specialized industrial categories such as excavators, bulldozers and cranes. Compared to existing benchmarks, DynUAV introduces substantial challenges arising from ego-motion, including drastic scale changes and viewpoint changes, as well as motion blur. Comprehensive evaluations of state-of-the-art trackers on DynUAV reveal their limitations, particularly in managing the intertwined challenges of detection and association under such dynamic conditions, thereby establishing DynUAV as a rigorous benchmark. We anticipate that DynUAV will serve as a demanding testbed to spur progress in real-world UAV-perspective MOT, and we will make all resources available at link.

Chinese Translation

无人机（UAV）的快速移动和灵活机动给多目标跟踪（MOT）带来了显著的观察挑战。然而，现有的无人机视角MOT基准往往缺乏这些复杂性，主要呈现可预测的相机动态和线性运动模式。为了解决这一问题，我们引入了DynUAV，这是一个新的动态无人机视角MOT基准，其特点是强烈的自我运动及由此产生的复杂表观轨迹。该基准包含42个视频序列，超过170万个边界框注释，涵盖了车辆、行人以及挖掘机、推土机和起重机等专业工业类别。与现有基准相比，DynUAV引入了由于自我运动而产生的重大挑战，包括剧烈的尺度变化和视角变化，以及运动模糊。在DynUAV上对最先进的跟踪器进行的全面评估揭示了它们的局限性，特别是在动态条件下管理检测和关联交织挑战方面，从而确立了DynUAV作为一个严格的基准。我们预计DynUAV将作为一个严苛的测试平台，促进现实世界无人机视角MOT的进展，我们将提供所有资源的链接。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2603.05971

Towards High-resolution and Disentangled Reference-based Sketch Colorization

面向高分辨率和解耦参考基础的素描上色

Yan, Dingkun, Wang, Xinrui, Wang, Ru, Li, Zhuoru, Yu, Jinze, Iwasawa, Yusuke, Matsuo, Yutaka, Guo, Jiaxian

Abstract

Sketch colorization is a critical task for automating and assisting in the creation of animations and digital illustrations. Previous research identified the primary difficulty as the distribution shift between semantically aligned training data and highly diverse test data, and focused on mitigating the artifacts caused by the distribution shift instead of fundamentally resolving the problem. In this paper, we present a framework that directly minimizes the distribution shift, thereby achieving superior quality, resolution, and controllability of colorization. We propose a dual-branch framework to explicitly model the data distributions of the training process and inference process with a semantic-aligned branch and a semantic-misaligned branch, respectively. A Gram Regularization Loss is applied across the feature maps of both branches, effectively enforcing cross-domain distribution coherence and stability. Furthermore, we adopt an anime-specific Tagger Network to extract fine-grained attributions from reference images and modulate SDXL's conditional encoders to ensure precise control, and a plugin module to enhance texture transfer. Quantitative and qualitative comparisons, alongside user studies, confirm that our method effectively overcomes the distribution shift challenge, establishing State-of-the-Art performance across both quality and controllability metrics. Ablation study reveals the influence of each component.

Chinese Translation

素描上色是自动化和辅助动画及数字插图创作的关键任务。以往的研究将主要困难归结为语义对齐训练数据与高度多样化测试数据之间的分布偏移，并集中于减轻由分布偏移引起的伪影，而不是从根本上解决这个问题。本文提出了一种框架，直接最小化分布偏移，从而实现更优质的上色效果、更高的分辨率和更好的可控性。我们提出了一种双分支框架，分别通过语义对齐分支和语义不对齐分支显式建模训练过程和推理过程的数据分布。对两个分支的特征图应用Gram正则化损失，有效地强制执行跨域分布的一致性和稳定性。此外，我们采用了一种特定于动漫的Tagger网络，从参考图像中提取细粒度属性，并调节SDXL的条件编码器以确保精确控制，同时引入插件模块以增强纹理转移。定量和定性比较以及用户研究证实我们的方法有效克服了分布偏移的挑战，在质量和可控性指标上建立了最先进的性能。消融研究揭示了各个组件的影响。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2603.05997

MM-ISTS: Cooperating Irregularly Sampled Time Series Forecasting with Multimodal Vision-Text LLMs

MM-ISTS：利用多模态视觉-文本大语言模型进行不规则采样时间序列预测

Lei, Zhi, Liu, Chenxi, Miao, Hao, Qiu, Wanghui, Yang, Bin, Guo, Chenjuan

Abstract

Irregularly sampled time series (ISTS) are widespread in real-world scenarios, exhibiting asynchronous observations on uneven time intervals across variables. Existing ISTS forecasting methods often solely utilize historical observations to predict future ones while falling short in learning contextual semantics and fine-grained temporal patterns. To address these problems, we achieve MM-ISTS, a multimodal framework augmented by vision-text large language models, that bridges temporal, visual, and textual modalities, facilitating ISTS forecasting. MM-ISTS encompasses a novel two-stage encoding mechanism. In particular, a cross-modal vision-text encoding module is proposed to automatically generate informative visual images and textual data, enabling the capture of intricate temporal patterns and comprehensive contextual understanding, in collaboration with multimodal LLMs (MLLMs). In parallel, ISTS encoding extracts complementary yet enriched temporal features from historical ISTS observations, including multi-view embedding fusion and a temporal-variable encoder. Further, we propose an adaptive query-based feature extractor to compress the learned tokens of MLLMs, filtering out small-scale useful knowledge, which in turn reduces computational costs. In addition, a multimodal alignment module with modality-aware gating is designed to alleviate the modality gap across ISTS, images, and text. Extensive experiments on real data offer insight into the effectiveness of the proposed solutions.

Chinese Translation

不规则采样时间序列（ISTS）在现实场景中广泛存在，表现为在不同变量之间的不均匀时间间隔内的异步观测。现有的ISTS预测方法通常仅利用历史观测来预测未来观测，而在学习上下文语义和细粒度时间模式方面存在不足。为了解决这些问题，我们实现了MM-ISTS，这是一种由视觉-文本大语言模型增强的多模态框架，能够连接时间、视觉和文本模态，促进ISTS预测。MM-ISTS包含一种新颖的两阶段编码机制。特别地，我们提出了一种跨模态视觉-文本编码模块，能够自动生成信息丰富的视觉图像和文本数据，从而与多模态大语言模型（MLLMs）协作，捕捉复杂的时间模式和全面的上下文理解。同时，ISTS编码从历史ISTS观测中提取互补且丰富的时间特征，包括多视角嵌入融合和时间变量编码器。此外，我们提出了一种自适应查询基础的特征提取器，以压缩MLLMs学习到的标记，过滤出小规模的有用知识，从而降低计算成本。此外，设计了一种具有模态感知门控的多模态对齐模块，以减轻ISTS、图像和文本之间的模态差距。在真实数据上的广泛实验提供了对所提解决方案有效性的深入见解。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2603.05999

RePer-360: Releasing Perspective Priors for 360$^\circ$ Depth Estimation via Self-Modulation

RePer-360：通过自我调制释放360$^ ext{°}$深度估计的视角先验

Guan, Cheng, Lin, Chunyu, Shen, Zhijie, Zhang, Junsong, Wang, Jiyuan

Abstract

Recent depth foundation models trained on perspective imagery achieve strong performance, yet generalize poorly to 360$^\circ$ images due to the substantial geometric discrepancy between perspective and panoramic domains. Moreover, fully fine-tuning these models typically requires large amounts of panoramic data. To address this issue, we propose RePer-360, a distortion-aware self-modulation framework for monocular panoramic depth estimation that adapts depth foundation models while preserving powerful pretrained perspective priors. Specifically, we design a lightweight geometry-aligned guidance module to derive a modulation signal from two complementary projections (i.e., ERP and CP) and use it to guide the model toward the panoramic domain without overwriting its pretrained perspective knowledge. We further introduce a Self-Conditioned AdaLN-Zero mechanism that produces pixel-wise scaling factors to reduce the feature distribution gap between the perspective and panoramic domains. In addition, a cubemap-domain consistency loss further improves training stability and cross-projection alignment. By shifting the focus from complementary-projection fusion to panoramic domain adaptation under preserved pretrained perspective priors, RePer-360 surpasses standard fine-tuning methods while using only 1\% of the training data. Under the same in-domain training setting, it further achieves an approximately 20\% improvement in RMSE. Code will be released upon acceptance.

Chinese Translation

最近在透视图像上训练的深度基础模型表现出色，但由于透视域与全景域之间存在显著的几何差异，这些模型在360$^ ext{°}$图像上的泛化能力较差。此外，完全微调这些模型通常需要大量的全景数据。为了解决这一问题，我们提出了RePer-360，一个考虑失真的自我调制框架，用于单目全景深度估计，该框架在保留强大的预训练透视先验的同时，适应深度基础模型。具体而言，我们设计了一个轻量级的几何对齐引导模块，从两个互补投影（即ERP和CP）中推导出调制信号，并利用该信号引导模型向全景域迁移，而不覆盖其预训练的透视知识。我们进一步引入了一种自条件AdaLN-Zero机制，生成逐像素的缩放因子，以减少透视域与全景域之间的特征分布差距。此外，立方体映射域一致性损失进一步提高了训练的稳定性和跨投影对齐。通过将重点从互补投影融合转向在保留预训练透视先验下的全景域适应，RePer-360在仅使用1 ext{%}训练数据的情况下超越了标准微调方法。在相同的领域内训练设置下，它在均方根误差（RMSE）上进一步实现了约20 ext{%}的提升。代码将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2603.06002

Demystifying KAN for Vision Tasks: The RepKAN Approach

揭示用于视觉任务的 KAN：RepKAN 方法

Cheon, Minjong

Abstract

Remote sensing image classification is essential for Earth observation, yet standard CNNs and Transformers often function as uninterpretable black-boxes. We propose RepKAN, a novel architecture that integrates the structural efficiency of CNNs with the non-linear representational power of KANs. By utilizing a dual-path design -- Spatial Linear and Spectral Non-linear -- RepKAN enables the autonomous discovery of class-specific spectral fingerprints and physical interaction manifolds. Experimental results on the EuroSAT and NWPU-RESISC45 datasets demonstrate that RepKAN provides explicit physically interpretable reasoning while outperforming state-of-the-art models. These findings indicate that RepKAN holds significant potential to serve as the backbone for future interpretable visual foundation models.

Chinese Translation

遥感图像分类对于地球观测至关重要，但标准卷积神经网络（CNN）和变换器（Transformers）往往作为不可解释的黑箱运作。我们提出了 RepKAN，一种新颖的架构，它将 CNN 的结构效率与 KAN 的非线性表示能力相结合。通过利用双路径设计——空间线性和光谱非线性——RepKAN 实现了类特定光谱指纹和物理交互流形的自主发现。在 EuroSAT 和 NWPU-RESISC45 数据集上的实验结果表明，RepKAN 提供了明确的物理可解释推理，同时超越了最先进的模型。这些发现表明，RepKAN 在未来可解释视觉基础模型中具有重要的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2603.06014

EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation

EffectMaker：统一推理与生成以实现定制化视觉效果创作

Yang, Shiyuan, Li, Ruihuang, Tao, Jiale, Shao, Shuai, Lu, Qinglin, Liao, Jing

Abstract

Visual effects (VFX) are essential for enhancing the expressiveness and creativity of video content, yet producing high-quality effects typically requires expert knowledge and costly production pipelines. Existing AIGC systems face significant challenges in VFX generation due to the scarcity of effect-specific data and the inherent difficulty of modeling supernatural or stylized effects. Moreover, these approaches often require per-effect fine-tuning, which severely limits their scalability and generalization to novel VFX. In this work, we present EffectMaker, a unified reasoning-generation framework that enables reference-based VFX customization. EffectMaker employs a multimodal large language model to interpret high-level effect semantics and reason about how they should adapt to a target subject, while a diffusion transformer leverages in-context learning to capture fine-grained visual cues from reference videos. These two components form a semantic-visual dual-path guidance mechanism that enables accurate, controllable, and effect-consistent synthesis without per-effect fine-tuning. Furthermore, we construct EffectData, the largest high-quality synthetic dataset containing 130k videos across 3k VFX categories, to improve generalization and scalability. Experiments show that EffectMaker achieves superior visual quality and effect consistency over state-of-the-art baselines, offering a scalable and flexible paradigm for customized VFX generation. Project page: https://effectmaker.github.io

Chinese Translation

视觉效果（VFX）对于增强视频内容的表现力和创意至关重要，但制作高质量的效果通常需要专业知识和昂贵的制作流程。现有的人工智能生成内容（AIGC）系统在VFX生成方面面临重大挑战，原因在于效果特定数据的稀缺以及建模超自然或风格化效果的固有困难。此外，这些方法通常需要针对每种效果进行微调，这严重限制了它们在新颖VFX上的可扩展性和泛化能力。在本研究中，我们提出了EffectMaker，一个统一的推理-生成框架，能够实现基于参考的VFX定制。EffectMaker采用多模态大语言模型来解释高层次的效果语义，并推理它们如何适应目标对象，同时扩散变换器利用上下文学习从参考视频中捕捉细粒度的视觉线索。这两个组件形成了一种语义-视觉双路径引导机制，使得在不进行每种效果微调的情况下，实现准确、可控且效果一致的合成。此外，我们构建了EffectData，这是包含130,000个视频和3,000个VFX类别的最大高质量合成数据集，以提高泛化能力和可扩展性。实验表明，EffectMaker在视觉质量和效果一致性方面优于最先进的基线，提供了一种可扩展且灵活的定制VFX生成范式。项目页面：https://effectmaker.github.io

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2603.06022

MOSIV: Multi-Object System Identification from Videos

MOSIV：基于视频的多物体系统识别

Liu, Chunjiang, Wang, Xiaoyuan, Lin, Qingran, Xiao, Albert, Chen, Haoyu, Wen, Shizheng, Zhang, Hao, Qi, Lu, Yang, Ming-Hsuan, Jeni, Laszlo A., Xu, Min, Zhao, Yizhou

Abstract

We introduce the challenging problem of multi-object system identification from videos, for which prior methods are ill-suited due to their focus on single-object scenes or discrete material classification with a fixed set of material prototypes. To address this, we propose MOSIV, a new framework that directly optimizes for continuous, per-object material parameters using a differentiable simulator guided by geometric objectives derived from video. We also present a new synthetic benchmark with contact-rich, multi-object interactions to facilitate evaluation. On this benchmark, MOSIV substantially improves grounding accuracy and long-horizon simulation fidelity over adapted baselines, establishing it as a strong baseline for this new task. Our analysis shows that object-level fine-grained supervision and geometry-aligned objectives are critical for stable optimization in these complex, multi-object settings. The source code and dataset will be released.

Chinese Translation

我们提出了一个具有挑战性的多物体系统识别问题，该问题在视频中进行，之前的方法由于专注于单物体场景或使用固定材料原型进行离散材料分类而不适用。为了解决这个问题，我们提出了MOSIV，一个新的框架，它通过一个可微分的模拟器直接优化每个物体的连续材料参数，该模拟器由从视频中推导出的几何目标引导。我们还提出了一个新的合成基准，包含丰富接触的多物体交互，以便于评估。在这个基准上，MOSIV显著提高了基础准确性和长时间模拟的保真度，确立了其作为这一新任务的强基准。我们的分析表明，物体级的细粒度监督和几何对齐目标对于在这些复杂的多物体环境中实现稳定优化至关重要。源代码和数据集将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2603.06032

StruVis: Enhancing Reasoning-based Text-to-Image Generation via Thinking with Structured Vision

StruVis：通过结构化视觉思维增强基于推理的文本到图像生成

Lyu, Yuanhuiyi, Lei, Kaiyu, Weng, Ziqiao, Zheng, Xu, Jiang, Lutao, Li, Teng, Li, Yangfu, Huang, Ziyuan, Zhang, Linfeng, Hu, Xuming

Abstract

Reasoning-based text-to-image (T2I) generation requires models to interpret complex prompts accurately. Existing reasoning frameworks can be broadly categorized into two types: (1) Text-Only Reasoning, which is computationally efficient but lacks access to visual context, often resulting in the omission of critical spatial and visual elements; and (2) Text-Image Interleaved Reasoning, which leverages a T2I generator to provide visual references during the reasoning process. While this approach enhances visual grounding, it incurs substantial computational costs and constrains the reasoning capacity of MLLMs to the representational limitations of the generator. To this end, we propose StruVis, a novel framework that enhances T2I generation through Thinking with Structured Vision. Instead of relying on intermediate image generation, StruVis employs text-based structured visual representations as intermediate reasoning states, thereby enabling the MLLM to effectively "perceive" visual structure within a purely text-based reasoning process. Powered by this, the reasoning potential for T2I generation of the MLLM is unlocked through structured-vision-guided reasoning. Additionally, as a generator-agnostic reasoning framework, our proposed StruVis can be seamlessly integrated with diverse T2I generators and efficiently enhance their performance in reasoning-based T2I generation. Extensive experiments demonstrate that StruVis achieves significant performance improvements on reasoning-based T2I benchmarks, e.g., a 4.61% gain on T2I-ReasonBench and a 4% gain on WISE.

Chinese Translation

基于推理的文本到图像（T2I）生成要求模型准确解读复杂的提示。现有的推理框架大致可分为两类：（1）仅文本推理（Text-Only Reasoning），该方法计算效率高，但缺乏视觉上下文，常常导致关键空间和视觉元素的遗漏；（2）文本-图像交错推理（Text-Image Interleaved Reasoning），该方法利用T2I生成器在推理过程中提供视觉参考。尽管这种方法增强了视觉基础，但它带来了巨大的计算成本，并限制了多模态大语言模型（MLLMs）的推理能力，使其受限于生成器的表征限制。为此，我们提出了StruVis，一个通过结构化视觉思维增强T2I生成的新框架。StruVis不依赖于中间图像生成，而是采用基于文本的结构化视觉表征作为中间推理状态，从而使MLLM能够在纯文本推理过程中有效“感知”视觉结构。在此基础上，MLLM的T2I生成推理潜力通过结构化视觉引导的推理得以释放。此外，作为一个与生成器无关的推理框架，我们提出的StruVis可以与多种T2I生成器无缝集成，并有效提升其在基于推理的T2I生成中的性能。大量实验表明，StruVis在基于推理的T2I基准测试中取得了显著的性能提升，例如在T2I-ReasonBench上提高了4.61%，在WISE上提高了4%。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2603.06034

Occlusion-Aware SORT: Observing Occlusion for Robust Multi-Object Tracking

考虑遮挡的SORT：观察遮挡以实现稳健的多目标跟踪

Li, Chunjiang, Ma, Jianbo, Shen, Li, Chen, Yanru, Chen, Liangyin

Abstract

Multi-object tracking (MOT) involves analyzing object trajectories and counting the number of objects in video sequences. However, 2D MOT faces challenges due to positional cost confusion arising from partial occlusion. To address this issue, we present the novel Occlusion-Aware SORT (OA-SORT) framework, a plug-and-play and training-free framework that includes the Occlusion-Aware Module (OAM), the Occlusion-Aware Offset (OAO), and the Bias-Aware Momentum (BAM). Specifically, OAM analyzes the occlusion status of objects, where a Gaussian Map (GM) is introduced to reduce background influence. In contrast, OAO and BAM leverage the OAM-described occlusion status to mitigate cost confusion and suppress estimation instability. Comprehensive evaluations on the DanceTrack, SportsMOT, and MOT17 datasets demonstrate the importance of occlusion handling in MOT. On the DanceTrack test set, OA-SORT achieves 63.1% and 64.2% in HOTA and IDF1, respectively. Furthermore, integrating the Occlusion-Aware framework into the four additional trackers improves HOTA and IDF1 by an average of 2.08% and 3.05%, demonstrating the reusability of the occlusion awareness.

Chinese Translation

多目标跟踪（MOT）涉及分析物体轨迹和计算视频序列中的物体数量。然而，由于部分遮挡导致的位置成本混淆，2D MOT面临挑战。为了解决这一问题，我们提出了新颖的考虑遮挡的SORT（Occlusion-Aware SORT，OA-SORT）框架，这是一种即插即用且无需训练的框架，包括考虑遮挡模块（Occlusion-Aware Module，OAM）、考虑遮挡偏移（Occlusion-Aware Offset，OAO）和偏差感知动量（Bias-Aware Momentum，BAM）。具体而言，OAM分析物体的遮挡状态，引入高斯图（Gaussian Map，GM）以减少背景影响。相对而言，OAO和BAM利用OAM描述的遮挡状态来减轻成本混淆并抑制估计不稳定性。在DanceTrack、SportsMOT和MOT17数据集上的全面评估表明，处理遮挡在MOT中的重要性。在DanceTrack测试集上，OA-SORT在HOTA和IDF1中分别达到了63.1%和64.2%。此外，将考虑遮挡的框架集成到四个额外的跟踪器中，HOTA和IDF1平均提高了2.08%和3.05%，证明了遮挡感知的可重用性。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2603.06036

Ensemble Learning with Sparse Hypercolumns

稀疏超列的集成学习

Dietlmeier, Julia, Ganepola, Vayangi, Adegboro, Oluwabukola G., Maniparambil, Mayug, Mazo, Claudia, O'Connor, Noel E.

Abstract

Directly inspired by findings in biological vision, high-dimensional hypercolumns are feature vectors built by concatenating multi-scale activations of convolutional neural networks for a single image pixel location. Together with powerful classifiers, they can be used for image segmentation i.e. pixel classification. However, in practice, there are only very few works dedicated to the use of hypercolumns. One reason is the computational complexity of processing concatenated dense hypercolumns that grows linearly with the size $N$ of the training set. In this work, we address this challenge by applying stratified subsampling to the VGG16 based hypercolumns. Furthermore, we investigate the performance of ensemble learning on sparse hypercolumns. Our experiments on a brain tumor dataset show that stacking and voting ensembles deliver competitive performance, but in the extreme low-shot case of $N \leq 20$, a simple Logistic Regression classifier is the most effective method. For 10% stratified subsampling rate, our best average Dice score is 0.66 for $N=20$. This is a statistically significant improvement of 24.53% over the standard multi-scale UNet baseline ($p$-value = $[3.07e-11]$, Wilcoxon signed-rank test), which is less effective due to overfitting.

Chinese Translation

直接受到生物视觉研究结果的启发，高维超列是通过连接卷积神经网络在单个图像像素位置的多尺度激活而构建的特征向量。结合强大的分类器，它们可以用于图像分割，即像素分类。然而，在实践中，专门使用超列的工作非常少。一个原因是处理连接的稠密超列的计算复杂性随着训练集大小 $N$ 的增加而线性增长。在本研究中，我们通过对基于 VGG16 的超列应用分层子采样来解决这一挑战。此外，我们还研究了集成学习在稀疏超列上的表现。我们在脑肿瘤数据集上的实验表明，堆叠和投票集成方法提供了具有竞争力的性能，但在极低样本情况下（$N leq 20$），简单的逻辑回归分类器是最有效的方法。在 10% 的分层子采样率下，我们的最佳平均 Dice 分数为 0.66（$N=20$）。这比标准的多尺度 UNet 基线有统计学上显著的改进，提升幅度为 24.53%（$p$-值 = $[3.07e-11]$，Wilcoxon 符号秩检验），后者由于过拟合而效果较差。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2603.06038

FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography

FontUse：一种基于数据的方法用于风格和使用场景条件下的图像内排版

Xin, Xia, Endo, Yuki, Kanamori, Yoshihiro

Abstract

Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.

Chinese Translation

近期的文本到图像模型能够从自然语言提示生成高质量图像，但控制排版仍然具有挑战性：请求的排版外观常常被忽视或仅被弱弱遵循。我们通过一种基于数据的方法来解决这一限制，该方法使用专门针对排版的结构化注释管道所衍生的有针对性的监督来训练图像生成模型。我们的管道构建了一个大规模以排版为重点的数据集FontUse，包含约70K幅图像，并附有用户友好的提示、文本区域位置和OCR识别字符串。这些注释是通过使用分割模型和多模态大型语言模型（MLLMs）自动生成的。提示明确结合了字体风格（例如，衬线体、手写体、优雅体）和使用场景（例如，婚礼邀请函、咖啡店菜单），使得即使是新手用户也能直观地进行规范。使用这些注释对现有生成器进行微调，使其能够在不修改架构的情况下始终如一地将风格和使用场景条件解释为文本提示。为了评估，我们引入了一种基于Long-CLIP的度量，衡量生成的排版与请求属性之间的对齐程度。在多样化的提示和布局下的实验表明，使用我们管道训练的模型生成的文本渲染与提示更为一致，优于竞争基线。我们的注释管道的源代码可在https://github.com/xiaxinz/FontUSE获取。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2603.06043

Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models

通过理解进行生成：基于理解驱动的统一多模态模型内在奖励机制

Pan, Jiadong, Li, Liang, Peng, Yuxin, Tang, Yu-Ming, Wang, Shuohuan, Sun, Yu, Wu, Hua, Huang, Qingming, Wang, Haifeng

Abstract

Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities. This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes. While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts. To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism, GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals--without reliance on external supervision. Experimental results show that our method substantially boosts UMMs' generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs' visual understanding and generation.

Chinese Translation

近年来，统一多模态模型（UMMs）在整合视觉理解与生成方面取得了显著进展，展现出在复杂文本到图像（T2I）任务中的强大潜力。尽管其理论前景良好，但仍存在持续的能力差距：UMMs通常在视觉理解方面表现优越，但生成能力相对较弱。这一差距主要源于理解与生成过程之间的内在解耦。虽然UMM能够准确解读细致的视觉细节，但在从复杂的文本提示中生成语义连贯的图像时常常面临困难。为了解决这一挑战，我们探索UMMs内部的理解能力，以提升生成质量。我们提出了一种基于标记级的内在文本-图像对齐奖励机制GvU，使UMM能够同时充当教师和学生：它利用理解分支评估自身输出，以指导生成过程。基于此，我们设计了一个自监督强化学习框架，使UMMs能够通过基于理解的内在奖励信号迭代提升生成质量，而无需依赖外部监督。实验结果表明，我们的方法显著提升了UMMs的生成能力，从而增强了其细致的视觉理解，缩小了UMMs在视觉理解与生成之间的能力差距。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2603.06048

GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection

GenHOI：朝着对象一致的手-物体交互，采用时间平衡和空间选择性对象注入

Huang, Xuan, Xiang, Mochu, Shen, Zhelun, Wu, Jinbo, Wu, Chenming, Zhao, Chen, Wang, Kaisiyuan, Zhou, Hang, Liu, Shanshan, Feng, Haocheng, He, Wei, Wang, Jingdong

Abstract

Hand-Object Interaction (HOI) remains a core challenge in digital human video synthesis, where models must generate physically plausible contact and preserve object identity across frames. Although recent HOI reenactment approaches have achieved progress, they are typically trained and evaluated in-domain and fail to generalize to complex, in-the-wild scenarios. In contrast, all-in-one video editing models exhibit broader robustness but still struggle with HOI-specific issues such as inconsistent object appearance. In this paper, we present GenHOI, a lightweight augmentation to pretrained video generation models that injects reference-object information in a temporally balanced and spatially selective manner. For temporal balancing, we propose Head-Sliding RoPE, which assigns head-specific temporal offsets to reference tokens, distributing their influence evenly across frames and mitigating the temporal decay of 3D RoPE to improve long-range object consistency. For spatial selectivity, we design a two-level spatial attention gate that concentrates object-conditioned attention on HOI regions and adaptively scales its strength, preserving background realism while enhancing interaction fidelity. Extensive qualitative and quantitative evaluations on unseen, in-the-wild scenes demonstrate that GenHOI significantly outperforms state-of-the-art HOI reenactment and all-in-one video editing methods. Project page: https://xuanhuang0.github.io/GenHOI/

Chinese Translation

手-物体交互（HOI）仍然是数字人类视频合成中的核心挑战，模型必须生成物理上合理的接触并在帧之间保持物体身份。尽管近期的HOI重演方法取得了一定进展，但它们通常是在特定领域内进行训练和评估，无法在复杂的真实场景中进行泛化。相比之下，一体化视频编辑模型表现出更广泛的鲁棒性，但仍然面临HOI特有的问题，例如物体外观不一致。本文提出了GenHOI，一种轻量级的预训练视频生成模型增强方法，以时间平衡和空间选择性的方式注入参考物体信息。为了实现时间平衡，我们提出了Head-Sliding RoPE，它为参考标记分配特定于头部的时间偏移，将其影响均匀分布在各帧之间，并减轻3D RoPE的时间衰减，以改善长距离物体一致性。为了实现空间选择性，我们设计了一个两级空间注意力门，集中物体条件的注意力于HOI区域，并自适应地调整其强度，保持背景的真实感，同时增强交互的保真度。在未见过的真实场景上进行的大量定性和定量评估表明，GenHOI显著优于最先进的HOI重演和一体化视频编辑方法。项目页面：https://xuanhuang0.github.io/GenHOI/

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2603.06049

Devil is in Narrow Policy: Unleashing Exploration in Driving VLA Models

狭窄政策中的魔鬼：释放驱动VLA模型的探索潜力

Chen, Canyu, Yang, Yuguang, Tan, Zhewen, Wang, Yizhi, Zhan, Ruiyi, Liu, Haiyan, Mao, Xuanyao, Bao, Jason, Tang, Xinyue, Yang, Linlin, Sun, Bingchuan, Wang, Yan, Zhang, Baochang

Abstract

We identify a fundamental Narrow Policy limitation undermining the performance of autonomous VLA models, where driving Imitation Learning (IL) tends to collapse exploration and limit the potential of subsequent Reinforcement Learning (RL) stages, which often saturate prematurely due to insufficient feedback diversity. Thereby, we propose Curious-VLA, a framework that alleviates the exploit-explore dilemma through a two-stage design. During IL, we introduce a Feasible Trajectory Expansion (FTE) strategy to generate multiple physically valid trajectories and a step-wise normalized trajectory representation to adapt this diverse data. In the RL stage, we present Adaptive Diversity-Aware Sampling (ADAS) that prioritizes high-diversity samples and introduce Spanning Driving Reward (SDR) with a focal style weighting to amplify reward's value span for improving sensitivity to driving quality. On the Navsim benchmark, Curious-VLA achieves SoTA results (PDMS 90.3, EPDMS 85.4) and a Best-of-N PDMS of 94.8, demonstrating its effectiveness in unlocking the exploratory potential of VLA models. Code: https://github.com/Mashiroln/curious_vla.git.

Chinese Translation

我们识别出一种基本的狭窄政策限制，削弱了自主VLA模型的性能，其中驱动模仿学习（Imitation Learning, IL）往往导致探索的崩溃，并限制后续强化学习（Reinforcement Learning, RL）阶段的潜力，这通常由于反馈多样性不足而过早饱和。因此，我们提出了Curious-VLA，一个通过两阶段设计缓解开发-探索困境的框架。在IL阶段，我们引入了一种可行轨迹扩展（Feasible Trajectory Expansion, FTE）策略，以生成多个物理有效的轨迹，并采用逐步归一化的轨迹表示来适应这些多样化的数据。在RL阶段，我们提出了自适应多样性感知采样（Adaptive Diversity-Aware Sampling, ADAS），优先考虑高多样性的样本，并引入了跨越驾驶奖励（Spanning Driving Reward, SDR），通过聚焦风格加权来扩大奖励的价值范围，以提高对驾驶质量的敏感性。在Navsim基准测试中，Curious-VLA达到了最先进的结果（PDMS 90.3，EPDMS 85.4）和最佳的N个PDMS为94.8，展示了其在解锁VLA模型探索潜力方面的有效性。代码链接：https://github.com/Mashiroln/curious_vla.git。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2603.06054

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

探究轻量级视觉-语言模型中的视觉概念以支持自动驾驶

Theodoridis, Nikos, Mohandas, Reenu, Sistu, Ganesh, Scanlan, Anthony, Eising, Ciarán, Brophy, Tim

Abstract

The use of Vision-Language Models (VLMs) in automated driving applications is becoming increasingly common, with the aim of leveraging their reasoning and generalisation capabilities to handle long tail scenarios. However, these models often fail on simple visual questions that are highly relevant to automated driving, and the reasons behind these failures remain poorly understood. In this work, we examine the intermediate activations of VLMs and assess the extent to which specific visual concepts are linearly encoded, with the goal of identifying bottlenecks in the flow of visual information. Specifically, we create counterfactual image sets that differ only in a targeted visual concept and then train linear probes to distinguish between them using the activations of four state-of-the-art (SOTA) VLMs. Our results show that concepts such as the presence of an object or agent in a scene are explicitly and linearly encoded, whereas other spatial visual concepts, such as the orientation of an object or agent, are only implicitly encoded by the spatial structure retained by the vision encoder. In parallel, we observe that in certain cases, even when a concept is linearly encoded in the model's activations, the model still fails to answer correctly. This leads us to identify two failure modes. The first is perceptual failure, where the visual information required to answer a question is not linearly encoded in the model's activations. The second is cognitive failure, where the visual information is present but the model fails to align it correctly with language semantics. Finally, we show that increasing the distance of the object in question quickly degrades the linear separability of the corresponding visual concept. Overall, our findings improve our understanding of failure cases in VLMs on simple visual tasks that are highly relevant to automated driving.

Chinese Translation

视觉-语言模型（VLMs）在自动驾驶应用中的使用日益普遍，旨在利用其推理和泛化能力来处理长尾场景。然而，这些模型在与自动驾驶高度相关的简单视觉问题上往往表现不佳，而导致这些失败的原因仍然不甚明了。在本研究中，我们考察了VLMs的中间激活，并评估特定视觉概念的线性编码程度，旨在识别视觉信息流中的瓶颈。具体而言，我们创建了仅在目标视觉概念上有所不同的反事实图像集，并利用四个最先进（SOTA）的VLMs的激活训练线性探针以区分它们。我们的结果表明，场景中物体或主体的存在等概念是明确且线性编码的，而其他空间视觉概念，如物体或主体的方向，仅通过视觉编码器保留的空间结构隐含编码。同时，我们观察到在某些情况下，即使概念在线性编码中存在，模型仍然无法正确回答。这使我们识别出两种失败模式。第一种是感知失败，即回答问题所需的视觉信息未在线性编码中体现。第二种是认知失败，即视觉信息存在，但模型未能将其正确与语言语义对齐。最后，我们展示了增加相关物体的距离会迅速降低相应视觉概念的线性可分性。总体而言，我们的研究结果提高了对VLMs在与自动驾驶高度相关的简单视觉任务中失败案例的理解。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2603.06057

TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation

TempoSyncDiff：用于低延迟音频驱动的说话人生成的蒸馏时间一致性扩散

Mazumdar, Soumya, Rakesh, Vineet Kumar

Abstract

Diffusion models have recently advanced photorealistic human synthesis, although practical talking-head generation (THG) remains constrained by high inference latency, temporal instability such as flicker and identity drift, and imperfect audio-visual alignment under challenging speech conditions. This paper introduces TempoSyncDiff, a reference-conditioned latent diffusion framework that explores few-step inference for efficient audio-driven talking-head generation. The approach adopts a teacher-student distillation formulation in which a diffusion teacher trained with a standard noise prediction objective guides a lightweight student denoiser capable of operating with significantly fewer inference steps to improve generation stability. The framework incorporates identity anchoring and temporal regularization designed to mitigate identity drift and frame-to-frame flicker during synthesis, while viseme-based audio conditioning provides coarse lip motion control. Experiments on the LRS3 dataset report denoising-stage component-level metrics relative to VAE reconstructions and preliminary latency characterization, including CPU-only and edge computing measurements and feasibility estimates for edge deployment. The results suggest that distilled diffusion models can retain much of the reconstruction behaviour of a stronger teacher while enabling substantially lower latency inference. The study is positioned as an initial step toward practical diffusion-based talking-head generation under constrained computational settings. GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff

Chinese Translation

扩散模型最近在逼真的人类合成方面取得了进展，尽管实际的说话人生成（THG）仍受到高推理延迟、时间不稳定性（如闪烁和身份漂移）以及在困难语音条件下音频与视觉的对齐不完善的限制。本文介绍了TempoSyncDiff，一种参考条件的潜在扩散框架，探索了高效的音频驱动说话人生成的少步推理方法。该方法采用了教师-学生蒸馏的形式，其中使用标准噪声预测目标训练的扩散教师指导一个轻量级的学生去噪器，该去噪器能够以显著更少的推理步骤运行，从而提高生成的稳定性。该框架结合了身份锚定和时间正则化，旨在减轻合成过程中的身份漂移和逐帧闪烁，同时基于视觉音素的音频调节提供粗略的唇部运动控制。在LRS3数据集上的实验报告了相对于变分自编码器（VAE）重建的去噪阶段组件级指标和初步延迟特征，包括仅使用CPU的测量和边缘计算的可行性估计。结果表明，蒸馏的扩散模型可以保留更强教师的许多重建特性，同时实现显著更低的延迟推理。本研究被视为在受限计算环境下实现基于扩散的实际说话人生成的初步步骤。GitHub: https://mazumdarsoumya.github.io/TempoSyncDiff

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2603.06061

Transforming Omnidirectional RGB-LiDAR data into 3D Gaussian Splatting

将全向RGB-LiDAR数据转化为3D高斯点云

Bae, Semin, Lim, Hansol, Choi, Jongseong Brad

Abstract

The demand for large-scale digital twins is rapidly growing in robotics and autonomous driving. However, constructing these environments with 3D Gaussian Splatting (3DGS) usually requires expensive, purpose-built data collection. Meanwhile, deployed platforms routinely collect extensive omnidirectional RGB and LiDAR logs, but a significant portion of these sensor data is directly discarded or strictly underutilized due to transmission constraints and the lack of scalable reuse pipeline. In this paper, we present an omnidirectional RGB-LiDAR reuse pipeline that transforms these archived logs into robust initialization assets for 3DGS. Direct conversion of such raw logs introduces practical bottlenecks: inherent non-linear distortion leads to unreliable Structure-from-Motion (SfM) tracking, and dense, unorganized LiDAR clouds cause computational overhead during 3DGS optimization. To overcome these challenges, our pipeline strategically integrates an ERP-to-cubemap conversion module for deterministic spatial anchoring, alongside PRISM-a color stratified downsampling strategy. By bridging these multi-modal inputs via Fast Point Feature Histograms (FPFH) based global registration and Iterative Closest Point (ICP), our pipeline successfully repurposes a considerable fraction of discarded data into usable SfM geometry. Furthermore, our LiDAR-reinforced initialization consistently enhances the final 3DGS rendering fidelity in structurally complex scenes compared to vision-only baselines. Ultimately, this work provides a deterministic workflow for creating simulation-grade digital twins from standard archived sensor logs.

Chinese Translation

在机器人技术和自动驾驶领域，对大规模数字双胞胎的需求正在迅速增长。然而，使用3D高斯点云（3DGS）构建这些环境通常需要昂贵的专用数据收集。同时，已部署的平台常常收集大量全向RGB和LiDAR日志，但由于传输限制和缺乏可扩展的重用流程，这些传感器数据中有相当一部分被直接丢弃或严格低效利用。本文提出了一种全向RGB-LiDAR重用流程，将这些存档日志转化为3DGS的强健初始化资产。直接转换此类原始日志会引入实际瓶颈：固有的非线性失真导致结构光（SfM）跟踪不可靠，而密集且无组织的LiDAR点云在3DGS优化过程中造成计算开销。为了克服这些挑战，我们的流程战略性地集成了一个ERP到立方体贴图的转换模块，以实现确定性的空间锚定，并采用PRISM——一种颜色分层下采样策略。通过基于快速点特征直方图（FPFH）的全局配准和迭代最近点（ICP）方法，我们的流程成功地将相当一部分被丢弃的数据重新利用为可用的SfM几何体。此外，与仅基于视觉的基线相比，我们的LiDAR增强初始化在结构复杂的场景中始终提高了最终3DGS渲染的保真度。最终，这项工作提供了一种确定性的工作流程，用于从标准存档传感器日志创建模拟级数字双胞胎。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2603.06071

Text-Driven Emotionally Continuous Talking Face Generation

基于文本驱动的情感连续对话人脸生成

Yang, Hao, Zhao, Yanyan, Zheng, Tian, Zhang, Hongbo, Wang, Bichen, Wu, Di, Fu, Xing, Zhi, Xuda, Huang, Yongbo, He, Hao

Abstract

Talking Face Generation (TFG) strives to create realistic and emotionally expressive digital faces. While previous TFG works have mastered the creation of naturalistic facial movements, they typically express a fixed target emotion in synthetic videos and lack the ability to exhibit continuously changing and natural expressions like humans do when conveying information. To synthesize realistic videos, we propose a novel task called Emotionally Continuous Talking Face Generation (EC-TFG), which takes a text segment and an emotion description with varying emotions as driving data, aiming to generate a video where the person speaks the text while reflecting the emotional changes within the description. Alongside this, we introduce a customized model, i.e., Temporal-Intensive Emotion Modulated Talking Face Generation (TIE-TFG), which innovatively manages dynamic emotional variations by employing Temporal-Intensive Emotion Fluctuation Modeling, allowing it to provide emotion variation sequences corresponding to the input text to drive continuous facial expression changes in synthesized videos. Extensive evaluations demonstrate our method's exceptional ability to produce smooth emotion transitions and uphold high-quality visuals and motion authenticity across diverse emotional states.

Chinese Translation

对话人脸生成（Talking Face Generation, TFG）旨在创建逼真且富有情感表现力的数字人脸。虽然以往的TFG研究已经掌握了自然面部动作的生成，但它们通常在合成视频中只表达固定的目标情感，缺乏像人类在传递信息时那样展现持续变化和自然表情的能力。为了合成真实的视频，我们提出了一项新任务，称为情感连续对话人脸生成（Emotionally Continuous Talking Face Generation, EC-TFG），该任务以文本片段和具有变化情感的情感描述作为驱动数据，旨在生成一个视频，其中人物在说出文本的同时反映情感描述中的情感变化。同时，我们引入了一种定制模型，即时间密集型情感调制对话人脸生成（Temporal-Intensive Emotion Modulated Talking Face Generation, TIE-TFG），该模型通过采用时间密集型情感波动建模创新性地管理动态情感变化，使其能够提供与输入文本相对应的情感变化序列，以驱动合成视频中的连续面部表情变化。广泛的评估表明我们的方法在生成平滑的情感过渡和保持高质量视觉效果及运动真实性方面表现出色，适用于多种情感状态。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2603.06081

Lyapunov Probes for Hallucination Detection in Large Foundation Models

用于大规模基础模型幻觉检测的李雅普诺夫探针

Luan, Bozhi, Li, Gen, Qin, Yalan, Guo, Jifeng, Zhou, Yun, Wu, Faguo, Zheng, Hongwei, Wu, Wenjun, Fan, Zhaoxin

Abstract

We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge-transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone regions. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.

Chinese Translation

我们通过动态系统稳定性理论的视角来解决大语言模型（LLMs）和多模态大语言模型（MLLMs）中的幻觉检测问题。我们并不将幻觉视为一个简单的分类任务，而是将（M）LLMs 概念化为动态系统，其中事实知识在表示空间中由稳定的平衡点表示。我们的主要见解是，幻觉往往出现在分隔稳定区和不稳定区的知识转变区域的边界处。为了捕捉这一现象，我们提出了李雅普诺夫探针（Lyapunov Probes）：一种轻量级网络，采用基于导数的稳定性约束进行训练，强制在输入扰动下信心单调衰减。通过进行系统的扰动分析并应用两阶段训练过程，这些探针能够可靠地区分稳定的事实区域和不稳定的、易产生幻觉的区域。在多样化的数据集和模型上的实验表明，相较于现有基线，性能有一致的提升。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2603.06090

DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

DeepSight：通过深度驱动的多模态模型连接深度图和语言

Yang, Hao, Zhang, Hongbo, Zhao, Yanyan, Qin, Bing

Abstract

Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.

Chinese Translation

多模态大型语言模型（MLLMs）在图像描述和视觉问答（VQA）等各种任务中取得了令人瞩目的表现；然而，它们在准确解读视觉数据中固有的深度信息方面常常面临挑战。在本研究中，我们介绍了DeepSight，这是首个专门设计用于增强三维场景理解的深度MLLM。与传统方法将RGB图像编码与文本对齐不同，我们的方法利用了深度图像的独特特性：单通道灰度图像，其中像素值直接反映深度线索，从而改善空间推理。为了解决与有限深度数据和简单通道复制不足相关的挑战，我们构建了一个新颖的深度图像-文本对数据集和一个深度指令数据集。深度图是通过GLPN模型从视觉图像生成的，GPT-4被用来策划相应的深度指令，这一方法得到了LLaVA的验证。此外，我们修改了CLIP中的ViT编码器，以纳入局部对象信息，从而更有效地捕捉深度的细微连续变化。为了评估我们模型的性能，我们开发了一个基于现有深度图像数据集的全面深度问答基准，严格评估在典型深度图场景中的理解能力。实验结果表明，DeepSight显著增强了深度感知和下游任务的表现，标志着多模态三维理解的重大进展。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2603.06122

FedARKS: Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration for Person Re-identification

FedARKS：通过稳健和区分性知识选择与整合实现的联邦聚合用于行人重识别

Xu, Xin, Ma, Binchang, Yu, Zhixi, Liu, Wei

Abstract

The application of federated domain generalization in person re-identification (FedDG-ReID) aims to enhance the model's generalization ability in unseen domains while protecting client data privacy. However, existing mainstream methods typically rely on global feature representations and simple averaging operations for model aggregation, leading to two limitations in domain generalization: (1) Using only global features makes it difficult to capture subtle, domain-invariant local details (such as accessories or textures); (2) Uniform parameter averaging treats all clients as equivalent, ignoring their differences in robust feature extraction capabilities, thereby diluting the contributions of high quality clients. To address these issues, we propose a novel federated learning framework, Federated Aggregation via Robust and Discriminative Knowledge Selection and Integration (FedARKS), comprising two mechanisms: RK (Robust Knowledge) and KS (Knowledge Selection).

Chinese Translation

联邦领域泛化在行人重识别中的应用（FedDG-ReID）旨在增强模型在未见领域中的泛化能力，同时保护客户端数据隐私。然而，现有主流方法通常依赖于全局特征表示和简单的平均操作进行模型聚合，这导致了领域泛化的两个局限性：（1）仅使用全局特征使得捕捉微妙的、领域不变的局部细节（如配饰或纹理）变得困难；（2）均匀的参数平均将所有客户端视为等同，忽视了它们在稳健特征提取能力上的差异，从而稀释了高质量客户端的贡献。为了解决这些问题，我们提出了一种新颖的联邦学习框架，即通过稳健和区分性知识选择与整合实现的联邦聚合（FedARKS），该框架包含两个机制：RK（稳健知识）和KS（知识选择）。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2603.06136

Cross-Resolution Distribution Matching for Diffusion Distillation

跨分辨率分布匹配用于扩散蒸馏

Chen, Feiyang, Pan, Hongpeng, Xu, Haonan, Duan, Xinyu, Yang, Yang, Wang, Zhefeng

Abstract

Diffusion distillation is central to accelerating image and video generation, yet existing methods are fundamentally limited by the denoising process, where step reduction has largely saturated. Partial timestep low-resolution generation can further accelerate inference, but it suffers noticeable quality degradation due to cross-resolution distribution gaps. We propose Cross-Resolution Distribution Matching Distillation (RMD), a novel distillation framework that bridges cross-resolution distribution gaps for high-fidelity, few-step multi-resolution cascaded inference. Specifically, RMD divides the timestep intervals for each resolution using logarithmic signal-to-noise ratio (logSNR) curves, and introduces logSNR-based mapping to compensate for resolution-induced shifts. Distribution matching is conducted along resolution trajectories to reduce the gap between low-resolution generator distributions and the teacher's high-resolution distribution. In addition, a predicted-noise re-injection mechanism is incorporated during upsampling to stabilize training and improve synthesis quality. Quantitative and qualitative results show that RMD preserves high-fidelity generation while accelerating inference across various backbones. Notably, RMD achieves up to 33.4X speedup on SDXL and 25.6X on Wan2.1-14B, while preserving high visual fidelity.

Chinese Translation

扩散蒸馏是加速图像和视频生成的核心，但现有方法在去噪过程中受到根本限制，步骤减少已基本饱和。部分时间步低分辨率生成可以进一步加速推理，但由于跨分辨率分布差距，质量显著下降。我们提出了跨分辨率分布匹配蒸馏（Cross-Resolution Distribution Matching Distillation, RMD），这是一种新颖的蒸馏框架，旨在弥合跨分辨率分布差距，以实现高保真、少步骤的多分辨率级联推理。具体而言，RMD使用对数信噪比（logarithmic signal-to-noise ratio, logSNR）曲线划分每个分辨率的时间步间隔，并引入基于logSNR的映射来补偿由分辨率引起的偏移。沿着分辨率轨迹进行分布匹配，以减少低分辨率生成器分布与教师高分辨率分布之间的差距。此外，在上采样过程中引入了预测噪声再注入机制，以稳定训练并提高合成质量。定量和定性结果表明，RMD在加速推理的同时保持高保真生成，适用于各种基础模型。值得注意的是，RMD在SDXL上实现了最高33.4倍的加速，在Wan2.1-14B上实现了25.6倍的加速，同时保持高视觉保真度。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2603.06140

Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

Place-it-R1：解锁多模态大语言模型在视频物体插入中的环境感知推理潜力

Gu, Bohai, Wu, Taiyi, Du, Dazhao, Liu, Jian, Yang, Shuai, Zhao, Xiaotong, Zhao, Alan, Guo, Song

Abstract

Modern video editing techniques have achieved high visual fidelity when inserting video objects. However, they focus on optimizing visual fidelity rather than physical causality, leading to edits that are physically inconsistent with their environment. In this work, we present Place-it-R$1$, an end-to-end framework for video object insertion that unlocks the environment-aware reasoning potential of Multimodal Large Language Models (MLLMs). Our framework leverages the Chain-of-Thought (CoT) reasoning of MLLMs to orchestrate video diffusion, following a Think-then-Place paradigm. To bridge cognitive reasoning and generative execution, we introduce three key innovations: First, MLLM performs physical scene understanding and interaction reasoning, generating environment-aware chain-of-thought tokens and inferring valid insertion regions to explicitly guide the diffusion toward physically plausible insertion. Then, we introduce MLLM-guided Spatial Direct Preference Optimization (DPO), where diffusion outputs are fed back to the MLLM for scoring, enabling visual naturalness. During inference, the MLLM iteratively triggers refinement cycles and elicits adaptive adjustments from the diffusion model, forming a closed-loop that progressively enhances editing quality. Furthermore, we provide two user-selectable modes: a plausibility-oriented flexible mode that permits environment modifications (\eg, generating support structures) to enhance physical plausibility, and a fidelity-oriented standard mode that preserves scene integrity for maximum fidelity, offering users explicit control over the plausibility-fidelity trade-off. Extensive experiments demonstrate Place-it-R1 achieves physically-coherent video object insertion compared with state-of-the-art solutions and commercial models.

Chinese Translation

现代视频编辑技术在插入视频物体时已实现高视觉保真度。然而，这些技术主要关注优化视觉保真度，而非物理因果关系，导致编辑结果在物理上与环境不一致。在本研究中，我们提出了Place-it-R1，一个端到端的视频物体插入框架，旨在解锁多模态大语言模型（MLLM）的环境感知推理潜力。我们的框架利用MLLM的思维链（Chain-of-Thought, CoT）推理来协调视频扩散，遵循“先思考再放置”（Think-then-Place）范式。为弥合认知推理与生成执行之间的差距，我们引入了三项关键创新：首先，MLLM执行物理场景理解和交互推理，生成环境感知的思维链令牌，并推断有效的插入区域，以明确指导扩散朝向物理上合理的插入。然后，我们引入了MLLM引导的空间直接偏好优化（Spatial Direct Preference Optimization, DPO），将扩散输出反馈给MLLM进行评分，从而实现视觉自然性。在推理过程中，MLLM迭代触发精细化循环，并从扩散模型中引发自适应调整，形成一个闭环，逐步提升编辑质量。此外，我们提供了两种用户可选择的模式：一种以合理性为导向的灵活模式，允许对环境进行修改（例如，生成支撑结构）以增强物理合理性；另一种以保真度为导向的标准模式，保持场景完整性以实现最大保真度，为用户提供对合理性与保真度权衡的明确控制。大量实验表明，与最先进的解决方案和商业模型相比，Place-it-R1实现了物理一致的视频物体插入。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2603.06141

Spatial Colour Mixing Illusions as a Perception Stress Test for Vision-Language Models

空间颜色混合错觉作为视觉-语言模型的感知压力测试

Basoc, Nicoleta-Nina, Cosma, Adrian, Radoi, Emilian

Abstract

Vision-language models (VLMs) achieve strong benchmark results, yet can exhibit systematic perceptual weaknesses: structured, large changes to pixel values can cause confident yet nonsensical predictions, even when the underlying scene remains easily recognizable to humans. We study this gap using Spatial Colour Mixing, a programmatic family of colour distortions that overlays structured patterns (in both RGB and Ostwald colour systems) onto natural images. We introduce a framework of eight spatial colour mixing variants and evaluate nine VLMs across three model families on four datasets. Across models and datasets, accuracy degrades sharply with increasing distortion, and scaling the language model does not reliably mitigate the failure. In a human study with 61 participants on an animal recognition dataset, humans substantially outperform VLMs under the same distortions. Finally, we show that a simple human-inspired preprocessing step recovers a meaningful portion of performance for several distortion types, motivating perception-aware preprocessing and tool-use as practical strategies for improving VLM robustness.

Chinese Translation

视觉-语言模型（VLMs）在基准测试中取得了强劲的成绩，但仍可能表现出系统性的感知弱点：对像素值进行结构化、大幅度的变化可能导致模型产生自信但毫无意义的预测，即使底层场景对人类来说仍然易于识别。我们通过空间颜色混合这一程序化的颜色扭曲家族进行研究，该方法将结构化模式（在RGB和Ostwald颜色系统中）叠加到自然图像上。我们引入了八种空间颜色混合变体的框架，并在四个数据集上评估了九个VLM模型，涵盖三个模型家族。在不同模型和数据集上，随着扭曲程度的增加，准确性急剧下降，而扩大语言模型的规模并不能可靠地缓解这一失败。在一项涉及61名参与者的动物识别数据集的人类研究中，人类在相同的扭曲条件下的表现显著优于VLMs。最后，我们展示了一种简单的人类启发的预处理步骤能够恢复多种扭曲类型的性能，激励了感知意识的预处理和工具使用作为提高VLM鲁棒性的实用策略。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2603.06147

Longitudinal NSCLC Treatment Progression via Multimodal Generative Models

通过多模态生成模型进行纵向非小细胞肺癌治疗进展研究

Mantegna, Massimiliano, Ayllón, Elena Mulero, Caragliano, Alice Natalina, Di Feola, Francesco, Tacconi, Claudia, Fiore, Michele, Ippolito, Edy, Greco, Carlo, Ramella, Sara, Cattin, Philippe C., Soda, Paolo, Tortora, Matteo, Guarrasi, Valerio

Abstract

Predicting tumor evolution during radiotherapy is a clinically critical challenge, particularly when longitudinal changes are driven by both anatomy and treatment. In this work, we introduce a Virtual Treatment (VT) framework that formulates non-small cell lung cancer (NSCLC) progression as a dose-aware multimodal conditional image-to-image translation problem. Given a CT scan, baseline clinical variables, and a specified radiation dose increment, VT aims to synthesize plausible follow-up CT images reflecting treatment-induced anatomical changes. We evaluate the proposed framework on a longitudinal dataset of 222 stage III NSCLC patients, comprising 895 CT scans acquired during radiotherapy under irregular clinical schedules. The generative process is conditioned on delivered dose increments together with demographic and tumor-related clinical variables. Representative GAN-based and diffusion-based models are benchmarked across 2D and 2.5D configurations. Quantitative and qualitative results indicate that diffusion-based models benefit more consistently from multimodal, dose-aware conditioning and produce more stable and anatomically plausible tumor evolution trajectories than GAN-based baselines, supporting the potential of VT as a tool for in-silico treatment monitoring and adaptive radiotherapy research in NSCLC.

Chinese Translation

在放疗过程中预测肿瘤演变是一个临床上至关重要的挑战，尤其是当纵向变化受到解剖结构和治疗双重驱动时。在本研究中，我们提出了一个虚拟治疗（Virtual Treatment, VT）框架，将非小细胞肺癌（NSCLC）的进展形式化为一个基于剂量的多模态条件图像到图像的转换问题。给定一张CT扫描图像、基线临床变量和指定的放射剂量增量，VT旨在合成反映治疗引起的解剖变化的合理后续CT图像。我们在一个包含222例III期NSCLC患者的纵向数据集中评估了所提出的框架，该数据集包括在不规则临床日程下获取的895张放疗期间的CT扫描。生成过程以施加的剂量增量以及人口统计学和肿瘤相关的临床变量为条件。在2D和2.5D配置中，对代表性的基于生成对抗网络（GAN）和扩散模型进行了基准测试。定量和定性结果表明，基于扩散的模型在多模态、基于剂量的条件下更为一致地受益，并且比基于GAN的基线产生更稳定和解剖上合理的肿瘤演变轨迹，支持VT作为非小细胞肺癌中用于计算机辅助治疗监测和自适应放疗研究的工具的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2603.06148

VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models

VLM-RobustBench：视觉语言模型鲁棒性综合基准

Saxena, Rohit, Suglia, Alessandro, Minervini, Pasquale

Abstract

Vision-language models (VLMs) achieve strong performance on standard, high-quality datasets, but we still do not fully understand how they perform under real-world image distortions. We present VLM-RobustBench, a benchmark spanning 49 augmentation types across noise, blur, weather, digital, and geometric perturbations, evaluated under graded severities (low/mid/high) and binary transforms, yielding 133 corrupted settings. We evaluate VLMs from four families (Qwen, InternVL, Molmo, Gemma) on two complementary benchmarks: MMBench (visually grounded) and MMMU-Pro (reasoning-oriented). Our results reveal that visual severity is a weak predictor of difficulty: low-severity spatial perturbations often degrade performance more than visually severe photometric corruptions. In particular, low-severity glass_blur reduces MMBench accuracy by about 8 pp on average across models, while the largest drops arise from resampling and geometric distortions (e.g., upsample, elastic_transform), reaching up to 34 pp. Overall, our findings suggest current VLMs are semantically strong but spatially fragile, motivating the definition of novel robustness evaluation protocols and training regimes that emphasize resampling and geometric invariances.

Chinese Translation

视觉语言模型（VLMs）在标准的高质量数据集上表现出色，但我们仍然未完全理解它们在现实世界图像失真下的表现。我们提出了VLM-RobustBench，这是一个涵盖49种增强类型的基准，涉及噪声、模糊、天气、数字和几何扰动，并在不同的严重程度（低/中/高）和二元变换下进行评估，产生了133种损坏设置。我们在两个互补的基准上评估了四个家族的VLM（Qwen、InternVL、Molmo、Gemma）：MMBench（视觉基础）和MMMU-Pro（推理导向）。我们的结果显示，视觉严重性是难度的弱预测因子：低严重性的空间扰动往往比视觉上严重的光度损坏更能降低性能。特别是，低严重性的玻璃模糊（glass_blur）使得MMBench的准确率平均下降约8个百分点，而最大的下降则来自重采样和几何失真（例如，重采样、弹性变换），降幅可达34个百分点。总体而言，我们的研究结果表明，当前的VLM在语义上表现强劲，但在空间上却脆弱，这促使我们定义新的鲁棒性评估协议和训练机制，强调重采样和几何不变性。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2603.06165

Reflective Flow Sampling Enhancement

反射流采样增强

Zhou, Zikai, Wang, Muyao, Shao, Shitong, Bai, Lichen, Xiong, Haoyi, Han, Bo, Xie, Zeke

Abstract

The growing demand for text-to-image generation has led to rapid advances in generative modeling. Recently, text-to-image diffusion models trained with flow matching algorithms, such as FLUX, have achieved remarkable progress and emerged as strong alternatives to conventional diffusion models. At the same time, inference-time enhancement strategies have been shown to improve the generation quality and text-prompt alignment of text-to-image diffusion models. However, these techniques are mainly applicable to conventional diffusion models and usually fail to perform well on flow models. To bridge this gap, we propose Reflective Flow Sampling (RF-Sampling), a theoretically-grounded and training-free inference enhancement framework explicitly designed for flow models, especially for the CFG-distilled variants (i.e., models distilled from CFG guidance techniques), like FLUX. Departing from heuristic interpretations, we provide a formal derivation proving that RF-Sampling implicitly performs gradient ascent on the text-image alignment score. By leveraging a linear combination of textual representations and integrating them with flow inversion, RF-Sampling allows the model to explore noise spaces that are more consistent with the input prompt. Extensive experiments across multiple benchmarks demonstrate that RF-Sampling consistently improves both generation quality and prompt alignment. Moreover, RF-Sampling is also the first inference enhancement method that can exhibit test-time scaling ability to some extent on FLUX.

Chinese Translation

对文本生成图像的日益增长的需求推动了生成建模的快速进展。最近，使用流匹配算法（如 FLUX）训练的文本到图像扩散模型取得了显著进展，并成为传统扩散模型的强有力替代方案。同时，推理时增强策略已被证明可以提高文本到图像扩散模型的生成质量和文本提示对齐。然而，这些技术主要适用于传统扩散模型，通常在流模型上表现不佳。为了解决这一问题，我们提出了反射流采样（RF-Sampling），这是一个理论基础扎实且无需训练的推理增强框架，专门为流模型设计，尤其是针对 CFG 蒸馏变体（即从 CFG 指导技术蒸馏而来的模型），如 FLUX。我们从启发式解释出发，提供了一个正式的推导，证明 RF-Sampling 隐式地对文本-图像对齐分数执行梯度上升。通过利用文本表示的线性组合并将其与流反演结合，RF-Sampling 使模型能够探索与输入提示更一致的噪声空间。跨多个基准的广泛实验表明，RF-Sampling 始终提高了生成质量和提示对齐。此外，RF-Sampling 也是第一个在 FLUX 上在某种程度上表现出测试时扩展能力的推理增强方法。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2603.06166

FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models

FreeOcc：基于基础模型的无训练全景占用预测

Caunes, Andrew, Chateau, Thierry, Fremont, Vincent

Abstract

Semantic and panoptic occupancy prediction for road scene analysis provides a dense 3D representation of the ego vehicle's surroundings. Current camera-only approaches typically rely on costly dense 3D supervision or require training models on data from the target domain, limiting deployment in unseen environments. We propose FreeOcc, a training-free pipeline that leverages pretrained foundation models to recover both semantics and geometry from multi-view images. FreeOcc extracts per-view panoptic priors with a promptable foundation segmentation model and prompt-to-taxonomy rules, and reconstructs metric 3D points with a reconstruction foundation model. Depth- and confidence- aware filtering lifts reliable labels into 3D, which are fused over time and voxelized with a deterministic refinement stack. For panoptic occupancy, instances are recovered by fitting and merging robust current-view 3D box candidates, enabling instance-aware occupancy without any learned 3D model. On Occ3D-nuScenes, FreeOcc achieves 16.9 mIoU and 16.5 RayIoU train-free, on par with state-of-the-art weakly supervised methods. When employed as a pseudo-label generation pipeline for training downstream models, it achieves 21.1 RayIoU, surpassing the previous state-of-the-art weakly supervised baseline. Furthermore, FreeOcc sets new baselines for both train-free and weakly supervised panoptic occupancy prediction, achieving 3.1 RayPQ and 3.9 RayPQ, respectively. These results highlight foundation-model-driven perception as a practical route to training-free 3D scene understanding.

Chinese Translation

语义和全景占用预测在道路场景分析中提供了自我车辆周围环境的密集3D表示。目前仅使用相机的方法通常依赖于昂贵的密集3D监督，或需要在目标领域的数据上训练模型，这限制了在未见环境中的部署。我们提出了FreeOcc，这是一种无训练的管道，利用预训练的基础模型从多视角图像中恢复语义和几何信息。FreeOcc通过可提示的基础分割模型和提示到分类规则提取每视角的全景先验，并使用重建基础模型重建度量3D点。深度和置信度感知过滤将可靠的标签提升到3D，并通过时间融合和使用确定性精炼堆栈进行体素化。在全景占用方面，通过拟合和合并稳健的当前视角3D框候选实例来恢复实例，使得无需任何学习的3D模型即可实现实例感知占用。在Occ3D-nuScenes数据集上，FreeOcc在无训练的情况下实现了16.9 mIoU和16.5 RayIoU，达到了最先进的弱监督方法的水平。当作为伪标签生成管道用于训练下游模型时，它实现了21.1 RayIoU，超越了之前的最先进弱监督基线。此外，FreeOcc为无训练和弱监督的全景占用预测设定了新的基线，分别达到了3.1 RayPQ和3.9 RayPQ。这些结果突显了以基础模型驱动的感知作为实现无训练3D场景理解的切实途径。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2603.06167

A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement

一种用于乳腺超声分割的半监督框架，具有无训练伪标签生成和标签精炼

Li, Ruili, Ding, Jiayi, Li, Ruiyu, Jin, Yilun, Ge, Shiwen, Zeng, Yuwen, Zhang, Xiaoyong, Takaya, Eichi, Vrba, Jan, Homma, Noriyasu

Abstract

Semi-supervised learning (SSL) has emerged as a promising paradigm for breast ultrasound (BUS) image segmentation, but it often suffers from unstable pseudo labels under extremely limited annotations, leading to inaccurate supervision and degraded performance. Recent vision-language models (VLMs) provide a new opportunity for pseudo-label generation, yet their effectiveness on BUS images remains limited because domain-specific prompts are difficult to transfer. To address this issue, we propose a semi-supervised framework with training-free pseudo-label generation and label refinement. By leveraging simple appearance-based descriptions (e.g., dark oval), our method enables cross-domain structural transfer between natural and medical images, allowing VLMs to generate structurally consistent pseudo labels. These pseudo labels are used to warm up a static teacher that captures global structural priors of breast lesions. Combined with an exponential moving average teacher, we further introduce uncertainty entropy weighted fusion and adaptive uncertainty-guided reverse contrastive learning to improve boundary discrimination. Experiments on four BUS datasets demonstrate that our method achieves performance comparable to fully supervised models even with only 2.5% labeled data, significantly outperforming existing SSL approaches. Moreover, the proposed paradigm is readily extensible: for other imaging modalities or diseases, only a global appearance description is required to obtain reliable pseudo supervision, enabling scalable semi-supervised medical image segmentation under limited annotations.

Chinese Translation

半监督学习（SSL）作为乳腺超声（BUS）图像分割的有前景的范式，然而在极其有限的标注下，它常常面临不稳定的伪标签问题，导致监督不准确和性能下降。近期的视觉-语言模型（VLMs）为伪标签生成提供了新的机会，但由于领域特定提示难以迁移，其在BUS图像上的有效性仍然有限。为了解决这一问题，我们提出了一种具有无训练伪标签生成和标签精炼的半监督框架。通过利用简单的基于外观的描述（例如，深色椭圆），我们的方法实现了自然图像与医学图像之间的跨领域结构转移，使得VLMs能够生成结构一致的伪标签。这些伪标签用于预热一个静态教师模型，该模型捕捉乳腺病变的全局结构先验。结合指数移动平均教师，我们进一步引入不确定性熵加权融合和自适应不确定性引导的反向对比学习，以提高边界判别能力。在四个BUS数据集上的实验表明，我们的方法在仅使用2.5%标注数据的情况下，性能可与全监督模型相媲美，显著优于现有的SSL方法。此外，所提出的范式具有良好的扩展性：对于其他成像模态或疾病，只需一个全局外观描述即可获得可靠的伪监督，从而在有限标注下实现可扩展的半监督医学图像分割。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2603.06168

JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas

JOPP-3D：基于点云和全景图的联合开放词汇语义分割

Inuganti, Sandeep, Kanayama, Hideaki, Shimizu, Kanta, Chamseddine, Mahdi, Yokota, Soichiro, Stricker, Didier, Rambach, Jason

Abstract

Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.

Chinese Translation

在3D点云和全景图等视觉模态之间进行语义分割仍然是一项具有挑战性的任务，主要由于标注数据的稀缺以及固定标签模型的适应性有限。本文提出了JOPP-3D，一个开放词汇语义分割框架，联合利用全景和点云数据，以实现基于语言的场景理解。我们将RGB-D全景图像转换为相应的切向透视图像和3D点云，然后利用这些模态提取和对齐基础的视觉-语言特征。这使得自然语言查询能够在两种输入模态上生成语义掩码。在Stanford-2D-3D-s和ToF-360数据集上的实验评估表明，JOPP-3D能够在全景和3D领域产生连贯且语义上有意义的分割。与当前最先进技术（SOTA）相比，我们提出的方法在开放和闭合词汇的2D和3D语义分割上取得了显著的提升。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2603.06173

Optimizing 3D Diffusion Models for Medical Imaging via Multi-Scale Reward Learning

通过多尺度奖励学习优化医学成像的3D扩散模型

Tian, Yueying, Han, Xudong, Zhou, Meng, Aviles-Espinosa, Rodrigo, Young, Rupert, Birch, Philip

Abstract

Diffusion models have emerged as powerful tools for 3D medical image generation, yet bridging the gap between standard training objectives and clinical relevance remains a challenge. This paper presents a method to enhance 3D diffusion models using Reinforcement Learning (RL) with multi-scale feedback. We first pretrain a 3D diffusion model on MRI volumes to establish a robust generative prior. Subsequently, we fine-tune the model using Proximal Policy Optimization (PPO), guided by a novel reward system that integrates both 2D slice-wise assessments and 3D volumetric analysis. This combination allows the model to simultaneously optimize for local texture details and global structural coherence. We validate our framework on the BraTS 2019 and OASIS-1 datasets. Our results indicate that incorporating RL feedback effectively steers the generation process toward higher quality distributions. Quantitative analysis reveals significant improvements in Fr\'echet Inception Distance (FID) and, crucially, the synthetic data demonstrates enhanced utility in downstream tumor and disease classification tasks compared to non-optimized baselines.

Chinese Translation

扩散模型已成为3D医学图像生成的强大工具，但在标准训练目标与临床相关性之间架起桥梁仍然是一项挑战。本文提出了一种通过多尺度反馈的强化学习（Reinforcement Learning, RL）来增强3D扩散模型的方法。我们首先在MRI体积上对3D扩散模型进行预训练，以建立一个稳健的生成先验。随后，我们使用近端策略优化（Proximal Policy Optimization, PPO）对模型进行微调，指导原则是一个新颖的奖励系统，该系统整合了2D切片评估和3D体积分析。这种组合使模型能够同时优化局部纹理细节和全局结构一致性。我们在BraTS 2019和OASIS-1数据集上验证了我们的框架。结果表明，纳入RL反馈有效地引导生成过程朝向更高质量的分布。定量分析显示Fréchet Inception Distance (FID)显著改善，重要的是，合成数据在下游肿瘤和疾病分类任务中相比于未优化基线表现出更高的实用性。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2603.06178

Making Training-Free Diffusion Segmentors Scale with the Generative Power

让无训练扩散分割器与生成能力相匹配

Meng, Benyuan, Xu, Qianqian, Wang, Zitai, Cao, Xiaochun, Huang, Longtao, Huang, Qingming

Abstract

As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.

Chinese Translation

作为强大的生成模型，文本到图像的扩散模型最近被探索用于区分任务。一系列研究集中于将预训练的扩散模型适应于语义分割，而无需进一步训练，从而形成了所谓的无训练扩散分割器。这些方法通常依赖于模型注意力层中的交叉注意力图，假设这些图能够捕捉图像像素与文本标记之间的语义关系。理想情况下，这类方法应该能从更强大的扩散模型中受益，即更强的生成能力应能带来更好的分割效果。然而，我们观察到现有方法往往无法相应地扩展。为了理解这个问题，我们识别出两个潜在的差距：（i）交叉注意力是在多个头和层之间计算的，但这些单独的注意力图与统一的全局表示之间存在差异。（ii）即使有全局图可用，由于不同文本标记之间的分数不平衡，它也并不能直接转化为准确的语义关联以进行分割。为了解决这些差距，我们提出了两种技术：自动聚合和逐像素重标定，这两者结合使得无训练分割能够更好地利用生成能力。我们在标准的语义分割基准上评估了我们的方法，并进一步将其整合到生成技术中，展示了改进的性能和广泛的适用性。代码可在 https://github.com/Darkbblue/goca 获取。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2603.06180

Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning

对比自监督学习：一种用于脚本相似性学习的两阶段框架

Roman, Claire, Meyer, Philippe

Abstract

Learning similarity metrics for glyphs and writing systems faces a fundamental challenge: while individual graphemes within invented alphabets can be reliably labeled, the historical relationships between different scripts remain uncertain and contested. We propose a two-stage framework that addresses this epistemological constraint. First, we train an encoder with contrastive loss on labeled invented alphabets, establishing a teacher model with robust discriminative features. Second, we extend to historically attested scripts through teacher-student distillation, where the student learns unsupervised representations guided by the teacher's knowledge but free to discover latent cross-script similarities. The asymmetric setup enables the student to learn deformation-invariant embeddings while inheriting discriminative structure from clean examples. Our approach bridges supervised contrastive learning and unsupervised discovery, enabling both hard boundaries between distinct systems and soft similarities reflecting potential historical influences. Experiments on diverse writing systems demonstrate effective few-shot glyph recognition and meaningful script clustering without requiring ground-truth evolutionary relationships.

Chinese Translation

学习字形和书写系统的相似性度量面临一个基本挑战：尽管在发明的字母表中，单个字母可以被可靠标记，但不同书写系统之间的历史关系仍然不确定且存在争议。我们提出了一种两阶段框架来解决这一认识论限制。首先，我们在标记的发明字母表上训练一个带有对比损失的编码器，建立一个具有强大判别特征的教师模型。其次，我们通过教师-学生蒸馏扩展到历史上有据可查的书写系统，在此过程中，学生在教师知识的指导下学习无监督表示，但可以自由发现潜在的跨脚本相似性。这种不对称的设置使得学生能够学习不变形的嵌入，同时继承来自干净示例的判别结构。我们的方法架起了监督对比学习与无监督发现之间的桥梁，使得不同系统之间既有明确的边界，又有反映潜在历史影响的软相似性。对多种书写系统的实验表明，在不需要真实进化关系的情况下，能够有效实现少样本字形识别和有意义的脚本聚类。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2603.06181

Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots

迈向运动图灵测试：评估类人机器人中的人类相似性

Li, Mingzhe, Liu, Mengyin, Wu, Zekai, Lin, Xincheng, Zhang, Junsheng, Yan, Ming, Xie, Zengye, Zhang, Changwang, Wen, Chenglu, Xu, Lan, Shen, Siqi, Wang, Cheng

Abstract

Humanoid robots have achieved significant progress in motion generation and control, exhibiting movements that appear increasingly natural and human-like. Inspired by the Turing Test, we propose the Motion Turing Test, a framework that evaluates whether human observers can discriminate between humanoid robot and human poses using only kinematic information. To facilitate this evaluation, we present the Human-Humanoid Motion (HHMotion) dataset, which consists of 1,000 motion sequences spanning 15 action categories, performed by 11 humanoid models and 10 human subjects. All motion sequences are converted into SMPL-X representations to eliminate the influence of visual appearance. We recruited 30 annotators to rate the human-likeness of each pose on a 0-5 scale, resulting in over 500 hours of annotation. Analysis of the collected data reveals that humanoid motions still exhibit noticeable deviations from human movements, particularly in dynamic actions such as jumping, boxing, and running. Building on HHMotion, we formulate a human-likeness evaluation task that aims to automatically predict human-likeness scores from motion data. Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based methods. The dataset, code, and benchmark will be publicly released to support future research in the community.

Chinese Translation

类人机器人在运动生成和控制方面取得了显著进展，展现出越来越自然和类人化的动作。受图灵测试的启发，我们提出了运动图灵测试（Motion Turing Test），这是一个评估人类观察者是否能够仅通过运动学信息区分类人机器人和人类姿态的框架。为了便于这一评估，我们呈现了人类-类人运动（Human-Humanoid Motion, HHMotion）数据集，该数据集包含1,000个运动序列，涵盖15个动作类别，由11个类人模型和10名人类受试者执行。所有运动序列均转换为SMPL-X表示，以消除视觉外观的影响。我们招募了30名注释员对每个姿态进行0-5分的类人性评分，累计超过500小时的注释。对收集数据的分析表明，类人运动仍然与人类运动存在明显偏差，特别是在跳跃、拳击和奔跑等动态动作中。在HHMotion的基础上，我们制定了一个类人性评估任务，旨在从运动数据中自动预测类人性评分。尽管多模态大型语言模型（LLM）近期取得了进展，但我们发现它们在评估运动类人性方面仍然不足。为此，我们提出了一个简单的基线模型，并证明其优于几种当代基于LLM的方法。该数据集、代码和基准将公开发布，以支持该领域未来的研究。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2603.06186

SpaCRD: Multimodal Deep Fusion of Histology and Spatial Transcriptomics for Cancer Region Detection

SpaCRD：组织学与空间转录组学的多模态深度融合用于癌症区域检测

Xue, Shuailin, Wan, Jun, Zhang, Lihua, Min, Wenwen

Abstract

Accurate detection of cancer tissue regions (CTR) enables deeper analysis of the tumor microenvironment and offers crucial insights into treatment response. Traditional CTR detection methods, which typically rely on the rich cellular morphology in histology images, are susceptible to a high rate of false positives due to morphological similarities across different tissue regions. The groundbreaking advances in spatial transcriptomics (ST) provide detailed cellular phenotypes and spatial localization information, offering new opportunities for more accurate cancer region detection. However, current methods are unable to effectively integrate histology images with ST data, especially in the context of cross-sample and cross-platform/batch settings for accomplishing the CTR detection. To address this challenge, we propose SpaCRD, a transfer learning-based method that deeply integrates histology images and ST data to enable reliable CTR detection across diverse samples, platforms, and batches. Once trained on source data, SpaCRD can be readily generalized to accurately detect cancerous regions across samples from different platforms and batches. The core of SpaCRD is a category-regularized variational reconstruction-guided bidirectional cross-attention fusion network, which enables the model to adaptively capture latent co-expression patterns between histological features and gene expression from multiple perspectives. Extensive benchmark analysis on 23 matched histology-ST datasets spanning various disease types, platforms, and batches demonstrates that SpaCRD consistently outperforms existing eight state-of-the-art methods in CTR detection.

Chinese Translation

准确检测癌症组织区域（CTR）能够深入分析肿瘤微环境，并为治疗反应提供重要见解。传统的CTR检测方法通常依赖于组织学图像中的丰富细胞形态，但由于不同组织区域之间的形态相似性，这些方法容易产生较高的假阳性率。空间转录组学（ST）的突破性进展提供了详细的细胞表型和空间定位信息，为更准确的癌症区域检测提供了新的机会。然而，当前的方法无法有效地将组织学图像与ST数据整合，特别是在跨样本和跨平台/批次的背景下进行CTR检测。为了解决这一挑战，我们提出了SpaCRD，这是一种基于迁移学习的方法，深度整合组织学图像和ST数据，以实现跨多样本、平台和批次的可靠CTR检测。一旦在源数据上训练完成，SpaCRD可以很好地推广到不同平台和批次的样本中准确检测癌变区域。SpaCRD的核心是一个类别正则化的变分重构引导的双向交叉注意力融合网络，使模型能够自适应地捕捉组织学特征与多个视角下基因表达之间的潜在共表达模式。在涵盖多种疾病类型、平台和批次的23个匹配的组织学-ST数据集上进行的广泛基准分析表明，SpaCRD在CTR检测中始终优于现有的八种最先进的方法。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2603.06200

Adaptive Language-Aware Image Reflection Removal Network

自适应语言感知图像反射去除网络

Fang, Siyan, Wang, Yuntao, Zhang, Jinpu, Li, Ziwen, Wang, Yuehuan

Abstract

Existing image reflection removal methods struggle to handle complex reflections. Accurate language descriptions can help the model understand the image content to remove complex reflections. However, due to blurred and distorted interferences in reflected images, machine-generated language descriptions of the image content are often inaccurate, which harms the performance of language-guided reflection removal. To address this, we propose the Adaptive Language-Aware Network (ALANet) to remove reflections even with inaccurate language inputs. Specifically, ALANet integrates both filtering and optimization strategies. The filtering strategy reduces the negative effects of language while preserving its benefits, whereas the optimization strategy enhances the alignment between language and visual features. ALANet also utilizes language cues to decouple specific layer content from feature maps, improving its ability to handle complex reflections. To evaluate the model's performance under complex reflections and varying levels of language accuracy, we introduce the Complex Reflection and Language Accuracy Variance (CRLAV) dataset. Experimental results demonstrate that ALANet surpasses state-of-the-art methods for image reflection removal. The code and dataset are available at https://github.com/fashyon/ALANet.

Chinese Translation

现有的图像反射去除方法在处理复杂反射时表现不佳。准确的语言描述可以帮助模型理解图像内容，从而去除复杂反射。然而，由于反射图像中存在模糊和扭曲的干扰，机器生成的图像内容语言描述往往不准确，这会影响基于语言的反射去除性能。为了解决这个问题，我们提出了自适应语言感知网络（Adaptive Language-Aware Network，ALANet），即使在语言输入不准确的情况下也能去除反射。具体而言，ALANet结合了过滤和优化策略。过滤策略减少了语言的负面影响，同时保留其优势，而优化策略增强了语言与视觉特征之间的对齐。ALANet还利用语言线索将特定层的内容从特征图中解耦，提高其处理复杂反射的能力。为了评估模型在复杂反射和不同语言准确性水平下的性能，我们引入了复杂反射与语言准确性方差（Complex Reflection and Language Accuracy Variance，CRLAV）数据集。实验结果表明，ALANet在图像反射去除方面超越了最先进的方法。代码和数据集可在 https://github.com/fashyon/ALANet 获取。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2603.06201

Point-Supervised Skeleton-Based Human Action Segmentation

点监督的基于骨架的人类动作分割

Wang, Hongsong, Shen, Yiqin, Yan, Pengbo, Gui, Jie

Abstract

Skeleton-based temporal action segmentation is a fundamental yet challenging task, playing a crucial role in enabling intelligent systems to perceive and respond to human activities. While fully-supervised methods achieve satisfactory performance, they require costly frame-level annotations and are sensitive to ambiguous action boundaries. To address these issues, we introduce a point-supervised framework for skeleton-based action segmentation, where only a single frame per action segment is labeled. We leverage multimodal skeleton data, including joint, bone, and motion information, encoded via a pretrained unified model to extract rich feature representations. To generate reliable pseudo-labels, we propose a novel prototype similarity method and integrate it with two existing methods: energy function and constrained K-Medoids clustering. Multimodal pseudo-label integration is proposed to enhance the reliability of the pseudo-label and guide the model training. We establish new benchmarks on PKU-MMD (X-Sub and X-View), MCFS-22, and MCFS-130, and implement baselines for point-supervised skeleton-based human action segmentation. Extensive experiments show that our method achieves competitive performance, even surpassing some fully-supervised methods while significantly reducing annotation effort.

Chinese Translation

基于骨架的时间动作分割是一项基础而具有挑战性的任务，在使智能系统感知和响应人类活动中发挥着至关重要的作用。尽管全监督方法能够实现令人满意的性能，但它们需要昂贵的帧级标注，并且对模糊的动作边界敏感。为了解决这些问题，我们提出了一种点监督框架用于基于骨架的动作分割，其中每个动作段仅标注一个帧。我们利用多模态骨架数据，包括关节、骨骼和运动信息，通过预训练的统一模型进行编码，以提取丰富的特征表示。为了生成可靠的伪标签，我们提出了一种新颖的原型相似性方法，并将其与两种现有方法结合：能量函数和约束K-中位数聚类。我们提出了多模态伪标签集成，以增强伪标签的可靠性并指导模型训练。我们在PKU-MMD（X-Sub和X-View）、MCFS-22和MCFS-130上建立了新的基准，并实现了基于点监督的骨架人类动作分割的基线。大量实验表明，我们的方法实现了竞争力的性能，甚至超过了一些全监督方法，同时显著减少了标注工作量。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2603.06210

VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction

VG3S：基于视觉几何的高斯点云在语义占用预测中的应用

Yan, Xiaoyang, Pei, Muleilan, Shen, Shaojie

Abstract

3D semantic occupancy prediction has become a crucial perception task for comprehensive scene understanding in autonomous driving. While recent advances have explored 3D Gaussian splatting for occupancy modeling to substantially reduce computational overhead, the generation of high-quality 3D Gaussians relies heavily on accurate geometric cues, which are often insufficient in purely vision-centric paradigms. To bridge this gap, we advocate for injecting the strong geometric grounding capability from Vision Foundation Models (VFMs) into occupancy prediction. In this regard, we introduce Visual Geometry Grounded Gaussian Splatting (VG3S), a novel framework that empowers Gaussian-based occupancy prediction with cross-view 3D geometric grounding. Specifically, to fully exploit the rich 3D geometric priors from a frozen VFM, we propose a plug-and-play hierarchical geometric feature adapter, which can effectively transform generic VFM tokens via feature aggregation, task-specific alignment, and multi-scale restructuring. Extensive experiments on the nuScenes occupancy benchmark demonstrate that VG3S achieves remarkable improvements of 12.6% in IoU and 7.5% in mIoU over the baseline. Furthermore, we show that VG3S generalizes seamlessly across diverse VFMs, consistently enhancing occupancy prediction accuracy and firmly underscoring the immense value of integrating priors derived from powerful, pre-trained geometry-grounded VFMs.

Chinese Translation

3D语义占用预测已成为自动驾驶中全面场景理解的重要感知任务。尽管近期的研究探索了3D高斯点云在占用建模中的应用，以显著降低计算开销，但高质量3D高斯的生成在很大程度上依赖于准确的几何线索，而这些线索在纯视觉中心的范式中往往不足。为了解决这一问题，我们提倡将视觉基础模型（Vision Foundation Models, VFMs）中的强几何基础能力注入到占用预测中。在这方面，我们提出了视觉几何基础高斯点云（Visual Geometry Grounded Gaussian Splatting, VG3S），这是一个新颖的框架，赋予基于高斯的占用预测以跨视角的3D几何基础。具体而言，为了充分利用来自冻结的VFM的丰富3D几何先验，我们提出了一种即插即用的层次几何特征适配器，该适配器能够通过特征聚合、任务特定对齐和多尺度重构有效地转换通用VFM标记。在nuScenes占用基准上的大量实验表明，VG3S在IoU上实现了12.6%的显著提升，在mIoU上实现了7.5%的提升，超越了基线。此外，我们还展示了VG3S在不同VFM之间的无缝泛化，持续提高了占用预测的准确性，并坚实地强调了整合来自强大、预训练的几何基础VFM的先验知识的巨大价值。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2603.06213

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

直截了当：无训练的多模态摘要通过事件链

You, Xiaoxing, Huang, Qiang, Li, Lingyu, Chang, Xiaojun, Yu, Jun

Abstract

Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive experiments on eight diverse datasets demonstrate that **CoE** consistently outperforms state-of-the-art video CoT baselines, achieving average gains of **+3.04 ROUGE**, **+9.51 CIDEr**, and **+1.88 BERTScore**, highlighting its robustness, interpretability, and cross-domain generalization. Our code is available at https://github.com/youxiaoxing/CoE.

Chinese Translation

多模态摘要（MMS）旨在通过理解和整合视频、文本和图像中的信息生成简明的文本摘要。然而，现有的方法仍面临三个主要挑战：（1）依赖于特定领域的监督，（2）与弱跨模态基础的隐式融合，以及（3）缺乏事件转变的平面时间建模。为了解决这些问题，我们提出了**CoE**，一个无训练的MMS框架，通过一个由层次事件图（HEG）引导的**事件链**进行结构化推理。HEG将文本语义编码为一个明确的事件层次结构，支撑跨模态基础和时间推理。在这一结构的指导下，**CoE**定位关键视觉线索，建模事件演变和因果转变，并通过轻量级的风格适应进行输出优化，以实现领域对齐。在八个不同的数据集上进行的广泛实验表明，**CoE**始终优于最先进的视频CoT基线，平均提升**+3.04 ROUGE**，**+9.51 CIDEr**和**+1.88 BERTScore**，突显了其鲁棒性、可解释性和跨领域泛化能力。我们的代码可在 https://github.com/youxiaoxing/CoE 获取。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2603.06216

EntON: Eigenentropy-Optimized Neighborhood Densification in 3D Gaussian Splatting

EntON：基于特征熵优化的3D高斯点云邻域稠密化

Jäger, Miriam, Jutzi, Boris

Abstract

We present a novel Eigenentropy-optimized neighboorhood densification strategy EntON in 3D Gaussian Splatting (3DGS) for geometrically accurate and high-quality rendered 3D reconstruction. While standard 3DGS produces Gaussians whose centers and surfaces are poorly aligned with the underlying object geometry, surface-focused reconstruction methods frequently sacrifice photometric accuracy. In contrast to the conventional densification strategy, which relies on the magnitude of the view-space position gradient, our approach introduces a geometry-aware strategy to guide adaptive splitting and pruning. Specifically, we compute the 3D shape feature Eigenentropy from the eigenvalues of the covariance matrix in the k-nearest neighborhood of each Gaussian center, which quantifies the local structural order. These Eigenentropy values are integrated into an alternating optimization framework: During the optimization process, the algorithm alternates between (i) standard gradient-based densification, which refines regions via view-space gradients, and (ii) Eigenentropy-aware densification, which preferentially densifies Gaussians in low-Eigenentropy (ordered, flat) neighborhoods to better capture fine geometric details on the object surface, and prunes those in high-Eigenentropy (disordered, spherical) regions. We provide quantitative and qualitative evaluations on two benchmark datasets: small-scale DTU dataset and large-scale TUM2TWIN dataset, covering man-made objects and urban scenes. Experiments demonstrate that our Eigenentropy-aware alternating densification strategy improves geometric accuracy by up to 33% and rendering quality by up to 7%, while reducing the number of Gaussians by up to 50% and training time by up to 23%. Overall, EnTON achieves a favorable balance between geometric accuracy, rendering quality and efficiency by avoiding unnecessary scene expansion.

Chinese Translation

我们提出了一种新颖的基于特征熵优化的邻域稠密化策略EntON，用于3D高斯点云（3D Gaussian Splatting，3DGS），以实现几何准确且高质量的3D重建。标准的3DGS生成的高斯分布，其中心和表面与底层物体几何形状的对齐较差，而以表面为中心的重建方法常常牺牲光度准确性。与传统的稠密化策略依赖于视空间位置梯度的大小不同，我们的方法引入了一种几何感知策略来指导自适应的分裂和修剪。具体而言，我们从每个高斯中心的k近邻的协方差矩阵的特征值中计算3D形状特征特征熵（Eigenentropy），以量化局部结构的有序性。这些特征熵值被整合到一个交替优化框架中：在优化过程中，算法在（i）标准基于梯度的稠密化（通过视空间梯度细化区域）和（ii）基于特征熵的稠密化（优先在低特征熵（有序、平坦）邻域中稠密化高斯，以更好地捕捉物体表面的细微几何细节，并修剪高特征熵（无序、球形）区域中的高斯）之间交替进行。我们在两个基准数据集上提供了定量和定性评估：小规模的DTU数据集和大规模的TUM2TWIN数据集，涵盖人造物体和城市场景。实验表明，我们的基于特征熵的交替稠密化策略在几何准确性上提高了多达33%，在渲染质量上提高了多达7%，同时将高斯数量减少了多达50%，训练时间减少了多达23%。总体而言，EntON通过避免不必要的场景扩展，实现了几何准确性、渲染质量和效率之间的良好平衡。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2603.06220

Word-Anchored Temporal Forgery Localization

基于词语锚定的时间伪造定位

Wang, Tianyi, Shao, Xi, Cheng, Harry, Wang, Yinglong, Kankanhalli, Mohan

Abstract

Current temporal forgery localization (TFL) approaches typically rely on temporal boundary regression or continuous frame-level anomaly detection paradigms to derive candidate forgery proposals. However, they suffer not only from feature granularity misalignment but also from costly computation. To address these issues, we propose word-anchored temporal forgery localization (WAFL), a novel paradigm that shifts the TFL task from temporal regression and continuous localization to discrete word-level binary classification. Specifically, we first analyze the essence of temporal forgeries and identify the minimum meaningful forgery units, word tokens, and then align data preprocessing with the natural linguistic boundaries of speech. To adapt powerful pre-trained foundation backbones for feature extraction, we introduce the forensic feature realignment (FFR) module, mapping representations from the pre-trained semantic space to a discriminative forensic manifold. This allows subsequent lightweight linear classifiers to efficiently perform binary classification and accomplish the TFL task. Furthermore, to overcome the extreme class imbalance inherent to forgery detection, we design the artifact-centric asymmetric (ACA) loss, which breaks the standard precision-recall trade-off by dynamically suppressing overwhelming authentic gradients while asymmetrically prioritizing subtle forensic artifacts. Extensive experiments demonstrate that WAFL significantly outperforms state-of-the-art approaches in localization performance under both in- and cross-dataset settings, while requiring substantially fewer learnable parameters and operating at high computational efficiency.

Chinese Translation

当前的时间伪造定位（TFL）方法通常依赖于时间边界回归或连续帧级异常检测范式来推导候选伪造提议。然而，这些方法不仅存在特征粒度不匹配的问题，还面临高昂的计算成本。为了解决这些问题，我们提出了基于词语锚定的时间伪造定位（WAFL），这是一种新颖的范式，将TFL任务从时间回归和连续定位转变为离散的词语级二分类。具体而言，我们首先分析时间伪造的本质，并识别出最小的有意义伪造单元——词语标记，然后将数据预处理与语音的自然语言边界对齐。为了适应强大的预训练基础模型进行特征提取，我们引入了法医特征重对齐（FFR）模块，将预训练语义空间中的表示映射到一个可区分的法医流形。这使得后续的轻量级线性分类器能够高效地执行二分类并完成TFL任务。此外，为了克服伪造检测中固有的极端类别不平衡问题，我们设计了以伪造物为中心的非对称（ACA）损失，通过动态抑制压倒性的真实梯度，同时非对称地优先考虑微妙的法医伪造物，从而打破标准的精确率-召回权衡。大量实验表明，WAFL在定位性能上显著优于最先进的方法，无论是在同一数据集还是跨数据集设置下，同时需要的可学习参数显著更少，并且在高计算效率下运行。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2603.06228

Low-latency Event-based Object Detection with Spatially-Sparse Linear Attention

基于事件的低延迟目标检测与空间稀疏线性注意力

Hao, Haiqing, Sui, Zhipeng, Zou, Rong, Dai, Zijia, Zubić, Nikola, Scaramuzza, Davide, Wang, Wenhui

Abstract

Event cameras provide sequential visual data with spatial sparsity and high temporal resolution, making them attractive for low-latency object detection. Existing asynchronous event-based neural networks realize this low-latency advantage by updating predictions event-by-event, but still suffer from two bottlenecks: recurrent architectures are difficult to train efficiently on long sequences, and improving accuracy often increases per-event computation and latency. Linear attention is appealing in this setting because it supports parallel training and recurrent inference. However, standard linear attention updates a global state for every event, yielding a poor accuracy-efficiency trade-off, which is problematic for object detection, where fine-grained representations and thus states are preferred. The key challenge is therefore to introduce sparse state activation that exploits event sparsity while preserving efficient parallel training. We propose Spatially-Sparse Linear Attention (SSLA), which introduces a mixture-of-spaces state decomposition and a scatter-compute-gather training procedure, enabling state-level sparsity as well as training parallelism. Built on SSLA, we develop an end-to-end asynchronous linear attention model, SSLA-Det, for event-based object detection. On Gen1 and N-Caltech101, SSLA-Det achieves state-of-the-art accuracy among asynchronous methods, reaching 0.375 mAP and 0.515 mAP, respectively, while reducing per-event computation by more than 20 times compared to the strongest prior asynchronous baseline, demonstrating the potential of linear attention for low-latency event-based vision.

Chinese Translation

事件相机提供了具有空间稀疏性和高时间分辨率的连续视觉数据，使其在低延迟目标检测中具有吸引力。现有的异步基于事件的神经网络通过逐事件更新预测来实现这一低延迟优势，但仍然面临两个瓶颈：递归架构在长序列上难以高效训练，而提高准确性往往会增加每个事件的计算量和延迟。在这种情况下，线性注意力因其支持并行训练和递归推理而显得尤为吸引人。然而，标准线性注意力为每个事件更新全局状态，导致准确性与效率之间的权衡较差，这对目标检测来说是一个问题，因为目标检测更倾向于细粒度的表示和状态。因此，关键挑战在于引入稀疏状态激活，利用事件稀疏性，同时保持高效的并行训练。我们提出了空间稀疏线性注意力（Spatially-Sparse Linear Attention, SSLA），它引入了一种混合空间状态分解和散点计算聚合的训练过程，从而实现状态级稀疏性和训练并行性。在SSLA的基础上，我们开发了一种端到端的异步线性注意力模型SSLA-Det，用于基于事件的目标检测。在Gen1和N-Caltech101数据集上，SSLA-Det在异步方法中实现了最先进的准确性，分别达到了0.375 mAP和0.515 mAP，同时与最强的先前异步基线相比，减少了20倍以上的每事件计算，展示了线性注意力在低延迟基于事件的视觉中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2603.06231

TaPD: Temporal-adaptive Progressive Distillation for Observation-Adaptive Trajectory Forecasting in Autonomous Driving

TaPD：用于自主驾驶中观察自适应轨迹预测的时间自适应渐进蒸馏

Fan, Mingyu, Liu, Yi, Zhou, Hao, Qian, Deheng, Khan, Mohammad Haziq, Raetsch, Matthias

Abstract

Trajectory prediction is essential for autonomous driving, enabling vehicles to anticipate the motion of surrounding agents to support safe planning. However, most existing predictors assume fixed-length histories and suffer substantial performance degradation when observations are variable or extremely short in real-world settings (e.g., due to occlusion or a limited sensing range). We propose TaPD (Temporal-adaptive Progressive Distillation), a unified plug-and-play framework for observation-adaptive trajectory forecasting under variable history lengths. TaPD comprises two cooperative modules: an Observation-Adaptive Forecaster (OAF) for future prediction and a Temporal Backfilling Module (TBM) for explicit reconstruction of the past. OAF is built on progressive knowledge distillation (PKD), which transfers motion pattern knowledge from long-horizon "teachers" to short-horizon "students" via hierarchical feature regression, enabling short observations to recover richer motion context. We further introduce a cosine-annealed distillation weighting scheme to balance forecasting supervision and feature alignment, improving optimization stability and cross-length consistency. For extremely short histories where implicit alignment is insufficient, TBM backfills missing historical segments conditioned on scene evolution, producing context-rich trajectories that strengthen PKD and thereby improve OAF. We employ a decoupled pretrain-reconstruct-finetune protocol to preserve real-motion priors while adapting to backfilled inputs. Extensive experiments on Argoverse 1 and Argoverse 2 show that TaPD consistently outperforms strong baselines across all observation lengths, delivers especially large gains under very short inputs, and improves other predictors (e.g., HiVT) in a plug-and-play manner. Code will be available at https://github.com/zhouhao94/TaPD.

Chinese Translation

轨迹预测对自主驾驶至关重要，使车辆能够预测周围代理的运动，以支持安全规划。然而，大多数现有的预测器假设固定长度的历史数据，并且在实际环境中（例如，由于遮挡或有限的感知范围）观察数据变化或极短时，性能会显著下降。我们提出了TaPD（时间自适应渐进蒸馏），这是一个统一的即插即用框架，用于在可变历史长度下进行观察自适应轨迹预测。TaPD由两个协作模块组成：观察自适应预测器（OAF）用于未来预测，时间回填模块（TBM）用于明确重建过去。OAF基于渐进知识蒸馏（PKD），通过分层特征回归将运动模式知识从长时间“教师”转移到短时间“学生”，使短观察能够恢复更丰富的运动上下文。我们进一步引入了一种余弦退火蒸馏加权方案，以平衡预测监督和特征对齐，提高优化稳定性和跨长度一致性。对于极短的历史数据，隐式对齐不足，TBM根据场景演变回填缺失的历史片段，生成丰富上下文的轨迹，从而增强PKD并改善OAF。我们采用解耦的预训练-重建-微调协议，以保持真实运动先验，同时适应回填输入。在Argoverse 1和Argoverse 2上的大量实验表明，TaPD在所有观察长度上始终优于强基线，在非常短的输入下尤其获得了显著提升，并以即插即用的方式改善了其他预测器（例如，HiVT）。代码将发布在 https://github.com/zhouhao94/TaPD。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2603.06250

Hierarchical Collaborative Fusion for 3D Instance-aware Referring Expression Segmentation

用于3D实例感知指称表达分割的层次协同融合

Zhou, Keshen, Chen, Runnan, Gong, Mingming, Liu, Tongliang

Abstract

Generalised 3D Referring Expression Segmentation (3D-GRES) localizes objects in 3D scenes based on natural language, even when descriptions match multiple or zero targets. Existing methods rely solely on sparse point clouds, lacking rich visual semantics for fine-grained descriptions. We propose HCF-RES, a multi-modal framework with two key innovations. First, Hierarchical Visual Semantic Decomposition leverages SAM instance masks to guide CLIP encoding at dual granularities -- pixel-level and instance-level features -- preserving object boundaries during 2D-to-3D projection. Second, Progressive Multi-level Fusion integrates representations through intra-modal collaboration, cross-modal adaptive weighting between 2D semantic and 3D geometric features, and language-guided refinement. HCF-RES achieves state-of-the-art results on both ScanRefer and Multi3DRefer.

Chinese Translation

广义3D指称表达分割（3D-GRES）基于自然语言定位3D场景中的物体，即使描述匹配多个或零个目标。现有方法仅依赖稀疏点云，缺乏丰富的视觉语义来支持细粒度描述。我们提出了HCF-RES，一个具有两个关键创新的多模态框架。首先，层次视觉语义分解利用SAM实例掩膜引导CLIP编码，在像素级和实例级特征的双重粒度下进行，保持在2D到3D投影过程中的物体边界。其次，渐进式多层融合通过模态内协作、2D语义与3D几何特征之间的跨模态自适应加权以及语言引导的细化来整合表示。HCF-RES在ScanRefer和Multi3DRefer数据集上取得了最先进的结果。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2603.06254

NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving

NOVA：用于自主驾驶中3D多目标跟踪的下一步开放词汇自回归

Luo, Kai, Wang, Xu, Fan, Rui, Yang, Kailun

Abstract

Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.

Chinese Translation

在开放世界感知中，跨未知目标的泛化至关重要，然而现有的3D多目标跟踪（3D MOT）流程仍受到封闭集假设和“语义盲”启发式方法的限制。为了解决这个问题，我们提出了下一步开放词汇自回归（NOVA），这一创新范式将3D跟踪从传统的碎片化距离匹配转向生成的时空语义建模。NOVA将3D轨迹重新表述为结构化的时空语义序列，使得物理运动连续性和深层语言先验的同时编码成为可能。通过利用大型语言模型（LLMs）的自回归能力，我们将跟踪任务转化为一个有原则的下一步序列补全过程。这一机制使模型能够明确利用语言空间的层次结构，以解决细粒度的语义模糊性，并通过高层次的常识推理在复杂的长距离序列中保持身份一致性。在nuScenes、V2X-Seq-SPD和KITTI上的大量实验表明NOVA的优越性能。值得注意的是，在nuScenes数据集上，NOVA在新类别上达到了22.41%的AMOTA，相较于基线实现了显著的20.21%的绝对提升。这些增益是通过一个紧凑的0.5B自回归模型实现的。代码将发布在https://github.com/xifen523/NOVA。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2603.06256

GazeMoE: Perception of Gaze Target with Mixture-of-Experts

GazeMoE：基于专家混合模型的注视目标感知

Dai, Zhuangzhuang, Lu, Zhongxi, Zakka, Vincent G., Manso, Luis J., Calero, Jose M Alcaraz, Li, Chen

Abstract

Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE

Chinese Translation

从可见图像中估计人类注视目标是机器人理解人类注意力的关键任务，但开发可泛化的神经架构和训练范式仍然具有挑战性。尽管近期在预训练视觉基础模型方面的进展为定位注视目标提供了有希望的途径，但多模态线索的整合——包括眼睛、头部姿态、手势和上下文特征——需要自适应和高效的解码机制。受大型视觉-语言模型中专家混合模型（Mixture-of-Experts, MoE）自适应领域专业知识的启发，我们提出了GazeMoE，这是一种新颖的端到端框架，通过MoE模块选择性地利用来自冻结基础模型的注视目标相关线索。为了应对注视目标分类中的类别不平衡（框内与框外）并增强鲁棒性，GazeMoE结合了类别平衡辅助损失以及包括区域特定裁剪和光度变换在内的战略数据增强。在基准数据集上进行的大量实验表明，我们的GazeMoE在具有挑战性的注视估计任务中实现了最先进的性能，超越了现有方法。代码和预训练模型已发布在 https://huggingface.co/zdai257/GazeMoE

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2603.06265

ODD-SEC: Onboard Drone Detection with a Spinning Event Camera

ODD-SEC：基于旋转事件相机的机载无人机检测

Dai, Kuan, Zhang, Hongxin, Zhong, Sheng, Zhou, Yi

Abstract

The rapid proliferation of drones requires balancing innovation with regulation. To address security and privacy concerns, techniques for drone detection have attracted significant attention.Passive solutions, such as frame camera-based systems, offer versatility and energy efficiency under typical conditions but are fundamentally constrained by their operational principles in scenarios involving fast-moving targets or adverse illumination.Inspired by biological vision, event cameras asynchronously detect per-pixel brightness changes, offering high dynamic range and microsecond-level responsiveness that make them uniquely suited for drone detection in conditions beyond the reach of conventional frame-based cameras.However, the design of most existing event-based solutions assumes a static camera, greatly limiting their applicability to moving carriers--such as quadrupedal robots or unmanned ground vehicles--during field operations.In this paper, we introduce a real-time drone detection system designed for deployment on moving carriers. The system utilizes a spinning event-based camera, providing a 360{\deg} horizontal field of view and enabling bearing estimation of detected drones. A key contribution is a novel image-like event representation that operates without motion compensation, coupled with a lightweight neural network architecture for efficient spatiotemporal learning. Implemented on an onboard Jetson Orin NX, the system can operate in real time. Outdoor experimental results validate reliable detection with a mean angular error below 2{\deg} under challenging conditions, underscoring its suitability for real-world surveillance applications. We will open-source our complete pipeline to support future research.

Chinese Translation

无人机的快速普及需要在创新与监管之间取得平衡。为了解决安全和隐私问题，无人机检测技术引起了广泛关注。基于帧相机的被动解决方案在典型条件下提供了多功能性和能效，但在涉及快速移动目标或不利光照的场景中，其操作原理根本上限制了其应用。受到生物视觉的启发，事件相机异步检测每个像素的亮度变化，提供高动态范围和微秒级响应能力，使其在超出传统帧相机能力的条件下特别适合无人机检测。然而，大多数现有基于事件的解决方案的设计假设相机是静态的，这在现场操作中大大限制了其在移动载体（如四足机器人或无人地面车辆）上的适用性。本文介绍了一种为移动载体部署而设计的实时无人机检测系统。该系统利用旋转事件相机，提供360度的水平视场，并能够估计检测到的无人机的方位。一个关键贡献是提出了一种新颖的类图像事件表示，该表示在没有运动补偿的情况下运行，并结合了一种轻量级神经网络架构以实现高效的时空学习。该系统在机载Jetson Orin NX上实现，能够实时运行。户外实验结果验证了在挑战性条件下可靠检测的有效性，平均角度误差低于2度，强调了其在现实世界监控应用中的适用性。我们将开源我们的完整流程，以支持未来的研究。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2603.06270

HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models

HiPP-Prune：面向视觉-语言模型的层次化偏好条件结构化剪枝

Bai, Lincen, Tabia, Hedi, Santos-Rodriguez, Raul

Abstract

Pruning vision-language models (VLMs) for efficient deployment is challenging because compression can affect not only task utility but also visual grounding, often amplifying object hallucinations even at the same sparsity level. We present HiPP-Prune, a hierarchical preference-conditioned structured pruning framework that treats pruning as conditional resource allocation under multiple objectives. HiPP-Prune makes plan-level decisions: a single policy invocation outputs a global pruning blueprint by factorizing decisions into an overall sparsity budget and a layer-wise allocation, enabling queryable trade-offs via a user-specified preference vector. To account for VLM-specific failure modes, our policy state integrates a visual sensitivity signal derived from attention flow between vision tokens and language hidden states, discouraging over-pruning of vision-critical layers that facilitate cross-modal fusion. We optimize pruning plans with plan-level Group Relative Policy Optimization (GRPO) under a multi-objective return that combines task utility, hallucination robustness (POPE), compression, and a synaptic-flow-inspired stability proxy to reduce unproductive exploration in high-sparsity regimes. Experiments on LLaVA with POPE and ScienceQA demonstrate that HiPP-Prune discovers diverse non-dominated pruning plans and provides controllable robustness--utility trade-offs under matched sparsity budgets.

Chinese Translation

对视觉-语言模型（VLMs）进行剪枝以实现高效部署是一项具有挑战性的任务，因为压缩不仅会影响任务效用，还可能影响视觉基础，通常在相同稀疏度水平下放大对象幻觉。我们提出了HiPP-Prune，这是一种层次化偏好条件结构化剪枝框架，将剪枝视为在多个目标下的条件资源分配。HiPP-Prune进行计划级决策：单次策略调用输出一个全局剪枝蓝图，通过将决策分解为整体稀疏预算和层级分配，从而实现通过用户指定的偏好向量进行可查询的权衡。为了考虑VLM特有的失败模式，我们的策略状态整合了源自视觉标记与语言隐藏状态之间注意力流的视觉敏感性信号，抑制对促进跨模态融合的视觉关键层的过度剪枝。我们在多目标回报下优化剪枝计划，采用计划级的群体相对策略优化（GRPO），该回报结合了任务效用、幻觉鲁棒性（POPE）、压缩和受突触流启发的稳定性代理，以减少高稀疏度环境下无效探索。在LLaVA与POPE和ScienceQA上的实验表明，HiPP-Prune发现了多样化的非支配剪枝计划，并在匹配的稀疏预算下提供可控的鲁棒性-效用权衡。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2603.06275

Spectral and Trajectory Regularization for Diffusion Transformer Super-Resolution

扩散变换器超分辨率的光谱与轨迹正则化

Wang, Jingkai, Tang, Yixin, Gong, Jue, Li, Jiatong, Li, Shu, Liu, Libo, Lan, Jianliang, Liu, Yutong, Zhang, Yulun

Abstract

Diffusion transformer (DiT) architectures show great potential for real-world image super-resolution (Real-ISR). However, their computationally expensive iterative sampling necessitates one-step distillation. Existing one-step distillation methods struggle with Real-ISR on DiT. They suffer from fundamental trajectory mismatch and generate severe grid-like periodic artifacts. To tackle these challenges, we propose StrSR, a novel one-step adversarial distillation framework featuring spectral and trajectory regularization. Specifically, we propose an asymmetric discriminative distillation architecture to bridge the trajectory gap. Additionally, we design a frequency distribution matching strategy to effectively suppress DiT-specific periodic artifacts caused by high-frequency spectral leakage. Extensive experiments demonstrate that StrSR achieves state-of-the-art performance in Real-ISR, across both quantitative metrics and visual perception. The code and models will be released at https://github.com/jkwang28/StrSR .

Chinese Translation

扩散变换器（Diffusion Transformer, DiT）架构在实际图像超分辨率（Real-ISR）方面展现出巨大的潜力。然而，它们计算成本高昂的迭代采样需要一步蒸馏。现有的一步蒸馏方法在DiT的Real-ISR任务中表现不佳，主要受到基本轨迹不匹配的影响，并产生严重的网格状周期性伪影。为了解决这些挑战，我们提出了StrSR，一种新颖的一步对抗蒸馏框架，具有光谱和轨迹正则化的特点。具体而言，我们提出了一种不对称的判别蒸馏架构，以弥合轨迹差距。此外，我们设计了一种频率分布匹配策略，有效抑制由高频光谱泄漏引起的DiT特有的周期性伪影。大量实验表明，StrSR在Real-ISR任务中在定量指标和视觉感知方面均实现了最先进的性能。代码和模型将发布在 https://github.com/jkwang28/StrSR 。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2603.06279

Can we Trust Unreliable Voxels? Exploring 3D Semantic Occupancy Prediction under Label Noise

我们能信任不可靠的体素吗？在标签噪声下探索3D语义占用预测

Li, Wenxin, Peng, Kunyu, Wen, Di, Zheng, Junwei, Wei, Jiale, Duan, Mengfei, Zhang, Yuheng, Fan, Rui, Yang, Kailun

Abstract

3D semantic occupancy prediction is a cornerstone of robotic perception, yet real-world voxel annotations are inherently corrupted by structural artifacts and dynamic trailing effects. This raises a critical but underexplored question: can autonomous systems safely rely on such unreliable occupancy supervision? To systematically investigate this issue, we establish OccNL, the first benchmark dedicated to 3D occupancy under occupancy-asymmetric and dynamic trailing noise. Our analysis reveals a fundamental domain gap: state-of-the-art 2D label noise learning strategies collapse catastrophically in sparse 3D voxel spaces, exposing a critical vulnerability in existing paradigms. To address this challenge, we propose DPR-Occ, a principled label noise-robust framework that constructs reliable supervision through dual-source partial label reasoning. By synergizing temporal model memory with representation-level structural affinity, DPR-Occ dynamically expands and prunes candidate label sets to preserve true semantics while suppressing noise propagation. Extensive experiments on SemanticKITTI demonstrate that DPR-Occ prevents geometric and semantic collapse under extreme corruption. Notably, even at 90% label noise, our method achieves significant performance gains (up to 2.57% mIoU and 13.91% IoU) over existing label noise learning baselines adapted to the 3D occupancy prediction task. By bridging label noise learning and 3D perception, OccNL and DPR-Occ provide a reliable foundation for safety-critical robotic perception in dynamic environments. The benchmark and source code will be made publicly available at https://github.com/mylwx/OccNL.

Chinese Translation

3D语义占用预测是机器人感知的基石，然而现实世界中的体素标注本质上受到结构性伪影和动态拖尾效应的影响。这引发了一个关键但未被充分探讨的问题：自主系统是否可以安全地依赖这种不可靠的占用监督？为了系统地研究这个问题，我们建立了OccNL，这是第一个专门针对占用不对称和动态拖尾噪声下的3D占用的基准。我们的分析揭示了一个基本的领域差距：最先进的2D标签噪声学习策略在稀疏的3D体素空间中崩溃，暴露了现有范式中的关键脆弱性。为了解决这一挑战，我们提出了DPR-Occ，一个原则性的标签噪声鲁棒框架，通过双源部分标签推理构建可靠的监督。通过将时间模型记忆与表示层级的结构亲和性相结合，DPR-Occ动态扩展和修剪候选标签集，以保留真实语义，同时抑制噪声传播。在SemanticKITTI上的广泛实验表明，DPR-Occ在极端腐蚀下防止了几何和语义的崩溃。值得注意的是，即使在90%的标签噪声下，我们的方法在3D占用预测任务中相较于现有的标签噪声学习基线也取得了显著的性能提升（mIoU提高了2.57%，IoU提高了13.91%）。通过桥接标签噪声学习和3D感知，OccNL和DPR-Occ为动态环境中的安全关键机器人感知提供了可靠的基础。基准和源代码将在https://github.com/mylwx/OccNL上公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2603.06281

Attribute Distribution Modeling and Semantic-Visual Alignment for Generative Zero-shot Learning

属性分布建模与语义-视觉对齐用于生成性零样本学习

Pu, Haojie, Li, Zhuoming, Gao, Yongbiao, Jia, Yuheng

Abstract

Generative zero-shot learning (ZSL) synthesizes features for unseen classes, leveraging semantic conditions to transfer knowledge from seen classes. However, it also introduces two intrinsic challenges: (1) class-level attributes fails to capture instance-specific visual appearances due to substantial intra-class variability, thus causing the class-instance gap; (2) the substantial mismatch between semantic and visual feature distributions, manifested in inter-class correlations, gives rise to the semantic-visual domain gap. To address these challenges, we propose an Attribute Distribution Modeling and Semantic-Visual Alignment (ADiVA) approach, jointly modeling attribute distributions and performing explicit semantic-visual alignment. Specifically, our ADiVA consists of two modules: an Attribute Distribution Modeling (ADM) module that learns a transferable attribute distribution for each class and samples instance-level attributes for unseen classes, and a Visual-Guided Alignment (VGA) module that refines semantic representations to better reflect visual structures. Experiments on three widely used benchmark datasets demonstrate that ADiVA significantly outperforms state-of-the-art methods (e.g., achieving gains of 4.7% and 6.1% on AWA2 and SUN, respectively). Moreover, our approach can serve as a plugin to enhance existing generative ZSL methods.

Chinese Translation

生成性零样本学习（ZSL）通过合成未见类别的特征，利用语义条件从已见类别转移知识。然而，这也带来了两个内在挑战：（1）类别级属性由于显著的类内变异性未能捕捉实例特定的视觉外观，从而导致类别与实例之间的差距；（2）语义与视觉特征分布之间的显著不匹配，表现为类间相关性，导致语义-视觉领域差距。为了解决这些挑战，我们提出了一种属性分布建模与语义-视觉对齐（ADiVA）的方法，联合建模属性分布并执行显式的语义-视觉对齐。具体而言，我们的ADiVA由两个模块组成：一个属性分布建模（ADM）模块，学习每个类别的可转移属性分布并为未见类别采样实例级属性；一个视觉引导对齐（VGA）模块，细化语义表示以更好地反映视觉结构。在三个广泛使用的基准数据集上的实验表明，ADiVA显著优于最先进的方法（例如，在AWA2和SUN上分别实现了4.7%和6.1%的提升）。此外，我们的方法可以作为插件来增强现有的生成性ZSL方法。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2603.06289

FlowMotion: Training-Free Flow Guidance for Video Motion Transfer

FlowMotion：无训练流引导的视频运动转移

Wang, Zhen, Xu, Youcan, Xiao, Jun, Chen, Long

Abstract

Video motion transfer aims to generate a target video that inherits motion patterns from a source video while rendering new scenes. Existing training-free approaches focus on constructing motion guidance based on the intermediate outputs of pre-trained T2V models, which results in heavy computational overhead and limited flexibility. In this paper, we present FlowMotion, a novel training-free framework that enables efficient and flexible motion transfer by directly leveraging the predicted outputs of flow-based T2V models. Our key insight is that early latent predictions inherently encode rich temporal information. Motivated by this, we propose flow guidance, which extracts motion representations based on latent predictions to align motion patterns between source and generated videos. We further introduce a velocity regularization strategy to stabilize optimization and ensure smooth motion evolution. By operating purely on model predictions, FlowMotion achieves superior time and resource efficiency as well as competitive performance compared with state-of-the-art methods.

Chinese Translation

视频运动转移旨在生成一个目标视频，该视频继承源视频的运动模式，同时渲染新的场景。现有的无训练方法集中于基于预训练的T2V模型的中间输出构建运动引导，这导致了较大的计算开销和有限的灵活性。本文提出了FlowMotion，一种新颖的无训练框架，通过直接利用基于流的T2V模型的预测输出，实现高效且灵活的运动转移。我们的关键见解是，早期的潜在预测本质上编码了丰富的时间信息。基于此，我们提出了流引导（flow guidance），该方法基于潜在预测提取运动表示，以对齐源视频和生成视频之间的运动模式。我们进一步引入了一种速度正则化策略，以稳定优化并确保平滑的运动演变。通过纯粹依赖模型预测，FlowMotion在时间和资源效率方面实现了优越的表现，并且与最先进的方法相比，性能也具有竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2603.06300

3D CBCT Artefact Removal Using Perpendicular Score-Based Diffusion Models

基于垂直评分的扩散模型的3D CBCT伪影去除

Schaub, Susanne, Bieder, Florentin, Oliveira, Matheus L., Wang, Yulan, Dagassan-Berndt, Dorothea, Bornstein, Michael M., Cattin, Philippe C.

Abstract

Cone-beam computed tomography (CBCT) is a widely used 3D imaging technique in dentistry, offering high-resolution images while minimising radiation exposure for patients. However, CBCT is highly susceptible to artefacts arising from high-density objects such as dental implants, which can compromise image quality and diagnostic accuracy. To reduce artefacts, implant inpainting in the sequence of projections plays a crucial role in many artefact reduction approaches. Recently, diffusion models have achieved state-of-the-art results in image generation and have widely been applied to image inpainting tasks. However, to our knowledge, existing diffusion-based methods for implant inpainting operate on independent 2D projections. This approach neglects the correlations among individual projections, resulting in inconsistencies in the reconstructed images. To address this, we propose a 3D dental implant inpainting approach based on perpendicular score-based diffusion models, each trained in two different planes and operating in the projection domain. The 3D distribution of the projection series is modelled by combining the two 2D score-based diffusion models in the sampling scheme. Our results demonstrate the method's effectiveness in producing high-quality, artefact-reduced 3D CBCT images, making it a promising solution for improving clinical imaging.

Chinese Translation

锥束计算机断层扫描（CBCT）是一种广泛应用于牙科的3D成像技术，能够提供高分辨率图像，同时最小化患者的辐射暴露。然而，CBCT对高密度物体（如牙科植体）引起的伪影非常敏感，这可能会影响图像质量和诊断准确性。为了减少伪影，植体填补在投影序列中的作用在许多伪影减少方法中至关重要。最近，扩散模型在图像生成方面取得了最先进的成果，并广泛应用于图像填补任务。然而，据我们所知，现有的基于扩散的植体填补方法仅在独立的2D投影上进行。这种方法忽视了各个投影之间的相关性，导致重建图像的不一致性。为了解决这个问题，我们提出了一种基于垂直评分的扩散模型的3D牙科植体填补方法，该方法在两个不同的平面上进行训练，并在投影域中操作。通过在采样方案中结合两个2D评分扩散模型，建模投影序列的3D分布。我们的结果表明，该方法在生成高质量、减少伪影的3D CBCT图像方面的有效性，使其成为改善临床成像的有前景的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2603.06302

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

DEX-AR：一种用于自回归视觉-语言模型的动态可解释性方法

Bousselham, Walid, Boggust, Angie, Strobelt, Hendrik, Kuehne, Hilde

Abstract

As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.

Chinese Translation

随着视觉-语言模型（VLMs）变得越来越复杂和广泛应用，理解它们的决策过程变得愈加重要。传统的可解释性方法主要针对分类任务，面对现代自回归VLMs时，由于其复杂的逐词生成过程以及视觉与文本模态之间的复杂交互，显得力不从心。我们提出了DEX-AR（自回归模型的动态可解释性），这是一种新颖的可解释性方法，旨在通过生成每个标记和序列级别的二维热图，突出对模型文本响应至关重要的图像区域，从而应对这些挑战。该方法通过计算相对于注意力图的层级梯度，解释自回归VLMs，包括层的重要性和生成标记的变化，尤其是在逐词生成过程中。DEX-AR引入了两个关键创新：一种动态头部过滤机制，用于识别专注于视觉信息的注意力头，以及一种序列级过滤方法，用于聚合每个标记的解释，同时区分视觉基础和纯语言标记。我们在ImageNet、VQAv2和PascalVOC上的评估显示，在基于扰动的度量（使用一种新颖的归一化困惑度测量）以及基于分割的度量上均有一致的改善。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2603.06311

Latent Transfer Attack: Adversarial Examples via Generative Latent Spaces

潜在转移攻击：通过生成潜在空间生成对抗样本

Shaar, Eitan, Shaulov, Ariel, Tur, Yalcin, Chechik, Gal, Shwartz-Ziv, Ravid

Abstract

Adversarial attacks are a central tool for probing the robustness of modern vision models, yet most methods optimize perturbations directly in pixel space under $\ell_\infty$ or $\ell_2$ constraints. While effective in white-box settings, pixel-space optimization often produces high-frequency, texture-like noise that is brittle to common preprocessing (e.g., resizing and cropping) and transfers poorly across architectures. We propose $\textbf{LTA}$ ($\textbf{L}$atent $\textbf{T}$ransfer $\textbf{A}$ttack), a transfer-based attack that instead optimizes perturbations in the latent space of a pretrained Stable Diffusion VAE. Given a clean image, we encode it into a latent code and optimize the latent representation to maximize a surrogate classifier loss, while softly enforcing a pixel-space $\ell_\infty$ budget after decoding. To improve robustness to resolution mismatch and standard input pipelines, we incorporate Expectation Over Transformations (EOT) via randomized resizing, interpolation, and cropping, and apply periodic latent Gaussian smoothing to suppress emerging artifacts and stabilize optimization. Across a suite of CNN and vision-transformer targets, LTA achieves strong transfer attack success while producing spatially coherent, predominantly low-frequency perturbations that differ qualitatively from pixel-space baselines and occupy a distinct point in the transfer-quality trade-off. Our results highlight pretrained generative latent spaces as an effective and structured domain for adversarial optimization, bridging robustness evaluation with modern generative priors.

Chinese Translation

对抗攻击是检验现代视觉模型鲁棒性的核心工具，但大多数方法是在像素空间下直接优化扰动，受限于 $ ext{ℓ}_ ext{∞}$ 或 $ ext{ℓ}_2$。虽然在白盒设置中有效，像素空间优化通常会产生高频、纹理状的噪声，这种噪声对常见的预处理（例如，缩放和裁剪）非常脆弱，并且在不同架构之间转移效果不佳。我们提出了 $ extbf{LTA}$ （$ extbf{L}$atent $ extbf{T}$ransfer $ extbf{A}$ttack），这是一种基于转移的攻击方法，它在预训练的稳定扩散变分自编码器（Stable Diffusion VAE）的潜在空间中优化扰动。给定一张干净的图像，我们将其编码为潜在代码，并优化潜在表示以最大化一个替代分类器的损失，同时在解码后软性地强制执行像素空间的 $ ext{ℓ}_ ext{∞}$ 预算。为了提高对分辨率不匹配和标准输入管道的鲁棒性，我们通过随机缩放、插值和裁剪引入了变换期望（Expectation Over Transformations, EOT），并应用周期性的潜在高斯平滑以抑制新出现的伪影并稳定优化。在一系列卷积神经网络（CNN）和视觉变换器（vision transformer）目标上，LTA 实现了强大的转移攻击成功率，同时产生空间上连贯、主要为低频的扰动，这些扰动在质上与像素空间基线不同，并在转移质量权衡中占据了一个独特的位置。我们的结果突显了预训练生成潜在空间作为对抗优化的有效且结构化的领域，架起了鲁棒性评估与现代生成先验之间的桥梁。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2603.06313

WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection

WMoE-CLIP：基于小波增强的专家混合提示学习用于零-shot异常检测

Chen, Peng, Huang, Chao

Abstract

Vision-language models have recently shown strong generalization in zero-shot anomaly detection (ZSAD), enabling the detection of unseen anomalies without task-specific supervision. However, existing approaches typically rely on fixed textual prompts, which struggle to capture complex semantics, and focus solely on spatial-domain features, limiting their ability to detect subtle anomalies. To address these challenges, we propose a wavelet-enhanced mixture-of-experts prompt learning method for ZSAD. Specifically, a variational autoencoder is employed to model global semantic representations and integrate them into prompts to enhance adaptability to diverse anomaly patterns. Wavelet decomposition extracts multi-frequency image features that dynamically refine textual embeddings through cross-modal interactions. Furthermore, a semantic-aware mixture-of-experts module is introduced to aggregate contextual information. Extensive experiments on 14 industrial and medical datasets demonstrate the effectiveness of the proposed method.

Chinese Translation

视觉-语言模型最近在零-shot异常检测（ZSAD）中表现出强大的泛化能力，能够在没有特定任务监督的情况下检测未见过的异常。然而，现有的方法通常依赖于固定的文本提示，这使得它们难以捕捉复杂的语义，并且仅关注空间域特征，限制了它们检测细微异常的能力。为了解决这些挑战，我们提出了一种基于小波增强的专家混合提示学习方法用于ZSAD。具体而言，采用变分自编码器建模全局语义表示，并将其整合到提示中，以增强对多样异常模式的适应性。小波分解提取多频率图像特征，通过跨模态交互动态地优化文本嵌入。此外，引入了一个语义感知的专家混合模块以聚合上下文信息。在14个工业和医疗数据集上的大量实验表明了所提方法的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2603.06321

P-SLCR: Unsupervised Point Cloud Semantic Segmentation via Prototypes Structure Learning and Consistent Reasoning

P-SLCR：通过原型结构学习和一致性推理实现无监督点云语义分割

Zhan, Lixin, Jiang, Jie, Zhou, Tianjian, Du, Yukun, Zheng, Yan, Duan, Xuehu

Abstract

Current semantic segmentation approaches for point cloud scenes heavily rely on manual labeling, while research on unsupervised semantic segmentation methods specifically for raw point clouds is still in its early stages. Unsupervised point cloud learning poses significant challenges due to the absence of annotation information and the lack of pre-training. The development of effective strategies is crucial in this context. In this paper, we propose a novel prototype library-driven unsupervised point cloud semantic segmentation strategy that utilizes Structure Learning and Consistent Reasoning (P-SLCR). First, we propose a Consistent Structure Learning to establish structural feature learning between consistent points and the library of consistent prototypes by selecting high-quality features. Second, we propose a Semantic Relation Consistent Reasoning that constructs a prototype inter-relation matrix between consistent and ambiguous prototype libraries separately. This process ensures the preservation of semantic consistency by imposing constraints on consistent and ambiguous prototype libraries through the prototype inter-relation matrix. Finally, our method was extensively evaluated on the S3DIS, SemanticKITTI, and Scannet datasets, achieving the best performance compared to unsupervised methods. Specifically, the mIoU of 47.1% is achieved for Area-5 of the S3DIS dataset, surpassing the classical fully supervised method PointNet by 2.5%.

Chinese Translation

当前针对点云场景的语义分割方法严重依赖手动标注，而专门针对原始点云的无监督语义分割方法的研究仍处于初级阶段。无监督点云学习面临着由于缺乏标注信息和缺乏预训练而带来的重大挑战。在此背景下，开发有效的策略至关重要。本文提出了一种新颖的基于原型库的无监督点云语义分割策略，利用结构学习和一致性推理（P-SLCR）。首先，我们提出了一种一致性结构学习方法，通过选择高质量特征，在一致点与一致原型库之间建立结构特征学习。其次，我们提出了一种语义关系一致性推理方法，分别构建一致原型库和模糊原型库之间的原型互关系矩阵。该过程通过对一致和模糊原型库施加约束，确保语义一致性的保持。最后，我们的方法在S3DIS、SemanticKITTI和Scannet数据集上进行了广泛评估，与无监督方法相比，取得了最佳性能。具体而言，在S3DIS数据集的Area-5上实现了47.1%的mIoU，超越了经典的完全监督方法PointNet 2.5%。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2603.06331

WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching

WorldCache：通过异构令牌缓存加速世界模型

Feng, Weilun, Fan, Guoxin, Qin, Haotong, Yang, Chuanguang, Wu, Mingqiang, Li, Yuqi, Li, Xiangqi, An, Zhulin, Huang, Libo, Wang, Dingrui, Liao, Longlong, Magno, Michele, Xu, Yongjun

Abstract

Diffusion-based world models have shown strong potential for unified world simulation, but the iterative denoising remains too costly for interactive use and long-horizon rollouts. While feature caching can accelerate inference without training, we find that policies designed for single-modal diffusion transfer poorly to world models due to two world-model-specific obstacles: \emph{token heterogeneity} from multi-modal coupling and spatial variation, and \emph{non-uniform temporal dynamics} where a small set of hard tokens drives error growth, making uniform skipping either unstable or overly conservative. We propose \textbf{WorldCache}, a caching framework tailored to diffusion world models. We introduce \textit{Curvature-guided Heterogeneous Token Prediction}, which uses a physics-grounded curvature score to estimate token predictability and applies a Hermite-guided damped predictor for chaotic tokens with abrupt direction changes. We also design \textit{Chaotic-prioritized Adaptive Skipping}, which accumulates a curvature-normalized, dimensionless drift signal and recomputes only when bottleneck tokens begin to drift. Experiments on diffusion world models show that WorldCache delivers up to \textbf{3.7$\times$} end-to-end speedups while maintaining \textbf{98\%} rollout quality, demonstrating the vast advantages and practicality of WorldCache in resource-constrained scenarios. Our code is released in https://github.com/FofGofx/WorldCache.

Chinese Translation

基于扩散的世界模型在统一世界仿真中展现出强大的潜力，但迭代去噪对于交互式使用和长时间范围的回放仍然成本过高。虽然特征缓存可以在不训练的情况下加速推理，但我们发现为单一模态扩散设计的策略在世界模型中转移效果不佳，主要由于两个特定于世界模型的障碍：来自多模态耦合和空间变化的 extit{令牌异质性}，以及 extit{非均匀时间动态}，在这种情况下，一小部分难处理的令牌会导致错误增长，使得均匀跳过变得不稳定或过于保守。我们提出了 extbf{WorldCache}，一个专为扩散世界模型量身定制的缓存框架。我们引入了 extit{曲率引导的异构令牌预测}，该方法使用基于物理的曲率评分来估计令牌的可预测性，并为方向突变的混沌令牌应用Hermite引导的阻尼预测器。我们还设计了 extit{混沌优先的自适应跳过}，该方法累积一个曲率归一化的无量纲漂移信号，并仅在瓶颈令牌开始漂移时重新计算。对扩散世界模型的实验表明，WorldCache在保持 extbf{98 ext%}回放质量的同时，提供了高达 extbf{3.7$ imes$}的端到端加速，展示了WorldCache在资源受限场景下的巨大优势和实用性。我们的代码已发布在https://github.com/FofGofx/WorldCache。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2603.06340

K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging

K-MaT：基于知识的流形传输在医学成像中的跨模态提示学习

Zeng, Jiajun, Albarqouni, Shadi

Abstract

Large-scale biomedical vision-language models (VLMs) adapted on high-end imaging (e.g., CT) often fail to transfer to frontline low-end modalities (e.g., radiography), collapsing into modality-specific shortcuts. We propose K-MaT (Knowledge-Anchored Manifold Transport), a prompt-learning framework that transfers decision structures to low-end modalities without requiring low-end training images. K-MaT factorizes prompts, anchors them to clinical text descriptions, and aligns the low-end prompt manifold to the visually-grounded high-end space using Fused Gromov-Wasserstein optimal transport. We evaluate K-MaT on four cross-modal benchmarks, including dermoscopy, mammography to ultrasound, and CT to chest X-ray. K-MaT achieves state-of-the-art results, improving the average harmonic mean of accuracy to 44.1% (from BiomedCoOp's 42.0%) and macro-F1 to 36.2%. Notably, on the challenging breast imaging task, it mitigates the catastrophic forgetting seen in standard methods like CoOp (which drops to 27.0% accuracy on the low-end), preserving robust performance across modalities. Aligning prompt manifolds via optimal transport provides a highly effective route for the zero-shot cross-modal deployment of medical VLMs.

Chinese Translation

在高端成像（例如CT）上适配的大规模生物医学视觉-语言模型（VLMs）通常无法有效迁移到前线低端模态（例如放射摄影），而是陷入模态特定的捷径。我们提出了K-MaT（基于知识的流形传输），这是一种提示学习框架，能够在不需要低端训练图像的情况下将决策结构转移到低端模态。K-MaT对提示进行因式分解，将其锚定到临床文本描述，并使用融合的Gromov-Wasserstein最优传输将低端提示流形与视觉基础的高端空间对齐。我们在四个跨模态基准上评估了K-MaT，包括皮肤镜检查、乳腺X光检查到超声波，以及CT到胸部X光。K-MaT实现了最先进的结果，将准确率的平均调和均值提高到44.1%（相比BiomedCoOp的42.0%）和宏观F1提高到36.2%。值得注意的是，在具有挑战性的乳腺成像任务中，它减轻了标准方法（如CoOp）中出现的灾难性遗忘（在低端模态下准确率降至27.0%），在各模态间保持了稳健的性能。通过最优传输对齐提示流形为医学VLMs的零-shot跨模态部署提供了一条高效的途径。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2603.06351

Dynamic Chunking Diffusion Transformer

动态分块扩散变换器

Haridas, Akash, Saxena, Utkarsh, Fashi, Parsa Ashrafi, Rezagholizadeh, Mehdi, Appia, Vikram, Barsoum, Emad

Abstract

Diffusion Transformers process images as fixed-length sequences of tokens produced by a static $\textit{patchify}$ operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet $256{\times}256$, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across $4{\times}$ and $16{\times}$ compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to $8{\times}$ fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.

Chinese Translation

扩散变换器将图像处理为由静态的 $ extit{patchify}$ 操作生成的固定长度的令牌序列。尽管这种设计有效，但它在低信息和高信息区域上均匀分配计算资源，忽视了图像包含不同细节区域的事实，以及去噪过程从早期时间步的粗略结构到后期时间步的细致细节的进展。我们引入了动态分块扩散变换器（Dynamic Chunking Diffusion Transformer, DC-DiT），它通过一个学习的编码器-路由器-解码器框架增强了 DiT 主干，能够以数据依赖的方式自适应地将 2D 输入压缩为更短的令牌序列，采用通过扩散训练端到端学习的分块机制。该机制学习将均匀的背景区域压缩为更少的令牌，而将细节丰富的区域压缩为更多的令牌，且在没有显式监督的情况下产生有意义的视觉分割。此外，它还学习在扩散时间步之间自适应地调整其压缩，在噪声阶段使用更少的令牌，而在细节出现时使用更多的令牌。在类别条件的 ImageNet $256{ imes}256$ 数据集上，DC-DiT 在 $4{ imes}$ 和 $16{ imes}$ 压缩下，始终在与参数匹配和 FLOP 匹配的 DiT 基线相比，提升了 FID 和 Inception Score，显示出这一技术具有进一步应用于像素空间、视频和 3D 生成的潜力。除了准确性，DC-DiT 还具有实用性：它可以从预训练的 DiT 检查点中进行再利用，所需的后训练计算量最小（最多减少 $8{ imes}$ 的训练步骤），并且可以与其他动态计算方法组合，以进一步减少生成的 FLOPs。

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2603.06357

LATO: 3D Mesh Flow Matching with Structured TOpology Preserving LAtents

LATO：具有结构化拓扑保持潜变量的3D网格流匹配

Zhao, Tianhao, Zhang, Youjia, Long, Hang, Zhang, Jinshen, Li, Wenbing, Yang, Yang, Zhang, Gongbo, Hladký, Jozef, Nießner, Matthias, Yang, Wei

Abstract

In this paper, we introduce LATO, a novel topology-preserving latent representation that enables scalable, flow matching-based synthesis of explicit 3D meshes. LATO represents a mesh as a Vertex Displacement Field (VDF) anchored on surface, incorporating a sparse voxel Variational Autoencoder (VAE) to compress this explicit signal into a structured, topology-aware voxel latent. To decapsulate the mesh, the VAE decoder progressively subdivides and prunes latent voxels to instantiate precise vertex locations. In the end, a dedicated connection head queries the voxel latent to predict edge connectivity between vertex pairs directly, allowing mesh topology to be recovered without isosurface extraction or heuristic meshing. For generative modeling, LATO adopts a two-stage flow matching process, first synthesizing the structure voxels and subsequently refining the voxel-wise topology features. Compared to prior isosurface/triangle-based diffusion models and autoregressive generation approaches, LATO generates meshes with complex geometry, well-formed topology while being highly efficient in inference.

Chinese Translation

在本文中，我们介绍了LATO，这是一种新颖的拓扑保持潜在表示，能够实现可扩展的基于流匹配的显式3D网格合成。LATO将网格表示为一个锚定在表面的顶点位移场（Vertex Displacement Field, VDF），结合稀疏体素变分自编码器（Variational Autoencoder, VAE）将这一显式信号压缩为结构化的、具有拓扑感知的体素潜变量。为了解封网格，VAE解码器逐步细分和修剪潜体素，以实例化精确的顶点位置。最后，一个专用的连接头查询体素潜变量，以直接预测顶点对之间的边连接性，从而允许在不进行等值面提取或启发式网格生成的情况下恢复网格拓扑。对于生成建模，LATO采用两阶段流匹配过程，首先合成结构体素，然后细化体素级拓扑特征。与先前的等值面/三角形基础扩散模型和自回归生成方法相比，LATO生成的网格具有复杂的几何形状，良好的拓扑结构，同时在推理时具有高效性。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2603.06362

Computer vision-based estimation of invertebrate biomass

基于计算机视觉的无脊椎动物生物量估计

Impiö, Mikko, Rehsen, Philipp M., Blair, Jarrett, Mielec, Cecilie, Beermann, Arne J., Leese, Florian, Høye, Toke T., Raitoharju, Jenni

Abstract

The ability to estimate invertebrate biomass using only images could help scaling up quantitative biodiversity monitoring efforts. Computer vision-based methods have the potential to omit the manual, time-consuming, and destructive process of dry weighing specimens. We present two approaches for dry mass estimation that do not require additional manual effort apart from imaging the specimens: fitting a linear model with novel predictors, automatically calculated by an imaging device, and training a family of end-to-end deep neural networks for the task, using single-view, multi-view, and metadata-aware architectures. We propose using area and sinking speed as predictors. These can be calculated with BIODISCOVER, which is a dual-camera system that captures image sequences of specimens sinking in an ethanol column. For this study, we collected a large dataset of dry mass measurement and image sequence pairs to train and evaluate models. We show that our methods can estimate specimen dry mass even with complex and visually diverse specimen morphologies. Combined with automatic taxonomic classification, our approach is an accurate method for group-level dry mass estimation, with a median percentage error of 10-20% for individuals. We highlight the importance of choosing appropriate evaluation metrics, and encourage using both percentage errors and absolute errors as metrics, because they measure different properties. We also explore different optimization losses, data augmentation methods, and model architectures for training deep-learning models.

Chinese Translation

仅通过图像估计无脊椎动物生物量的能力有助于扩大定量生物多样性监测的努力。基于计算机视觉的方法有潜力省去手动、耗时且破坏性的干重称量过程。我们提出了两种干重估计方法，这些方法除了对标本进行成像外，不需要额外的手动操作：一种是使用由成像设备自动计算的新型预测变量拟合线性模型，另一种是训练一系列端到端深度神经网络，采用单视图、多视图和考虑元数据的架构。我们建议使用面积和下沉速度作为预测变量。这些变量可以通过BIODISCOVER计算，该系统是一个双摄像头系统，能够捕捉标本在乙醇柱中下沉的图像序列。在本研究中，我们收集了大量的干重测量和图像序列配对数据集，以训练和评估模型。我们展示了我们的方法能够估计即使在复杂和视觉多样的标本形态下的标本干重。结合自动分类，我们的方法是一种准确的群体级干重估计方法，个体的中位百分比误差为10-20%。我们强调选择适当评估指标的重要性，并鼓励同时使用百分比误差和绝对误差作为指标，因为它们测量不同的属性。我们还探索了不同的优化损失、数据增强方法和模型架构，以训练深度学习模型。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2603.06366

OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis

OralGPT-Plus：通过强化学习学习使用视觉工具进行全景X光分析

Fan, Yuxuan, Hao, Jing, Chen, Hong, Bao, Jiahao, Shao, Yihua, Liang, Yuci, Hung, Kuo Feng, Tang, Hao

Abstract

Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.

Chinese Translation

全景牙科放射影像需要细致的空间推理、双侧对称理解和多步骤的诊断验证，然而现有的视觉-语言模型在静态单次处理范式下运行，这限制了它们的临床可靠性。本文介绍了OralGPT-Plus，一种旨在进行迭代和对称感知诊断推理的代理视觉-语言模型，用于全景牙科放射影像分析。为了支持这一范式，我们构建了DentalProbe，一个包含五千张图像的数据集，配有专家策划的诊断轨迹，为局部检查和对侧比较提供结构化监督。我们进一步开发了一种基于再检查驱动的强化学习框架，鼓励具有临床意义的重新检查，并通过基于评分标准的奖励和条件诊断驱动奖励来稳定长期推理。同时，我们提出了MMOral-X，这是第一个针对整体全景诊断的基准，包含300个开放式问题和多个难度级别的区域级注释。OralGPT-Plus在MMOral-X和已建立的全景基准上显示出相对于强基线的一致且可靠的改进，表明交互式和对称信息推理的有效性。我们的工作突显了代理建模在牙科影像中的价值，并为未来的临床对齐全景放射影像分析研究提供了基础。

View on arXiv Download PDF AI Translation

cs.CV / 119 / 2603.06374

Rewis3d: Reconstruction Improves Weakly-Supervised Semantic Segmentation

Rewis3d：重建提升弱监督语义分割

Ernst, Jonas, Boettcher, Wolfgang, Hoyer, Lukas, Lenssen, Jan Eric, Schiele, Bernt

Abstract

We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.

Chinese Translation

我们提出了Rewis3d，一个利用近期前馈3D重建进展的框架，显著改善了在2D图像上的弱监督语义分割。获得密集的像素级注释仍然是训练分割模型的一个昂贵瓶颈。为了解决这个问题，稀疏注释提供了一种高效的弱监督替代方案。然而，它们仍然存在性能差距。为此，我们引入了一种新颖的方法，利用3D场景重建作为辅助监督信号。我们的关键见解是，从2D视频中恢复的3D几何结构提供了强有力的线索，可以在整个场景中传播稀疏注释。具体而言，双学生-教师架构强制2D图像与重建的3D点云之间的语义一致性，利用最先进的前馈重建生成可靠的几何监督。大量实验表明，Rewis3d在稀疏监督下实现了最先进的性能，相较于现有方法提高了2-7%的准确率，而无需额外的标签或推理开销。

View on arXiv Download PDF AI Translation

cs.CV / 120 / 2603.06378

MoEMambaMIL: Structure-Aware Selective State Space Modeling for Whole-Slide Image Analysis

MoEMambaMIL：面向结构的选择性状态空间建模用于全切片图像分析

Xie, Dongqing, Wu, Yonghuang

Abstract

Whole-slide image (WSI) analysis is challenging due to the gigapixel scale of slides and their inherent hierarchical multi-resolution structure. Existing multiple instance learning (MIL) approaches often model WSIs as unordered collections of patches, which limits their ability to capture structured dependencies between global tissue organization and local cellular patterns. Although recent State Space Models (SSMs) enable efficient modeling of long sequences, how to structure WSI tokens to fully exploit their spatial hierarchy remains an open problem.We propose MoEMambaMIL, a structure-aware SSM framework for WSI analysis that integrates region-nested selective scanning with mixture-of-experts (MoE) modeling. Leveraging multi-resolution preprocessing, MoEMambaMIL organizes patch tokens into region-aware sequences that preserve spatial containment across resolutions. On top of this structured sequence, we decouple resolution-aware encoding and region-adaptive contextual modeling via a combination of static, resolution-specific experts and dynamic sparse experts with learned routing. This design enables efficient long-sequence modeling while promoting expert specialization across heterogeneous diagnostic patterns. Experiments demonstrate that MoEMambaMIL achieves the best performance across 9 downstream tasks.

Chinese Translation

全切片图像（WSI）分析因切片的千兆像素规模及其固有的层次多分辨率结构而具有挑战性。现有的多实例学习（MIL）方法通常将WSI建模为无序的补丁集合，这限制了它们捕捉全局组织与局部细胞模式之间结构依赖关系的能力。尽管最近的状态空间模型（SSM）能够高效建模长序列，但如何构建WSI标记以充分利用其空间层次仍然是一个未解决的问题。我们提出了MoEMambaMIL，这是一种面向结构的SSM框架，用于WSI分析，集成了区域嵌套选择性扫描与专家混合（MoE）建模。通过多分辨率预处理，MoEMambaMIL将补丁标记组织成区域感知序列，保持不同分辨率之间的空间包含关系。在此结构化序列之上，我们通过静态的分辨率特定专家和具有学习路由的动态稀疏专家的组合，解耦分辨率感知编码和区域自适应上下文建模。该设计实现了高效的长序列建模，同时促进了异质诊断模式下专家的专业化。实验表明，MoEMambaMIL在9个下游任务中实现了最佳性能。

View on arXiv Download PDF AI Translation

cs.CV / 121 / 2603.06382

CHMv2: Improvements in Global Canopy Height Mapping using DINOv3

CHMv2：基于DINOv3的全球树冠高度映射改进

Brandt, John, Yi, Seungeun, Tolan, Jamie, Li, Xinyuan, Potapov, Peter, Ertel, Jessica, Spore, Justine, Vo, Huy V., Ramamonjisoa, Michaël, Labatut, Patrick, Bojanowski, Piotr, Couprie, Camille

Abstract

Accurate canopy height information is essential for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure, yet high-fidelity measurements from airborne laser scanning (ALS) remain unevenly available globally. Here we present CHMv2, a global, meter-resolution canopy height map derived from high-resolution optical satellite imagery using a depth-estimation model built on DINOv3 and trained against ALS canopy height models. Compared to existing products, CHMv2 substantially improves accuracy, reduces bias in tall forests, and better preserves fine-scale structure such as canopy edges and gaps. These gains are enabled by a large expansion of geographically diverse training data, automated data curation and registration, and a loss formulation and data sampling strategy tailored to canopy height distributions. We validate CHMv2 against independent ALS test sets and against tens of millions of GEDI and ICESat-2 observations, demonstrating consistent performance across major forest biomes.

Chinese Translation

准确的树冠高度信息对于量化森林碳、监测恢复和退化以及评估栖息地结构至关重要，但来自航空激光扫描（ALS）的高保真测量在全球范围内仍然不均匀可用。在此，我们提出了CHMv2，这是一个全球范围内、米级分辨率的树冠高度图，基于高分辨率光学卫星影像，使用基于DINOv3构建的深度估计模型，并针对ALS树冠高度模型进行了训练。与现有产品相比，CHMv2显著提高了准确性，减少了高大森林中的偏差，并更好地保留了诸如树冠边缘和缝隙等细尺度结构。这些改进得益于地理多样化训练数据的大规模扩展、自动化数据整理和配准，以及针对树冠高度分布量身定制的损失公式和数据采样策略。我们将CHMv2与独立的ALS测试集以及数千万个GEDI和ICESat-2观测数据进行了验证，证明其在主要森林生物群落中的一致性表现。

View on arXiv Download PDF AI Translation

cs.CV / 122 / 2603.06384

Prompt Group-Aware Training for Robust Text-Guided Nuclei Segmentation

面向群体的提示训练以实现稳健的文本引导核分割

Wu, Yonghuang, Liang, Zhenyang, Zeng, Wenwen, Xie, Xuan, Yu, Jinhua

Abstract

Foundation models such as Segment Anything Model 3 (SAM3) enable flexible text-guided medical image segmentation, yet their predictions remain highly sensitive to prompt formulation. Even semantically equivalent descriptions can yield inconsistent masks, limiting reliability in clinical and pathology workflows. We reformulate prompt sensitivity as a group-wise consistency problem. Semantically related prompts are organized into \emph{prompt groups} sharing the same ground-truth mask, and a prompt group-aware training framework is introduced for robust text-guided nuclei segmentation. The approach combines (i) a quality-guided group regularization that leverages segmentation loss as an implicit ranking signal, and (ii) a logit-level consistency constraint with a stop-gradient strategy to align predictions within each group. The method requires no architectural modification and leaves inference unchanged. Extensive experiments on multi-dataset nuclei benchmarks show consistent gains under textual prompting and markedly reduced performance variance across prompt quality levels. On six zero-shot cross-dataset tasks, our method improves Dice by an average of 2.16 points. These results demonstrate improved robustness and generalization for vision-language segmentation in computational pathology.

Chinese Translation

基础模型如Segment Anything Model 3 (SAM3)使得灵活的文本引导医学图像分割成为可能，但其预测仍对提示的表述高度敏感。即使是语义上等价的描述也可能产生不一致的掩膜，从而限制了在临床和病理工作流程中的可靠性。我们将提示敏感性重新表述为一个群体一致性问题。语义相关的提示被组织成共享相同真实掩膜的 extit{提示组}，并引入了一种面向提示组的训练框架，以实现稳健的文本引导核分割。该方法结合了（i）利用分割损失作为隐式排名信号的质量引导组正则化，以及（ii）通过停止梯度策略对每个组内的预测进行对齐的logit级一致性约束。该方法无需修改架构，推理过程保持不变。在多数据集核基准上的广泛实验表明，在文本提示下取得了一致的提升，并显著减少了不同提示质量水平下的性能方差。在六个零样本跨数据集任务中，我们的方法平均提高了Dice系数2.16分。这些结果表明，在计算病理学中，视觉-语言分割的稳健性和泛化能力得到了改善。

View on arXiv Download PDF AI Translation

cs.CV / 123 / 2603.06386

REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation

REACT++：用于实时场景图生成的高效交叉注意力机制

Neau, Maëlic, Falomir, Zoe

Abstract

Scene Graph Generation (SGG) is a task that encodes visual relationships between objects in images as graph structures. SGG shows significant promise as a foundational component for downstream tasks, such as reasoning for embodied agents. To enable real-time applications, SGG must address the trade-off between performance and inference speed. However, current methods tend to focus on one of the following: (1) improving relation prediction accuracy, (2) enhancing object detection accuracy, or (3) reducing latency, without aiming to balance all three objectives simultaneously. To address this limitation, we build on the powerful Real-time Efficiency and Accuracy Compromise for Tradeoffs in Scene Graph Generation (REACT) architecture and propose REACT++, a new state-of-the-art model for real-time SGG. By leveraging efficient feature extraction and subject-to-object cross-attention within the prototype space, REACT++ balances latency and representational power. REACT++ achieves the highest inference speed among existing SGG models, improving relation prediction accuracy without sacrificing object detection performance. Compared to the previous REACT version, REACT++ is 20% faster with a gain of 10% in relation prediction accuracy on average. The code is available at https://github.com/Maelic/SGG-Benchmark.

Chinese Translation

场景图生成（SGG）是一项将图像中对象之间的视觉关系编码为图结构的任务。SGG作为下游任务（如具身智能体推理）的基础组件，展现出显著的潜力。为了实现实时应用，SGG必须解决性能与推理速度之间的权衡。然而，当前的方法往往集中于以下三个方面之一：（1）提高关系预测的准确性，（2）增强对象检测的准确性，或（3）减少延迟，而未能同时平衡这三个目标。为了解决这一局限性，我们基于强大的实时效率与准确性折衷场景图生成（REACT）架构，提出了REACT++，这是一个用于实时SGG的新一代最先进模型。通过在原型空间内利用高效的特征提取和主体对对象的交叉注意力，REACT++平衡了延迟和表现力。REACT++在现有SGG模型中实现了最高的推理速度，提高了关系预测的准确性，同时不牺牲对象检测的性能。与之前的REACT版本相比，REACT++的速度提高了20%，关系预测准确性平均提升了10%。代码可在 https://github.com/Maelic/SGG-Benchmark 获取。

View on arXiv Download PDF AI Translation

cs.CV / 124 / 2603.06389

Solving Jigsaw Puzzles in the Wild: Human-Guided Reconstruction of Cultural Heritage Fragments

在野外解决拼图难题：人类引导的文化遗产碎片重建

Safaei, Omidreza, Aslan, Sinem, Vascon, Sebastiano, Palmieri, Luca, Khoroshiltseva, Marina, Pelillo, Marcello

Abstract

Reassembling real-world archaeological artifacts from fragmented pieces poses significant challenges due to erosion, missing regions, irregular shapes, and large-scale ambiguity. Traditional jigsaw puzzle solvers, often designed for clean synthetic scenarios, struggle under these conditions, especially when the number of fragments grows into the thousands, as in the RePAIR benchmark. In this paper, we propose a human-in-the-loop (HIL) puzzle solving framework designed to address the complexity and scale of real-world cultural heritage reconstruction. Our approach integrates an automatic relaxation-labeling solver with interactive human guidance, allowing users to iteratively lock verified placements, correct errors, and guide the system toward semantically and geometrically coherent assemblies. We introduce two complementary interaction strategies, Iterative Anchoring and Continuous Interactive Refinement, which support scalable reconstruction across varying levels of ambiguity and puzzle size. Experiments on several RePAIR groups demonstrate that our hybrid approach substantially outperforms both fully automatic and manual baselines in accuracy and efficiency, offering a practical solution for large-scale expert-in-the-loop artifact reassembly.

Chinese Translation

从碎片中重新组装现实世界的考古文物面临着显著的挑战，这些挑战源于侵蚀、缺失区域、不规则形状和大规模的模糊性。传统的拼图解法通常针对干净的合成场景设计，在这些条件下表现不佳，尤其是当碎片数量达到数千个时，如在RePAIR基准测试中所示。本文提出了一种人机协作（HIL）拼图解决框架，旨在应对现实世界文化遗产重建的复杂性和规模。我们的方法将自动松弛标记求解器与互动的人类指导相结合，使用户能够迭代地锁定经过验证的放置位置、纠正错误，并引导系统朝向语义和几何上连贯的组合。我们引入了两种互补的交互策略：迭代锚定和持续互动细化，这些策略支持在不同模糊性和拼图大小下的可扩展重建。在多个RePAIR组的实验中，我们的混合方法在准确性和效率上显著优于完全自动和手动基线，为大规模专家参与的文物重组提供了实用的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 125 / 2603.06399

DiffInf: Influence-Guided Diffusion for Supervision Alignment in Facial Attribute Learning

DiffInf：基于影响引导的扩散框架用于面部属性学习中的监督对齐

Pal, Basudha, Chellappa, Rama

Abstract

Facial attribute classification relies on large-scale annotated datasets in which many traits, such as age and expression, are inherently ambiguous and continuous but are discretized into categorical labels. Annotation inconsistencies arise from subjectivity and visual confounders such as pose, illumination, expression, and demographic variation, creating mismatch between images and assigned labels. These inconsistencies introduce supervision errors that impair representation learning and degrade downstream prediction. We introduce DiffInf, a self-influence--guided diffusion framework for mitigating annotation inconsistencies in facial attribute learning. We first train a baseline classifier and compute sample-wise self-influence scores using a practical first-order approximation to identify training instances that disproportionately destabilize optimization. Instead of discarding these influential samples, we apply targeted generative correction via a latent diffusion autoencoder to better align visual content with assigned labels while preserving identity and realism. To enable differentiable guidance during correction, we train a lightweight predictor of high-influence membership and use it as a surrogate influence regularizer. The edited samples replace the originals, yielding an influence-refined dataset of unchanged size. Across multi-class facial attribute classification, DiffInf consistently improves generalization compared with standard noisy-label training, robust optimization baselines, and influence-based filtering. Our results demonstrate that repairing influential annotation inconsistencies at the image level enhances downstream facial attribute classification without sacrificing distributional coverage.

Chinese Translation

面部属性分类依赖于大规模标注数据集，其中许多特征（如年龄和表情）本质上是模糊和连续的，但被离散化为类别标签。标注不一致性源于主观性和视觉混淆因素，如姿势、光照、表情和人口统计变化，导致图像与分配标签之间的不匹配。这些不一致性引入了监督错误，损害了表示学习并降低了下游预测的效果。我们提出了DiffInf，一种自我影响引导的扩散框架，用于减轻面部属性学习中的标注不一致性。我们首先训练一个基线分类器，并使用实用的一阶近似计算样本级自我影响分数，以识别那些不成比例地破坏优化的训练实例。我们并不舍弃这些有影响力的样本，而是通过潜在扩散自编码器应用有针对性的生成修正，以更好地将视觉内容与分配标签对齐，同时保持身份和真实感。为了在修正过程中实现可微分引导，我们训练了一个轻量级的高影响力成员预测器，并将其用作替代影响正则化器。编辑后的样本替换原始样本，生成一个影响精炼的数据集，大小保持不变。在多类面部属性分类中，与标准噪声标签训练、稳健优化基线和基于影响的过滤相比，DiffInf始终提高了泛化能力。我们的结果表明，在图像级修复有影响力的标注不一致性能够增强下游面部属性分类的效果，而不牺牲分布覆盖率。

View on arXiv Download PDF AI Translation

cs.CV / 126 / 2603.06407

Locating and Editing Figure-Ground Organization in Vision Transformers

在视觉变换器中定位和编辑图形-背景组织

Arnold, Stefan, Gröbner, René

Abstract

Vision Transformers must resolve figure-ground organization by choosing between completions driven by local geometric evidence and those favored by global organizational priors, giving rise to a characteristic perceptual ambiguity. We aim to locate where the canonical Gestalt prior convexity is realized within the internal components of BEiT. Using a controlled perceptual conflict based on synthetic shapes of darts, we systematically mask regions that equally admit either a concave completion or a convex completion. We show that BEiT reliably favors convex completion under this competition. Projecting internal activations into the model's discrete visual codebook space via logit attribution reveals that this preference is governed by identifiable functional units within transformer substructures. Specifically, we find that figure-ground organization is ambiguous through early and intermediate layers and resolves abruptly in later layers. By decomposing the direct effect of attention heads, we identify head L0H9 acting as an early seed, introducing a weak bias toward convexity. Downscaling this single attention head shifts the distributional mass of the perceptual conflict across a continuous decision boundary, allowing concave evidence to guide completion.

Chinese Translation

视觉变换器必须通过在局部几何证据驱动的补全与全球组织先验偏好的补全之间进行选择，从而解决图形-背景组织问题，这导致了特征感知的模糊性。我们的目标是定位经典格式塔先验的凸性在 BEiT 的内部组件中实现的位置。通过基于合成飞镖形状的受控感知冲突，我们系统地遮蔽那些同样允许凹形补全或凸形补全的区域。我们展示了在这种竞争下，BEiT 可靠地偏向于凸形补全。通过对模型的离散视觉词汇空间进行逻辑归因，投影内部激活揭示了这种偏好是由变换器子结构中的可识别功能单元所主导的。具体而言，我们发现图形-背景组织在早期和中间层中是模糊的，并在后期层中突然得到解决。通过分解注意力头的直接影响，我们识别出头 L0H9 作为一个早期种子，引入了对凸性的弱偏见。缩小这个单一注意力头的规模，改变了感知冲突的分布质量，使得凹形证据能够引导补全。

View on arXiv Download PDF AI Translation

cs.CV / 127 / 2603.06408

Physical Simulator In-the-Loop Video Generation

物理仿真器环路内视频生成

Foo, Lin Geng, Huang, Mark He, Lattas, Alexandros, Moschoglou, Stylianos, Beeler, Thabo, Theobalt, Christian

Abstract

Recent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity. Project Page: https://vcai.mpi-inf.mpg.de/projects/PSIVG/

Chinese Translation

最近基于扩散的视频生成技术取得了显著的视觉真实感，但仍然难以遵循基本的物理法则，如重力、惯性和碰撞。生成的物体在帧之间的运动往往不一致，表现出不合理的动态，或违反物理约束，从而限制了人工智能生成视频的真实感和可靠性。为了解决这一问题，我们提出了物理仿真器环路内视频生成（PSIVG），这是一个将物理仿真器集成到视频扩散过程中的新框架。PSIVG 从由预训练扩散模型生成的模板视频开始，重建 4D 场景和前景物体网格，将其初始化在物理仿真器中，并生成物理一致的轨迹。这些模拟轨迹随后被用来引导视频生成器朝向时空上物理一致的运动。为了进一步提高物体运动过程中的纹理一致性，我们提出了一种测试时纹理一致性优化（TTCO）技术，该技术基于来自仿真器的像素对应关系调整文本和特征嵌入。全面的实验表明，PSIVG 生成的视频在遵循现实世界物理的同时，保持了视觉质量和多样性。项目页面：https://vcai.mpi-inf.mpg.de/projects/PSIVG/

View on arXiv Download PDF AI Translation

cs.CV / 128 / 2603.06421

Non-invasive Growth Monitoring of Small Freshwater Fish in Home Aquariums via Stereo Vision

通过立体视觉对家庭水族箱中小型淡水鱼进行非侵入性生长监测

Seibold, Clemens, Hilsmann, Anna, Eisert, Peter

Abstract

Monitoring fish growth behavior provides relevant information about fish health in aquaculture and home aquariums. Yet, monitoring fish sizes poses different challenges, as fish are small and subject to strong refractive distortions in aquarium environments. Image-based measurement offers a practical, non-invasive alternative that allows frequent monitoring without disturbing the fish. In this paper, we propose a non-invasive refraction-aware stereo vision method to estimate fish length in aquariums. Our approach uses a YOLOv11-Pose network to detect fish and predict anatomical keypoints on the fish in each stereo image. A refraction-aware epipolar constraint accounting for the air-glass-water interfaces enables robust matching, and unreliable detections are removed using a learned quality score. A subsequent refraction-aware 3D triangulation recovers 3D keypoints, from which fish length is measured. We validate our approach on a new stereo dataset of endangered Sulawesi ricefish captured under aquarium-like conditions and demonstrate that filtering low-quality detections is essential for accurate length estimation. The proposed system offers a simple and practical solution for non-invasive growth monitoring and can be easily applied in home aquariums.

Chinese Translation

监测鱼类生长行为为水产养殖和家庭水族箱中的鱼类健康提供了相关信息。然而，监测鱼类尺寸面临不同的挑战，因为鱼体小且在水族箱环境中受到强烈的折射失真。基于图像的测量提供了一种实用的非侵入性替代方案，允许在不打扰鱼类的情况下进行频繁监测。本文提出了一种非侵入性的折射感知立体视觉方法，用于估计水族箱中鱼类的长度。我们的方法使用YOLOv11-Pose网络检测鱼类并预测每个立体图像中鱼体的解剖关键点。考虑到空气-玻璃-水界面的折射感知极线约束使得匹配更加稳健，并通过学习的质量评分去除不可靠的检测。随后，通过折射感知的三维三角测量恢复三维关键点，从中测量鱼类长度。我们在一个新的立体数据集上验证了我们的方法，该数据集包含在类似水族箱条件下捕获的濒危苏拉威西稻鱼，并证明过滤低质量检测对于准确的长度估计至关重要。所提出的系统为非侵入性生长监测提供了一种简单而实用的解决方案，并且可以轻松应用于家庭水族箱。

View on arXiv Download PDF AI Translation

cs.CV / 129 / 2603.06426

CLoPA: Continual Low Parameter Adaptation of Interactive Segmentation for Medical Image Annotation

CLoPA：医学图像标注的交互式分割的持续低参数适应

Esmaeili, Parhom, Tangwiriyasakul, Chayanin, Gibson, Eli, Ourselin, Sebastien, Cardoso, M. Jorge

Abstract

Interactive segmentation enables clinicians to guide annotation, but existing zero-shot models like nnInteractive fail to consistently reach expert-level performance across diverse medical imaging tasks. Because annotation campaigns produce a growing stream of task-specific labelled data, online adaptation of the segmentation model is a natural complement to zero-shot inference. We propose CLoPA, a continual adaptation strategy that tunes a small fraction of nnInteractive's parameters on the annotation cache, triggered by lightweight episode scheduling. CLoPA requires no new parameters or changes to the inference pipeline, and operates entirely within the existing annotation workflow. Across eight Medical Segmentation Decathlon tasks spanning diverse anatomical targets and imaging characteristics, CLoPA rapidly elevates performance to expert-level, even for tasks where nnInteractive previously failed, with the majority of gains realised after a single training episode. We show that the benefits of tuning different parameter groups depends on task characteristics and data regimes. Also, that for targets with complex geometries (e.g., hepatic vessels), instance normalisation and low-level feature tuning saturates, suggesting a need for deeper feature-representation alignment in the most challenging scenarios.

Chinese Translation

交互式分割使临床医生能够指导标注，但现有的零样本模型如 nnInteractive 在不同医学影像任务中未能始终如一地达到专家级表现。由于标注活动产生了不断增长的特定任务标注数据流，因此分割模型的在线适应是对零样本推理的自然补充。我们提出了 CLoPA，一种持续适应策略，通过轻量级的情节调度，调整 nnInteractive 参数的一小部分以适应标注缓存。CLoPA 不需要新的参数或对推理流程的更改，完全在现有的标注工作流程中运行。在涵盖多种解剖目标和影像特征的八个医学分割十项全能任务中，CLoPA 能迅速将性能提升至专家级，即使在 nnInteractive 之前失败的任务中，绝大多数的性能提升在单次训练情节后即可实现。我们展示了调整不同参数组的好处依赖于任务特征和数据模式。此外，对于具有复杂几何形状的目标（例如肝脏血管），实例归一化和低层特征调整会饱和，这表明在最具挑战性的场景中需要更深层次的特征表示对齐。

View on arXiv Download PDF AI Translation

cs.CV / 130 / 2603.06445

What if? Emulative Simulation with World Models for Situated Reasoning

如果？基于世界模型的情境推理仿真

Liu, Ruiping, Chen, Yufan, Zhang, Yuheng, Zheng, Junwei, Peng, Kunyu, Wu, Chengzhi, Huang, Chenguang, Wen, Di, Zhang, Jiaming, Yang, Kailun, Stiefelhagen, Rainer

Abstract

Situated reasoning often relies on active exploration, yet in many real-world scenarios such exploration is infeasible due to physical constraints of robots or safety concerns of visually impaired users. Given only a limited observation, can an agent mentally simulate a future trajectory toward a target situation and answer spatial what-if questions? We introduce WanderDream, the first large-scale dataset designed for the emulative simulation of mental exploration, enabling models to reason without active exploration. WanderDream-Gen comprises 15.8K panoramic videos across 1,088 real scenes from HM3D, ScanNet++, and real-world captures, depicting imagined trajectories from current viewpoints to target situations. WanderDream-QA contains 158K question-answer pairs, covering starting states, paths, and end states along each trajectory to comprehensively evaluate exploration-based reasoning. Extensive experiments with world models and MLLMs demonstrate (1) that mental exploration is essential for situated reasoning, (2) that world models achieve compelling performance on WanderDream-Gen, (3) that imagination substantially facilitates reasoning on WanderDream-QA, and (4) that WanderDream data exhibit remarkable transferability to real-world scenarios. The source code and all data will be released.

Chinese Translation

情境推理通常依赖于主动探索，但在许多现实场景中，由于机器人物理限制或视觉障碍用户的安全考虑，这种探索是不可行的。在仅有有限观察的情况下，代理能否在心理上模拟朝向目标情境的未来轨迹，并回答空间的假设性问题？我们提出了WanderDream，这是第一个旨在进行心理探索仿真的大规模数据集，使模型能够在没有主动探索的情况下进行推理。WanderDream-Gen包含来自HM3D、ScanNet++和现实世界捕捉的1,088个真实场景中的15.8K全景视频，描绘了从当前视点到目标情境的想象轨迹。WanderDream-QA包含158K问答对，涵盖每条轨迹的起始状态、路径和结束状态，以全面评估基于探索的推理。通过对世界模型和MLLMs的广泛实验，结果表明：(1) 心理探索对于情境推理至关重要；(2) 世界模型在WanderDream-Gen上表现出色；(3) 想象力显著促进了WanderDream-QA上的推理；(4) WanderDream数据在现实场景中展现出显著的可迁移性。源代码和所有数据将被公开。

View on arXiv Download PDF AI Translation

cs.CV / 131 / 2603.06449

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

CaTok：驯化均值流以实现一维因果图像标记化

Chen, Yitong, Wu, Zuxuan, Qiu, Xipeng, Jiang, Yu-Gang

Abstract

Autoregressive (AR) language models rely on causal tokenization, but extending this paradigm to vision remains non-trivial. Current visual tokenizers either flatten 2D patches into non-causal sequences or enforce heuristic orderings that misalign with the "next-token prediction" pattern. Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. By selecting tokens over time intervals and binding them to the MeanFlow objective, as illustrated in Fig. 1, CaTok learns causal 1D representations that support both fast one-step generation and high-fidelity multi-step sampling, while naturally capturing diverse visual concepts across token intervals. To further stabilize and accelerate training, we propose a straightforward regularization REPA-A, which aligns encoder features with Vision Foundation Models (VFMs). Experiments demonstrate that CaTok achieves state-of-the-art results on ImageNet reconstruction, reaching 0.75 FID, 22.53 PSNR and 0.674 SSIM with fewer training epochs, and the AR model attains performance comparable to leading approaches.

Chinese Translation

自回归（AR）语言模型依赖于因果标记化，但将这一范式扩展到视觉领域仍然具有挑战性。目前的视觉标记器要么将二维补丁展平为非因果序列，要么强制执行与“下一个标记预测”模式不一致的启发式排序。最近的扩散自编码器同样存在不足：对所有标记进行解码器条件化缺乏因果性，而应用嵌套的丢弃机制则引入了不平衡。为了解决这些挑战，我们提出了CaTok，一种具有均值流解码器的1D因果图像标记器。通过在时间间隔内选择标记并将其绑定到均值流目标，如图1所示，CaTok学习因果1D表示，支持快速的一步生成和高保真度的多步采样，同时自然捕捉跨标记间隔的多样视觉概念。为了进一步稳定和加速训练，我们提出了一种简单的正则化方法REPA-A，该方法将编码器特征与视觉基础模型（VFM）对齐。实验表明，CaTok在ImageNet重建上取得了最先进的结果，FID达到0.75，PSNR为22.53，SSIM为0.674，并且在较少的训练周期内，AR模型的性能与领先方法相当。

View on arXiv Download PDF AI Translation

cs.CV / 132 / 2603.06453

Pinterest Canvas: Large-Scale Image Generation at Pinterest

Pinterest Canvas：Pinterest的大规模图像生成

Wang, Yu, Tzeng, Eric, Shiau, Raymond, Yang, Jie, Kislyuk, Dmitry, Rosenberg, Charles

Abstract

While recent image generation models demonstrate a remarkable ability to handle a wide variety of image generation tasks, this flexibility makes them hard to control via prompting or simple inference adaptation alone, rendering them unsuitable for use cases with strict product requirements. In this paper, we introduce Pinterest Canvas, our large-scale image generation system built to support image editing and enhancement use cases at Pinterest. Canvas is first trained on a diverse, multimodal dataset to produce a foundational diffusion model with broad image-editing capabilities. However, rather than relying on one generic model to handle every downstream task, we instead rapidly fine-tune variants of this base model on task-specific datasets, producing specialized models for individual use cases. We describe key components of Canvas and summarize our best practices for dataset curation, training, and inference. We also showcase task-specific variants through case studies on background enhancement and aspect-ratio outpainting, highlighting how we tackle their specific product requirements. Online A/B experiments demonstrate that our enhanced images receive a significant 18.0% and 12.5% engagement lift, respectively, and comparisons with human raters further validate that our models outperform third-party models on these tasks. Finally, we showcase other Canvas variants, including multi-image scene synthesis and image-to-video generation, demonstrating that our approach can generalize to a wide variety of potential downstream tasks.

Chinese Translation

尽管最近的图像生成模型在处理各种图像生成任务方面表现出显著的能力，但这种灵活性使得它们难以仅通过提示或简单的推理适应进行控制，从而不适合具有严格产品要求的使用案例。本文介绍了Pinterest Canvas，我们的大规模图像生成系统，旨在支持Pinterest的图像编辑和增强使用案例。Canvas首先在一个多样化的多模态数据集上进行训练，以生成具有广泛图像编辑能力的基础扩散模型。然而，我们并不依赖一个通用模型来处理每个下游任务，而是快速微调该基础模型的变体，以适应特定任务的数据集，从而为各个使用案例生成专业模型。我们描述了Canvas的关键组件，并总结了我们在数据集策划、训练和推理方面的最佳实践。我们还通过背景增强和纵横比扩展的案例研究展示了特定任务的变体，突显了我们如何应对其特定的产品要求。在线A/B实验表明，我们增强的图像分别获得了显著的18.0%和12.5%的参与度提升，与人工评估者的比较进一步验证了我们的模型在这些任务上优于第三方模型。最后，我们展示了其他Canvas变体，包括多图像场景合成和图像到视频生成，证明我们的方法可以推广到各种潜在的下游任务。

View on arXiv Download PDF AI Translation

cs.CV / 133 / 2603.06454

Training Flow Matching: The Role of Weighting and Parameterization

训练流匹配：加权和参数化的作用

Gagneux, Anne, Martin, Ségolène, Gribonval, Rémi, Massias, Mathurin

Abstract

We study the training objectives of denoising-based generative models, with a particular focus on loss weighting and output parameterization, including noise-, clean image-, and velocity-based formulations. Through a systematic numerical study, we analyze how these training choices interact with the intrinsic dimensionality of the data manifold, model architecture, and dataset size. Our experiments span synthetic datasets with controlled geometry as well as image data, and compare training objectives using quantitative metrics for denoising accuracy (PSNR across noise levels) and generative quality (FID). Rather than proposing a new method, our goal is to disentangle the various factors that matter when training a flow matching model, in order to provide practical insights on design choices.

Chinese Translation

我们研究了基于去噪的生成模型的训练目标，特别关注损失加权和输出参数化，包括基于噪声、干净图像和速度的公式。通过系统的数值研究，我们分析了这些训练选择如何与数据流形的内在维度、模型架构和数据集大小相互作用。我们的实验涵盖了具有控制几何形状的合成数据集以及图像数据，并使用去噪精度（在不同噪声水平下的 PSNR）和生成质量（FID）的定量指标比较训练目标。我们的目标不是提出一种新方法，而是理清在训练流匹配模型时影响结果的各种因素，以便为设计选择提供实用的见解。

View on arXiv Download PDF AI Translation

cs.CV / 134 / 2603.06459

Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement

基础模型是否了解几何？探测冻结特征以进行连续物理测量

Shkolnikov, Yakov Pyotr

Abstract

Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees -- a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity -- functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process, not in language alignment: Qwen2.5-VL's LLM layers actually improve probe accuracy over its raw vision encoder. Layer-wise analysis reveals a universal mid-network accuracy peak across all architectures, with attention heads in layers 18-22 carrying disproportionate geometric signal. These findings enable a single frozen backbone to function as a multi-task geometric sensor through lightweight probes, without fine-tuning or text generation.

Chinese Translation

视觉-语言模型编码了其文本路径无法表达的连续几何信息：一个6,000参数的线性探针从冻结特征中提取手关节角度，平均绝对误差为6.1度，而最佳文本输出仅达到20.0度——这是一个3.3倍的瓶颈。LoRA微调（r=16，2,000张图像）将这一差距缩小至6.5度，提供了路径训练不足而非表示能力不足的证据。训练目标对准确性影响大于架构：五种编码器涵盖自监督、对比和混合范式，尽管共享的CKA表示相似度低至0.41，但仍收敛到统计上等效的准确性（R^2约为0.55，TOST等效于delta=0.03）——功能收敛而非表示收敛。自回归生成损害了几何保真度，但这种损害源于生成过程，而非语言对齐：Qwen2.5-VL的LLM层实际上提高了探针的准确性，相较于其原始视觉编码器。逐层分析揭示了所有架构中普遍存在的中间网络准确性峰值，层18-22中的注意力头承载着不成比例的几何信号。这些发现使得单一冻结骨干网络能够通过轻量级探针作为多任务几何传感器运作，而无需微调或文本生成。

View on arXiv Download PDF AI Translation

cs.CV / 135 / 2603.06467

GreenRFM: Toward a resource-efficient radiology foundation model

GreenRFM：朝向资源高效的放射学基础模型

Li, Yingtai, Ming, Shuai, Zhao, Mingyue, Lai, Haoran, Wang, Rongsheng, Zhou, Rui, Wang, Rundong, Li, Yujia, Wei, Wei, Zhou, Shaohua Kevin

Abstract

The development of radiology foundation models (RFMs) is hindered by a reliance on brute-force scaling. Existing approaches often directly translate methods for natural images, which prioritize scale over precision and hence lead to brittle and expensive models in clinical practice. To address this, we present a resource-efficient pre-training framework, GreenRFM, that achieves state-of-the-art performance. Our framework ensures robust generalization across diverse patient populations and imaging protocols, reducing computational requirements by orders of magnitude while surpassing complex, parameter-heavy models. These capabilities stem from principled supervision design that aims to maximally utilize supervisory signals via More distilled, Ubiquitous, Semantic-enforcing, and Task-aligning (MUST) supervision, rather than simply piling up the quantity of training data. We offer two GreenRFM configurations: (i) a performant model that establishes a new state-of-the-art using a single 24GB GPU within 24 hours, and (ii) a lightweight model that matches existing benchmarks with 6GB VRAM in 4 hours. We conduct extensive experiments using over 200,000 images from four institutions and of two modalities. GreenRFMs achieve superior performances on chest and abdominal CT datasets, regardless of public or private benchmark, surpassing a range of baseline models. In addition, the results on internal musculoskeletal MRI images show that the same supervision principles transfer between different modalities. Our performance and efficiency challenge the ``scale is all you need'' dogma and democratize the equitable development of state-of-the-art RFMs for clinicians even on a laptop.

Chinese Translation

放射学基础模型（RFM）的发展受到对粗暴扩展的依赖的制约。现有方法通常直接将自然图像的方法进行转换，优先考虑规模而非精度，因此在临床实践中导致脆弱且昂贵的模型。为了解决这一问题，我们提出了一种资源高效的预训练框架，GreenRFM，能够实现最先进的性能。我们的框架确保在不同患者群体和成像协议中具有稳健的泛化能力，将计算需求降低几个数量级，同时超越复杂的、参数密集的模型。这些能力源于原则性监督设计，旨在通过更精炼、普遍、语义强化和任务对齐的（MUST）监督，最大限度地利用监督信号，而不仅仅是堆积训练数据的数量。我们提供了两种GreenRFM配置：（i）一种高性能模型，在24小时内使用单个24GB GPU建立新的最先进水平；（ii）一种轻量级模型，在4小时内以6GB VRAM达到现有基准。我们使用来自四个机构和两种模态的超过200,000张图像进行了广泛实验。GreenRFM在胸部和腹部CT数据集上表现出优越的性能，无论是公共基准还是私有基准，均超越了一系列基线模型。此外，在内部肌肉骨骼MRI图像上的结果表明，相同的监督原则可以在不同模态之间转移。我们的性能和效率挑战了“规模就是一切”的教条，使得即使在笔记本电脑上，临床医生也能公平地开发最先进的RFM。

View on arXiv Download PDF AI Translation

cs.CV / 136 / 2603.06471

Match4Annotate: Propagating Sparse Video Annotations via Implicit Neural Feature Matching

Match4Annotate：通过隐式神经特征匹配传播稀疏视频注释

Zhang, Zhuorui, Pallarès-López, Roger, Namburi, Praneeth, Anthony, Brian W.

Abstract

Acquiring per-frame video annotations remains a primary bottleneck for deploying computer vision in specialized domains such as medical imaging, where expert labeling is slow and costly. Label propagation offers a natural solution, yet existing approaches face fundamental limitations. Video trackers and segmentation models can propagate labels within a single sequence but require per-video initialization and cannot generalize across videos. Classic correspondence pipelines operate on detector-chosen keypoints and struggle in low-texture scenes, while dense feature matching and one-shot segmentation methods enable cross-video propagation but lack spatiotemporal smoothness and unified support for both point and mask annotations. We present Match4Annotate, a lightweight framework for both intra-video and inter-video propagation of point and mask annotations. Our method fits a SIREN-based implicit neural representation to DINOv3 features at test time, producing a continuous, high-resolution spatiotemporal feature field, and learns a smooth implicit deformation field between frame pairs to guide correspondence matching. We evaluate on three challenging clinical ultrasound datasets. Match4Annotate achieves state-of-the-art inter-video propagation, outperforming feature matching and one-shot segmentation baselines, while remaining competitive with specialized trackers for intra-video propagation. Our results show that lightweight, test-time-optimized feature matching pipelines have the potential to offer an efficient and accessible solution for scalable annotation workflows.

Chinese Translation

获取逐帧视频注释仍然是将计算机视觉应用于医学影像等专业领域的主要瓶颈，因为专家标注过程缓慢且成本高昂。标签传播提供了一种自然的解决方案，但现有方法面临基本限制。视频跟踪器和分割模型可以在单个序列内传播标签，但需要每个视频的初始化，并且无法跨视频进行泛化。经典的对应管道依赖于检测器选择的关键点，在低纹理场景中表现不佳，而密集特征匹配和一次性分割方法虽然能够实现跨视频传播，但缺乏时空平滑性，并且对点和掩膜注释的支持不统一。我们提出了Match4Annotate，这是一个轻量级框架，用于点和掩膜注释的内部和跨视频传播。我们的方法在测试时将基于SIREN的隐式神经表示拟合到DINOv3特征，生成连续的高分辨率时空特征场，并学习帧对之间的平滑隐式变形场以指导对应匹配。我们在三个具有挑战性的临床超声数据集上进行了评估。Match4Annotate实现了最先进的跨视频传播，超越了特征匹配和一次性分割基线，同时在内部视频传播方面与专业跟踪器保持竞争力。我们的结果表明，轻量级、测试时优化的特征匹配管道有潜力为可扩展的注释工作流程提供高效且易于访问的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 137 / 2603.06507

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

自监督流匹配用于可扩展的多模态合成

Chefer, Hila, Esser, Patrick, Lorenz, Dominik, Podell, Dustin, Raja, Vikash, Tong, Vinh, Torralba, Antonio, Rombach, Robin

Abstract

Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Chinese Translation

强语义表示提高了扩散模型和流模型的收敛性和生成质量。现有的方法在很大程度上依赖于外部模型，这些模型需要单独训练，操作在不对齐的目标上，并表现出意想不到的扩展行为。我们认为，这种依赖性源于模型的训练目标，该目标提出了去噪任务，而对学习语义表示的激励很少。我们引入了Self-Flow：一种自监督流匹配范式，将表示学习整合到生成框架中。我们的关键机制，双时间步调度（Dual-Timestep Scheduling），在不同的标记上应用异质噪声水平，创造出信息不对称，迫使模型从损坏的输入中推断缺失的信息。这推动了强表示的学习以及生成能力的提升，而无需外部监督。我们的方法在不同模态之间具有良好的泛化能力，并支持多模态训练，同时遵循预期的扩展规律，实现了优越的图像、视频和音频生成。

View on arXiv Download PDF AI Translation

cs.CV / 138 / 2603.06522

Artificial Intelligence for Detecting Fetal Orofacial Clefts and Advancing Medical Education

人工智能在胎儿口面裂检测及医学教育中的应用

Zhang, Yuanji, Huang, Yuhao, Dou, Haoran, Zhu, Xiliang, Ling, Chen, Yang, Zhong, Liang, Lianying, Li, Jiuping, Liang, Siying, Li, Rui, Cao, Yan, Zhang, Yuhan, Lai, Jiewei, Zhou, Yongsong, Zheng, Hongyu, Gao, Xinru, Yu, Cheng, Shi, Liling, Yuan, Mengqin, Li, Honglong, Huang, Xiaoqiong, Chen, Chaoyu, Zhang, Jialin, Pan, Wenxiong, Frangi, Alejandro F., He, Guangzhi, Yang, Xin, Xiong, Yi, Yin, Linliang, Deng, Xuedong, Ni, Dong

Abstract

Orofacial clefts are among the most common congenital craniofacial abnormalities, yet accurate prenatal detection remains challenging due to the scarcity of experienced specialists and the relative rarity of the condition. Early and reliable diagnosis is essential to enable timely clinical intervention and reduce associated morbidity. Here we show that an artificial intelligence system, trained on over 45,139 ultrasound images from 9,215 fetuses across 22 hospitals, can diagnose fetal orofacial clefts with sensitivity and specificity exceeding 93% and 95% respectively, matching the performance of senior radiologists and substantially outperforming junior radiologists. When used as a medical copilot, the system raises junior radiologists' sensitivity by more than 6%. Beyond direct diagnostic assistance, the system also accelerates the development of clinical expertise. A pilot study involving 24 radiologists and trainees demonstrated that the model can improve the expertise development for rare conditions. This dual-purpose approach offers a scalable solution for improving both diagnostic accuracy and specialist training in settings where experienced radiologists are scarce.

Chinese Translation

口面裂是最常见的先天性颅面畸形之一，但由于经验丰富的专家稀缺以及该病症相对罕见，准确的产前检测仍然具有挑战性。早期和可靠的诊断对于及时临床干预和减少相关发病率至关重要。在此，我们展示了一种人工智能系统，该系统基于来自22家医院的9,215名胎儿的45,139张超声图像进行训练，能够以超过93%的灵敏度和95%的特异性诊断胎儿口面裂，其表现与资深放射科医师相当，并且显著优于初级放射科医师。当作为医学副驾驶使用时，该系统使初级放射科医师的灵敏度提高了超过6%。除了直接的诊断辅助外，该系统还加速了临床专业知识的发展。一项涉及24名放射科医师和实习生的初步研究表明，该模型能够改善对罕见病症的专业知识发展。这种双重目的的方法为在经验丰富的放射科医师稀缺的环境中提高诊断准确性和专业培训提供了一种可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 139 / 2603.06523

SCAN: Visual Explanations with Self-Confidence and Analysis Networks

SCAN：具有自信度和分析网络的视觉解释

Lee, Gwanghee, Jeong, Sungyoon, Jhang, Kyoungson

Abstract

Explainable AI (XAI) has become essential in computer vision to make the decision-making processes of deep learning models transparent. However, current visual explanation (XAI) methods face a critical trade-off between the high fidelity of architecture-specific methods and the broad applicability of universal ones. This often results in abstract or fragmented explanations and makes it difficult to compare explanatory power across diverse model families, such as CNNs and Transformers. This paper introduces the Self-Confidence and Analysis Networks (SCAN), a novel universal framework that overcomes these limitations for both convolutional neural network and transformer architectures. SCAN utilizes an AutoEncoder-based approach to reconstruct features from a model's intermediate layers. Guided by the Information Bottleneck principle, it generates a high-resolution Self-Confidence Map that identifies information-rich regions. Extensive experiments on diverse architectures and datasets demonstrate that SCAN consistently achieves outstanding performance on various quantitative metrics such as AUC-D, Negative AUC, Drop%, and Win%. Qualitatively, it produces significantly clearer, object-focused explanations than existing methods. By providing a unified framework that is both architecturally universal and highly faithful, SCAN enhances model transparency and offers a more reliable tool for understanding the decision-making processes of complex neural networks.

Chinese Translation

可解释人工智能（XAI）在计算机视觉中变得至关重要，以使深度学习模型的决策过程透明。然而，当前的视觉解释（XAI）方法面临着架构特定方法的高保真性与通用方法的广泛适用性之间的关键权衡。这往往导致解释过于抽象或片段化，使得在不同模型家族（如卷积神经网络和变换器）之间比较解释能力变得困难。本文介绍了自信度和分析网络（Self-Confidence and Analysis Networks, SCAN），这是一个新颖的通用框架，克服了卷积神经网络和变换器架构的这些限制。SCAN利用基于自编码器的方法从模型的中间层重建特征。在信息瓶颈原则的指导下，它生成一个高分辨率的自信度图（Self-Confidence Map），以识别信息丰富的区域。在多种架构和数据集上的广泛实验表明，SCAN在各种定量指标（如AUC-D、负AUC、Drop%和Win%）上始终表现出色。在定性方面，它比现有方法生成显著更清晰、以对象为中心的解释。通过提供一个在架构上通用且高度忠实的统一框架，SCAN增强了模型的透明度，并为理解复杂神经网络的决策过程提供了更可靠的工具。

View on arXiv Download PDF AI Translation

cs.CV / 140 / 2603.06530

AV-Unified: A Unified Framework for Audio-visual Scene Understanding

AV-Unified：一种统一的音视频场景理解框架

Li, Guangyao, Wang, Xin, Zhu, Wenwu

Abstract

When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model's adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.

Chinese Translation

当人类感知世界时，他们自然会在动态的现实场景中整合多种音视频任务。然而，目前的研究如事件定位、解析、分割和问答等大多是单独探讨的，这使得全面理解复杂的音视频场景以及探索任务间关系变得具有挑战性。因此，我们提出了 extbf{AV-Unified}，一个统一的框架，能够在广泛的音视频场景理解任务中实现联合学习。AV-Unified标准化了每个任务的多样输入输出格式，并结合多尺度时空感知网络，有效捕捉音视频关联。具体而言，我们通过将所有支持任务的输入和输出转换为离散标记序列来统一它们，建立共享表示，使得单一架构能够在异构多样的数据集上进行联合训练。考虑到音视频事件的时间粒度差异，我们设计了一个多尺度时间感知模块来捕捉关键线索。同时，为了克服视觉领域缺乏听觉监督的问题，我们设计了一个基于跨模态引导的空间感知模块，以建模空间音视频关联。此外，采用特定任务的文本提示来增强模型的适应性和任务意识。在基准数据集（如AVE、LLP、MUSIC-AVQA、VGG-SS和AVS）上的大量实验表明，AV-Unified在时间、空间和时空任务中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 141 / 2603.06531

Spatial Calibration of Diffuse LiDARs

扩散型激光雷达的空间标定

Behari, Nikhil, Raskar, Ramesh

Abstract

Diffuse direct time-of-flight LiDARs report per-pixel depth histograms formed by aggregating photon returns over a wide instantaneous field of view, violating the single-ray assumption behind standard LiDAR-RGB calibration. We present a simple spatial calibration procedure that estimates, for each diffuse LiDAR pixel, its footprint (effective support region) and relative spatial sensitivity in a co-located RGB image plane. Using a scanned retroreflective patch with background subtraction, we recover per-pixel response maps that provide an explicit LiDAR-to-RGB correspondence for cross-modal alignment and fusion. We demonstrate the method on the ams OSRAM TMF8828.

Chinese Translation

扩散型直接飞行时间激光雷达通过在广泛的瞬时视场内聚合光子返回，报告每个像素的深度直方图，这违反了标准激光雷达-RGB标定所依据的单光线假设。我们提出了一种简单的空间标定程序，该程序为每个扩散型激光雷达像素估计其足迹（有效支持区域）和在共置RGB图像平面中的相对空间灵敏度。通过使用带背景减除的扫描反射补丁，我们恢复每个像素的响应图，从而为跨模态对齐和融合提供明确的激光雷达与RGB之间的对应关系。我们在ams OSRAM TMF8828上演示了该方法。

View on arXiv Download PDF AI Translation

cs.CV / 142 / 2603.06533

NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

NEGATE：文本到视频扩散中的语言否定的约束语义指导

Kang, Taewon, Lin, Ming C.

Abstract

Negation is a fundamental linguistic operator, yet it remains inadequately modeled in diffusion-based generative systems. In this work, we present a formal treatment of linguistic negation in diffusion-based generative models by modeling it as a structured feasibility constraint on semantic guidance within diffusion dynamics. Rather than introducing heuristics or retraining model parameters, we reinterpret classifier-free guidance as defining a semantic update direction and enforce negation by projecting the update onto a convex constraint set derived from linguistic structure. This novel formulation provides a unified framework for handling diverse negation phenomena, including object absence, graded non-inversion semantics, multi-negation composition, and scope-sensitive disambiguation. Our approach is training-free, compatible with pretrained diffusion backbones, and naturally extends from image generation to temporally evolving video trajectories. In addition, we introduce a structured negation-centric benchmark suite that isolates distinct linguistic failure modes in generative systems, to further research in this area. Experiments demonstrate that our method achieves robust negation compliance while preserving visual fidelity and structural coherence, establishing the first unified formulation of linguistic negation in diffusion-based generative models beyond representation-level evaluation.

Chinese Translation

否定是一个基本的语言操作符，但在基于扩散的生成系统中仍然建模不足。在本研究中，我们通过将语言否定建模为扩散动态中语义指导的结构化可行性约束，提出了对基于扩散的生成模型中语言否定的正式处理。我们并没有引入启发式方法或重新训练模型参数，而是重新解释无分类器指导为定义语义更新方向，并通过将更新投影到从语言结构派生的凸约束集上来强制执行否定。这一新颖的表述提供了一个统一框架，用于处理多样的否定现象，包括对象缺失、分级非反转语义、多重否定组合和范围敏感的消歧。我们的方法是无训练的，兼容预训练的扩散主干，并自然地从图像生成扩展到时间演变的视频轨迹。此外，我们引入了一个以结构化否定为中心的基准套件，以隔离生成系统中不同的语言失效模式，以进一步推动该领域的研究。实验表明，我们的方法在保持视觉保真度和结构一致性的同时，实现了稳健的否定合规性，建立了基于扩散的生成模型中语言否定的第一个统一表述，超越了表示层面的评估。

View on arXiv Download PDF AI Translation

cs.CV / 143 / 2603.06543

SurgFormer: Scalable Learning of Organ Deformation with Resection Support and Real-Time Inference

SurgFormer：支持切除的可扩展器官变形学习与实时推理

Shahbazi, Ashkan, Akbari, Elaheh, Pereira, Kyvia, Heiselman, Jon S., Benson, Annie C., Johnston, Garrison L. H., Wu, Jie Ying, Simaan, Nabil, Miga, Michael I., Kolouri, Soheil

Abstract

We introduce SurgFormer, a multiresolution gated transformer for data driven soft tissue simulation on volumetric meshes. High fidelity biomechanical solvers are often too costly for interactive use, so we train SurgFormer on solver generated data to predict nodewise displacement fields at near real time rates. SurgFormer builds a fixed mesh hierarchy and applies repeated multibranch blocks that combine local message passing, coarse global self attention, and pointwise feedforward updates, fused by learned per node, per channel gates to adaptively integrate local and long range information while remaining scalable on large meshes. For cut conditioned simulation, resection information is encoded as a learned cut embedding and provided as an additional input, enabling a unified model for both standard deformation prediction and topology altering cases. We also introduce two surgical simulation datasets generated under a unified protocol with XFEM based supervision: a cholecystectomy resection dataset and an appendectomy manipulation and resection dataset with cut and uncut cases. To our knowledge, this is the first learned volumetric surrogate setting to study XFEM supervised cut conditioned deformation within the same volumetric pipeline as standard deformation prediction. Across diverse baselines, SurgFormer achieves strong accuracy with favorable efficiency, making it a practical backbone for both tasks. {Code, data, and project page: \href{https://mint-vu.github.io/SurgFormer/}{available here}}

Chinese Translation

我们介绍了SurgFormer，一种用于基于体积网格的数据驱动软组织模拟的多分辨率门控变换器。高保真生物力学求解器通常成本过高，无法用于交互式应用，因此我们在求解器生成的数据上训练SurgFormer，以近实时的速度预测节点位移场。SurgFormer构建了一个固定的网格层次结构，并应用重复的多分支模块，这些模块结合了局部信息传递、粗略的全局自注意力和逐点前馈更新，通过学习的每节点、每通道门控融合，以自适应地整合局部和长程信息，同时在大网格上保持可扩展性。对于切除条件下的模拟，切除信息被编码为学习的切除嵌入，并作为额外输入提供，从而实现了一个统一的模型，适用于标准变形预测和拓扑改变的情况。我们还介绍了在统一协议下生成的两个外科手术模拟数据集，采用基于XFEM的监督：一个胆囊切除切除数据集和一个阑尾切除操作与切除数据集，包含切除和未切除的案例。据我们所知，这是第一个学习的体积代理设置，用于在与标准变形预测相同的体积管道中研究XFEM监督的切除条件变形。在各种基准测试中，SurgFormer实现了强大的准确性和良好的效率，使其成为这两项任务的实用基础。{代码、数据和项目页面： exttt{available here}}

View on arXiv Download PDF AI Translation

cs.CV / 144 / 2603.06544

Modeling and Measuring Redundancy in Multisource Multimodal Data for Autonomous Driving

自主驾驶中多源多模态数据冗余的建模与测量

Zhou, Yuhan, Sattari, Mehri, Chen, Haihua, Sha, Kewei

Abstract

Next-generation autonomous vehicles (AVs) rely on large volumes of multisource and multimodal ($M^2$) data to support real-time decision-making. In practice, data quality (DQ) varies across sources and modalities due to environmental conditions and sensor limitations, yet AV research has largely prioritized algorithm design over DQ analysis. This work focuses on redundancy as a fundamental but underexplored DQ issue in AV datasets. Using the nuScenes and Argoverse 2 (AV2) datasets, we model and measure redundancy in multisource camera data and multimodal image-LiDAR data, and evaluate how removing redundant labels affects the YOLOv8 object detection task. Experimental results show that selectively removing redundant multisource image object labels from cameras with shared fields of view improves detection. In nuScenes, mAP${50}$ gains from $0.66$ to $0.70$, $0.64$ to $0.67$, and from $0.53$ to $0.55$, on three representative overlap regions, while detection on other overlapping camera pairs remains at the baseline even under stronger pruning. In AV2, $4.1$-$8.6\%$ of labels are removed, and mAP${50}$ stays near the $0.64$ baseline. Multimodal analysis also reveals substantial redundancy between image and LiDAR data. These findings demonstrate that redundancy is a measurable and actionable DQ factor with direct implications for AV performance. This work highlights the role of redundancy as a data quality factor in AV perception and motivates a data-centric perspective for evaluating and improving AV datasets. Code, data, and implementation details are publicly available at: https://github.com/yhZHOU515/RedundancyAD

Chinese Translation

下一代自主车辆（AV）依赖于大量的多源和多模态（$M^2$）数据来支持实时决策。然而，实际中，由于环境条件和传感器限制，数据质量（DQ）在不同源和模态之间存在差异，而自主驾驶研究在很大程度上优先考虑算法设计而非数据质量分析。本研究将冗余作为自主驾驶数据集中一个基本但未充分探讨的数据质量问题进行研究。通过使用nuScenes和Argoverse 2（AV2）数据集，我们对多源相机数据和多模态图像-激光雷达数据中的冗余进行建模与测量，并评估去除冗余标签对YOLOv8目标检测任务的影响。实验结果表明，从具有共享视场的相机中选择性地去除冗余多源图像目标标签可以提高检测性能。在nuScenes数据集中，mAP${50}$在三个代表性重叠区域的提升分别为从$0.66$到$0.70$、$0.64$到$0.67$，以及从$0.53$到$0.55$，而其他重叠相机对的检测在更强的剪枝下仍保持在基线水平。在AV2数据集中，去除了$4.1$-$8.6\%$的标签，mAP${50}$保持在接近$0.64$的基线水平。多模态分析还揭示了图像与激光雷达数据之间存在显著冗余。这些发现表明，冗余是一个可测量且可操作的数据质量因素，对自主驾驶性能具有直接影响。本研究强调冗余作为自主驾驶感知中的数据质量因素的重要性，并激励从数据中心的视角来评估和改善自主驾驶数据集。代码、数据和实现细节可在以下网址公开获取：https://github.com/yhZHOU515/RedundancyAD

View on arXiv Download PDF AI Translation

cs.CV / 145 / 2603.06561

EgoReasoner: Learning Egocentric 4D Reasoning via Task-Adaptive Structured Thinking

EgoReasoner：通过任务自适应结构化思维学习自我中心的4D推理

Zhu, Fangrui, Xi, Yunfeng, Ni, Jianmo, Cai, Mu, Gong, Boqing, Zhao, Long, Qu, Chen, Miao, Ian, Li, Yi, Zhong, Cheng, Jiang, Huaizu, Patel, Shwetak

Abstract

Egocentric video understanding is inherently complex due to the dynamic 4D nature of the environment, where camera motion and object displacements necessitate a continuous re-evaluation of spatial relations. In this work, we target a suite of under-explored egocentric 4D reasoning tasks, including fixture interaction counting, viewpoint-relative fixture location, object movement itinerary tracking, and stationary object localization, that require fundamentally different cognitive operations: spatial anchoring, temporal tracking, and duration reasoning. We observe that these structural differences make task-agnostic approaches insufficient: generic Chain-of-Thought methods lack task-appropriate reasoning primitives, and uniform reinforcement learning actively destabilizes performance on spatial tasks. To address this, we propose EgoReasoner, a two-stage framework that aligns both the reasoning scaffold and the reward signal to each task's cognitive structure. In the first stage, Task-Adaptive Thinking Templates guide the synthesis of structured CoT traces that teach the model to reason adaptively across task types via supervised fine-tuning. In the second stage, task-aware reward functions verify entity grounding, temporal alignment, and task-adaptive logical consistency, selectively strengthening each reasoning pathway via reinforcement fine-tuning with GRPO. Our 3B-parameter model, trained on only 16K samples, achieves 37.5% average accuracy on the challenging HD-EPIC benchmark, surpassing Qwen2.5-VL-7B (25.7%) by over 10 points.

Chinese Translation

自我中心的视频理解由于环境的动态4D特性而 inherently 复杂，其中相机运动和物体位移需要对空间关系进行持续的重新评估。在本研究中，我们针对一系列尚未充分探索的自我中心4D推理任务，包括固定物体交互计数、视角相对的固定物体位置、物体运动路径跟踪和静态物体定位，这些任务需要根本不同的认知操作：空间锚定、时间跟踪和持续时间推理。我们观察到，这些结构差异使得任务无关的方法不足：通用的链式思维（Chain-of-Thought）方法缺乏适合任务的推理原语，而统一的强化学习在空间任务上会主动导致性能不稳定。为了解决这个问题，我们提出了EgoReasoner，一个两阶段框架，将推理结构和奖励信号与每个任务的认知结构对齐。在第一阶段，任务自适应思维模板引导结构化链式思维（CoT）轨迹的合成，教会模型通过监督微调在任务类型之间自适应推理。在第二阶段，任务感知的奖励函数验证实体的基础、时间对齐和任务自适应的逻辑一致性，通过与GRPO的强化微调选择性地增强每个推理路径。我们的3B参数模型仅在16K样本上训练，在具有挑战性的HD-EPIC基准测试中达到了37.5%的平均准确率，超过了Qwen2.5-VL-7B（25.7%）超过10个百分点。

View on arXiv Download PDF AI Translation

cs.CV / 146 / 2603.06569

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

企鹅视觉语言模型（Penguin-VL）：探索基于大型语言模型（LLM）的视觉编码器的视觉语言模型（VLM）效率极限

Zhang, Boqiang, Ke, Lei, Yang, Ruihan, Gao, Qi, Qu, Tianyuan, Chen, Rossell, Yu, Dong, Leoweiliang

Abstract

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL

Chinese Translation

视觉语言模型（VLM）的发展在很大程度上依赖于模型规模的扩大，这限制了其在计算资源受限的移动和边缘设备（如智能手机和机器人）上的部署。在本研究中，我们探讨了紧凑型（例如，2B和8B）VLM的性能极限。我们挑战了当前的普遍做法，即最先进的VLM必须依赖于通过大规模对比预训练（例如，CLIP/SigLIP）初始化的视觉编码器。我们识别出一个目标不匹配的问题：对比学习优化用于区分，强制施加粗略和类别级别的不变性，从而抑制了密集标注和复杂VLM推理所需的细粒度视觉线索。为了解决这一问题，我们提出了企鹅视觉语言模型（Penguin-VL），其视觉编码器是从仅文本的LLM初始化的。我们的实验表明，Penguin-Encoder作为传统对比预训练的优越替代方案，解锁了更高的视觉保真度和数据效率，促进了多模态理解。在各种图像和视频基准测试中，Penguin-VL在数学推理方面的表现与领先的VLM（例如，Qwen3-VL）相当，并在文档理解、视觉知识和多视角视频理解等任务中超越了它们。值得注意的是，这些提升是在轻量级架构下实现的，表明改进的视觉表征而非模型规模是性能的主要驱动因素。我们的消融实验显示，Penguin-Encoder始终优于对比预训练的编码器，保留了对密集感知和复杂推理至关重要的细粒度空间和时间线索。这使其成为计算高效的VLM的强大替代方案，并在资源受限的环境中实现高性能。代码链接：https://github.com/tencent-ailab/Penguin-VL

View on arXiv Download PDF AI Translation

cs.CV / 147 / 2603.06570

SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

SUREON：一种用于外科推理的基准和视觉-语言模型

Perez, Alejandra, Rau, Anita, White, Lee, Mlambo, Busisiwe, Nwoye, Chinedu, Jamal, Muhammad Abdullah, Mohareri, Omid

Abstract

Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.

Chinese Translation

外科医生不仅仅是观察——他们还进行解读。当专家观察外科场景时，他们不仅理解正在使用什么工具，还理解为什么选择该工具、它带来了什么风险以及接下来会发生什么。目前的外科人工智能无法回答这些问题，主要是因为明确编码外科推理的训练数据在规模上极难进行标注。然而，外科视频讲座中正好包含了这些内容——由专家叙述的意图、理由和预期的解释，目的是为了教学。尽管这些叙述本质上是嘈杂且无结构的，但它们编码了外科人工智能目前所缺乏的推理能力。我们引入了SUREON，一个大规模视频问答数据集，系统性地从外科学术视频中提取这一训练信号。SUREON定义了12个问题类别，涵盖安全评估、决策理由和预测，并使用多代理管道在规模上提取和结构化监督。在134.7K个片段和170种手术类型中，SUREON生成了206.8K个问答对，并提供了354个经过专家验证的基准示例。为了评估这种监督在多大程度上转化为外科推理能力，我们引入了两个模型：SureonVLM，一个通过监督微调适应的视觉-语言模型，以及SureonVLM-R1，一个通过群体相对策略优化训练的推理模型。这两个模型能够回答有关外科手术的复杂问题，并显著优于更大的通用领域模型，在SUREON基准上超过84%的准确率，同时在标准外科感知任务中超越通用领域模型。对SureonVLM-R1的定性分析揭示了明确的推理行为，例如从视觉上下文推断手术意图。

View on arXiv Download PDF AI Translation

cs.CV / 148 / 2603.06572

SCOPE: Scene-Contextualized Incremental Few-Shot 3D Segmentation

SCOPE：场景上下文增量少样本3D分割

Thengane, Vishal, An, Zhaochong, Huang, Tianjin, Phung, Son Lam, Bouzerdoum, Abdesselam, Yin, Lu, Zhao, Na, Zhu, Xiatian

Abstract

Incremental Few-Shot (IFS) segmentation aims to learn new categories over time from only a few annotations. Although widely studied in 2D, it remains underexplored for 3D point clouds. Existing methods suffer from catastrophic forgetting or fail to learn discriminative prototypes under sparse supervision, and often overlook a key cue: novel categories frequently appear as unlabelled background in base-training scenes. We introduce SCOPE (Scene-COntextualised Prototype Enrichment), a plug-and-play background-guided prototype enrichment framework that integrates with any prototype-based 3D segmentation method. After base training, a class-agnostic segmentation model extracts high-confidence pseudo-instances from background regions to build a prototype pool. When novel classes arrive with few labelled samples, relevant background prototypes are retrieved and fused with few-shot prototypes to form enriched representations without retraining the backbone or adding parameters. Experiments on ScanNet and S3DIS show that SCOPE achieves SOTA performance, improving novel-class IoU by up to 6.98% and 3.61%, and mean IoU by 2.25% and 1.70%, respectively, while maintaining low forgetting. Code is available https://github.com/Surrey-UP-Lab/SCOPE.

Chinese Translation

增量少样本（IFS）分割旨在通过少量标注随时间学习新类别。尽管在2D领域得到了广泛研究，但在3D点云中仍然未得到充分探索。现有方法面临灾难性遗忘，或在稀疏监督下无法学习到具有区分性的原型，并且常常忽视一个关键线索：新类别通常在基础训练场景中作为未标记背景出现。我们提出了SCOPE（场景上下文原型增强），这是一个即插即用的背景引导原型增强框架，可以与任何基于原型的3D分割方法集成。在基础训练后，一个类无关的分割模型从背景区域提取高置信度伪实例，以建立原型池。当新类别以少量标记样本到来时，相关的背景原型被检索并与少样本原型融合，以形成增强的表示，而无需重新训练主干网络或增加参数。在ScanNet和S3DIS上的实验表明，SCOPE达到了最先进的性能，新的类别IoU提高了最多6.98%和3.61%，平均IoU提高了2.25%和1.70%，同时保持了低遗忘率。代码可在https://github.com/Surrey-UP-Lab/SCOPE获取。

View on arXiv Download PDF AI Translation

cs.CV / 149 / 2603.06576

BEVLM: Distilling Semantic Knowledge from LLMs into Bird's-Eye View Representations

BEVLM：从大型语言模型中提取语义知识并转化为鸟瞰图表示

Monninger, Thomas, Xie, Shaoyuan, Chen, Qi Alfred, Ding, Sihao

Abstract

The integration of Large Language Models (LLMs) into autonomous driving has attracted growing interest for their strong reasoning and semantic understanding abilities, which are essential for handling complex decision-making and long-tail scenarios. However, existing methods typically feed LLMs with tokens from multi-view and multi-frame images independently, leading to redundant computation and limited spatial consistency. This separation in visual processing hinders accurate 3D spatial reasoning and fails to maintain geometric coherence across views. On the other hand, Bird's-Eye View (BEV) representations learned from geometrically annotated tasks (e.g., object detection) provide spatial structure but lack the semantic richness of foundation vision encoders. To bridge this gap, we propose BEVLM, a framework that connects a spatially consistent and semantically distilled BEV representation with LLMs. Through extensive experiments, we show that BEVLM enables LLMs to reason more effectively in cross-view driving scenes, improving accuracy by 46%, by leveraging BEV features as unified inputs. Furthermore, by distilling semantic knowledge from LLMs into BEV representations, BEVLM significantly improves closed-loop end-to-end driving performance by 29% in safety-critical scenarios.

Chinese Translation

大型语言模型（LLMs）在自动驾驶中的应用引起了越来越多的关注，因为它们强大的推理和语义理解能力对于处理复杂决策和长尾场景至关重要。然而，现有方法通常独立地将来自多视角和多帧图像的标记输入到LLMs中，这导致了冗余计算和有限的空间一致性。这种视觉处理的分离妨碍了准确的三维空间推理，并未能在不同视角之间保持几何一致性。另一方面，从几何标注任务（例如，物体检测）中学习到的鸟瞰图（BEV）表示提供了空间结构，但缺乏基础视觉编码器的语义丰富性。为了解决这一问题，我们提出了BEVLM，一个将空间一致且经过语义提炼的BEV表示与LLMs连接的框架。通过大量实验，我们表明BEVLM使得LLMs在跨视角驾驶场景中能够更有效地进行推理，准确率提高了46%，通过将BEV特征作为统一输入。此外，通过将语义知识从LLMs提炼到BEV表示中，BEVLM在安全关键场景下显著提高了闭环端到端驾驶性能，提升幅度达到29%。

View on arXiv Download PDF AI Translation

cs.CV / 150 / 2603.06577

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

全方位扩散：基于掩码的离散扩散模型的统一多模态理解与生成

Li, Lijiang, Long, Zuwei, Shen, Yunhang, Gao, Heting, Cao, Haoyu, Sun, Xing, Shan, Caifeng, He, Ran, Fu, Chaoyou

Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

Chinese Translation

尽管近期的多模态大语言模型（MLLMs）取得了显著进展，但它们主要采用传统的自回归架构作为基础，仍有较大的空间探索有效且高效的架构设计替代方案。同时，近期研究成功地将离散扩散模型应用于视觉理解和图像生成等多个领域，揭示了其作为多模态系统有前景的基础架构的巨大潜力。受到这些开创性研究的启发，我们提出了Omni-Diffusion，这是第一个完全基于掩码的离散扩散模型构建的任意对任意多模态语言模型，统一了文本、语音和图像的理解与生成。Omni-Diffusion采用统一的基于掩码的离散扩散模型，直接捕捉离散多模态标记的联合分布。这种方法不仅支持双模态任务，还支持涉及多种模态的更复杂场景。在一系列多样的基准测试中，我们的方法在性能上超越或与现有处理两种或更多模态的多模态系统持平，突显了扩散模型在推动下一代多模态基础模型中的重要前景。项目网页：https://omni-diffusion.github.io。

View on arXiv Download PDF AI Translation

cs.CV / 151 / 2603.06578

Multimodal Large Language Models as Image Classifiers

多模态大型语言模型作为图像分类器

Kisel, Nikita, Volkov, Illia, Janouskova, Klara, Matas, Jiri

Abstract

Multimodal Large Language Models (MLLM) classification performance depends critically on evaluation protocol and ground truth quality. Studies comparing MLLMs with supervised and vision-language models report conflicting conclusions, and we show these conflicts stem from protocols that either inflate or underestimate performance. Across the most common evaluation protocols, we identify and fix key issues: model outputs that fall outside the provided class list and are discarded, inflated results from weak multiple-choice distractors, and an open-world setting that underperforms only due to poor output mapping. We additionally quantify the impact of commonly overlooked design choices - batch size, image ordering, and text encoder selection - showing they substantially affect accuracy. Evaluating on ReGT, our multilabel reannotation of 625 ImageNet-1k classes, reveals that MLLMs benefit most from corrected labels (up to +10.8%), substantially narrowing the perceived gap with supervised models. Much of the reported MLLMs underperformance on classification is thus an artifact of noisy ground truth and flawed evaluation protocol rather than genuine model deficiency. Models less reliant on supervised training signals prove most sensitive to annotation quality. Finally, we show that MLLMs can assist human annotators: in a controlled case study, annotators confirmed or integrated MLLMs predictions in approximately 50% of difficult cases, demonstrating their potential for large-scale dataset curation.

Chinese Translation

多模态大型语言模型（MLLM）的分类性能在很大程度上依赖于评估协议和真实标签的质量。比较MLLM与监督学习和视觉-语言模型的研究报告了相互矛盾的结论，我们展示了这些矛盾源于膨胀或低估性能的协议。在最常见的评估协议中，我们识别并修复了关键问题：模型输出超出提供的类别列表并被丢弃、来自弱多项选择干扰项的膨胀结果，以及由于输出映射不佳而导致的开放世界设置表现不佳。我们还量化了常被忽视的设计选择的影响——批量大小、图像排序和文本编码器选择——显示这些因素对准确性有显著影响。在ReGT（我们对625个ImageNet-1k类别的多标签重新标注）上的评估表明，MLLM最受益于修正标签（最高可达+10.8%），显著缩小了与监督模型的感知差距。因此，报告的MLLM在分类上的低性能在很大程度上是噪声真实标签和缺陷评估协议的产物，而非模型本身的缺陷。对监督训练信号依赖较少的模型对标注质量最为敏感。最后，我们展示了MLLM可以辅助人类标注者：在一个受控案例研究中，标注者在大约50%的困难案例中确认或整合了MLLM的预测，展示了其在大规模数据集策划中的潜力。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2603.05522

RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

RoboLayout：面向具身智能体的可微分3D场景生成

Shamsaddinlou, Ali

Abstract

Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but also feasible for interaction by embodied agents remains challenging, particularly in physically constrained indoor environments. In this paper, RoboLayout is introduced as an extension of LayoutVLM that augments the original framework with agent-aware reasoning and improved optimization stability. RoboLayout integrates explicit reachability constraints into a differentiable layout optimization process, enabling the generation of layouts that are navigable and actionable by embodied agents. Importantly, the agent abstraction is not limited to a specific robot platform and can represent diverse entities with distinct physical capabilities, such as service robots, warehouse robots, humans of different age groups, or animals, allowing environment design to be tailored to the intended agent. In addition, a local refinement stage is proposed that selectively reoptimizes problematic object placements while keeping the remainder of the scene fixed, improving convergence efficiency without increasing global optimization iterations. Overall, RoboLayout preserves the strong semantic alignment and physical plausibility of LayoutVLM while enhancing applicability to agent-centric indoor scene generation, as demonstrated by experimental results across diverse scene configurations.

Chinese Translation

最近在视觉语言模型（VLMs）方面的进展显示出其在空间推理和从开放式语言指令生成3D场景布局方面的强大潜力。然而，生成不仅语义连贯而且适合具身智能体交互的布局仍然具有挑战性，特别是在物理受限的室内环境中。本文介绍了RoboLayout，作为LayoutVLM的扩展，增强了原始框架的智能体感知推理和优化稳定性。RoboLayout将显式可达性约束集成到可微分布局优化过程中，使得生成的布局可供具身智能体导航和操作。重要的是，智能体抽象并不限于特定的机器人平台，可以表示具有不同物理能力的多样实体，如服务机器人、仓库机器人、不同年龄段的人类或动物，从而使环境设计能够针对预期的智能体进行定制。此外，提出了一种局部优化阶段，选择性地重新优化问题对象的放置，同时保持场景的其余部分固定，提高了收敛效率而不增加全局优化迭代次数。总体而言，RoboLayout保持了LayoutVLM强大的语义一致性和物理合理性，同时增强了其在以智能体为中心的室内场景生成中的适用性，实验结果在多样的场景配置中得到了验证。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2603.05614

Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum

实时人工智能服务经济：跨连续体的代理计算框架

Lovén, Lauri, Saleh, Alaa, Farahani, Reza, Murturi, Ilir, López, Miguel Bordallo, Donta, Praveen Kumar, Dustdar, Schahram

Abstract

Real-time AI services increasingly operate across the device-edge-cloud continuum, where autonomous AI agents generate latency-sensitive workloads, orchestrate multi-stage processing pipelines, and compete for shared resources under policy and governance constraints. This article shows that the structure of service-dependency graphs, modelled as DAGs whose nodes represent compute stages and whose edges encode execution ordering, is a primary determinant of whether decentralised, price-based resource allocation can work reliably at scale. When dependency graphs are hierarchical (tree or series-parallel), prices converge to stable equilibria, optimal allocations can be computed efficiently, and under appropriate mechanism design (with quasilinear utilities and discrete slice items), agents have no incentive to misreport their valuations within each decision epoch. When dependencies are more complex, with cross-cutting ties between pipeline stages, prices oscillate, allocation quality degrades, and the system becomes difficult to manage. To bridge this gap, we propose a hybrid management architecture in which cross-domain integrators encapsulate complex sub-graphs into resource slices that present a simpler, well-structured interface to the rest of the market. A systematic ablation study across six experiments (1,620 runs, 10 seeds each) confirms that (i) dependency-graph topology is a first-order determinant of price stability and scalability,(ii) the hybrid architecture reduces price volatility by up to 70-75% without sacrificing throughput, (iii) governance constraints create quantifiable efficiency-compliance trade-offs that depend jointly on topology and load, and (iv) under truthful bidding the decentralised market matches a centralised value-optimal baseline, confirming that decentralised coordination can replicate centralised allocation quality.

Chinese Translation

实时人工智能服务越来越多地在设备-边缘-云的连续体中运行，其中自主人工智能代理生成延迟敏感的工作负载，协调多阶段处理管道，并在政策和治理约束下竞争共享资源。本文表明，服务依赖图的结构（建模为有向无环图（DAG），其节点代表计算阶段，边表示执行顺序）是去中心化、基于价格的资源分配能否在规模上可靠运行的主要决定因素。当依赖图是层次化的（树形或系列并联），价格会收敛到稳定均衡，能够高效计算最佳分配，并且在适当的机制设计下（具有准线性效用和离散切片项目），代理在每个决策时刻没有动机虚报其估值。当依赖关系更复杂，管道阶段之间存在交叉联系时，价格会波动，分配质量下降，系统变得难以管理。为了解决这一问题，我们提出了一种混合管理架构，其中跨域集成器将复杂子图封装为资源切片，向市场的其余部分呈现更简单、结构良好的接口。通过六个实验（1,620次运行，每个实验10个种子）的系统消融研究证实了以下几点：（i）依赖图拓扑是价格稳定性和可扩展性的一级决定因素；（ii）混合架构在不牺牲吞吐量的情况下将价格波动降低了70-75%；（iii）治理约束创造了可量化的效率-合规权衡，这取决于拓扑和负载的共同影响；（iv）在真实出价的情况下，去中心化市场与集中化的价值最优基准相匹配，确认了去中心化协调能够复制集中化分配质量。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2603.05706

Reasoning Models Struggle to Control their Chains of Thought

推理模型在控制其思维链方面面临挑战

Yueh-Han, Chen, McCarthy, Robert, Lee, Bruce W., He, He, Kivlichan, Ian, Baker, Bowen, Carroll, Micah, Korbak, Tomek

Abstract

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in their CoT, it could undermine CoT monitorability. To measure this undesirable capability -- CoT controllability -- we introduce the CoT-Control evaluation suite, which includes tasks that require models to solve problems while adhering to CoT instructions, e.g., reasoning about a genetics question without using the word 'chromosome'. We show that reasoning models possess significantly lower CoT controllability than output controllability; for instance, Claude Sonnet 4.5 can control its CoT only 2.7% of the time but 61.9% when controlling its final output. We also find that CoT controllability is higher for larger models and decreases with more RL training, test-time compute, and increased problem difficulty. CoT controllability failures extend even to situations in which models are given incentives (as opposed to direct requests) to evade CoT monitors, although models exhibit slightly higher controllability when they are told they are being monitored. Similarly, eliciting controllability by adversarially optimizing prompts does not meaningfully increase controllability. Our results leave us cautiously optimistic that CoT controllability is currently unlikely to be a failure mode of CoT monitorability. However, the mechanism behind low controllability is not well understood. Given its importance for maintaining CoT monitorability, we recommend that frontier labs track CoT controllability in future models.

Chinese Translation

思维链（Chain-of-thought, CoT）监控是一种有前景的工具，用于检测现代推理模型的不当行为并理解其动机。然而，如果模型能够控制其在思维链中所表达的内容，这可能会削弱思维链的可监控性。为了衡量这种不理想的能力——思维链可控性（CoT controllability），我们引入了CoT-Control评估套件，其中包括要求模型在遵循思维链指令的情况下解决问题的任务，例如，在不使用“染色体”（chromosome）一词的情况下推理关于遗传学的问题。我们展示了推理模型的思维链可控性显著低于输出可控性；例如，Claude Sonnet 4.5在控制其思维链时仅能做到2.7%，而在控制其最终输出时则为61.9%。我们还发现，思维链可控性在较大模型中较高，并且随着更多的强化学习训练、测试时计算和问题难度的增加而降低。思维链可控性失败甚至扩展到模型在面对激励（而非直接请求）以规避思维链监控的情况下，尽管当模型被告知其正在被监控时，其可控性略有提高。同样，通过对抗性优化提示来引发可控性并未显著提高可控性。我们的结果让我们谨慎乐观地认为，思维链可控性目前不太可能成为思维链可监控性的失败模式。然而，低可控性背后的机制尚不清楚。鉴于其对维持思维链可监控性的重要性，我们建议前沿实验室在未来的模型中跟踪思维链可控性。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2603.05860

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

通过经验驱动的自我技能发现演化医学影像代理

Fan, Lin, Dai, Pengyu, Deng, Zhipeng, Wang, Haolin, Gong, Xun, Zheng, Yefeng, Ou, Yafei

Abstract

Clinical image interpretation is inherently multi-step and tool-centric: clinicians iteratively combine visual evidence with patient context, quantify findings, and refine their decisions through a sequence of specialized procedures. While LLM-based agents promise to orchestrate such heterogeneous medical tools, existing systems treat tool sets and invocation strategies as static after deployment. This design is brittle under real-world domain shifts, across tasks, and evolving diagnostic requirements, where predefined tool chains frequently degrade and demand costly manual re-design. We propose MACRO, a self-evolving, experience-augmented medical agent that shifts from static tool composition to experience-driven tool discovery. From verified execution trajectories, the agent autonomously identifies recurring effective multi-step tool sequences, synthesizes them into reusable composite tools, and registers these as new high-level primitives that continuously expand its behavioral repertoire. A lightweight image-feature memory grounds tool selection in a visual-clinical context, while a GRPO-like training loop reinforces reliable invocation of discovered composites, enabling closed-loop self-improvement with minimal supervision. Extensive experiments across diverse medical imaging datasets and tasks demonstrate that autonomous composite tool discovery consistently improves multi-step orchestration accuracy and cross-domain generalization over strong baselines and recent state-of-the-art agentic methods, bridging the gap between brittle static tool use and adaptive, context-aware clinical AI assistance. Code will be available upon acceptance.

Chinese Translation

临床图像解读本质上是多步骤和工具中心的：临床医生迭代地将视觉证据与患者背景相结合，量化发现，并通过一系列专业程序来完善他们的决策。虽然基于大型语言模型（LLM）的代理有望协调这些异构的医疗工具，但现有系统在部署后将工具集和调用策略视为静态。这种设计在现实世界的领域变化、任务间的转变以及不断演变的诊断需求下显得脆弱，预定义的工具链常常会降级并需要昂贵的手动重新设计。我们提出了MACRO，一个自我演化的、经验增强的医疗代理，它从静态工具组合转向经验驱动的工具发现。通过验证的执行轨迹，该代理自主识别重复有效的多步骤工具序列，将其合成可重用的复合工具，并将这些工具登记为新的高层原语，持续扩展其行为库。一个轻量级的图像特征记忆将工具选择基于视觉临床背景，而类似GRPO的训练循环则强化了对发现的复合工具的可靠调用，使得在最小监督下实现闭环自我改进。在多样的医学影像数据集和任务上的广泛实验表明，自主复合工具发现始终提高了多步骤协调的准确性和跨领域的泛化能力，超越了强基线和近期最先进的代理方法，弥合了脆弱的静态工具使用与自适应、上下文感知的临床人工智能辅助之间的差距。代码将在接受后提供。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2603.05910

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks

世界不会静止：用于智能体基准的可编程演化

Li, Guangrui, Xie, Yaochen, Liu, Yi, Dong, Ziwei, Pan, Xingyuan, Zheng, Tianqi, Choi, Jason, Morais, Michael J., Jha, Binit, Mishra, Shaunak, Zhou, Bingrou, Luo, Chen, Cheng, Monica Xiao, Song, Dawn

Abstract

LLM-powered agents fulfill user requests by interacting with environments, querying data, and invoking tools in a multi-turn process. Yet, most existing benchmarks assume static environments with fixed schemas and toolsets, neglecting the evolutionary nature of the real world and agents' robustness to environmental changes. In this paper, we study a crucial problem: how to evolve the agent environment in a scalable and controllable way, thereby better evaluating agents' adaptability to real-world dynamics. We propose ProEvolve, a graph-based framework that makes environment evolution programmable. At its core, a typed relational graph provides a unified, explicit representation of the environment: data, tools, and schema. Under this formalism, adding, removing, or modifying capabilities are expressed as graph transformations that coherently propagate updates across tools, schemas, and data access. Building on this, ProEvolve can (1) program the evolutionary dynamics as graph transformations to generate environments automatically, and (2) instantiate task sandboxes via subgraph sampling and programming. We validate ProEvolve by evolving a single environment into 200 environments and 3,000 task sandboxes, and benchmark representative agents accordingly.

Chinese Translation

基于大型语言模型（LLM）的智能体通过与环境互动、查询数据和调用工具来满足用户请求，这一过程通常是多轮的。然而，大多数现有基准假设环境是静态的，具有固定的模式和工具集，忽视了现实世界的演变特性以及智能体对环境变化的适应能力。在本文中，我们研究了一个关键问题：如何以可扩展和可控的方式演化智能体环境，从而更好地评估智能体对现实世界动态的适应性。我们提出了ProEvolve，一个基于图的框架，使环境演化可编程。其核心是一个类型化的关系图，提供了环境（数据、工具和模式）的统一、明确的表示。在这一形式下，添加、删除或修改能力被表达为图的变换，这些变换能够一致地传播更新到工具、模式和数据访问。基于此，ProEvolve可以（1）将演化动态编程为图变换，以自动生成环境，以及（2）通过子图采样和编程实例化任务沙箱。我们通过将单一环境演化为200个环境和3,000个任务沙箱来验证ProEvolve，并相应地对代表性智能体进行基准测试。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2603.05912

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

DeepFact：深度研究事实性的共同演化基准与代理

Huang, Yukun, Ribeiro, Leonardo F. R., Hardalov, Momchil, Dhingra, Bhuwan, Dreyer, Markus, Saligrama, Venkatesh

Abstract

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers are primarily designed for general-domain, factoid-style atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs. Yet building such a benchmark is itself difficult. We first show that static expert-labeled benchmarks are brittle in this setting: in a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on a hidden micro-gold set of verifiable claims. We propose Evolving Benchmarking via Audit-then-Score (AtS), where benchmark labels and rationales are explicitly revisable: when a verifier disagrees with the current benchmark, it must submit evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before models are scored. Across four AtS rounds, expert micro-gold accuracy rises to 90.9%, indicating experts are substantially more reliable as auditors than as one-shot labelers. We instantiate AtS as DeepFact-Bench, a versioned DRR factuality benchmark with auditable rationales, and DeepFact-Eval, a document-level verification agent (with a grouped lite variant) that outperforms existing verifiers on DeepFact-Bench and transfers well to external factuality datasets.

Chinese Translation

增强搜索的LLM代理能够生成深度研究报告（DRRs），但验证声明级事实性仍然具有挑战性。现有的事实检查工具主要针对一般领域的事实型原子声明设计，并且没有基准可以测试这些验证工具是否适用于DRRs。然而，构建这样的基准本身就是困难的。我们首先展示了静态专家标注的基准在这种情况下是脆弱的：在一项针对博士级专家的控制研究中，未辅助的专家在一个隐藏的可验证声明微金标准集上仅达到60.8%的准确率。我们提出了通过审核后评分（Audit-then-Score, AtS）进行演变基准的方法，其中基准标签和理由是可以明确修订的：当验证者不同意当前基准时，必须提交证据；审计员裁定争议；并且接受的修订在模型评分之前更新基准。在四轮AtS中，专家微金标准的准确率上升至90.9%，表明专家作为审计员的可靠性远高于作为一次性标注者的可靠性。我们将AtS实例化为DeepFact-Bench，这是一个具有可审计理由的版本化DRR事实性基准，以及DeepFact-Eval，这是一个文档级验证代理（带有分组轻量变体），在DeepFact-Bench上表现优于现有验证工具，并且能够很好地迁移到外部事实性数据集。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2603.05980

An Interactive Multi-Agent System for Evaluation of New Product Concepts

用于新产品概念评估的交互式多代理系统

Xuan, Bin, Ai, Ruo, Lee, Hakyeon

Abstract

Product concept evaluation is a critical stage that determines strategic resource allocation and project success in enterprises. However, traditional expert-led approaches face limitations such as subjective bias and high time and cost requirements. To support this process, this study proposes an automated approach utilizing a large language model (LLM)-based multi-agent system (MAS). Through a systematic analysis of previous research on product development and team collaboration, this study established two primary evaluation dimensions, namely technical feasibility and market feasibility. The proposed system consists of a team of eight virtual agents representing specialized domains such as R&D and marketing. These agents use retrieval-augmented generation (RAG) and real-time search tools to gather objective evidence and validate concepts through structured deliberations based on the established criteria. The agents were further fine-tuned using professional product review data to enhance their judgment accuracy. A case study involving professional display monitor concepts demonstrated that the system's evaluation rankings were consistent with those of senior industry experts. These results confirm the usability of the proposed multi-agent-based evaluation approach for supporting product development decisions.

Chinese Translation

产品概念评估是决定企业战略资源分配和项目成功的关键阶段。然而，传统的专家主导方法面临主观偏见以及高时间和成本要求等局限性。为支持这一过程，本研究提出了一种基于大型语言模型（LLM）的自动化多代理系统（MAS）方法。通过对以往产品开发和团队协作研究的系统分析，本研究建立了两个主要评估维度，即技术可行性和市场可行性。所提出的系统由八个虚拟代理组成，代表研发和市场等专业领域。这些代理使用增强检索生成（RAG）和实时搜索工具收集客观证据，并通过基于既定标准的结构化讨论来验证概念。代理还利用专业产品评审数据进行了进一步微调，以提高其判断准确性。一项涉及专业显示器概念的案例研究表明，该系统的评估排名与行业资深专家的评估结果一致。这些结果确认了所提出的基于多代理的评估方法在支持产品开发决策中的可用性。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2603.06064

Agentic LLM Planning via Step-Wise PDDL Simulation: An Empirical Characterisation

通过逐步 PDDL 模拟实现的自主 LLM 规划：实证特征化

Göbel, Kai, Lorang, Pierrick, Zips, Patrik, Glück, Tobias

Abstract

Task planning, the problem of sequencing actions to reach a goal from an initial state, is a core capability requirement for autonomous robotic systems. Whether large language models (LLMs) can serve as viable planners alongside classical symbolic methods remains an open question. We present PyPDDLEngine, an open-source Planning Domain Definition Language (PDDL) simulation engine that exposes planning operations as LLM tool calls through a Model Context Protocol (MCP) interface. Rather than committing to a complete action sequence upfront, the LLM acts as an interactive search policy that selects one action at a time, observes each resulting state, and can reset and retry. We evaluate four approaches on 102 International Planning Competition (IPC) Blocksworld instances under a uniform 180-second budget: Fast Downward lama-first and seq-sat-lama-2011 as classical baselines, direct LLM planning (Claude Haiku 4.5), and agentic LLM planning via PyPDDLEngine. Fast Downward achieves 85.3% success. The direct and agentic LLM approaches achieve 63.7% and 66.7%, respectively, a consistent but modest three-percentage-point advantage for the agentic approach at $5.7\times$ higher token cost per solution. Across most co-solved difficulty blocks, both LLM approaches produce shorter plans than seq-sat-lama-2011 despite its iterative quality improvement, a result consistent with training-data recall rather than generalisable planning. These results suggest that agentic gains depend on the nature of environmental feedback. Coding agents benefit from externally grounded signals such as compiler errors and test failures, whereas PDDL step feedback is self-assessed, leaving the agent to evaluate its own progress without external verification.

Chinese Translation

任务规划，即从初始状态到达目标的动作序列问题，是自主机器人系统的核心能力要求。大型语言模型（LLMs）是否能够作为可行的规划者与经典符号方法并存，仍然是一个悬而未决的问题。我们提出了 PyPDDLEngine，这是一个开源的规划领域定义语言（PDDL）模拟引擎，通过模型上下文协议（MCP）接口将规划操作暴露为 LLM 工具调用。LLM 并不是一次性确定完整的动作序列，而是作为一种交互式搜索策略，每次选择一个动作，观察每个结果状态，并能够重置和重试。我们在 102 个国际规划竞赛（IPC） Blocksworld 实例上评估了四种方法，统一预算为 180 秒：作为经典基线的 Fast Downward lama-first 和 seq-sat-lama-2011，直接 LLM 规划（Claude Haiku 4.5），以及通过 PyPDDLEngine 实现的自主 LLM 规划。Fast Downward 的成功率为 85.3%。直接和自主 LLM 方法分别达到 63.7% 和 66.7%，自主方法在每个解决方案的代价上以 $5.7 imes$ 的更高令牌成本获得了一致但适度的三个百分点优势。在大多数共同解决的难度块中，尽管 seq-sat-lama-2011 进行了迭代质量改进，但这两种 LLM 方法生成的计划比 seq-sat-lama-2011 更短，这一结果与训练数据的回忆一致，而不是可推广的规划。这些结果表明，自主收益依赖于环境反馈的性质。编码代理受益于外部基础信号，如编译错误和测试失败，而 PDDL 步骤反馈是自我评估的，使代理在没有外部验证的情况下评估自己的进展。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2603.06067

Aggregative Semantics for Quantitative Bipolar Argumentation Frameworks

定量双极论证框架的聚合语义

Munro, Yann, Bloch, Isabelle, Lesot, Marie-Jeanne

Abstract

Formal argumentation is being used increasingly in artificial intelligence as an effective and understandable way to model potentially conflicting pieces of information, called arguments, and identify so-called acceptable arguments depending on a chosen semantics. This paper deals with the specific context of Quantitative Bipolar Argumentation Frameworks (QBAF), where arguments have intrinsic weights and can attack or support each other. In this context, we introduce a novel family of gradual semantics, called aggregative semantics. In order to deal with situations in which attackers and supporters do not play a symmetric role, and in contrast to modular semantics, we propose to aggregate attackers and supporters separately. This leads to a three-stage computation, which consists in computing a global weight for attackers and another for supporters, before aggregating these two values with the intrinsic weight of the argument. We discuss the properties that the three aggregation functions should satisfy depending on the context, as well as their relationships with the classical principles for gradual semantics. This discussion is supported by various simple examples, as well as a final example on which five hundred aggregative semantics are tested and compared, illustrating the range of possible behaviours with aggregative semantics. Decomposing the computation into three distinct and interpretable steps leads to a more parametrisable computation: it keeps the bipolarity one step further than what is done in the literature, and it leads to more understandable gradual semantics.

Chinese Translation

形式论证在人工智能中的应用日益增多，作为一种有效且易于理解的方式来建模潜在冲突的信息片段，称为论证，并根据所选择的语义识别所谓的可接受论证。本文处理定量双极论证框架（Quantitative Bipolar Argumentation Frameworks，QBAF）的特定背景，其中论证具有内在权重，并且可以相互攻击或支持。在此背景下，我们引入了一种新颖的渐进语义家族，称为聚合语义。为了处理攻击者和支持者不对称角色的情况，并与模块化语义形成对比，我们建议分别聚合攻击者和支持者。这导致了一个三阶段的计算过程，首先计算攻击者的全局权重和支持者的全局权重，然后将这两个值与论证的内在权重进行聚合。我们讨论了根据上下文，三个聚合函数应满足的性质，以及它们与经典渐进语义原则的关系。该讨论通过各种简单示例以及一个最终示例得以支持，在该示例中测试和比较了五百种聚合语义，展示了聚合语义可能的行为范围。将计算分解为三个独立且可解释的步骤，导致了更具参数化的计算：它将双极性推进了一步，并导致了更易于理解的渐进语义。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2603.06082

Offline Materials Optimization with CliqueFlowmer

基于CliqueFlowmer的离线材料优化

Kuba, Jakub Grudzien, Miller, Benjamin Kurt, Levine, Sergey, Abbeel, Pieter

Abstract

Recent advances in deep learning inspired neural network-based approaches to computational materials discovery (CMD). A plethora of problems in this field involve finding materials that optimize a target property. Nevertheless, the increasingly popular generative modeling methods are ineffective at boldly exploring attractive regions of the materials space due to their maximum likelihood training. In this work, we offer an alternative CMD technique based on offline model-based optimization (MBO) that fuses direct optimization of a target material property into generation. To that end, we introduce a domain-specific model, dubbed CliqueFlowmer, that incorporates recent advances of clique-based MBO into transformer and flow generation. We validate CliqueFlowmer's optimization abilities and show that materials it produces strongly outperform those provided by generative baselines. To enable employment of CliqueFlowmer in specialized materials optimization problems and support interdisciplinary research, we open-source our code at https://github.com/znowu/CliqueFlowmer.

Chinese Translation

近年来，深度学习的进展激发了基于神经网络的计算材料发现（CMD）方法。在这一领域中，许多问题涉及寻找优化目标属性的材料。然而，由于最大似然训练，日益流行的生成建模方法在大胆探索材料空间的吸引区域方面效果不佳。在本研究中，我们提出了一种基于离线模型优化（MBO）的替代CMD技术，将目标材料属性的直接优化融入生成过程中。为此，我们引入了一种特定领域的模型，称为CliqueFlowmer，该模型将基于团的MBO的最新进展与变换器（transformer）和流生成相结合。我们验证了CliqueFlowmer的优化能力，并展示了其生成的材料在性能上显著优于生成基线提供的材料。为了使CliqueFlowmer在专业材料优化问题中得到应用，并支持跨学科研究，我们已将代码开源，网址为 https://github.com/znowu/CliqueFlowmer。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2603.06217

Conversational Demand Response: Bidirectional Aggregator-Prosumer Coordination through Agentic AI

对话式需求响应：通过自主智能体实现双向聚合器-生产者协调

Makroum, Reda El, Zwickl-Bernhard, Sebastian, Kranzl, Lukas, Auer, Hans

Abstract

Residential demand response depends on sustained prosumer participation, yet existing coordination is either fully automated, or limited to one-way dispatch signals and price alerts that offer little possibility for informed decision-making. This paper introduces Conversational Demand Response (CDR), a coordination mechanism where aggregators and prosumers interact through bidirectional natural language, enabled through agentic AI. A two-tier multi-agent architecture is developed in which an aggregator agent dispatches flexibility requests and a prosumer Home Energy Management System (HEMS) assesses deliverability and cost-benefit by calling an optimization-based tool. CDR also enables prosumer-initiated upstream communication, where changes in preferences can reach the aggregator directly. Proof-of-concept evaluation shows that interactions complete in under 12 seconds. The architecture illustrates how agentic AI can bridge the aggregator-prosumer coordination gap, providing the scalability of automated DR while preserving the transparency, explainability, and user agency necessary for sustained prosumer participation. All system components, including agent prompts, orchestration logic, and simulation interfaces, are released as open source to enable reproducibility and further development.

Chinese Translation

住宅需求响应依赖于持续的生产者参与，然而现有的协调方式要么完全自动化，要么仅限于单向调度信号和价格提醒，这些方式对知情决策的可能性贡献甚微。本文提出了对话式需求响应（Conversational Demand Response, CDR），这是一种协调机制，聚合器与生产者通过双向自然语言进行互动，借助自主智能体（agentic AI）实现。我们开发了一种两层多智能体架构，其中聚合器智能体发出灵活性请求，而生产者家庭能源管理系统（Home Energy Management System, HEMS）通过调用基于优化的工具评估可交付性和成本效益。CDR还支持生产者发起的上游沟通，使得偏好的变化能够直接传达给聚合器。概念验证评估表明，交互在12秒内完成。该架构展示了自主智能体如何弥合聚合器与生产者之间的协调差距，提供了自动化需求响应的可扩展性，同时保持了持续生产者参与所需的透明性、可解释性和用户自主性。所有系统组件，包括智能体提示、调度逻辑和仿真接口，均以开源形式发布，以便于可重复性和进一步开发。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2603.06278

Artificial Intelligence for Climate Adaptation: Reinforcement Learning for Climate Change-Resilient Transport

气候适应中的人工智能：用于应对气候变化的强化学习交通运输

Costa, Miguel, Vandervoort, Arthur, Schmidt, Carolin, Miranda, João, Petersen, Morten W., Drews, Martin, Morrisey, Karyn, Pereira, Francisco C.

Abstract

Climate change is expected to intensify rainfall and, consequently, pluvial flooding, leading to increased disruptions in urban transportation systems over the coming decades. Designing effective adaptation strategies is challenging due to the long-term, sequential nature of infrastructure investments, deep climate uncertainty, and the complex interactions between flooding, infrastructure, and mobility impacts. In this work, we propose a novel decision-support framework using reinforcement learning (RL) for long-term flood adaptation planning. Formulated as an integrated assessment model (IAM), the framework combines rainfall projection and flood modeling, transport simulation, and quantification of direct and indirect impacts on infrastructure and mobility. Our RL-based approach learns adaptive strategies that balance investment and maintenance costs against avoided impacts. We evaluate the framework through a case study of Copenhagen's inner city over the 2024-2100 period, testing multiple adaptation options, and different belief and realized climate scenarios. Results show that the framework outperforms traditional optimization approaches by discovering coordinated spatial and temporal adaptation pathways and learning trade-offs between impact reduction and adaptation investment, yielding more resilient strategies. Overall, our results showcase the potential of reinforcement learning as a flexible decision-support tool for adaptive infrastructure planning under climate uncertainty.

Chinese Translation

气候变化预计将加剧降雨，从而导致城市洪水的增加，进而在未来几十年内对城市交通系统造成更大的干扰。由于基础设施投资的长期性和连续性、深刻的气候不确定性以及洪水、基础设施和流动性影响之间的复杂互动，设计有效的适应策略面临挑战。在本研究中，我们提出了一种新颖的决策支持框架，利用强化学习（RL）进行长期洪水适应规划。该框架被构建为一个综合评估模型（IAM），结合了降雨预测和洪水建模、交通模拟，以及对基础设施和流动性直接和间接影响的量化。我们的基于RL的方法学习适应性策略，以平衡投资和维护成本与避免的影响。我们通过对2024-2100年期间哥本哈根市中心的案例研究来评估该框架，测试多种适应选项以及不同的信念和实际气候情景。结果表明，该框架在发现协调的空间和时间适应路径以及学习影响减少与适应投资之间的权衡方面，优于传统的优化方法，从而产生了更具韧性的策略。总体而言，我们的结果展示了强化学习作为应对气候不确定性的适应性基础设施规划的灵活决策支持工具的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2603.06290

The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI

EpisTwin：一种基于知识图谱的神经符号架构用于个人人工智能

Servedio, Giovanni, Aghilar, Potito, Mattiace, Alessio, Carmosino, Gianni, Musicco, Francesco, Conte, Gabriele, Anelli, Vito Walter, Di Noia, Tommaso, Donini, Francesco Maria

Abstract

Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation offers a partial remedy, its reliance on unstructured vector similarity fails to capture the latent semantic topology and temporal dependencies essential for holistic sensemaking. We introduce EpisTwin, a neuro-symbolic framework that grounds generative reasoning in a verifiable, user-centric Personal Knowledge Graph. EpisTwin leverages Multimodal Language Models to lift heterogeneous, cross-application data into semantic triples. At inference, EpisTwin enables complex reasoning over the personal semantic graph via an agentic coordinator that combines Graph Retrieval-Augmented Generation with Online Deep Visual Refinement, dynamically re-grounding symbolic entities in their raw visual context. We also introduce PersonalQA-71-100, a synthetic benchmark designed to simulate a realistic user's digital footprint and evaluate EpisTwin performance. Our framework demonstrates robust results across a suite of state-of-the-art judge models, offering a promising direction for trustworthy Personal AI.

Chinese Translation

个人人工智能目前受到用户数据在孤立信息孤岛中碎片化的限制。尽管检索增强生成（Retrieval-Augmented Generation）提供了部分解决方案，但其对非结构化向量相似性的依赖未能捕捉到整体意义构建所必需的潜在语义拓扑和时间依赖性。我们提出了EpisTwin，这是一种神经符号框架，将生成推理建立在可验证的以用户为中心的个人知识图谱上。EpisTwin利用多模态语言模型将异构的跨应用数据提升为语义三元组。在推理过程中，EpisTwin通过一个代理协调器实现对个人语义图的复杂推理，该协调器将图检索增强生成与在线深度视觉细化相结合，动态地将符号实体重新定位于其原始视觉上下文中。我们还引入了PersonalQA-71-100，这是一个合成基准，旨在模拟现实用户的数字足迹并评估EpisTwin的性能。我们的框架在一系列最先进的评估模型中展示了强劲的结果，为可信的个人人工智能提供了一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2603.06333

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

SAHOO：递归自我改进中高阶优化目标的保护性对齐

Sahoo, Subramanyam, Chadha, Aman, Jain, Vinija, Chaudhary, Divya

Abstract

Recursive self-improvement is moving from theory to practice: modern systems can critique, revise, and evaluate their own outputs, yet iterative self-modification risks subtle alignment drift. We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii) constraint preservation checks that enforce safety-critical invariants such as syntactic correctness and non-hallucination; and (iii) regression-risk quantification to flag improvement cycles that undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains, including 18.3 percent improvement in code tasks and 16.8 percent in reasoning, while preserving constraints in two domains and maintaining low violations in truthfulness. Thresholds are calibrated on a small validation set of 18 tasks across three cycles. We further map the capability-alignment frontier, showing efficient early improvement cycles but rising alignment costs later and exposing domain-specific tensions such as fluency versus factuality. SAHOO therefore makes alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.

Chinese Translation

递归自我改进正从理论走向实践：现代系统能够批评、修订和评估自身输出，但迭代自我修改存在微妙的对齐漂移风险。我们提出了SAHOO，一个实用框架，通过三项保护措施来监控和控制漂移：（i）目标漂移指数（Goal Drift Index, GDI），一个结合语义、词汇、结构和分布性度量的学习型多信号检测器；（ii）约束保持检查，强制执行安全关键的不变性，如语法正确性和非幻觉性；（iii）回归风险量化，以标记那些撤销先前收益的改进周期。在代码生成、数学推理和真实性的189个任务中，SAHOO显著提高了质量，包括代码任务的18.3%提升和推理的16.8%提升，同时在两个领域保持约束并在真实性方面维持低违规率。阈值是在三个周期的18个任务的小型验证集上进行校准的。我们进一步绘制了能力-对齐边界，显示出早期改进周期的高效性，但后期对齐成本上升，并暴露出特定领域的紧张关系，如流利性与真实性之间的权衡。因此，SAHOO使得在递归自我改进过程中对齐的保持变得可测量、可部署，并在大规模上进行系统验证。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2603.06394

Talk Freely, Execute Strictly: Schema-Gated Agentic AI for Flexible and Reproducible Scientific Workflows

自由对话，严格执行：用于灵活和可重复科学工作流的模式门控代理人工智能

Strickland, Joel, Vijeta, Arjun, Moores, Chris, Bodek, Oliwia, Nenchev, Bogdan, Whitehead, Thomas, Phillips, Charles, Tassenberg, Karl, Conduit, Gareth, Pellegrini, Ben

Abstract

Large language models (LLMs) can now translate a researcher's plain-language goal into executable computation, yet scientific workflows demand determinism, provenance, and governance that are difficult to guarantee when an LLM decides what runs. Semi-structured interviews with 18 experts across 10 industrial R&D stakeholders surface 2 competing requirements--deterministic, constrained execution and conversational flexibility without workflow rigidity--together with boundary properties (human-in-the-loop control and transparency) that any resolution must satisfy. We propose schema-gated orchestration as the resolving principle: the schema becomes a mandatory execution boundary at the composed-workflow level, so that nothing runs unless the complete action--including cross-step dependencies--validates against a machine-checkable specification. We operationalize the 2 requirements as execution determinism (ED) and conversational flexibility (CF), and use these axes to review 20 systems spanning 5 architectural groups along a validation-scope spectrum. Scores are assigned via a multi-model protocol--15 independent sessions across 3 LLM families--yielding substantial-to-near-perfect inter-model agreement (Krippendorff a=0.80 for ED and a=0.98 for CF), demonstrating that multi-model LLM scoring can serve as a reusable alternative to human expert panels for architectural assessment. The resulting landscape reveals an empirical Pareto front--no reviewed system achieves both high flexibility and high determinism--but a convergence zone emerges between the generative and workflow-centric extremes. We argue that a schema-gated architecture, separating conversational from execution authority, is positioned to decouple this trade-off, and distill 3 operational principles--clarification-before-execution, constrained plan-act orchestration, and tool-to-workflow-level gating--to guide adoption.

Chinese Translation

大型语言模型（LLMs）现在能够将研究者的自然语言目标转换为可执行的计算，但科学工作流要求确定性、来源和治理，这在LLM决定执行内容时难以保证。对来自10个工业研发利益相关者的18位专家进行的半结构化访谈揭示了两种相互竞争的需求——确定性、受限的执行与不受工作流刚性限制的对话灵活性——以及任何解决方案必须满足的边界属性（人机协作控制和透明度）。我们提出模式门控编排作为解决原则：模式成为组合工作流级别的强制执行边界，因此只有在完整操作（包括跨步骤依赖关系）符合机器可检查的规范时，才会执行。我们将这两种需求操作化为执行确定性（ED）和对话灵活性（CF），并利用这两个维度对涵盖5个架构组的20个系统进行评审。通过多模型协议分配分数——在3个LLM家族中进行15次独立会话——得到了显著到近乎完美的模型间一致性（ED的Krippendorff α=0.80，CF的α=0.98），证明多模型LLM评分可以作为人类专家小组进行架构评估的可重用替代方案。结果景观揭示了一个经验帕累托前沿——没有审查的系统同时实现高灵活性和高确定性——但在生成与工作流中心极端之间出现了一个收敛区。我们认为，模式门控架构将对话与执行权威分离，有望解耦这一权衡，并提炼出三项操作原则——执行前澄清、受限的计划-行动编排，以及工具到工作流级别的门控——以指导其采用。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2603.06565

Boosting deep Reinforcement Learning using pretraining with Logical Options

通过逻辑选项的预训练提升深度强化学习

Ye, Zihan, Chau, Phil, Emunds, Raban, Blüml, Jannis, Derstroff, Cedric, Delfosse, Quentin, Arenz, Oleg, Kersting, Kristian

Abstract

Deep reinforcement learning agents are often misaligned, as they over-exploit early reward signals. Recently, several symbolic approaches have addressed these challenges by encoding sparse objectives along with aligned plans. However, purely symbolic architectures are complex to scale and difficult to apply to continuous settings. Hence, we propose a hybrid approach, inspired by humans' ability to acquire new skills. We use a two-stage framework that injects symbolic structure into neural-based reinforcement learning agents without sacrificing the expressivity of deep policies. Our method, called Hybrid Hierarchical RL (H^2RL), introduces a logical option-based pretraining strategy to steer the learning policy away from short-term reward loops and toward goal-directed behavior while allowing the final policy to be refined via standard environment interaction. Empirically, we show that this approach consistently improves long-horizon decision-making and yields agents that outperform strong neural, symbolic, and neuro-symbolic baselines.

Chinese Translation

深度强化学习代理常常出现不一致的情况，因为它们过度利用早期的奖励信号。最近，一些符号方法通过编码稀疏目标和对齐计划来解决这些挑战。然而，纯符号架构在扩展性上复杂，并且难以应用于连续环境。因此，我们提出了一种混合方法，灵感来自于人类获取新技能的能力。我们使用一个两阶段框架，将符号结构注入基于神经网络的强化学习代理中，而不牺牲深度策略的表达能力。我们的方法称为混合层次强化学习（Hybrid Hierarchical RL，H^2RL），引入了一种基于逻辑选项的预训练策略，以引导学习策略远离短期奖励循环，朝向目标导向行为，同时允许最终策略通过标准环境交互进行精炼。通过实证研究，我们表明这种方法持续改善了长时间决策能力，并产生了超越强大神经、符号和神经符号基线的代理。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2603.05519

Verify as You Go: An LLM-Powered Browser Extension for Fake News Detection

边验证边浏览：一种基于大型语言模型的假新闻检测浏览器扩展

Sallami, Dorsaf, Aïmeur, Esma

Abstract

The rampant spread of fake news in the digital age poses serious risks to public trust and democratic institutions, underscoring the need for effective, transparent, and user-centered detection tools. Existing browser extensions often fall short due to opaque model behavior, limited explanatory support, and a lack of meaningful user engagement. This paper introduces Aletheia, a novel browser extension that leverages Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs) to detect fake news and provide evidence-based explanations. Aletheia further includes two interactive components: a Discussion Hub that enables user dialogue around flagged content and a Stay Informed feature that surfaces recent fact-checks. Through extensive experiments, we show that Aletheia outperforms state-of-the-art baselines in detection performance. Complementing this empirical evaluation, a complementary user study with 250 participants confirms the system's usability and perceived effectiveness, highlighting its potential as a transparent tool for combating online fake news.

Chinese Translation

在数字时代，假新闻的肆意传播对公众信任和民主机构构成了严重风险，凸显了有效、透明且以用户为中心的检测工具的必要性。现有的浏览器扩展往往由于模型行为不透明、解释支持有限以及缺乏有意义的用户参与而未能满足需求。本文介绍了一种新型浏览器扩展Aletheia，它利用检索增强生成（Retrieval-Augmented Generation, RAG）和大型语言模型（Large Language Models, LLMs）来检测假新闻并提供基于证据的解释。Aletheia还包括两个互动组件：一个讨论中心（Discussion Hub），使用户能够围绕标记内容进行对话，以及一个保持信息更新（Stay Informed）功能，展示最新的事实核查。通过广泛的实验，我们展示了Aletheia在检测性能上优于最先进的基线。此外，针对250名参与者的补充用户研究确认了该系统的可用性和感知有效性，突显了其作为透明工具对抗在线假新闻的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2603.05540

Attention Meets Reachability: Structural Equivalence and Efficiency in Grammar-Constrained LLM Decoding

注意力与可达性相遇：语法约束下的 LLM 解码中的结构等价性与效率

Alpay, Faruk, Senturk, Bilge

Abstract

We study grammar-constrained decoding (GCD) as a coupling between an autoregressive next-token distribution and a reachability oracle over a pushdown system compiled from a context-free grammar (CFG). We prove an oracle invariance theorem: language-equivalent grammars induce identical admissible next-token sets for every prefix, hence identical logit masks, yet can yield provably different compiled state spaces and online ambiguity costs. We give exact control-state blowup counts for the canonical $a^n b^n$ language under redundant nonterminal delegation, and introduce a left-to-right structural ambiguity cost (SAC) measuring incremental packed-parse-forest growth per token. For two equivalent grammars over all finite strings, SAC is $O(1)$ per token under right-recursion but $\Theta(t^2)$ per token and $\Theta(n^3)$ cumulatively under concatenation. We establish engine-independent lower bounds: any sound, retrieval-efficient, parse-preserving online masking engine must incur $\Omega(t^2)$ work per token on a specific constant-size CFG family, unconditionally within this model. We define decoding-cost equivalence classes of grammars and prove existence of minimal-SAC representatives within bounded rewrite families. Finally, we characterize the true conditional sampler via a Doob $h$-transform and derive sharp one-step KL and total-variation distortion bounds for hard-masked decoding in terms of survival-probability spread among admissible next tokens. We integrate these results with Transformer and Mixture-of-Experts architectures, derive latency envelopes in terms of vocabulary size, active state sets, and beam width, and connect SAC to instrumentation-based predictive performance models and automated grammar optimization.

Chinese Translation

我们研究语法约束解码（GCD），将自回归下一个标记分布与基于上下文无关文法（CFG）编译的下推系统上的可达性预言机相结合。我们证明了一个预言机不变性定理：语言等价的文法为每个前缀诱导相同的可接受下一个标记集合，因此诱导相同的 logit 掩码，但可以产生可证明不同的编译状态空间和在线歧义成本。我们给出了在冗余非终结符委派下，典型的 $a^n b^n$ 语言的确切控制状态膨胀计数，并引入了一种从左到右的结构歧义成本（SAC），用于测量每个标记的增量打包解析森林增长。对于所有有限字符串的两个等价文法，在右递归下，SAC 为每个标记 $O(1)$，而在连接下为每个标记 $ heta(t^2)$，累积为 $ heta(n^3)$。我们建立了与引擎无关的下界：任何有效、检索高效、解析保持的在线掩码引擎在特定常量大小的 CFG 家族中，必须在此模型下无条件地每个标记耗费 $ heta(t^2)$ 的工作。我们定义了解码成本等价类的文法，并证明在有界重写家族中存在最小 SAC 代表。最后，我们通过 Doob $h$-变换表征真实的条件采样器，并推导出硬掩码解码的尖锐一步 KL 和总变差失真界限，这些界限与可接受下一个标记之间的生存概率分布相关。我们将这些结果与 Transformer 和专家混合架构结合，推导出与词汇大小、活跃状态集和束宽度相关的延迟包络，并将 SAC 与基于仪器的预测性能模型和自动化语法优化联系起来。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2603.05617

NOTAI.AI: Explainable Detection of Machine-Generated Text via Curvature and Feature Attribution

NOTAI.AI：通过曲率和特征归因进行机器生成文本的可解释检测

Breneur, Oleksandr Marchenko, Danilov, Adelaide, Nourbakhsh, Aria, Lamsiyah, Salima

Abstract

We present NOTAI.AI, an explainable framework for machine-generated text detection that extends Fast-DetectGPT by integrating curvature-based signals with neural and stylometric features in a supervised setting. The system combines 17 interpretable features, including Conditional Probability Curvature, ModernBERT detector score, readability metrics, and stylometric cues, within a gradient-boosted tree (XGBoost) meta-classifier to determine whether a text is human- or AI-generated. Furthermore, NOTAI.AI applies Shapley Additive Explanations (SHAP) to provide both local and global feature-level attribution. These attributions are further translated into structured natural-language rationales through an LLM-based explanation layer, which enables user-facing interpretability. The system is deployed as an interactive web application that supports real-time analysis, visual feature inspection, and structured evidence presentation. A web interface allows users to input text and inspect how neural and statistical signals influence the final decision. The source code and demo video are publicly available to support reproducibility.

Chinese Translation

我们提出了NOTAI.AI，这是一个可解释的机器生成文本检测框架，扩展了Fast-DetectGPT，通过在监督环境中整合基于曲率的信号与神经和风格特征。该系统结合了17个可解释特征，包括条件概率曲率（Conditional Probability Curvature）、ModernBERT检测器得分、可读性指标和风格线索，在一个基于梯度提升树（XGBoost）的元分类器中，以确定文本是人类生成还是AI生成。此外，NOTAI.AI应用了Shapley加法解释（SHAP）来提供局部和全局特征级别的归因。这些归因进一步通过基于大型语言模型（LLM）的解释层转化为结构化的自然语言理由，从而实现用户可视化的可解释性。该系统作为一个互动网页应用程序部署，支持实时分析、可视化特征检查和结构化证据展示。网页界面允许用户输入文本并检查神经和统计信号如何影响最终决策。源代码和演示视频已公开，以支持可重复性。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2603.05618

Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs

更安全的推理痕迹：测量和减轻大型语言模型中的思维链泄漏

Ahrend, Patrick, Eder, Tobias, Yang, Xiyang, Pan, Zhiyi, Groh, Georg

Abstract

Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent. Increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF-IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.

Chinese Translation

思维链（Chain-of-Thought, CoT）提示提高了大型语言模型（LLMs）的推理能力，但可能通过将个人可识别信息（Personally Identifiable Information, PII）从提示中重新浮现到推理痕迹和输出中，增加隐私风险，即使在指示模型不重述PII的政策下。我们使用一种模型无关的框架研究这种直接的推理时PII泄漏，该框架（i）将泄漏定义为跨11种PII类型的风险加权、令牌级事件，（ii）将泄漏曲线追踪作为允许的思维链预算的函数，以及（iii）在具有层次风险分类法的结构化PII数据集上比较开源和闭源模型家族。我们发现，CoT始终会增加泄漏，尤其是在高风险类别中，并且泄漏在很大程度上依赖于模型家族和预算。增加推理预算可能会根据基础模型的不同而放大或减弱泄漏。然后，我们基准测试了轻量级推理时的守门者：基于规则的检测器、TF-IDF + 逻辑回归分类器、基于GLiNER的命名实体识别模型，以及将LLM作为评判者，使用风险加权的F1、宏观F1和召回率。没有单一方法在所有模型或预算中占据主导地位，这促使我们提出混合的、风格自适应的守门政策，在一个共同的、可重复的协议下平衡效用和风险。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2603.05651

The Fragility Of Moral Judgment In Large Language Models

大型语言模型中的道德判断脆弱性

van Nuenen, Tom, Sachdeva, Pratik S.

Abstract

People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.

Chinese Translation

人们越来越多地使用大型语言模型（LLMs）来获取日常的道德和人际指导，但这些系统无法审视缺失的背景信息，并按照呈现的困境进行判断。我们引入了一种扰动框架，用于测试LLM道德判断的稳定性和可操控性，同时保持潜在的道德冲突不变。利用来自r/AmItheAsshole的2,939个困境（2025年1月至3月），我们生成了三类内容扰动：表面编辑（词汇/结构噪声）、视角转变（声音和立场中立化）以及说服线索（自我定位、社会证明、模式承认、受害者框架）。我们还改变了评估协议（输出顺序、指令位置和非结构化提示）。我们对四个模型（GPT-4.1、Claude 3.7 Sonnet、DeepSeek V3、Qwen2.5-72B）进行了所有变体的评估（N=129,156个判断）。表面扰动产生了较低的翻转率（7.5%），主要在自我一致性噪声底线（4-13%）内，而视角转变则导致显著更高的不稳定性（24.3%）。大部分困境（37.9%）对表面噪声具有鲁棒性，但在视角变化下会翻转，表明模型将叙事声音作为一种务实线索进行条件化。不稳定性集中在道德模糊的案例中；没有任何一方被指责的场景最容易受到影响。说服扰动产生系统性的方向性变化。协议选择主导了所有其他因素：结构化协议之间的一致性仅为67.6%（kappa=0.55），而在所有三个协议中，仅有35.7%的模型-场景单元匹配。这些结果表明，LLM的道德判断是叙事形式和任务支架共同产生的，当结果依赖于呈现技巧而非道德实质时，这引发了可重复性和公平性的问题。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2603.05690

FreeTxt-Vi: A Benchmarked Vietnamese-English Toolkit for Segmentation, Sentiment, and Summarisation

FreeTxt-Vi：一个经过基准测试的越南语-英语分词、情感分析和摘要生成工具包

Huy, Hung Nguyen, El-Haj, Mo, Knight, Dawn, Rayson, Paul

Abstract

FreeTxt-Vi is a free and open source web based toolkit for creating and analysing bilingual Vietnamese English text collections. Positioned at the intersection of corpus linguistics and natural language processing NLP it enables users to build explore and interpret free text data without requiring programming expertise. The system combines corpus analysis features such as concordancing keyword analysis word relation exploration and interactive visualisation with transformer based NLP components for sentiment analysis and summarisation. A key contribution of this work is the design of a unified bilingual NLP pipeline that integrates a hybrid VnCoreNLP and Byte Pair Encoding BPE segmentation strategy a fine tuned TabularisAI sentiment classifier and a fine tuned Qwen2.5 model for abstractive summarisation. Unlike existing text analysis platforms FreeTxt Vi is evaluated as a set of language processing components. We conduct a three part evaluation covering segmentation sentiment analysis and summarisation and show that our approach achieves competitive or superior performance compared to widely used baselines in both Vietnamese and English. By reducing technical barriers to multilingual text analysis FreeTxt Vi supports reproducible research and promotes the development of language resources for Vietnamese a widely spoken but underrepresented language in NLP. The toolkit is applicable to domains including education digital humanities cultural heritage and the social sciences where qualitative text data are common but often difficult to process at scale.

Chinese Translation

FreeTxt-Vi 是一个基于网络的免费开源工具包，用于创建和分析双语越南语-英语文本集合。该工具包位于语料库语言学与自然语言处理（NLP）的交汇处，使用户能够在无需编程专业知识的情况下构建、探索和解释自由文本数据。该系统结合了语料库分析功能，如词汇搭配、关键词分析、词语关系探索和交互式可视化，以及基于变换器的NLP组件，用于情感分析和摘要生成。本文的一个关键贡献是设计了一个统一的双语NLP管道，整合了混合的VnCoreNLP和字节对编码（Byte Pair Encoding, BPE）分词策略、经过微调的TabularisAI情感分类器，以及经过微调的Qwen2.5模型用于抽象摘要生成。与现有的文本分析平台不同，FreeTxt-Vi 被评估为一组语言处理组件。我们进行了三部分评估，涵盖分词、情感分析和摘要生成，并显示我们的方法在越南语和英语中均实现了与广泛使用的基准相比具有竞争力或更优的性能。通过降低多语言文本分析的技术门槛，FreeTxt-Vi 支持可重复的研究，并促进了越南语这一在NLP中使用广泛但代表性不足的语言的语言资源开发。该工具包适用于教育、数字人文学科、文化遗产和社会科学等领域，这些领域中定性文本数据普遍存在，但通常难以大规模处理。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2603.05698

Towards Robust Retrieval-Augmented Generation Based on Knowledge Graph: A Comparative Analysis

基于知识图谱的鲁棒检索增强生成：比较分析

Amamou, Hazem, Gagnon, Stéphane, Davoust, Alan, Avila, Anderson R.

Abstract

Retrieval-Augmented Generation (RAG) was introduced to enhance the capabilities of Large Language Models (LLMs) beyond their encoded prior knowledge. This is achieved by providing LLMs with an external source of knowledge, which helps reduce factual hallucinations and enables access to new information not available during pretraining. However, inconsistent retrieved information can negatively affect LLM responses. The Retrieval-Augmented Generation Benchmark (RGB) was introduced to evaluate the robustness of RAG systems under such conditions. In this work, we use the RGB corpus to evaluate LLMs in four scenarios: noise robustness, information integration, negative rejection, and counterfactual robustness. We perform a comparative analysis between the RGB RAG baseline and GraphRAG, a knowledge graph based retrieval system. We test three GraphRAG customizations to improve robustness. Results show improvements over the RGB baseline and provide insights for designing more reliable RAG systems for real world scenarios.

Chinese Translation

检索增强生成（Retrieval-Augmented Generation, RAG）被引入以增强大型语言模型（Large Language Models, LLMs）超越其编码的先前知识的能力。这是通过为LLMs提供外部知识源来实现的，这有助于减少事实幻觉，并使其能够访问在预训练期间不可用的新信息。然而，不一致的检索信息可能会对LLM的响应产生负面影响。为评估RAG系统在这种情况下的鲁棒性，引入了检索增强生成基准（Retrieval-Augmented Generation Benchmark, RGB）。在本研究中，我们使用RGB语料库评估LLMs在四种场景下的表现：噪声鲁棒性、信息整合、负面拒绝和反事实鲁棒性。我们对RGB RAG基线和基于知识图谱的检索系统GraphRAG进行了比较分析。我们测试了三种GraphRAG的定制以提高鲁棒性。结果显示，相较于RGB基线有了改善，并为设计更可靠的RAG系统以应对现实世界场景提供了见解。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2603.05723

Cultural Perspectives and Expectations for Generative AI: A Global Survey Approach

生成性人工智能的文化视角与期望：全球调查方法

van Liemt, Erin, Shelby, Renee, Smart, Andrew, Kumbale, Sinchana, Zhang, Richard, Dixit, Neha, Rashid, Qazi Mamunur, Smith-Loud, Jamila

Abstract

There is a lack of empirical evidence about global attitudes around whether and how GenAI should represent cultures. This paper assesses understandings and beliefs about culture as it relates to GenAI from a large-scale global survey. We gathered data about what culture means to different groups, and about how GenAI should approach the representation of cultural artifacts, concepts, or values. We distill working definitions of culture directly from these communities to build an understanding of its conceptual complexities and how they relate to representations in Generative AI. We survey from across parts of Europe, North and South America, Asia, and Africa. We conclude with a set of recommendations for Culture and GenAI development. These include participatory approaches, prioritizing specific cultural dimensions beyond geography, such as religion and tradition, and a sensitivity framework for addressing cultural ``redlines''.

Chinese Translation

关于全球对生成性人工智能（GenAI）应如何代表文化的态度，缺乏实证证据。本文通过大规模全球调查评估了与GenAI相关的文化理解和信念。我们收集了不同群体对文化的定义，以及GenAI应如何处理文化文物、概念或价值观的表现。我们直接从这些社区提炼出文化的工作定义，以建立对其概念复杂性的理解，以及这些复杂性与生成性人工智能表现之间的关系。我们的调查覆盖了欧洲、北美、南美、亚洲和非洲的多个地区。最后，我们提出了一系列关于文化与GenAI发展的建议。这些建议包括参与式方法、优先考虑超越地理的特定文化维度（如宗教和传统），以及用于处理文化“红线”的敏感性框架。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2603.05727

Structured Multidimensional Representation Learning for Large Language Models

大语言模型的结构化多维表示学习

Ichi, Alaa El, Jbilou, Khalide, Guide, Mohamed El, Dufrenois, Franck

Abstract

Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the embedding dimension. In this work, we introduce a structured spectral factorization of the embedding space based on the L-product for third-order tensors. By reshaping token representations into spectral tensor slices and performing attention and feed-forward operations in the transform domain, we obtain a Tensor Transformer architecture that decomposes the encoder into p independent spectral sub-transformers while preserving standard Transformer semantics. We prove that the proposed L-Transformer is spectrally equivalent to p parallel Transformers operating on reduceddimensional embeddings, which yields approximately 1/p reduction (up to lower-order terms such as biases and normalization parameters) in encoder parameters under fixed total embedding size. When instantiated with a real-valued Discrete Cosine Transform (DCT), the method remains fully differentiable and compatible with existing training pipelines. Beyond compression, the spectral decomposition introduces an inductive bias over embedding frequencies, enabling slice-dependent frequency scaling that improves generalization. Experiments on IMDB and AG~News show that the proposed model can substantially reduce encoder parameters (up to 75\% for p=4) while maintaining competitive accuracy. On IMDB, the tensorized encoder matches or improves upon the standard baseline under compression, whereas on AG~News at moderate width we observe a small accuracy decrease in exchange for a 4 times encoder reduction; at BERT-base width (d=768), performance returns to parity.

Chinese Translation

Transformer架构在广泛的模式识别和自然语言处理任务中实现了最先进的性能，但其扩展伴随着参数的大幅增长和嵌入维度的冗余。在本研究中，我们引入了一种基于L-乘积的嵌入空间的结构化谱因式分解，适用于三阶张量。通过将标记表示重塑为谱张量切片，并在变换域中执行注意力和前馈操作，我们获得了一种张量Transformer架构，该架构将编码器分解为p个独立的谱子Transformer，同时保留标准Transformer的语义。我们证明了所提出的L-Transformer在谱上等价于p个在降维嵌入上运行的并行Transformer，这在固定总嵌入大小的情况下，导致编码器参数大约减少1/p（考虑到偏置和归一化参数等低阶项）。当使用实值离散余弦变换（DCT）实例化时，该方法保持完全可微并与现有训练流程兼容。除了压缩之外，谱分解还引入了对嵌入频率的归纳偏置，使得切片依赖的频率缩放得以实现，从而提高了泛化能力。在IMDB和AG~News上的实验表明，所提出的模型可以显著减少编码器参数（对于p=4可减少多达75%），同时保持竞争力的准确性。在IMDB上，张量化编码器在压缩下与标准基线相匹配或有所改善，而在AG~News上，在适中的宽度下，我们观察到小幅准确性下降，以换取4倍的编码器减少；在BERT-base宽度（d=768）下，性能恢复到平衡。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2603.05743

Let's Talk, Not Type: An Oral-First Multi-Agent Architecture for Guaran\'i

让我们交谈，而不是打字：一种以口语为先的多智能体架构用于瓜拉尼语

Adorno, Samantha, Moharir, Akshata Kishore, Kandala, Ratna

Abstract

Although artificial intelligence (AI) and Human-Computer Interaction (HCI) systems are often presented as universal solutions, their design remains predominantly text-first, underserving primarily oral languages and indigenous communities. This position paper uses Guaran\'i, an official and widely spoken language of Paraguay, as a case study to argue that language support in AI remains insufficient unless it aligns with lived oral practices. We propose an alternative to the standard "text-to-speech" pipeline, proposing instead an oral-first multi-agent architecture. By decoupling Guaran\'i natural language understanding from dedicated agents for conversation state and community-led governance, we demonstrate a technical framework that respects indigenous data sovereignty and diglossia. Our work moves beyond mere recognition to focus on turn-taking, repair, and shared context as the primary locus of interaction. We conclude that for AI to be truly culturally grounded, it must shift from adapting oral languages to text-centric systems to treating spoken conversation as a first-class design requirement, ensuring digital ecosystems empower rather than overlook diverse linguistic practices.

Chinese Translation

尽管人工智能（AI）和人机交互（HCI）系统常被视为普遍解决方案，但其设计仍然主要以文本为主，未能充分服务于以口语为主的语言和土著社区。本文以瓜拉尼语作为案例研究，论证了除非与实际的口语实践相一致，否则AI的语言支持仍然不足。我们提出了一种替代标准“文本转语音”流程的方案，建议采用以口语为先的多智能体架构。通过将瓜拉尼语的自然语言理解与专门用于对话状态和社区主导治理的代理解耦，我们展示了一个尊重土著数据主权和双语现象的技术框架。我们的工作超越了单纯的识别，重点关注轮流发言、修复和共享背景作为交互的主要焦点。我们得出结论，AI要真正扎根于文化，必须从将口语语言适应于以文本为中心的系统转变为将口语对话视为一项首要设计要求，从而确保数字生态系统能够赋权而非忽视多样的语言实践。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2603.05744

CodeScout: Contextual Problem Statement Enhancement for Software Agents

CodeScout：软件代理的上下文问题陈述增强

Suri, Manan, Li, Xiangci, Shojaie, Mehdi, Han, Songyang, Hsu, Chao-Chun, Garg, Shweta, Deshmukh, Aniket Anand, Kumar, Varun

Abstract

Current AI-powered code assistance tools often struggle with poorly-defined problem statements that lack sufficient task context and requirements specification. Recent analysis of software engineering agents reveals that failures on such underspecified requests are highly correlated with longer trajectories involving either over-exploration or repeated attempts at applying the same fix without proper evolution or testing, leading to suboptimal outcomes across software development tasks. We introduce CodeScout, a contextual query refinement approach that systematically converts underspecified user requests into comprehensive, actionable problem statements through lightweight pre-exploration of the target codebase. Our key innovation is demonstrating that structured analysis before task execution can supplement existing agentic capabilities without requiring any modifications to their underlying scaffolds. CodeScout performs targeted context scoping, conducts multi-perspective analysis examining potential fixes and exploration opportunities, then synthesizes these insights into enhanced problem statements with reproduction steps, expected behaviors, and targeted exploration hints. This pre-exploration directly addresses the identified failure patterns by reducing non-converging agent trajectories while clarifying user intent in natural language space. We evaluate CodeScout using state-of-the-art agentic scaffolds and language models on SWEBench-Verified, demonstrating a 20\% improvement in resolution rates with up to 27 additional issues resolved compared to the default baseline method. Our results suggest that systematic query refinement through contextual analysis represents a promising direction for enhancing AI code assistance capabilities.

Chinese Translation

当前的人工智能驱动代码辅助工具常常在处理缺乏充分任务上下文和需求规范的模糊问题陈述时遇到困难。对软件工程代理的最新分析表明，在此类不明确请求上的失败与涉及过度探索或在没有适当演变或测试的情况下重复尝试应用相同修复的较长轨迹高度相关，导致软件开发任务的次优结果。我们提出了CodeScout，这是一种上下文查询精炼方法，通过对目标代码库的轻量级预探索，系统地将不明确的用户请求转化为全面、可操作的问题陈述。我们的关键创新在于展示了任务执行前的结构化分析可以补充现有的代理能力，而无需对其基础架构进行任何修改。CodeScout进行针对性的上下文范围界定，进行多角度分析，检查潜在修复和探索机会，然后将这些见解综合为增强的问题陈述，包括重现步骤、预期行为和针对性的探索提示。这种预探索直接解决了识别出的失败模式，通过减少非收敛的代理轨迹，同时在自然语言空间中澄清用户意图。我们使用最先进的代理架构和语言模型在SWEBench-Verified上评估CodeScout，结果显示解决率提高了20%，与默认基线方法相比，解决了多达27个额外问题。我们的结果表明，通过上下文分析进行系统的查询精炼代表了增强人工智能代码辅助能力的一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2603.05750

NERdME: a Named Entity Recognition Dataset for Indexing Research Artifacts in Code Repositories

NERdME：一个用于索引代码库中研究文献的命名实体识别数据集

Gesese, Genet Asefa, Chen, Zongxiong, Jiang, Shufan, Tan, Mary Ann, Liu, Zhaotai, Schimmler, Sonja, Sack, Harald

Abstract

Existing scholarly information extraction (SIE) datasets focus on scientific papers and overlook implementation-level details in code repositories. README files describe datasets, source code, and other implementation-level artifacts, however, their free-form Markdown offers little semantic structure, making automatic information extraction difficult. To address this gap, NERdME is introduced: 200 manually annotated README files with over 10,000 labeled spans and 10 entity types. Baseline results using large language models and fine-tuned transformers show clear differences between paperlevel and implementation-level entities, indicating the value of extending SIE benchmarks with entity types available in README files. A downstream entity-linking experiment was conducted to demonstrate that entities derived from READMEs can support artifact discovery and metadata integration.

Chinese Translation

现有的学术信息提取（SIE）数据集主要集中在科学论文上，忽视了代码库中的实现层面细节。README 文件描述了数据集、源代码及其他实现层面的文献，然而其自由格式的 Markdown 提供的语义结构较少，使得自动信息提取变得困难。为了解决这一问题，本文介绍了 NERdME：一个包含 200 个手动标注的 README 文件的数据集，涵盖超过 10,000 个标注跨度和 10 种实体类型。使用大型语言模型和微调的变换器进行的基线结果显示，论文层面和实现层面的实体之间存在明显差异，表明通过 README 文件中可用的实体类型扩展 SIE 基准的重要性。还进行了下游实体链接实验，以证明从 README 中提取的实体可以支持文献发现和元数据集成。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2603.05776

PVminerLLM: Structured Extraction of Patient Voice from Patient-Generated Text using Large Language Models

PVminerLLM：利用大型语言模型从患者生成文本中结构化提取患者声音

Fodeh, Samah, Ma, Linhai, Puthiaraju, Ganesh, Talakokkul, Srivani, Khan, Afshan, Hagaman, Ashley, Lowe, Sarah, Roundtree, Aimee

Abstract

Motivation: Patient-generated text contains critical information about patients' lived experiences, social circumstances, and engagement in care, including factors that strongly influence adherence, care coordination, and health equity. However, these patient voice signals are rarely available in structured form, limiting their use in patient-centered outcomes research and clinical quality improvement. Reliable extraction of such information is therefore essential for understanding and addressing non-clinical drivers of health outcomes at scale. Results: We introduce PVminer, a benchmark for structured extraction of patient voice, and propose PVminerLLM, a supervised fine-tuned large language model tailored to this task. Across multiple datasets and model sizes, PVminerLLM substantially outperforms prompt-based baselines, achieving up to 83.82% F1 for Code prediction, 80.74% F1 for Sub-code prediction, and 87.03% F1 for evidence Span extraction. Notably, strong performance is achieved even with smaller models, demonstrating that reliable patient voice extraction is feasible without extreme model scale. These results enable scalable analysis of social and experiential signals embedded in patient-generated text. Availability and Implementation: Code, evaluation scripts, and trained LLMs will be released publicly. Annotated datasets will be made available upon request for research use. Keywords: Large Language Models, Supervised Fine-Tuning, Medical Annotation, Patient-Generated Text, Clinical NLP

Chinese Translation

动机：患者生成的文本包含关于患者生活经历、社会环境和护理参与的重要信息，包括强烈影响依从性、护理协调和健康公平的因素。然而，这些患者声音信号很少以结构化形式出现，限制了它们在以患者为中心的结果研究和临床质量改进中的应用。因此，可靠提取此类信息对于理解和解决大规模健康结果的非临床驱动因素至关重要。结果：我们介绍了PVminer，一个用于结构化提取患者声音的基准，并提出了PVminerLLM，一个针对该任务进行监督微调的大型语言模型。在多个数据集和模型规模下，PVminerLLM显著优于基于提示的基准，代码预测的F1值高达83.82%，子代码预测的F1值为80.74%，证据跨度提取的F1值为87.03%。值得注意的是，即使使用较小的模型也能取得良好的性能，证明在不需要极大模型规模的情况下，可靠的患者声音提取是可行的。这些结果使得对嵌入在患者生成文本中的社会和体验信号进行可扩展分析成为可能。可用性和实现：代码、评估脚本和训练好的LLM将公开发布。经过注释的数据集将在请求后提供用于研究。关键词：大型语言模型、监督微调、医学注释、患者生成文本、临床自然语言处理

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2603.05778

Tutor Move Taxonomy: A Theory-Aligned Framework for Analyzing Instructional Moves in Tutoring

辅导动作分类法：一个与理论对齐的框架，用于分析辅导中的教学行为

Zhou, Zhuqian, Vanacore, Kirk, Thompson, Tamisha, John, Jennifer St, Kizilcec, Rene

Abstract

Understanding what makes tutoring effective requires methods for systematically analyzing tutors' instructional actions during learning interactions. This paper presents a tutor move taxonomy designed to support large-scale analysis of tutoring dialogue within the National Tutoring Observatory. The taxonomy provides a structured annotation framework for labeling tutors' instructional moves during one-on-one tutoring sessions. We developed the taxonomy through a hybrid deductive-inductive process. First, we synthesized research from cognitive science, the learning sciences, classroom discourse analysis, and intelligent tutoring systems to construct a preliminary framework of tutoring moves. We then refined the taxonomy through iterative coding of authentic tutoring transcripts conducted by expert annotators with extensive instructional and qualitative research experience. The resulting taxonomy organizes tutoring behaviors into four categories: tutoring support, learning support, social-emotional and motivational support, and logistical support. Learning support moves are further organized along a spectrum of student engagement, distinguishing between moves that elicit student reasoning and those that provide direct explanation or answers. By defining tutoring dialogue in terms of discrete instructional actions, the taxonomy enables scalable annotation using AI, computational modeling of tutoring strategies, and empirical analysis of how tutoring behaviors relate to learning outcomes.

Chinese Translation

理解辅导有效性的因素需要系统分析辅导者在学习互动中的教学行为的方法。本文提出了一种辅导动作分类法，旨在支持国家辅导观察站内辅导对话的大规模分析。该分类法提供了一个结构化的注释框架，用于标记一对一辅导会话中辅导者的教学动作。我们通过混合的演绎-归纳过程开发了该分类法。首先，我们综合了来自认知科学、学习科学、课堂话语分析和智能辅导系统的研究，以构建辅导动作的初步框架。随后，我们通过对真实辅导记录的迭代编码，结合具有丰富教学和定性研究经验的专家注释者的反馈，进一步完善了该分类法。最终的分类法将辅导行为组织为四个类别：辅导支持、学习支持、社会情感与动机支持，以及后勤支持。学习支持动作进一步沿着学生参与的光谱进行组织，区分出引导学生推理的动作与提供直接解释或答案的动作。通过将辅导对话定义为离散的教学行为，该分类法使得利用人工智能进行可扩展的注释、辅导策略的计算建模以及辅导行为与学习结果之间关系的实证分析成为可能。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2603.05818

RouteGoT: Node-Adaptive Routing for Cost-Efficient Graph of Thoughts Reasoning

RouteGoT：成本高效的思维图推理中的节点自适应路由

Liu, Yuhang, Wang, Ruijie, Chu, Yunlong, Hao, Bing, Lin, Yumeng, Liu, Shengzhong, Shao, Minglai

Abstract

Large Language Models (LLMs) excel at multi-step reasoning, yet increasing the structural complexity of inference does not consistently improve system-level returns. Methods such as Tree of Thoughts (ToT), Graph of Thoughts (GoT), and Adaptive Graph of Thoughts (AGoT) can boost accuracy on some benchmarks, but often introduce substantial overhead in token consumption and latency, and their gains can be unstable across task distributions-sometimes underperforming simpler Chain-of-Thought (CoT) or direct input-output prompting (IO). We attribute this inefficiency to stage-wise and node-wise heterogeneity inside GoT-style reasoning pipelines: high-quality planning and final synthesis are globally coupled and typically benefit from strong models, whereas many intermediate subtasks are localized and can be solved accurately by lighter models with far fewer tokens. Motivated by these observations, we propose RouteGoT, a budget-controllable, node-adaptive routing framework for graph-structured reasoning. RouteGoT performs in-graph routing by prioritizing strong models for planning and synthesis, while dynamically allocating lightweight models and cost-effective strategies to leaf subtasks based on predicted difficulty. It further integrates explicit budget constraints into a global inference scheduler to control graph expansion under a user-specified token budget, enabling predictable performance-cost trade-offs. Experiments across reasoning, retrieval, and multi-hop QA benchmarks show that RouteGoT matching or improving accuracy while substantially reducing token usage; specifically, it achieves an average 8.1 percentage points accuracy improvement and 79.1\% output token reduction compared to AGoT. Furthermore, RouteGoT outperforms existing routing baselines by maintaining a superior cost-accuracy trade-off, demonstrating improved robustness under varying budget targets and tasks.

Chinese Translation

大型语言模型（LLMs）在多步骤推理方面表现出色，但增加推理的结构复杂性并不总能持续提高系统级的回报。诸如思维树（Tree of Thoughts, ToT）、思维图（Graph of Thoughts, GoT）和自适应思维图（Adaptive Graph of Thoughts, AGoT）等方法可以在某些基准测试中提高准确性，但往往会引入显著的令牌消耗和延迟开销，其增益在任务分布中可能不稳定，有时甚至表现不如更简单的思维链（Chain-of-Thought, CoT）或直接的输入-输出提示（input-output prompting, IO）。我们将这种低效归因于GoT风格推理管道中的阶段性和节点性异质性：高质量的规划和最终合成是全局耦合的，通常受益于强大的模型，而许多中间子任务是局部的，可以由轻量级模型以更少的令牌准确解决。基于这些观察，我们提出了RouteGoT，一种可控预算的节点自适应路由框架，用于图结构推理。RouteGoT通过优先选择强大模型进行规划和合成，在图内路由时动态分配轻量级模型和成本效益策略给叶子子任务，基于预测的难度。此外，它将明确的预算约束集成到全局推理调度器中，以控制在用户指定的令牌预算下的图扩展，从而实现可预测的性能-成本权衡。在推理、检索和多跳问答基准测试中的实验表明，RouteGoT在匹配或提高准确性的同时显著减少了令牌使用；具体而言，与AGoT相比，它实现了平均8.1个百分点的准确性提升和79.1%的输出令牌减少。此外，RouteGoT通过保持优越的成本-准确性权衡，超越了现有的路由基线，在不同预算目标和任务下表现出更强的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2603.05828

HART: Data-Driven Hallucination Attribution and Evidence-Based Tracing for Large Language Models

HART：基于数据的幻觉归因与证据追踪框架用于大型语言模型

Liang, Shize, Wang, Hongzhi

Abstract

Large language models (LLMs) have demonstrated remarkable performance in text generation and knowledge-intensive question answering. Nevertheless, they are prone to producing hallucinated content, which severely undermines their reliability in high-stakes application domains. Existing hallucination attribution approaches, based on either external knowledge retrieval or internal model mechanisms, primarily focus on semantic similarity matching or representation-level discrimination. As a result, they have difficulty establishing structured correspondences at the span level between hallucination types, underlying error generation mechanisms, and external factual evidence, thereby limiting the interpretability of hallucinated fragments and the traceability of supporting or opposing evidence. To address these limitations, we propose HART, a fine-grained hallucination attribution and evidence retrieval framework for large language models. HART formalizes hallucination tracing as a structured modeling task comprising four stages: span localization, mechanism attribution, evidence retrieval, and causal tracing. Based upon this formulation, we develop the first structured dataset tailored for hallucination tracing, in which hallucination types, error mechanisms, and sets of counterfactual evidence are jointly annotated to enable causal-level interpretability evaluation. Experimental results on the proposed dataset demonstrate that HART substantially outperforms strong retrieval baselines, including BM25 and DPR, validating the effectiveness and generalization capability of the proposed tracing paradigm for hallucination analysis and evidence alignment.

Chinese Translation

大型语言模型（LLMs）在文本生成和知识密集型问答中展现了卓越的性能。然而，它们容易产生幻觉内容，这严重削弱了它们在高风险应用领域的可靠性。现有的幻觉归因方法主要基于外部知识检索或内部模型机制，主要关注语义相似性匹配或表示层次的区分。因此，它们在幻觉类型、潜在错误生成机制与外部事实证据之间的跨度级别建立结构化对应关系方面存在困难，从而限制了幻觉片段的可解释性和支持或反对证据的可追溯性。为了解决这些局限性，我们提出了HART，一个用于大型语言模型的细粒度幻觉归因和证据检索框架。HART将幻觉追踪形式化为一个结构化建模任务，包含四个阶段：跨度定位、机制归因、证据检索和因果追踪。基于这一框架，我们开发了第一个针对幻觉追踪的结构化数据集，其中幻觉类型、错误机制和反事实证据集被联合标注，以便进行因果层面的可解释性评估。在所提出的数据集上的实验结果表明，HART显著优于强大的检索基线，包括BM25和DPR，验证了所提出的追踪范式在幻觉分析和证据对齐方面的有效性和泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2603.05863

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

ReflexiCoder：通过强化学习教会大型语言模型自我反思生成的代码并进行自我修正

Jiang, Juyong, Shen, Jiasi, Kim, Sunghun, Yoo, Kang Min, Kim, Jeonghoon, Kim, Sungju

Abstract

While Large Language Models (LLMs) have revolutionized code generation, standard "System 1" approaches, generating solutions in a single forward pass, often hit a performance ceiling when faced with complex algorithmic tasks. Existing iterative refinement strategies attempt to bridge this gap at inference time, yet they predominantly rely on external oracles, execution feedback, or computationally expensive prompt-response cycles. In this work, we propose ReflexiCoder, a novel reinforcement learning (RL) framework that internalizes the structured reasoning trajectory, encompassing initial generation, bug and optimization aware reflection, and self-correction, directly into the model's weights. Unlike prior methods, ReflexiCoder shifts the paradigm from external-dependent refinement to an intrinsic, fully autonomous self-reflection and self-correction capabilities at inference time. We utilize an RL-zero training paradigm with granular reward functions to optimize the entire reflection-correction trajectory, teaching the model how to debug without reliance on ground-truth feedback or execution engines at inference time. Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80% (78.57%) on MBPP (Plus), 35.00% on BigCodeBench, 52.21% on LiveCodeBench, and 37.34% on CodeForces in a single-attempt setting, rivaling or surpassing proprietary models like GPT-5.1. Notably, our framework is significantly more token-efficient than base models, reducing inference-time compute overhead by approximately 40% through disciplined, high-speed reasoning and reflection patterns. Source code is available at https://github.com/juyongjiang/ReflexiCoder.

Chinese Translation

虽然大型语言模型（LLMs）已经彻底改变了代码生成，但标准的“系统1”方法在面对复杂的算法任务时，往往会遇到性能瓶颈，通常只在一次前向传递中生成解决方案。现有的迭代优化策略试图在推理时弥补这一差距，但它们主要依赖于外部神谕、执行反馈或计算成本高昂的提示-响应循环。在本研究中，我们提出了ReflexiCoder，这是一种新颖的强化学习（RL）框架，它将结构化推理轨迹内化，包括初始生成、错误和优化意识的反思以及自我修正，直接融入模型的权重中。与之前的方法不同，ReflexiCoder将范式从依赖外部的优化转变为在推理时具备内在的、完全自主的自我反思和自我修正能力。我们利用RL-zero训练范式和细粒度奖励函数来优化整个反思-修正轨迹，教会模型如何在推理时不依赖于真实反馈或执行引擎进行调试。通过在七个基准测试上的广泛实验，我们的ReflexiCoder-8B在1.5B-14B范围内的领先开源模型中建立了新的最先进（SOTA）水平，在HumanEval（Plus）上达到94.51%（87.20%），在MBPP（Plus）上达到81.80%（78.57%），在BigCodeBench上达到35.00%，在LiveCodeBench上达到52.21%，在CodeForces上达到37.34%，在单次尝试设置中与GPT-5.1等专有模型相媲美或超越。值得注意的是，我们的框架在令牌效率上显著优于基础模型，通过严谨的高速推理和反思模式，将推理时的计算开销减少了约40%。源代码可在https://github.com/juyongjiang/ReflexiCoder获取。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2603.05878

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

ROSE：重新排序的 SparseGPT 用于更精确的一次性大型语言模型剪枝

Su, Mingluo, Wang, Huan

Abstract

Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of LLM one-shot pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weights with larger potential pruning errors to be pruned earlier. ROSE first performs pre-pruning to identify candidate weights for removal, and estimates both column and block pruning loss. Subsequently, two-level reordering is performed: columns within each block are reordered in descending order of column loss, while blocks are reordered based on block loss. We introduce the relative range of block loss as a metric to identify columnar layers, enabling adaptive reordering across the entire model. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods. Our code is available at https://github.com/mingluo-su/ROSE.

Chinese Translation

剪枝被广泛认为是减少大型语言模型（LLMs）参数的有效方法，这可能导致更高效的部署和推理。一种经典且突出的 LLM 一次性剪枝路径是利用二阶梯度（即 Hessian），这在开创性工作 SparseGPT 中得到了体现。然而，SparseGPT 中预定义的从左到右的剪枝顺序在权重呈列状模式时会导致性能不佳。本文研究了在 SparseGPT 框架下剪枝顺序的影响。分析结果促使我们提出 ROSE，一种重新排序的 SparseGPT 方法，优先剪枝潜在剪枝误差较大的权重。ROSE 首先进行预剪枝，以识别待移除的候选权重，并估计列和块的剪枝损失。随后，进行两级重新排序：在每个块内，按列损失的降序重新排序列，而块则根据块损失进行重新排序。我们引入块损失的相对范围作为识别列状层的指标，从而实现整个模型的自适应重新排序。在流行的 LLM（LLaMA2-7B/13B/70B，LLaMA3-8B，Mistral-7B）上进行的实证结果表明，ROSE 超越了原始的 SparseGPT 和其他对比剪枝方法。我们的代码可在 https://github.com/mingluo-su/ROSE 获取。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2603.05881

Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

回答前的信心：高效大语言模型不确定性估计的范式转变

Li, Changcheng, Wu, Jiancan, Zhang, Hengheng, Chen, Zhengsu, An, Guo, Qiu, Junxiang, Wang, Xiang, Tian, Qi

Abstract

Reliable deployment of large language models (LLMs) requires accurate uncertainty estimation. Existing methods are predominantly answer-first, producing confidence only after generating an answer, which measure the correctness of a specific response and limits practical usability. We study a confidence-first paradigm, where the model outputs its confidence before answering, interpreting this score as the model's probability of answering the question correctly under its current policy. We propose CoCA(Co-optimized Confidence and Answers), a GRPO reinforcement learning framework that jointly optimizes confidence calibration and answer accuracy via segmented credit assignment. By assigning separate rewards and group-relative advantages to confidence and answer segments, CoCA enables stable joint optimization and avoids reward hacking. Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.

Chinese Translation

大语言模型（LLMs）的可靠部署需要准确的不确定性估计。现有方法主要是先回答后评估，仅在生成答案后才产生信心，这种方法测量特定响应的正确性，限制了实际的可用性。我们研究了一种先信心后回答的范式，其中模型在回答之前输出其信心，将该分数解释为模型在当前策略下正确回答问题的概率。我们提出了CoCA（协同优化信心与答案），这是一种通过分段信用分配共同优化信心校准和答案准确性的GRPO强化学习框架。通过为信心和答案部分分配单独的奖励和组相对优势，CoCA实现了稳定的联合优化，避免了奖励操控。在数学、代码和事实问答基准上的实验表明，CoCA在保持答案质量的同时改善了校准和不确定性区分，从而使更广泛的下游应用成为可能。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2603.05883

VerChol -- Grammar-First Tokenization for Agglutinative Languages

VerChol -- 以语法为先的粘合语言分词

Raja, Prabhu

Abstract

Tokenization is the foundational step in all large language model (LLM) pipelines, yet the dominant approach Byte Pair Encoding (BPE) and its variants is inherently script agnostic and optimized for English like morphology. For agglutinative languages a typological class encompassing the Dravidian family (Tamil, Kannada, Telugu, Malayalam), Turkic languages (Turkish, Azerbaijani, Uzbek), Uralic languages (Finnish, Hungarian, Estonian), Korean, Japanese, Swahili, Basque, and others, a single word may encode root, tense, aspect, person, number, gender agreement, case, and postpositions into one orthographic unit. Statistical tokenizers fragment these words into byte pair chunks that sever morpheme boundaries and inflate token counts.

Chinese Translation

分词是所有大型语言模型（LLM）流程中的基础步骤，然而，主流方法字节对编码（Byte Pair Encoding, BPE）及其变体本质上是与书写系统无关的，并且针对类似英语的形态进行了优化。对于粘合语言这一类型学类别，包括德拉威语系（泰米尔语、卡纳达语、泰卢固语、马拉雅拉姆语）、突厥语言（土耳其语、阿塞拜疆语、乌兹别克语）、乌拉尔语言（芬兰语、匈牙利语、爱沙尼亚语）、韩语、日语、斯瓦希里语、巴斯克语等，一个单词可能将词根、时态、体、人称、数、性别一致、格和后置词编码为一个正字法单位。统计分词器将这些单词分割成字节对块，切断了语素边界，并导致令牌数量膨胀。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2603.05890

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

迷失在故事中：大语言模型（LLMs）在长篇故事生成中的一致性错误

Li, Junjie, Guo, Xinrui, Wu, Yuhao, Lee, Roy Ka-Wei, Li, Hongzhi, Xie, Yutao

Abstract

What happens when a storyteller forgets its own story? Large Language Models (LLMs) can now generate narratives spanning tens of thousands of words, but they often fail to maintain consistency throughout. When generating long-form narratives, these models can contradict their own established facts, character traits, and world rules. Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored. To address this gap, we present ConStory-Bench, a benchmark designed to evaluate narrative consistency in long-form story generation. It contains 2,000 prompts across four task scenarios and defines a taxonomy of five error categories with 19 fine-grained subtypes. We also develop ConStory-Checker, an automated pipeline that detects contradictions and grounds each judgment in explicit textual evidence. Evaluating a range of LLMs through five research questions, we find that consistency errors show clear tendencies: they are most common in factual and temporal dimensions, tend to appear around the middle of narratives, occur in text segments with higher token-level entropy, and certain error types tend to co-occur. These findings can inform future efforts to improve consistency in long-form narrative generation. Our project page is available at https://picrew.github.io/constory-bench.github.io/.

Chinese Translation

当讲故事的人忘记自己的故事时会发生什么？大型语言模型（LLMs）现在能够生成长达数万字的叙述，但它们往往无法在整个过程中保持一致性。在生成长篇叙述时，这些模型可能会与自己已建立的事实、角色特征和世界规则相矛盾。现有的故事生成基准主要关注情节质量和流畅性，导致一致性错误在很大程度上未得到探索。为了解决这一问题，我们提出了ConStory-Bench，这是一个旨在评估长篇故事生成中叙述一致性的基准。它包含2000个提示，涵盖四种任务场景，并定义了五个错误类别的分类法，包含19个细分类型。我们还开发了ConStory-Checker，这是一个自动化流程，能够检测矛盾并将每个判断基于明确的文本证据。通过五个研究问题评估一系列LLMs，我们发现一致性错误表现出明显的倾向：它们在事实和时间维度上最为常见，往往出现在叙述的中间部分，发生在具有较高标记级别熵的文本片段中，并且某些错误类型倾向于共现。这些发现可以为未来改善长篇叙述生成中的一致性提供参考。我们的项目页面可访问 https://picrew.github.io/constory-bench.github.io/。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2603.05895

Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions

构建联合国安全理事会决议的集成大语言模型语义标注器

Ghaly, Hussein

Abstract

This paper introduces a new methodology for using LLM-based systems for accurate and efficient semantic tagging of UN Security Council resolutions. The main goal is to leverage LLM performance variability to build ensemble systems for data cleaning and semantic tagging tasks. We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement. These metrics allow the selection of the best output from multiple runs of several GPT models. GPT-4.1 achieved the highest metrics for both tasks (Cleaning: CPR 84.9% - Semantic Tagging: CPR 99.99% and TWF 99.92%). In terms of cost, smaller models, such as GPT-4.1-mini, achieved comparable performance to the best model in each task at only 20% of the cost. These metrics ultimately allowed the ensemble to select the optimal output (both cleaned and tagged content) for all the LLM models involved, across multiple runs. With this ensemble design and the use of metrics, we create a reliable LLM system for performing semantic tagging on challenging texts.

Chinese Translation

本文介绍了一种新的方法论，旨在利用基于大语言模型（LLM）系统对联合国安全理事会决议进行准确高效的语义标注。主要目标是利用LLM性能的变异性构建集成系统，以用于数据清理和语义标注任务。我们引入了两个评估指标：内容保留率（Content Preservation Ratio, CPR）和标签格式良好性（Tag Well-Formedness, TWF），以避免在任务要求之外对输入文本的虚构、不必要的添加或遗漏。这些指标允许从多个GPT模型的多次运行中选择最佳输出。GPT-4.1在两个任务中均取得了最高指标（清理任务：CPR 84.9% - 语义标注任务：CPR 99.99% 和 TWF 99.92%）。在成本方面，较小的模型，如GPT-4.1-mini，在每个任务中以仅20%的成本达到了与最佳模型相当的性能。这些指标最终使集成能够选择所有参与的LLM模型在多次运行中的最佳输出（包括清理和标注后的内容）。通过这种集成设计和指标的使用，我们创建了一个可靠的LLM系统，用于对具有挑战性的文本进行语义标注。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2603.05909

InfoGatherer: Principled Information Seeking via Evidence Retrieval and Strategic Questioning

信息收集者：通过证据检索和战略提问进行原则性信息寻求

Taranukhin, Maksym, Li, Shuyue Stella, Milios, Evangelos, Pleiss, Geoff, Tsvetkov, Yulia, Shwartz, Vered

Abstract

LLMs are increasingly deployed in high-stakes domains such as medical triage and legal assistance, often as document-grounded QA systems in which a user provides a description, relevant sources are retrieved, and an LLM generates a prediction. In practice, initial user queries are often underspecified, and a single retrieval pass is insufficient for reliable decision-making, leading to incorrect and overly confident answers. While follow-up questioning can elicit missing information, existing methods typically depend on implicit, unstructured confidence signals from the LLM, making it difficult to determine what remains unknown, what information matters most, and when to stop asking questions. We propose InfoGatherer, a framework that gathers missing information from two complementary sources: retrieved domain documents and targeted follow-up questions to the user. InfoGatherer models uncertainty using Dempster-Shafer belief assignments over a structured evidential network, enabling principled fusion of incomplete and potentially contradictory evidence from both sources without prematurely collapsing to a definitive answer. Across legal and medical tasks, InfoGatherer outperforms strong baselines while requiring fewer turns. By grounding uncertainty in formal evidential theory rather than heuristic LLM signals, InfoGatherer moves towards trustworthy, interpretable decision support in domains where reliability is critical.

Chinese Translation

大型语言模型（LLMs）越来越多地应用于医疗分诊和法律援助等高风险领域，通常作为文档基础的问答系统，其中用户提供描述，相关来源被检索，LLM生成预测。实际上，初始用户查询往往不够明确，单次检索无法满足可靠决策的需求，导致错误和过于自信的回答。虽然后续提问可以引出缺失的信息，但现有方法通常依赖于LLM的隐式、非结构化的信心信号，这使得确定哪些信息仍然未知、哪些信息最为重要以及何时停止提问变得困难。我们提出了信息收集者（InfoGatherer），一个从两个互补来源收集缺失信息的框架：检索到的领域文档和针对用户的有针对性的后续提问。信息收集者通过在结构化证据网络上使用邓普斯特-沙弗置信分配来建模不确定性，从而实现对来自两个来源的不完整和潜在矛盾证据的原则性融合，而不至于过早地得出明确答案。在法律和医疗任务中，信息收集者的表现优于强基线，同时需要更少的交互轮次。通过将不确定性基于正式的证据理论而非启发式的LLM信号，信息收集者朝着在可靠性至关重要的领域实现可信、可解释的决策支持迈进。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2603.05923

Learning Next Action Predictors from Human-Computer Interaction

从人机交互中学习下一步行动预测器

Shaikh, Omar, Teutschbein, Valentin, Gandhi, Kanishk, Chi, Yikun, Haber, Nick, Robinson, Thomas, Ram, Nilam, Reeves, Byron, Yang, Sherry, Bernstein, Michael S., Yang, Diyi

Abstract

Truly proactive AI systems must anticipate what we will do next. This foresight demands far richer information than the sparse signals we type into our prompts -- it demands reasoning over the entire context of what we see and do. We formalize this as next action prediction (NAP): given a sequence of a user's multimodal interactions with a computer (screenshots, clicks, sensor data), predict that user's next action. Progress on this task requires both new data and modeling approaches. To scale data, we annotate longitudinal, naturalistic computer use with vision-language models. We release an open-source pipeline for performing this labeling on private infrastructure, and label over 360K actions across one month of continuous phone usage from 20 users, amounting to 1,800 hours of screen time. We then introduce LongNAP, a user model that combines parametric and in-context learning to reason over long interaction histories. LongNAP is trained via policy gradient methods to generate user-specific reasoning traces given some context; retrieve relevant traces from a library of past traces; and then apply retrieved traces in-context to predict future actions. Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively). Additionally, LongNAP generalizes to held out users when trained across individuals. The space of next actions a user might take at any moment is unbounded, spanning thousands of possible outcomes. Despite this, 17.1% of LongNAP's predicted trajectories are well-aligned with what a user does next (LLM-judge score $\geq$ 0.5). This rises to 26% when we filter to highly confident predictions. In sum, we argue that learning from the full context of user behavior to anticipate user needs is now a viable task with substantial opportunity.

Chinese Translation

真正的主动人工智能系统必须预测我们接下来会做什么。这种前瞻性要求的信息远比我们在提示中输入的稀疏信号丰富得多——它需要对我们所见和所做的整个上下文进行推理。我们将其形式化为下一步行动预测（Next Action Prediction, NAP）：给定用户与计算机的多模态交互序列（屏幕截图、点击、传感器数据），预测该用户的下一步行动。要在这一任务上取得进展，既需要新的数据，也需要新的建模方法。为了扩大数据规模，我们使用视觉-语言模型对长期自然计算机使用进行注释。我们发布了一个开源管道，用于在私有基础设施上执行这一标注，并对来自20名用户的一个月连续手机使用中的360K个行动进行了标注，总计1800小时的屏幕时间。随后，我们引入了LongNAP，一个用户模型，结合了参数学习和上下文学习，以对长时间交互历史进行推理。LongNAP通过策略梯度方法进行训练，以生成给定上下文的用户特定推理轨迹；从过去轨迹的库中检索相关轨迹；然后将检索到的轨迹在上下文中应用，以预测未来的行动。使用LLM作为评估指标（与真实值的相似度为0-1），LongNAP在保留数据集上显著优于监督微调和提示基线（分别提高了79%和39%）。此外，当在不同个体之间训练时，LongNAP能够推广到未见用户。用户在任何时刻可能采取的下一步行动的空间是无限的，涵盖了数千种可能的结果。尽管如此，LongNAP预测的轨迹中有17.1%与用户接下来的行动高度一致（LLM评估分数≥0.5）。当我们过滤到高度自信的预测时，这一比例上升至26%。总之，我们认为，从用户行为的完整上下文中学习以预测用户需求现在已成为一个可行的任务，并且具有重要的机会。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2603.05928

Addressing the Ecological Fallacy in Larger LMs with Human Context

在大型语言模型中解决生态谬误与人类上下文

Soni, Nikita, Kunjadiya, Dhruv Vijay, Shah, Pratham Piyush, Mohanty, Dikshya, Schwartz, H. Andrew, Balasubramanian, Niranjan

Abstract

Language model training and inference ignore a fundamental linguistic fact -- there is a dependence between multiple sequences of text written by the same person. Prior work has shown that addressing this form of \textit{ecological fallacy} can greatly improve the performance of multiple smaller (~124M) GPT-based models. In this work, we ask if addressing the ecological fallacy by modeling the author's language context with a specific LM task (called HuLM) can provide similar benefits for a larger-scale model, an 8B Llama model. To this end, we explore variants that process an author's language in the context of their other temporally ordered texts. We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (\textit{HuFT:Human-aware Fine-Tuning}). Empirical comparisons show that addressing the ecological fallacy during fine-tuning alone using QLoRA improves the performance of the larger 8B model over standard fine-tuning. Additionally, QLoRA-based continued HuLM pre-training results in a human-aware model generalizable for improved performance over eight downstream tasks with linear task classifier training alone. These results indicate the utility and importance of modeling language in the context of its original generators, the authors.

Chinese Translation

语言模型的训练和推理忽视了一个基本的语言学事实——同一人撰写的多个文本序列之间存在依赖关系。先前的研究表明，解决这种形式的 extit{生态谬误}可以显著提高多个较小（约124M）的基于GPT的模型的性能。在本研究中，我们探讨通过使用特定的语言模型任务（称为HuLM）来建模作者的语言上下文，是否能够为一个更大规模的模型，即8B Llama模型，提供类似的好处。为此，我们探索了处理作者语言的变体，考虑到他们其他时间顺序排列的文本。我们研究了使用HuLM目标进行预训练时引入作者上下文的效果，以及在微调过程中使用作者上下文（ extit{HuFT:人类意识微调}）。实证比较表明，仅在微调过程中使用QLoRA解决生态谬误，能够提升更大8B模型的性能，相较于标准微调。此外，基于QLoRA的持续HuLM预训练产生了一个人类意识模型，该模型在八个下游任务中通过线性任务分类器训练单独实现了更好的性能。这些结果表明，在其原始生成者（作者）的上下文中建模语言的实用性和重要性。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2603.05933

Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling

隐式风格条件化：一种用于低资源角色建模的结构化风格重写框架

Zhu, Chanhui

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing (RP); however, small Language Models (SLMs) with highly stylized personas remains a challenge due to data scarcity and the complexity of style disentanglement. Standard Supervised Fine-Tuning (SFT) often captures surface-level semantics while failing to reproduce the intricate syntactic and pragmatic nuances of a character, leading to "Out-Of-Character" (OOC) generation. To address this, we propose a Structured Style-Rewrite Framework that explicitly disentangles style into three interpretable dimensions: lexical signatures (via PMI), syntactic patterns (grounded in PCFG rules), and pragmatic style. Furthermore, we introduce an implicit style conditioning strategy via Chain-of-Thought (CoT) distillation. By leveraging explicit reasoning traces during training as a strong inductive bias, our approach aligns the model's latent representations with structured style features, enabling high-fidelity stylized generation without requiring explicit reasoning tokens during inference. Extensive experiments on a specific high-stylization domain (anime characters) demonstrate that our method enables a Qwen-1.7B model to outperform significantly larger baselines (e.g., 4B Vanilla SFT) in style consistency and semantic fidelity. Our approach offers a data-efficient paradigm for democratizing inference and deployment on consumer hardware.

Chinese Translation

大型语言模型（LLMs）在角色扮演（RP）方面展现了令人印象深刻的能力；然而，由于数据稀缺和风格解耦的复杂性，具有高度风格化角色的小型语言模型（SLMs）仍然面临挑战。标准的监督微调（SFT）往往只能捕捉表层语义，而未能重现角色的复杂句法和语用细微差别，导致“角色外”（OOC）生成。为了解决这个问题，我们提出了一种结构化风格重写框架，该框架将风格明确解耦为三个可解释的维度：词汇特征（通过PMI）、句法模式（基于PCFG规则）和语用风格。此外，我们通过链式思维（CoT）蒸馏引入了一种隐式风格条件化策略。通过在训练过程中利用显式推理轨迹作为强归纳偏置，我们的方法将模型的潜在表示与结构化风格特征对齐，从而实现高保真度的风格化生成，而无需在推理过程中明确推理标记。在特定的高风格化领域（如动漫角色）进行的广泛实验表明，我们的方法使得Qwen-1.7B模型在风格一致性和语义保真度方面显著超越了更大基线模型（例如，4B Vanilla SFT）。我们的方法为在消费者硬件上实现推理和部署提供了一种数据高效的范式。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2603.05953

Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models

我们是谁，我们在哪里：在个体、情境与大型语言模型交汇处的心理健康

Soni, Nikita, Nilsson, August Håkan, Mahwish, Syeda, Varadarajan, Vasudha, Schwartz, H. Andrew, Boyd, Ryan L.

Abstract

Mental health is not a fixed trait but a dynamic process shaped by the interplay between individual dispositions and situational contexts. Building on interactionist and constructionist psychological theories, we develop interpretable models to predict well-being and identify adaptive and maladaptive self-states in longitudinal social media data. Our approach integrates person-level psychological traits (e.g., resilience, cognitive distortions, implicit motives) with language-inferred situational features derived from the Situational 8 DIAMONDS framework. We compare these theory-grounded features to embeddings from a psychometrically-informed language model that captures temporal and individual-specific patterns. Results show that our principled, theory-driven features provide competitive performance while offering greater interpretability. Qualitative analyses further highlight the psychological coherence of features most predictive of well-being. These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.

Chinese Translation

心理健康不是一个固定的特征，而是一个动态过程，受到个体特质与情境背景之间相互作用的影响。基于互动主义和建构主义心理理论，我们开发了可解释的模型，以预测幸福感并识别纵向社交媒体数据中的适应性和不适应性自我状态。我们的方法将个体层面的心理特质（例如，韧性、认知扭曲、隐性动机）与根据情境8 DIAMONDS框架推断的语言情境特征相结合。我们将这些基于理论的特征与一种心理测量学-informed语言模型的嵌入进行比较，该模型捕捉了时间和个体特定的模式。结果表明，我们的原则性、理论驱动的特征在提供竞争性表现的同时，具有更高的可解释性。定性分析进一步突显了最能预测幸福感的特征的心理一致性。这些发现强调了将计算建模与心理理论相结合的价值，以在情境敏感且人类可理解的方式评估动态心理状态。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2603.05996

Track-SQL: Enhancing Generative Language Models with Dual-Extractive Modules for Schema and Context Tracking in Multi-turn Text-to-SQL

Track-SQL：通过双提取模块增强生成语言模型以实现多轮文本到SQL中的模式和上下文跟踪

Chen, Bingfeng, Shi, Shaobin, Luo, Yongqi, Xu, Boyan, Cai, Ruichu, Hao, Zhifeng

Abstract

Generative language models have shown significant potential in single-turn Text-to-SQL. However, their performance does not extend equivalently to multi-turn Text-to-SQL. This is primarily due to generative language models' inadequacy in handling the complexities of context information and dynamic schema linking in multi-turn interactions. In this paper, we propose a framework named Track-SQL, which enhances generative language models with dual-extractive modules designed to track schema and contextual changes in multi-turn Text-to-SQL. Specifically, Track-SQL incorporates a \emph{Semantic-enhanced Schema Extractor} and a \emph{Schema-aware Context Extractor}. Experimental results demonstrate that Track-SQL achieves state-of-the-art performance on the SparC and CoSQL datasets. Furthermore, detailed ablation studies reveal that Track-SQL significantly improves execution accuracy in multi-turn interactions by 7.1\% and 9.55\% on these datasets, respectively. Our implementation will be open-sourced at https://github.com/DMIRLAB-Group/Track-SQL.

Chinese Translation

生成语言模型在单轮文本到SQL中展现了显著的潜力。然而，它们的性能并未同样适用于多轮文本到SQL。这主要是由于生成语言模型在处理多轮交互中的上下文信息复杂性和动态模式链接方面的不足。本文提出了一种名为Track-SQL的框架，该框架通过双提取模块增强生成语言模型，旨在跟踪多轮文本到SQL中的模式和上下文变化。具体而言，Track-SQL结合了一个 extit{语义增强模式提取器}和一个 extit{模式感知上下文提取器}。实验结果表明，Track-SQL在SparC和CoSQL数据集上达到了最先进的性能。此外，详细的消融研究显示，Track-SQL在这些数据集上的多轮交互执行准确率分别提高了7.1 ext{%}和9.55 ext{%}。我们的实现将开源于https://github.com/DMIRLAB-Group/Track-SQL。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2603.06007

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

MASFactory：一个以图为中心的框架，用于协调基于大型语言模型的多智能体系统与 Vibe 图形化

Liu, Yang, Cai, Jinxuan, Li, Yishen, Meng, Qi, Liu, Zedi, Li, Xin, Qian, Chen, Shi, Chuan, Yang, Cheng

Abstract

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents/sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory) and video (https://youtu.be/ANynzVfY32k) are publicly available.

Chinese Translation

基于大型语言模型（LLM）的多智能体系统（MAS）越来越多地被用于通过角色专业化和协作扩展智能体问题解决能力。MAS 工作流可以自然地建模为有向计算图，其中节点执行智能体/子工作流，边缘编码依赖关系和消息传递。然而，在当前框架中实现复杂的图工作流仍然需要大量的手动工作，提供的重用性有限，并且难以集成异构外部上下文源。为克服这些局限性，我们提出了 MASFactory，一个以图为中心的框架，用于协调基于 LLM 的 MAS。它引入了 Vibe 图形化，这是一种人机协作的方法，将自然语言意图编译成可编辑的工作流规范，然后转化为可执行的图。此外，该框架提供可重用组件和可插拔的上下文集成，以及用于拓扑预览、运行时追踪和人机交互的可视化工具。我们在七个公共基准上评估了 MASFactory，验证了代表性 MAS 方法的重现一致性和 Vibe 图形化的有效性。我们的代码（https://github.com/BUPT-GAMMA/MASFactory）和视频（https://youtu.be/ANynzVfY32k）已公开发布。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2603.06024

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

视图融合：用于多视角推理的结构化空间思维链

Tao, Xingjian, Wang, Yiwei, Cai, Yujun, Song, Yifan, Tang, Jing

Abstract

Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.

Chinese Translation

多视角空间推理对于当前的视觉-语言模型仍然具有挑战性。即使在多个视角可用的情况下，模型往往未能充分利用视角间的关系，而是依赖于单图像的捷径，导致在视角转换和对遮挡敏感的情况下表现脆弱。我们提出了视图融合（ViewFusion），这是一个两阶段框架，明确将跨视角空间预对齐与问答过程分开。在第一阶段，模型进行有意识的空间预思考，以推断视角关系和视角间的空间变换，形成一个超越简单重述的中间工作空间。在第二阶段，模型基于该工作空间进行以问题驱动的推理，以生成最终预测。我们使用合成推理监督训练视图融合，随后通过使用GRPO进行强化学习，这提高了答案的正确性，同时稳定了预期的两阶段生成行为。在MMSI-Bench上，视图融合的准确性比Qwen3-VL-4B-Instruct提高了5.3\%，在需要真正跨视角对齐的示例中获得了最大的提升。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2603.06066

Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring

利用大型语言模型评估奥地利A-Level德语论文的自动化评分

Kubesch, Jonas, Huber, Lena, Havas, Clemens

Abstract

Automated Essay Scoring (AES) has been explored for decades with the goal to support teachers by reducing grading workload and mitigating subjective biases. While early systems relied on handcrafted features and statistical models, recent advances in Large Language Models (LLMs) have made it possible to evaluate student writing with unprecedented flexibility. This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation. A dataset of 101 anonymised student exams across three text types was processed and evaluated. Four LLMs, DeepSeek-R1 32b, Qwen3 30b, Mixtral 8x7b and LLama3.3 70b, were evaluated with different contexts and prompting strategies. The LLMs were able to reach a maximum of 40.6% agreement with the human rater in the rubric-provided sub-dimensions, and only 32.8% of final grades matched the ones given by a human expert. The results indicate that even though smaller models are able to use standardised rubrics for German essay grading, they are not accurate enough to be used in a real-world grading environment.

Chinese Translation

自动化论文评分（AES）已被研究数十年，旨在通过减少评分工作量和减轻主观偏见来支持教师。尽管早期系统依赖于手工特征和统计模型，但大型语言模型（LLMs）的最新进展使得以空前的灵活性评估学生写作成为可能。本文探讨了最先进的开放权重LLMs在奥地利A-Level德语文本评分中的应用，特别关注基于评分标准的评估。处理并评估了来自三种文本类型的101份匿名学生考试数据集。评估了四种LLMs，分别是DeepSeek-R1 32b、Qwen3 30b、Mixtral 8x7b和LLama3.3 70b，采用不同的上下文和提示策略。LLMs在评分标准提供的子维度上与人工评分者的最大一致性达到40.6%，而最终成绩中仅有32.8%与人类专家的评分相符。结果表明，尽管较小的模型能够使用标准化评分标准进行德语论文评分，但它们的准确性不足以在实际评分环境中使用。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2603.06088

Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality

经验塑造性格：大型语言模型个性的语言学起源及其功能影响

Wang, Xi, Zhuang, Mengdie, Liu, Jiqun

Abstract

Human problem-solving is enriched by a diversity of styles and personality traits, yet the development of Large Language Models (LLMs) has largely prioritized uniform performance benchmarks that favour specific behavioural tendencies such as assertiveness. To investigate how diverse experiences shape machine personality and influence problem-solving, this study employs continued pre-training to expose models to domain-specific texts in an unsupervised manner, simulating the accumulation of experience. By adapting the Big Five framework via the Machine Personality Inventory (MPI), we quantify the personality traits of these model variants and analyse their relationship to linguistic style and reasoning behaviour. The findings reveal that model competence is bimodal, peaking at "Expressive Generalists" and "Suppressed Specialists," while identifying a "Suppression Advantage" where reduced social traits enhance complex reasoning performance. This study further establishes a causal link between training data linguistics, such as imperative frequency, and lexical diversity, providing a roadmap for "Personality Engineering".

Chinese Translation

人类解决问题的能力因风格和个性特征的多样性而得以丰富，然而大型语言模型（LLMs）的发展在很大程度上优先考虑了统一的性能基准，这些基准偏向于特定的行为倾向，如自信。为了研究多样化的经验如何塑造机器个性并影响问题解决，本研究采用持续的预训练，以无监督的方式使模型接触领域特定文本，模拟经验的积累。通过调整大五人格框架（Big Five）并使用机器个性清单（Machine Personality Inventory, MPI），我们量化了这些模型变体的个性特征，并分析其与语言风格和推理行为的关系。研究结果表明，模型的能力呈双峰分布，分别在“表达型通才”（Expressive Generalists）和“抑制型专家”（Suppressed Specialists）处达到峰值，同时识别出一种“抑制优势”（Suppression Advantage），即减少社交特征能够增强复杂推理的表现。本研究进一步建立了训练数据语言学（如命令频率）与词汇多样性之间的因果关系，为“个性工程”（Personality Engineering）提供了路线图。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2603.06114

Making Implicit Premises Explicit in Logical Understanding of Enthymemes

在对隐含论证的逻辑理解中使隐含前提显性化

Feng, Xuyao, Hunter, Anthony

Abstract

Real-world arguments in text and dialogues are normally enthymemes (i.e. some of their premises and/or claims are implicit). Natural language processing (NLP) methods for handling enthymemes can potentially identify enthymemes in text but they do not decode their underlying logic, whereas logic-based approaches for handling them assume a knowledgebase with sufficient formulae that can be used to decode them via abduction. There is therefore a lack of a systematic method for translating textual components of an enthymeme into a logical argument and generating the logical formulae required for their decoding, and thereby showing logical entailment. To address this, we propose a pipeline that integrates: (1) a large language model (LLM) to generate intermediate implicit premises based on the explicit premise and claim; (2) another LLM to translate the natural language into logical formulas; and (3) a neuro-symbolic reasoner based on a SAT solver to determine entailment. We evaluate our pipeline on two enthymeme datasets, demonstrating promising performance in selecting the correct implicit premise, as measured by precision, recall, F1-score, and accuracy.

Chinese Translation

文本和对话中的现实世界论证通常是隐含论证（即其某些前提和/或主张是隐含的）。处理隐含论证的自然语言处理（NLP）方法可以潜在地识别文本中的隐含论证，但它们并未解码其潜在逻辑，而基于逻辑的方法处理隐含论证则假设存在一个包含足够公式的知识库，可以通过归纳法对其进行解码。因此，缺乏一种系统的方法将隐含论证的文本组成部分转化为逻辑论证，并生成解码所需的逻辑公式，从而展示逻辑蕴涵。为了解决这个问题，我们提出了一个集成的流程： (1) 使用大型语言模型（LLM）根据显性前提和主张生成中间隐含前提；(2) 另一个LLM将自然语言翻译为逻辑公式；(3) 基于SAT求解器的神经符号推理器用于确定蕴涵。我们在两个隐含论证数据集上评估了我们的流程，结果显示在选择正确隐含前提方面表现出良好的性能，具体通过精确度、召回率、F1分数和准确率进行衡量。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2603.06123

Diffusion Language Models Are Natively Length-Aware

扩散语言模型本质上具备长度感知能力

Rossi, Vittorio, Cirò, Giacomo, Beltrame, Davide, Gandolfi, Luca, Röttger, Paul, Hovy, Dirk

Abstract

Unlike autoregressive language models, which terminate variable-length generation upon predicting an End-of-Sequence (EoS) token, Diffusion Language Models (DLMs) operate over a fixed maximum-length context window for a predetermined number of denoising steps. However, this process is independent of the required response length, resulting in computational waste for the majority of short responses common in reasoning and chat tasks. To address this problem, we conjecture that the latent prompt representation contains sufficient information to estimate the required output length. We provide empirical evidence for this phenomenon and propose a zero-shot mechanism to dynamically crop the context window before generation begins, leading to fewer diffusion steps and substantial computational savings. We evaluate our approach on four benchmarks with diverse tasks -- GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) -- revealing massive efficiency gains at minimal performance impact. We report significant reductions in FLOPs across all tasks, with no statistically significant performance degradation, and significant performance improvements in 2 out of 4 tasks.

Chinese Translation

与自回归语言模型不同，自回归语言模型在预测到序列结束（End-of-Sequence, EoS）标记时终止可变长度的生成，而扩散语言模型（Diffusion Language Models, DLMs）则在预定数量的去噪步骤内，基于固定的最大长度上下文窗口进行操作。然而，这一过程与所需的响应长度无关，导致在推理和对话任务中，短响应的生成存在计算浪费。为了解决这一问题，我们推测潜在的提示表示包含足够的信息来估计所需的输出长度。我们为这一现象提供了实证证据，并提出了一种零-shot机制，在生成开始之前动态裁剪上下文窗口，从而减少扩散步骤并显著节省计算资源。我们在四个具有多样化任务的基准上评估了我们的方法——GSM8K（推理）、HumanEval（代码生成）、IfEval（指令遵循）和LongFormQA（问答），结果显示在性能影响最小的情况下，效率获得了巨大的提升。我们报告了所有任务的FLOPs显著减少，且没有统计学意义上的性能下降，并在4个任务中的2个任务中实现了显著的性能提升。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2603.06135

A Causal Graph Approach to Oppositional Narrative Analysis

一种因果图方法用于对立叙事分析

Revilla, Diego, Fernandez-de-Retana, Martin, Chen, Lingfeng, Bilbao-Jayo, Aritz, Fernandez-de-Retana, Miguel

Abstract

Current methods for textual analysis rely on data annotated within predefined ontologies, often embedding human bias within black-box models. Despite achieving near-perfect performance, these approaches exploit unstructured, linear pattern recognition rather than modeling the structured interactions between entities that naturally emerge in discourse. In this work, we propose a graph-based framework for the detection, analysis, and classification of oppositional narratives and their underlying entities by representing narratives as entity-interaction graphs. Moreover, by incorporating causal estimation at the node level, our approach derives a causal representation of each contribution to the final classification by distilling the constructed sentence graph into a minimal causal subgraph. Building upon this representation, we introduce a classification pipeline that outperforms existing approaches to oppositional thinking classification task.

Chinese Translation

当前的文本分析方法依赖于在预定义本体中标注的数据，往往在黑箱模型中嵌入人类偏见。尽管这些方法实现了近乎完美的性能，但它们利用的是非结构化的线性模式识别，而不是建模在话语中自然出现的实体之间的结构化互动。在本研究中，我们提出了一种基于图的框架，用于检测、分析和分类对立叙事及其潜在实体，通过将叙事表示为实体互动图。此外，通过在节点级别引入因果估计，我们的方法通过将构建的句子图提炼为最小因果子图，推导出每个贡献对最终分类的因果表示。在此表示的基础上，我们引入了一个分类流程，其性能优于现有的对立思维分类任务的方法。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2603.06183

CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation

CRIMSON：基于临床的生成放射学报告评估的LLM基础指标

Baharoon, Mohammed, Heintz, Thibault, Raissi, Siavash, Alabbad, Mahmoud, Alhammad, Mona, AlOmaish, Hassan, Kim, Sung Eun, Banerjee, Oishi, Rajpurkar, Pranav

Abstract

We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety. Unlike prior metrics, CRIMSON incorporates full clinical context, including patient age, indication, and guideline-based decision rules, and prevents normal or clinically insignificant findings from exerting disproportionate influence on the overall score. The framework categorizes errors into a comprehensive taxonomy covering false findings, missing findings, and eight attribute-level errors (e.g., location, severity, measurement, and diagnostic overinterpretation). Each finding is assigned a clinical significance level (urgent, actionable non-urgent, non-actionable, or expected/benign), based on a guideline developed in collaboration with attending cardiothoracic radiologists, enabling severity-aware weighting that prioritizes clinically consequential mistakes over benign discrepancies. CRIMSON is validated through strong alignment with clinically significant error counts annotated by six board-certified radiologists in ReXVal (Kendalls tau = 0.61-0.71; Pearsons r = 0.71-0.84), and through two additional benchmarks that we introduce. In RadJudge, a targeted suite of clinically challenging pass-fail scenarios, CRIMSON shows consistent agreement with expert judgment. In RadPref, a larger radiologist preference benchmark of over 100 pairwise cases with structured error categorization, severity modeling, and 1-5 overall quality ratings from three cardiothoracic radiologists, CRIMSON achieves the strongest alignment with radiologist preferences. We release the metric, the evaluation benchmarks, RadJudge and RadPref, and a fine-tuned MedGemma model to enable reproducible evaluation of report generation, all available at https://github.com/rajpurkarlab/CRIMSON.

Chinese Translation

我们介绍了CRIMSON，这是一个针对胸部X光报告生成的临床基础评估框架，评估报告的诊断正确性、上下文相关性和患者安全性。与之前的指标不同，CRIMSON纳入了完整的临床背景，包括患者年龄、指征和基于指南的决策规则，并防止正常或临床上不显著的发现对总体评分产生不成比例的影响。该框架将错误分类为一个全面的分类法，涵盖虚假发现、缺失发现和八种属性级错误（例如，位置、严重性、测量和诊断过度解读）。每个发现根据与主治心胸放射科医师合作制定的指南，分配一个临床重要性等级（紧急、可采取行动的非紧急、不可采取行动的或预期/良性），使得严重性意识加权优先考虑临床上重要的错误而非良性差异。CRIMSON通过与六位获得认证的放射科医师在ReXVal中标注的临床重要错误计数的强一致性进行验证（Kendalls tau = 0.61-0.71；Pearsons r = 0.71-0.84），以及通过我们引入的两个额外基准进行验证。在RadJudge中，CRIMSON在一组针对临床挑战的通过-失败场景中与专家判断表现出一致的协议。在RadPref中，CRIMSON在一个包含超过100对有结构错误分类、严重性建模和来自三位心胸放射科医师的1-5总体质量评分的放射科医师偏好基准中，达到了与放射科医师偏好的最佳一致性。我们发布了该指标、评估基准RadJudge和RadPref，以及一个经过微调的MedGemma模型，以便实现报告生成的可重复评估，所有内容可在https://github.com/rajpurkarlab/CRIMSON获取。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2603.06194

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

MAPO：用于长时间多轮对话的混合优势策略优化

Zhang, Naifan, Sun, Ruihan, Su, Jinwei, Yang, Hengjie, Pan, Zhengyuan, Chen, Zhaohan, Zhang, Xiaofan

Abstract

Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while na\"ive turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.

Chinese Translation

主观多轮对话任务，如情感支持，需要能够适应不断变化的用户状态并优化长时间交互质量的对话策略。然而，由于缺乏可靠的过程监督，这类环境下的强化学习（RL）仍然面临挑战。仅基于结果的训练将跨轮次的信用分配压缩为单一的轨迹级奖励，而简单的轮次级组采样在交互环境中会产生高昂的回滚成本。我们提出了一种名为MAPO的无评论家且高效的RL算法，该算法利用来自评判模型的密集过程反馈，并通过蒙特卡洛回报传播长时间效应。为了稳定优化，我们引入了一种混合优势估计器，结合了轮次级归一化和批次级归一化，使得信用分配既细致又可扩展。在多个主观对话基准测试中，包括EMPA、EmoBench和EQ-Bench，以及模型规模从7B到32B，我们的方法在训练稳定性和最终性能上始终优于仅基于结果的GRPO和单级归一化基线。在EMPA上，我们的改进率高达9个百分点，且对话得分相比7B基础模型提高了43.2分。尽管仅在EMPA风格的环境中进行训练，我们的方法在未见过的情感智力基准上表现良好，包括在EmoBench上提高了4分，在EQ-Bench上提高了3.5分。这些结果共同表明，密集过程监督与混合级归一化相结合，使得主观开放式多轮对话的有效且可扩展的RL成为可能。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2603.06197

Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol & Validation Using Eleven Large Language Models

人工智能人群智慧（AI-CROWD）在内容分析中的真实值近似：研究协议及使用十一种大型语言模型的验证

de-Marcos, Luis, Goyanes, Manuel, Domínguez-Díaz, Adrián

Abstract

Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time, cost, and consistency challenges. To overcome this barrier, we introduce the AI-CROWD protocol, which approximates ground truth by leveraging the collective outputs of an ensemble of large language models (LLMs). Rather than asserting that the resulting labels are true ground truth, the protocol generates a consensus-based approximation derived from convergent and divergent inferences across multiple models. By aggregating outputs via majority voting and interrogating agreement/disagreement patterns with diagnostic metrics, AI-CROWD identifies high-confidence classifications while flagging potential ambiguity or model-specific biases.

Chinese Translation

大规模内容分析越来越受到可观察的真实值或金标准标签缺乏的限制，因为通过广泛的人类编码创建这样的基准对于庞大的数据集而言变得不切实际，主要是由于时间、成本和一致性等挑战。为了解决这一障碍，我们提出了AI-CROWD协议，该协议通过利用一组大型语言模型（LLMs）的集体输出来近似真实值。该协议并不声称生成的标签是真实的真实值，而是基于多个模型之间的趋同和分歧推理生成共识基础的近似值。通过多数投票聚合输出，并使用诊断指标审查一致性/不一致性模式，AI-CROWD能够识别高置信度的分类，同时标记潜在的模糊性或模型特定的偏差。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2603.06198

LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation

LIT-RAGBench：大型语言模型在检索增强生成中的生成能力基准测试

Itai, Koki, Hasegawa, Shunichi, Yamamoto, Yuta, Minegishi, Gouki, Otsuki, Masaki

Abstract

Retrieval-Augmented Generation (RAG) is a framework in which a Generator, such as a Large Language Model (LLM), produces answers by retrieving documents from an external collection using a Retriever. In practice, Generators must integrate evidence from long contexts, perform multi-step reasoning, interpret tables, and abstain when evidence is missing. However, existing benchmarks for Generators provide limited coverage, with none enabling simultaneous evaluation of multiple capabilities under unified conditions. To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic, Table, and Abstention, each further divided into practical evaluation aspects. LIT-RAGBench systematically covers patterns combining multiple aspects across categories. By using fictional entities and scenarios, LIT-RAGBench evaluates answers grounded in the provided external documents. The dataset consists of 114 human-constructed Japanese questions and an English version generated by machine translation with human curation. We use LLM-as-a-Judge for scoring and report category-wise and overall accuracy. Across API-based and open-weight models, no model exceeds 90% overall accuracy. By making strengths and weaknesses measurable within each category, LIT-RAGBench serves as a valuable metric for model selection in practical RAG deployments and for building RAG-specialized models. We release LIT-RAGBench, including the dataset and evaluation code, at https://github.com/Koki-Itai/LIT-RAGBench.

Chinese Translation

检索增强生成（RAG）是一种框架，其中生成器（如大型语言模型（LLM））通过使用检索器从外部文献中检索文档来生成答案。在实际应用中，生成器必须整合来自长上下文的证据，进行多步骤推理，解读表格，并在缺乏证据时选择不作答。然而，现有的生成器基准测试覆盖范围有限，且没有一个能够在统一条件下同时评估多种能力。为填补现有评估与实际应用之间的差距，我们引入了LIT-RAGBench（逻辑、整合、表格、推理和不作答RAG生成器基准），该基准定义了五个类别：整合、推理、逻辑、表格和不作答，每个类别进一步细分为实际评估方面。LIT-RAGBench系统性地覆盖了跨类别结合多种方面的模式。通过使用虚构实体和场景，LIT-RAGBench评估基于提供的外部文档的答案。该数据集由114个人工构建的日语问题和经过人工审校的机器翻译生成的英文版本组成。我们使用LLM作为评判者进行评分，并报告类别和整体准确率。在基于API和开放权重的模型中，没有模型的整体准确率超过90%。通过使每个类别内的优缺点可量化，LIT-RAGBench为实际RAG部署中的模型选择和构建RAG专用模型提供了有价值的指标。我们在https://github.com/Koki-Itai/LIT-RAGBench发布LIT-RAGBench，包括数据集和评估代码。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2603.06199

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

FlashPrefill：超快速长上下文预填充的瞬时模式发现与阈值设定

Fan, Qihang, Huang, Huaibo, Wu, Zhiying, Wang, Juqiu, Wang, Bingning, He, Ran

Abstract

Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention mechanisms have been explored, they typically suffer from either significant search latency or insufficient sparsity. In this paper, we propose FlashPrefill, a framework enabling ultra-fast prefilling via instantaneous pattern discovery and thresholding. FlashPrefill leverages a fast block-searching technique to simultaneously locate dynamic vertical, slash, and block-sparse attention patterns. Crucially, it introduces a dynamic thresholding mechanism that bypasses the prohibitive overhead of sorting or accumulating attention scores while effectively eliminating the long-tail distribution to enhance sparsity. Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences. Notably, unlike existing methods that incur efficiency degradation on shorter contexts, FlashPrefill maintains a 1.71x speedup even at a 4K context length, demonstrating its robustness and practical utility across varying sequence scales.

Chinese Translation

长上下文建模是大型语言模型的一项关键能力，但注意力机制的平方复杂度仍然是一个重要瓶颈，尤其是在计算密集的预填充阶段。虽然已经探索了多种稀疏注意力机制，但它们通常面临显著的搜索延迟或稀疏性不足的问题。本文提出了FlashPrefill，一个通过瞬时模式发现和阈值设定实现超快速预填充的框架。FlashPrefill利用快速块搜索技术，同时定位动态的垂直、斜线和块稀疏注意力模式。重要的是，它引入了一种动态阈值机制，避免了对注意力分数进行排序或累积的高昂开销，同时有效消除了长尾分布以增强稀疏性。广泛的评估表明，FlashPrefill在效率上实现了显著飞跃，在256K序列上提供了前所未有的27.78倍加速。值得注意的是，与现有方法在较短上下文中效率下降不同，FlashPrefill在4K上下文长度下仍保持1.71倍的加速，展示了其在不同序列规模下的鲁棒性和实用性。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2603.06222

SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models

SPOT：用于大型语言模型中高效且可解释的潜在推理的跨度级思考暂停

Chu, Yunlong, Shao, Minglai, Liu, Yuhang, Hao, Bing, Lin, Yumeng, Wang, Jialu, Wang, Ruijie

Abstract

Explicit Chain-of-Thought improves the reasoning performance of large language models but often incurs high inference cost due to verbose token-level traces. While recent approaches reduce this overhead via concise prompting or step pruning, they largely truncate what the model says rather than internalize what the model thinks. Latent reasoning offers a promising alternative by performing computation in the hidden space, yet prior methods face two critical challenges. Many existing approaches rely on rigid point-to-point alignment, forcing a latent token to approximate the final representation of a reasoning step, which can be insufficient to capture the dense, variable-length semantics of an entire reasoning segment. Furthermore, these methods often suffer from a lack of interpretability: latent states are commonly produced by unconstrained optimization or embedding mixing, yielding vectors that are difficult to decode or audit under the pretrained language head. We propose SPOT, a flexible framework that compresses explicit CoT into compact latent pause tokens without enforcing a fixed response template. At the core of SPOT is Span-level Semantic Alignment, a Sinkhorn optimal-transport objective that softly matches each pause token to the semantics of an entire reasoning segment, overcoming the rigidity of step-end alignment. To further improve interpretability, SPOT introduces a Frozen-Head Decoding Constraint that keeps latent states directly decodable as token distributions under the frozen pretrained LM head, enabling readable keyword interpretations of latent thoughts. Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.

Chinese Translation

显式思维链（Chain-of-Thought）提高了大型语言模型的推理性能，但由于冗长的令牌级痕迹，通常会导致高推理成本。尽管最近的方法通过简洁的提示或步骤修剪来减少这种开销，但它们在很大程度上截断了模型所说的内容，而不是内化模型的思维。潜在推理通过在隐藏空间中进行计算提供了一种有前景的替代方案，但先前的方法面临两个关键挑战。许多现有方法依赖于刚性的逐点对齐，迫使潜在令牌近似推理步骤的最终表示，这可能不足以捕捉整个推理段的密集、可变长度语义。此外，这些方法通常缺乏可解释性：潜在状态通常通过不受约束的优化或嵌入混合生成，导致在预训练语言头下难以解码或审计的向量。我们提出了SPOT，一个灵活的框架，将显式的思维链压缩为紧凑的潜在暂停令牌，而不强制固定的响应模板。SPOT的核心是跨度级语义对齐（Span-level Semantic Alignment），这是一个Sinkhorn最优传输目标，柔性地将每个暂停令牌与整个推理段的语义匹配，克服了步骤结束对齐的刚性。为了进一步提高可解释性，SPOT引入了冻结头解码约束（Frozen-Head Decoding Constraint），使潜在状态在冻结的预训练语言模型头下保持直接可解码为令牌分布，从而实现潜在思维的可读关键词解释。在推理基准上的实验表明，SPOT平均提高了2.3个百分点的准确性，同时减少了生成的令牌数量37.5%，并提供了潜在推理过程的忠实语义解释。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2603.06264

Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion

注意差距：大型语言模型与亚洲公众舆论对齐的陷阱

Shankar, Hari, P, Vedanta S, Margapuri, Sriharini, Mazumder, Debjani, Kumaraguru, Ponnurangam, Chakraborty, Abhijnan

Abstract

Large Language Models (LLMs) are increasingly being deployed in multilingual, multicultural settings, yet their reliance on predominantly English-centric training data risks misalignment with the diverse cultural values of different societies. In this paper, we present a comprehensive, multilingual audit of the cultural alignment of contemporary LLMs including GPT-4o-Mini, Gemini-2.5-Flash, Llama 3.2, Mistral and Gemma 3 across India, East Asia and Southeast Asia. Our study specifically focuses on the sensitive domain of religion as the prism for broader alignment. To facilitate this, we conduct a multi-faceted analysis of every LLM's internal representations, using log-probs/logits, to compare the model's opinion distributions against ground-truth public attitudes. We find that while the popular models generally align with public opinion on broad social issues, they consistently fail to accurately represent religious viewpoints, especially those of minority groups, often amplifying negative stereotypes. Lightweight interventions, such as demographic priming and native language prompting, partially mitigate but do not eliminate these cultural gaps. We further show that downstream evaluations on bias benchmarks (such as CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ) reveal persistent harms and under-representation in sensitive contexts. Our findings underscore the urgent need for systematic, regionally grounded audits to ensure equitable global deployment of LLMs.

Chinese Translation

大型语言模型（LLMs）正越来越多地在多语言、多文化的环境中部署，然而它们对以英语为中心的训练数据的依赖，可能导致与不同社会的多样文化价值观的错位。在本文中，我们对当代LLMs（包括GPT-4o-Mini、Gemini-2.5-Flash、Llama 3.2、Mistral和Gemma 3）在印度、东亚和东南亚的文化对齐进行了全面的多语言审计。我们的研究特别关注宗教这一敏感领域，以此作为更广泛对齐的透镜。为此，我们对每个LLM的内部表征进行了多方面分析，使用对数概率/对数值（log-probs/logits），将模型的观点分布与真实公众态度进行比较。我们发现，尽管这些流行模型在广泛社会问题上通常与公众舆论一致，但它们在准确呈现宗教观点方面始终存在不足，尤其是对少数群体的观点，常常放大负面刻板印象。轻量级干预措施，如人口统计启发和母语提示，部分缓解但并未消除这些文化差距。我们进一步展示了在偏见基准（如CrowS-Pairs、IndiBias、ThaiCLI、KoBBQ）上的下游评估揭示了在敏感背景下持续存在的伤害和代表性不足。我们的研究结果强调了进行系统的、以地区为基础的审计的迫切需要，以确保LLMs的全球公平部署。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2603.06324

The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks

反向提问的艺术：评估人工智能对当代艺术作品的模仿

Dinu, Anca, Mihail, Andreiana, Florescu, Andra-Maria, Creanga, Claudiu

Abstract

This study explores artificial visual creativity, focusing on ChatGPT's ability to generate new images intentionally pastiching original artworks such as paintings, drawings, sculptures and installations. The process involved twelve artists from Romania, Bulgaria, France, Austria, and the United Kingdom, each invited to contribute with three of their artworks and to grade and comment on the AI-generated versions. The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions. The results point to a significant gap between color and texture-based similarity and compositional, conceptual, and perceptual one. Consequently, we advocate for the use of a "style transfer dashboard" of complementary metrics to evaluate the similarity between pastiches and originals, rather than using a single style metric. The artists' comments revealed limitations of ChatGPT's pastiches after contemporary artworks, which were perceived by the authors of the originals as lacking dimensionality, context, and intentional sense, and seeming more of a paraphrase or an approximate quotation rather than as a valuable, emotion-evoking artwork.

Chinese Translation

本研究探讨了人工视觉创意，重点关注ChatGPT生成新图像的能力，这些图像故意模仿原始艺术作品，如绘画、素描、雕塑和装置艺术。该过程涉及来自罗马尼亚、保加利亚、法国、奥地利和英国的十二位艺术家，他们被邀请贡献三件自己的艺术作品，并对AI生成的版本进行评分和评论。分析结合了人类评估与计算方法，旨在检测原作与其AI生成版本之间的视觉和风格相似性或差异。结果显示，在基于颜色和纹理的相似性与基于构图、概念和感知的相似性之间存在显著差距。因此，我们倡导使用“风格迁移仪表板”来评估模仿作品与原作之间的相似性，而不是使用单一的风格指标。艺术家的评论揭示了ChatGPT在模仿当代艺术作品时的局限性，原作的作者认为这些模仿缺乏立体感、背景和意图，更多地表现为一种释义或近似引用，而不是一种有价值的、能够引发情感的艺术作品。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2603.06348

Transparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI

透明人工智能在数学中的应用：基于变换器的大型语言模型用于数学实体关系提取与可解释人工智能

Aurpa, Tanjim Taharat

Abstract

Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study formulates mathematical problem interpretation as a Mathematical Entity Relation Extraction (MERE) task, where operands are treated as entities and operators as their relationships. Transformer-based models are applied to automatically extract these relations from mathematical text, with Bidirectional Encoder Representations from Transformers (BERT) achieving the best performance, reaching an accuracy of 99.39%. To enhance transparency and trust in the model's predictions, Explainable Artificial Intelligence (XAI) is incorporated using Shapley Additive Explanations (SHAP). The explainability analysis reveals how specific textual and mathematical features influence relation prediction, providing insights into feature importance and model behavior. By combining transformer-based learning, a task-specific dataset, and explainable modeling, this work offers an effective and interpretable framework for MERE, supporting future applications in automated problem solving, knowledge graph construction, and intelligent educational systems.

Chinese Translation

数学文本理解是一项具有挑战性的任务，因为其中存在专业实体及其复杂关系。本研究将数学问题的解释形式化为数学实体关系提取（Mathematical Entity Relation Extraction，MERE）任务，其中操作数被视为实体，运算符则被视为它们之间的关系。采用基于变换器的模型自动从数学文本中提取这些关系，其中双向编码器表示变换器（Bidirectional Encoder Representations from Transformers，BERT）实现了最佳性能，准确率达到99.39%。为了增强模型预测的透明性和可信度，采用可解释人工智能（Explainable Artificial Intelligence，XAI），使用Shapley加性解释（Shapley Additive Explanations，SHAP）。可解释性分析揭示了特定文本和数学特征如何影响关系预测，提供了特征重要性和模型行为的洞见。通过结合基于变换器的学习、特定任务数据集和可解释建模，本研究提供了一个有效且可解释的MERE框架，支持未来在自动问题解决、知识图谱构建和智能教育系统中的应用。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2603.06416

Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task

大型语言模型中义务条件推理的评估：以瓦森选择任务为例

Abe, Hirohiko, Ozeki, Kentaro, Ando, Risako, Morishita, Takanobu, Mineshima, Koji, Okada, Mitsuhiro

Abstract

As large language models (LLMs) advance in linguistic competence, their reasoning abilities are gaining increasing attention. In humans, reasoning often performs well in domain specific settings, particularly in normative rather than purely formal contexts. Although prior studies have compared LLM and human reasoning, the domain specificity of LLM reasoning remains underexplored. In this study, we introduce a new Wason Selection Task dataset that explicitly encodes deontic modality to systematically distinguish deontic from descriptive conditionals, and use it to examine LLMs' conditional reasoning under deontic rules. We further analyze whether observed error patterns are better explained by confirmation bias (a tendency to seek rule-supporting evidence) or by matching bias (a tendency to ignore negation and select items that lexically match elements of the rule). Results show that, like humans, LLMs reason better with deontic rules and display matching-bias-like errors. Together, these findings suggest that the performance of LLMs varies systematically across rule types and that their error patterns can parallel well-known human biases in this paradigm.

Chinese Translation

随着大型语言模型（LLMs）在语言能力上的进步，它们的推理能力正受到越来越多的关注。在人类中，推理在特定领域的设置中往往表现良好，特别是在规范性而非纯形式的上下文中。尽管之前的研究比较了LLM和人类的推理，但LLM推理的领域特异性仍然未被充分探索。在本研究中，我们引入了一个新的瓦森选择任务数据集，该数据集明确编码了义务情态，以系统地区分义务条件句和描述性条件句，并利用该数据集来检验LLM在义务规则下的条件推理。我们进一步分析观察到的错误模式是否更能通过确认偏差（寻求支持规则的证据的倾向）或匹配偏差（忽视否定并选择与规则元素在词汇上匹配的项目的倾向）来解释。结果表明，LLM与人类一样，在义务规则下推理表现更好，并表现出类似匹配偏差的错误。这些发现表明，LLM的表现在规则类型之间系统性地变化，并且它们的错误模式可以与这一范式中众所周知的人类偏差相平行。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2603.06424

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

从提示到偏好优化：基于大型语言模型的自动化作文评分的比较研究

Nguyen, Minh Hoang, Pham, Vu Hoang, Huynh, Xuan Thanh, Mai, Phuc Hong, Nguyen, Vinh The, Huynh, Quang Nhut, Nguyen, Huy Tien, Le, Tung

Abstract

Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2

Chinese Translation

大型语言模型（LLMs）最近重塑了自动化作文评分（AES），然而以往的研究通常孤立地考察个别技术，限制了对其在英语作为第二语言（L2）写作中相对优缺点的理解。为了解决这一问题，我们对主要的基于LLM的AES范式在雅思写作任务2（IELTS Writing Task 2）上进行了全面比较。在这一统一基准上，我们评估了四种方法：（i）基于编码器的分类微调，（ii）零样本和少样本提示，（iii）指令微调和检索增强生成（Retrieval-Augmented Generation, RAG），以及（iv）结合直接偏好优化（Direct Preference Optimization, DPO）和RAG的监督微调。我们的结果揭示了不同方法之间明显的准确性-成本-鲁棒性权衡，最佳配置结合了k-SFT和RAG，取得了93%的F1分数，展现出最强的整体结果。本研究提供了现代基于LLM的AES策略在英语L2中的首次统一实证比较，为自动评分写作任务提供了潜在的前景。代码公开在 https://github.com/MinhNguyenDS/LLM_AES-EnL2

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2603.06428

Abductive Reasoning with Syllogistic Forms in Large Language Models

大型语言模型中的三段论形式的溯因推理

Abe, Hirohiko, Ando, Risako, Ozeki, Takanobu Morishita Kentaro, Mineshima, Koji, Okada, Mitsuhiro

Abstract

Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key concern. Prior studies have indicated that LLMs and humans share similar biases, such as dismissing logically valid inferences that contradict common beliefs. However, criticizing LLMs for these biases might be unfair, considering our reasoning not only involves formal deduction but also abduction, which draws tentative conclusions from limited information. Abduction can be regarded as the inverse form of syllogism in its basic structure, that is, a process of drawing a minor premise from a major premise and conclusion. This paper explores the accuracy of LLMs in abductive reasoning by converting a syllogistic dataset into one suitable for abduction. It aims to investigate whether the state-of-the-art LLMs exhibit biases in abduction and to identify potential areas for improvement, emphasizing the importance of contextualized reasoning beyond formal deduction. This investigation is vital for advancing the understanding and application of LLMs in complex reasoning tasks, offering insights into bridging the gap between machine and human cognition.

Chinese Translation

使用大型语言模型（LLMs）的人工智能研究正在快速发展，与人类推理的性能比较已成为一个关键关注点。先前的研究表明，LLMs和人类在某些偏见上存在相似之处，例如忽视与常识相悖的逻辑有效推理。然而，批评LLMs的这些偏见可能不公平，因为我们的推理不仅涉及形式演绎，还包括溯因推理，它从有限的信息中得出暂时的结论。溯因推理在其基本结构上可以被视为三段论的逆形式，即从大前提和结论中得出小前提的过程。本文通过将三段论数据集转换为适合溯因推理的数据集，探讨LLMs在溯因推理中的准确性。研究旨在调查最先进的LLMs是否在溯因推理中表现出偏见，并识别潜在的改进领域，强调超越形式演绎的情境化推理的重要性。这项研究对于推动对LLMs在复杂推理任务中的理解和应用至关重要，为弥合机器与人类认知之间的差距提供了见解。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2603.06485

PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations

PONTE：自然语言可信解释的个性化编排

Vineis, Vittoria, Silvestri, Matteo, Antonelli, Lorenzo, Betello, Filippo, Tolomei, Gabriele

Abstract

Explainable Artificial Intelligence (XAI) seeks to enhance the transparency and accountability of machine learning systems, yet most methods follow a one-size-fits-all paradigm that neglects user differences in expertise, goals, and cognitive needs. Although Large Language Models can translate technical explanations into natural language, they introduce challenges related to faithfulness and hallucinations. To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives. PONTE models personalization as a closed-loop validation and adaptation process rather than prompt engineering. It combines: (i) a low-dimensional preference model capturing stylistic requirements; (ii) a preference-conditioned generator grounded in structured XAI artifacts; and (iii) verification modules enforcing numerical faithfulness, informational completeness, and stylistic alignment, optionally supported by retrieval-grounded argumentation. User feedback iteratively updates the preference state, enabling quick personalization. Automatic and human evaluations across healthcare and finance domains show that the verification-refinement loop substantially improves completeness and stylistic alignment over validation-free generation. Human studies further confirm strong agreement between intended preference vectors and perceived style, robustness to generation stochasticity, and consistently positive quality assessments.

Chinese Translation

可解释人工智能（XAI）旨在增强机器学习系统的透明度和问责制，但大多数方法遵循一刀切的范式，忽视了用户在专业知识、目标和认知需求上的差异。尽管大型语言模型可以将技术解释转化为自然语言，但它们引入了与忠实性和幻觉相关的挑战。为了解决这些挑战，我们提出了PONTE（个性化编排自然语言可信解释），这是一个人机协作框架，用于自适应和可靠的XAI叙述。PONTE将个性化建模为一个闭环验证和适应过程，而不是提示工程。它结合了：（i）一个低维偏好模型，捕捉风格要求；（ii）一个基于结构化XAI文献的偏好条件生成器；以及（iii）强制执行数值忠实性、信息完整性和风格一致性的验证模块，选项上支持基于检索的论证。用户反馈迭代更新偏好状态，实现快速个性化。在医疗和金融领域的自动和人工评估表明，验证-精炼循环显著提高了完整性和风格一致性，相较于无验证生成。人类研究进一步确认了意图偏好向量与感知风格之间的强一致性，对生成随机性的鲁棒性，以及持续的积极质量评估。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2603.06503

Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing

超越行的推理：面向多模态电子表格理解与编辑的自主检索

Gulati, Anmol, Sen, Sahil, Sarguroh, Waqar, Paul, Kevin

Abstract

Recent advances in multimodal Retrieval-Augmented Generation (RAG) enable Large Language Models (LLMs) to analyze enterprise spreadsheet workbooks containing millions of cells, cross-sheet dependencies, and embedded visual artifacts. However, state-of-the-art approaches exclude critical context through single-pass retrieval, lose data resolution through compression, and exceed LLM context windows through naive full-context injection, preventing reliable multi-step reasoning over complex enterprise workbooks. We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis to structured editing. Supported by over 200 hours of expert human evaluation, BRTR achieves state-of-the-art performance across three frontier spreadsheet understanding benchmarks, surpassing prior methods by 25 percentage points on FRTR-Bench, 7 points on SpreadsheetLLM, and 32 points on FINCH. We evaluate five multimodal embedding models, identifying NVIDIA NeMo Retriever 1B as the top performer for mixed tabular and visual data, and vary nine LLMs. Ablation experiments confirm that the planner, retrieval, and iterative reasoning each contribute substantially, and cost analysis shows GPT-5.2 achieves the best efficiency-accuracy trade-off. Throughout all evaluations, BRTR maintains full auditability through explicit tool-call traces.

Chinese Translation

近期在多模态检索增强生成（RAG）方面的进展，使得大型语言模型（LLMs）能够分析包含数百万单元格、跨表依赖关系和嵌入视觉元素的企业电子表格工作簿。然而，当前最先进的方法通过单次检索排除了关键上下文，通过压缩失去了数据分辨率，并通过天真的全上下文注入超出了LLM的上下文窗口，阻碍了对复杂企业工作簿的可靠多步骤推理。我们提出了超越行的推理（BRTR），这是一个面向电子表格理解的多模态自主框架，它用迭代工具调用循环替代了单次检索，支持从复杂分析到结构化编辑的端到端Excel工作流程。经过超过200小时的专家人工评估，BRTR在三个前沿电子表格理解基准测试中实现了最先进的性能，在FRTR-Bench上超越了之前的方法25个百分点，在SpreadsheetLLM上超越了7个百分点，在FINCH上超越了32个百分点。我们评估了五种多模态嵌入模型，确定NVIDIA NeMo Retriever 1B是混合表格和视觉数据的最佳表现者，并测试了九种LLM。消融实验确认规划、检索和迭代推理各自都有显著贡献，成本分析显示GPT-5.2实现了最佳的效率与准确性权衡。在所有评估中，BRTR通过明确的工具调用痕迹保持了完全的可审计性。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2603.06505

Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

在上下文中发声：通过对比学习实现的多语言自动语音识别与语音上下文对齐

Zhang, Yuchen, Mouratidis, Haralambos, Shekhar, Ravi

Abstract

Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.

Chinese Translation

自动语音识别（ASR）受益于预训练语音和语言模型的进展，但大多数系统仍然局限于单语环境和短小的孤立语句。尽管近期在上下文感知的ASR方面取得了一定进展，但仍面临两个关键挑战：有限的多语言支持和语音与上下文表示之间缺乏原则性的对齐。在本文中，我们提出了一种上下文感知的多语言ASR框架，支持多种语言和口音，同时保持预训练模型的模块化。我们的方法通过轻量级投影模块结合了一个冻结的语音编码器和一个仅解码的语言模型，允许结构化的上下文提示（包括对话历史和偏向词）来指导转录。为了改善语音与上下文之间的交互，我们采用了一种对比学习目标，将它们的表示对齐到共享的嵌入空间。对11种语言和5种英语方言的1500多个小时的真实对话语音进行评估表明，上下文输入始终提高了识别质量。在应用于不同上下文类型时，对比对齐提供了额外的提升，总体性能提升超过5%。这些结果突显了上下文建模和跨模态对齐在多语言ASR中的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2603.06552

KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection

KCLarity在SemEval-2026任务6中的表现：编码器与零样本方法在政治规避检测中的应用

Sage, Archie, Greco, Salvatore

Abstract

This paper describes the KCLarity team's participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in political discourse. We investigate two modelling formulations: (i) directly predicting the clarity label, and (ii) predicting the evasion label and deriving clarity through the task taxonomy hierarchy. We further explore several auxiliary training variants and evaluate decoder-only models in a zero-shot setting under the evasion-first formulation. Overall, the two formulations yield comparable performance. Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.

Chinese Translation

本文描述了KCLarity团队在SemEval 2026的CLARITY共享任务中的参与，该任务旨在对政治话语中的模糊性和规避技术进行分类。我们研究了两种建模形式：（i）直接预测清晰度标签，以及（ii）预测规避标签并通过任务分类层次推导清晰度。我们进一步探索了几种辅助训练变体，并在规避优先的形式下评估仅解码器模型在零样本设置中的表现。总体而言，这两种形式的性能相当。在基于编码器的模型中，RoBERTa-large在公共测试集上取得了最佳结果，而零样本的GPT-5.2在隐藏评估集上表现出更好的泛化能力。

View on arXiv Download PDF AI Translation

arXiv Papers

ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation

Digital-Twin Losses for Lane-Compliant Trajectory Prediction at Urban Intersections

TEGA: A Tactile-Enhanced Grasping Assistant for Assistive Robotics via Sensor Fusion and Closed-Loop Haptic Feedback

PRISM: Personalized Refinement of Imitation Skills for Manipulation via Human Instructions

Task Parameter Extrapolation via Learning Inverse Tasks from Forward Demonstrations

RACAS: Controlling Diverse Robots With a Single Agentic System

Control Lyapunov Functions for Underactuated Soft Robots

RFM-HRI : A Multimodal Dataset of Medical Robot Failure, User Reaction and Recovery Preferences for Item Retrieval Tasks

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

TransMASK: Masked State Representation through Learned Transformation

Contact-Grounded Policy: Dexterous Visuotactile Policy with Generative Contact Grounding

Introducing the transitional autonomous vehicle lane-changing dataset: Empirical Experiments

Environment-Aware Path Generation for Robotic Additive Manufacturing of Structures

Vision-Language System using Open-Source LLMs for Gestures in Medical Interpreter Robots

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

Multi-Robot Trajectory Planning via Constrained Bayesian Optimization and Local Cost Map Learning with STL-Based Conflict Resolution

Task-Level Decisions to Gait Level Control: A Hierarchical Policy Approach for Quadruped Navigation

CDF-Glove: A Cable-Driven Force Feedback Glove for Dexterous Teleoperation

Hierarchical Latent Action Model

OpenHEART: Opening Heterogeneous Articulated Objects with a Legged Manipulator

Terrain characterization and locomotion adaptation in a small-scale lizard-inspired robot

Expert Knowledge-driven Reinforcement Learning for Autonomous Racing via Trajectory Guidance and Dynamics Constraints

DexEMG: Towards Dexterous Teleoperation System via EMG2Pose Generalization

AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models

Improved hopping control on slopes for small robots using spring mass modeling

Iterative Convex Optimization with Control Barrier Functions for Obstacle Avoidance among Polytopes

Swooper: Learning High-Speed Aerial Grasping With a Simple Gripper

How to Model Your Crazyflie Brushless

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

Proprioceptive Shape Estimation of Tensegrity Manipulators Using Energy Minimisation

HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild

MagRobot:An Open Simulator for Magnetically Navigated Robots

Moving Through Clutter: Scaling Data Collection and Benchmarking for 3D Scene-Aware Humanoid Locomotion via Virtual Reality

TADPO: Reinforcement Learning Goes Off-road

Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration

RODEO: RObotic DEcentralized Organization

Lifelong Embodied Navigation Learning

Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning

Sticky-Glance: Robust Intent Recognition for Human Robot Collaboration via Single-Glance

A Hazard-Informed Data Pipeline for Robotics Physical Safety

Dual-Agent Multiple-Model Reinforcement Learning for Event-Triggered Human-Robot Co-Adaptation in Decoupled Task Spaces

DreamToNav: Generalizable Navigation for Robots via Generative Video Planning

KISS-IMU: Self-supervised Inertial Odometry with Motion-balanced Learning and Uncertainty-aware Inference

Few-Shot Neural Differentiable Simulator: Real-to-Sim Rigid-Contact Modeling

Towards Robotic Lake Maintenance: Integrating SONAR and Satellite Data to Assist Human Operators

SuperSuit: An Isomorphic Bimodal Interface for Scalable Mobile Manipulation

Open-Source Based and ETSI Compliant Cooperative, Connected, and Automated Mini-Cars

Safe Consensus of Cooperative Manipulation with Hierarchical Event-Triggered Control Barrier Functions

Data Analogies Enable Efficient Cross-Embodiment Transfer

History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation

Control Barrier Corridors: From Safety Functions to Safe Sets

A Unified Low-Dimensional Design Embedding for Joint Optimization of Shape, Material, and Actuation in Soft Robots

CFEAR-Teach-and-Repeat: Fast and Accurate Radar-only Localization

SG-DOR: Learning Scene Graphs with Direction-Conditioned Occlusion Reasoning for Pepper Plants

Underactuated multimodal jumping robot for extraterrestrial exploration

Unified Learning of Temporal Task Structure and Action Timing for Bimanual Robot Manipulation

Uncertainty-Aware Adaptive Dynamics For Underwater Vehicle-Manipulator Robots

Fly360: Omnidirectional Obstacle Avoidance within Drone View

Edges Are All You Need: Robust Gait Recognition via Label-Free Structure

Thinking with Spatial Code for Physical-World Video Reasoning

From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

DreamCAD: Scaling Multi-modal CAD Generation using Differentiable Parametric Surfaces

Adversarial Batch Representation Augmentation for Batch Correction in High-Content Cellular Screening

Post Fusion Bird's Eye View Feature Stabilization for Robust Multimodal 3D Detection

Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

Making Reconstruction FID Predictive of Diffusion Generation FID

When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

OWL: A Novel Approach to Machine Perception During Motion

MultiHaystack: Benchmarking Multimodal Retrieval and Reasoning over 40K Images, Videos, and Documents

Interpretable Perception and Reasoning for Audiovisual Geolocation

Any to Full: Prompting Depth Anything for Depth Completion in One Stage

Unlocking ImageNet's Multi-Object Nature: Automated Large-Scale Multilabel Annotation

From Phase Grounding to Intelligent Surgical Narratives

Full Dynamic Range Sky-Modelling For Image Based Lighting

Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers

Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval

Spectral Probing of Feature Upsamplers in 2D-to-3D Scene Reconstruction