arXiv Daily Digest

271

Papers

SODA-CitrON: Static Object Data Association by Clustering Multi-Modal Sensor Detections Online

SODA-CitrON：通过在线聚类多模态传感器检测实现静态物体数据关联

Nausner, Jan, Wohlleben, Kilian, Hubner, Michael

Abstract

The online fusion and tracking of static objects from heterogeneous sensor detections is a fundamental problem in robotics, autonomous systems, and environmental mapping. Although classical data association approaches such as JPDA are well suited for dynamic targets, they are less effective for static objects observed intermittently and with heterogeneous uncertainties, where motion models provide minimal discriminative with respect to clutter. In this paper, we propose a novel method for static object data association by clustering multi-modal sensor detections online (SODA-CitrON), while simultaneously estimating positions and maintaining persistent tracks for an unknown number of objects. The proposed unsupervised machine learning approach operates in a fully online manner and handles temporally uncorrelated and multi-sensor measurements. Additionally, it has a worst-case loglinear complexity in the number of sensor detections while providing full output explainability. We evaluate the proposed approach in different Monte Carlo simulation scenarios and compare it against state-of-the-art methods, including Bayesian filtering, DBSTREAM clustering, and JPDA. The results demonstrate that SODA-CitrON consistently outperforms the compared methods in terms of F1 score, position RMSE, MOTP, and MOTA in the static object mapping scenarios studied.

Chinese Translation

从异构传感器检测中在线融合和跟踪静态物体是机器人技术、自动化系统和环境映射中的一个基本问题。尽管经典的数据关联方法如JPDA（联合概率数据关联）非常适合动态目标，但对于间歇性观察到的静态物体和具有异构不确定性的情况，它们的效果较差，因为运动模型在杂波中提供的区分能力有限。本文提出了一种新颖的静态物体数据关联方法，通过在线聚类多模态传感器检测（SODA-CitrON），同时估计位置并为未知数量的物体保持持续跟踪。所提出的无监督机器学习方法以完全在线的方式运行，能够处理时间上不相关的多传感器测量。此外，它在传感器检测数量方面具有最坏情况下的对数线性复杂度，同时提供完整的输出可解释性。我们在不同的蒙特卡洛仿真场景中评估了所提出的方法，并将其与包括贝叶斯滤波、DBSTREAM聚类和JPDA在内的最先进方法进行了比较。结果表明，SODA-CitrON在所研究的静态物体映射场景中，在F1得分、位置均方根误差（RMSE）、MOTP和MOTA等指标上始终优于比较方法。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2602.22346

Detection and Recognition: A Pairwise Interaction Framework for Mobile Service Robots

检测与识别：移动服务机器人对偶交互框架

Liang, Mengyu, Schlegel, Sarah Gillet, Leite, Iolanda

Abstract

Autonomous mobile service robots, like lawnmowers or cleaning robots, operating in human-populated environments need to reason about local human-human interactions to support safe and socially aware navigation while fulfilling their tasks. For such robots, interaction understanding is not primarily a fine-grained recognition problem, but a perception problem under limited sensing quality and computational resources. Many existing approaches focus on holistic group activity recognition, which often requires complex and large models which may not be necessary for mobile service robots. Others use pairwise interaction methods which commonly rely on skeletal representations but their use in outdoor environments remains challenging. In this work, we argue that pairwise human interaction constitute a minimal yet sufficient perceptual unit for robot-centric social understanding. We study the problem of identifying interacting person pairs and classifying coarse-grained interaction behaviors sufficient for downstream group-level reasoning and service robot decision-making. To this end, we adopt a two-stage framework in which candidate interacting pairs are first identified based on lightweight geometric and motion cues, and interaction types are subsequently classified using a relation network. We evaluate the proposed approach on the JRDB dataset, where it achieves sufficient accuracy with reduced computational cost and model size compared to appearance-based methods. Additional experiments on the Collective Activity Dataset and zero shot test on a lawnmower-collected dataset further illustrate the generality of the proposed framework. These results suggest that pairwise geometric and motion cues provide a practical basis for interaction perception on mobile service robot providing a promising method for integration into mobile robot navigation stacks in future work. Code will be released soon

Chinese Translation

自主移动服务机器人，如割草机或清洁机器人，在人类聚集的环境中操作时，需要推理局部的人际交互，以支持安全和具有社会意识的导航，同时完成其任务。对于这样的机器人，交互理解并不是主要的细粒度识别问题，而是在有限的感知质量和计算资源下的感知问题。许多现有的方法专注于整体群体活动识别，这通常需要复杂且庞大的模型，而这些模型对于移动服务机器人可能并非必要。其他方法使用对偶交互方法，通常依赖于骨架表示，但在户外环境中的应用仍然具有挑战性。在本研究中，我们认为对偶人际交互构成了机器人中心社会理解的最小但足够的感知单元。我们研究了识别交互人对和分类粗粒度交互行为的问题，这些行为足以用于下游的群体级推理和服务机器人决策。为此，我们采用了一个两阶段框架，其中候选交互对首先基于轻量几何和运动线索进行识别，随后使用关系网络对交互类型进行分类。我们在JRDB数据集上评估了所提出的方法，与基于外观的方法相比，它在降低计算成本和模型大小的同时实现了足够的准确性。在集体活动数据集上的额外实验以及在割草机收集的数据集上的零样本测试进一步说明了所提出框架的普遍性。这些结果表明，对偶几何和运动线索为移动服务机器人上的交互感知提供了实用基础，为未来的移动机器人导航系统集成提供了有希望的方法。代码将很快发布。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2602.22459

Hierarchical Trajectory Planning of Floating-Base Multi-Link Robot for Maneuvering in Confined Environments

浮动基座多链机器人在受限环境中机动的分层轨迹规划

Chen, Yicheng, Li, Jinjie, Liu, Haokun, Luo, Zicheng, Kaneko, Kotaro, Zhao, Moju

Abstract

Floating-base multi-link robots can change their shape during flight, making them well-suited for applications in confined environments such as autonomous inspection and search and rescue. However, trajectory planning for such systems remains an open challenge because the problem lies in a high-dimensional, constraint-rich space where collision avoidance must be addressed together with kinematic limits and dynamic feasibility. This work introduces a hierarchical trajectory planning framework that integrates global guidance with configuration-aware local optimization. First, we exploit the dual nature of these robots - the root link as a rigid body for guidance and the articulated joints for flexibility - to generate global anchor states that decompose the planning problem into tractable segments. Second, we design a local trajectory planner that optimizes each segment in parallel with differentiable objectives and constraints, systematically enforcing kinematic feasibility and maintaining dynamic feasibility by avoiding control singularities. Third, we implement a complete system that directly processes point-cloud data, eliminating the need for handcrafted obstacle models. Extensive simulations and real-world experiments confirm that this framework enables an articulated aerial robot to exploit its morphology for maneuvering that rigid robots cannot achieve. To the best of our knowledge, this is the first planning framework for floating-base multi-link robots that has been demonstrated on a real robot to generate continuous, collision-free, and dynamically feasible trajectories directly from raw point-cloud inputs, without relying on handcrafted obstacle models.

Chinese Translation

浮动基座多链机器人在飞行过程中可以改变其形状，使其非常适合于在受限环境中的应用，如自主检查和搜救。然而，这类系统的轨迹规划仍然是一个开放的挑战，因为该问题位于一个高维、约束丰富的空间中，必须同时解决碰撞避免、运动学限制和动态可行性。本研究提出了一种分层轨迹规划框架，将全局指导与考虑配置的局部优化相结合。首先，我们利用这些机器人的双重特性——根链作为刚体进行指导，关节则提供灵活性——生成全局锚定状态，将规划问题分解为可处理的片段。其次，我们设计了一个局部轨迹规划器，优化每个片段，同时考虑可微分的目标和约束，系统地强制执行运动学可行性，并通过避免控制奇异性来保持动态可行性。第三，我们实现了一个完整的系统，直接处理点云数据，消除了对手工制作障碍物模型的需求。大量的仿真和实际实验确认，该框架使得关节式空中机器人能够利用其形态进行刚性机器人无法实现的机动。根据我们的最佳知识，这是第一个在真实机器人上演示的浮动基座多链机器人规划框架，能够直接从原始点云输入生成连续、无碰撞且动态可行的轨迹，而无需依赖手工制作的障碍物模型。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2602.22461

EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow

EgoAVFlow：通过3D流从人类自我中心视频中学习机器人策略与主动视觉

Cho, Daesol, Jang, Youngseok, Xu, Danfei, Ha, Sehoon

Abstract

Egocentric human videos provide a scalable source of manipulation demonstrations; however, deploying them on robots requires active viewpoint control to maintain task-critical visibility, which human viewpoint imitation often fails to provide due to human-specific priors. We propose EgoAVFlow, which learns manipulation and active vision from egocentric videos through a shared 3D flow representation that supports geometric visibility reasoning and transfers without robot demonstrations. EgoAVFlow uses diffusion models to predict robot actions, future 3D flow, and camera trajectories, and refines viewpoints at test time with reward-maximizing denoising under a visibility-aware reward computed from predicted motion and scene geometry. Real-world experiments under actively changing viewpoints show that EgoAVFlow consistently outperforms prior human-demo-based baselines, demonstrating effective visibility maintenance and robust manipulation without robot demonstrations.

Chinese Translation

自我中心的人类视频提供了一种可扩展的操作演示来源；然而，将其应用于机器人需要主动视角控制以保持任务关键的可见性，而人类视角模仿往往无法提供这一点，因为其受到人类特有先验的影响。我们提出了EgoAVFlow，它通过共享的3D流表示从自我中心视频中学习操作和主动视觉，该表示支持几何可见性推理并能够在没有机器人演示的情况下进行迁移。EgoAVFlow使用扩散模型来预测机器人动作、未来的3D流和相机轨迹，并在测试时通过基于可见性意识的奖励最大化去噪来优化视角，该奖励是根据预测的运动和场景几何计算得出的。在主动变化视角下的真实世界实验表明，EgoAVFlow始终优于以往基于人类演示的基线，展示了有效的可见性维护和稳健的操作能力，而无需机器人演示。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2602.22474

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

何时行动、询问或学习：基于不确定性的政策引导

Yuan, Jessie, Wu, Yilin, Bajcsy, Andrea

Abstract

Policy steering is an emerging way to adapt robot behaviors at deployment-time: a learned verifier analyzes low-level action samples proposed by a pre-trained policy (e.g., diffusion policy) and selects only those aligned with the task. While Vision-Language Models (VLMs) are promising general-purpose verifiers due to their reasoning capabilities, existing frameworks often assume these models are well-calibrated. In practice, the overconfident judgment from VLM can degrade the steering performance under both high-level semantic uncertainty in task specifications and low-level action uncertainty or incapability of the pre-trained policy. We propose uncertainty-aware policy steering (UPS), a framework that jointly reasons about semantic task uncertainty and low-level action feasibility, and selects an uncertainty resolution strategy: execute a high-confidence action, clarify task ambiguity via natural language queries, or ask for action interventions to correct the low-level policy when it is deemed incapable at the task. We leverage conformal prediction to calibrate the composition of the VLM and the pre-trained base policy, providing statistical assurances that the verifier selects the correct strategy. After collecting interventions during deployment, we employ residual learning to improve the capability of the pre-trained policy, enabling the system to learn continually but with minimal expensive human feedback. We demonstrate our framework through experiments in simulation and on hardware, showing that UPS can disentangle confident, ambiguous, and incapable scenarios and minimizes expensive user interventions compared to uncalibrated baselines and prior human- or robot-gated continual learning approaches. Videos can be found at https://jessie-yuan.github.io/ups/

Chinese Translation

政策引导是一种在部署时适应机器人行为的新兴方法：一个学习的验证器分析由预训练政策（例如，扩散政策）提出的低级动作样本，并仅选择与任务一致的样本。尽管视觉-语言模型（VLM）由于其推理能力而成为有前景的通用验证器，但现有框架通常假设这些模型是经过良好校准的。在实践中，VLM的过度自信判断可能会在任务规范的高级语义不确定性和预训练政策的低级动作不确定性或无能情况下降低引导性能。我们提出了基于不确定性的政策引导（UPS），这是一个共同推理语义任务不确定性和低级动作可行性的框架，并选择一种不确定性解决策略：执行高置信度动作，通过自然语言查询澄清任务模糊性，或在低级政策被认为在任务中无能时请求动作干预以进行纠正。我们利用符合预测来校准VLM和预训练基础政策的组合，提供统计保证以确保验证器选择正确的策略。在部署过程中收集干预后，我们采用残差学习来提高预训练政策的能力，使系统能够持续学习，但所需的人类反馈成本最低。我们通过模拟和硬件实验展示了我们的框架，表明UPS能够区分自信、模糊和无能的场景，并与未校准的基线以及之前的人类或机器人门控的持续学习方法相比，最小化昂贵的用户干预。视频可以在 https://jessie-yuan.github.io/ups/ 找到。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2602.22514

SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation

SignVLA：一种无注释的视觉-语言-动作框架，用于实时手语引导的机器人操作

Tan, Xinyu, Bai, Ningwei, Gardener, Harry, Zhong, Zhengyang, Zhang, Luoyu, Yang, Liuhaichen, Duan, Zhekai, Galeitsiwe, Monkgogi, Tang, Zezhi

Abstract

We present, to our knowledge, the first sign language-driven Vision-Language-Action (VLA) framework for intuitive and inclusive human-robot interaction. Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm and directly maps visual sign gestures to semantic instructions. This design reduces annotation cost and avoids the information loss introduced by gloss representations, enabling more natural and scalable multimodal interaction. In this work, we focus on a real-time alphabet-level finger-spelling interface that provides a robust and low-latency communication channel for robotic control. Compared with large-scale continuous sign language recognition, alphabet-level interaction offers improved reliability, interpretability, and deployment feasibility in safety-critical embodied environments. The proposed pipeline transforms continuous gesture streams into coherent language commands through geometric normalization, temporal smoothing, and lexical refinement, ensuring stable and consistent interaction. Furthermore, the framework is designed to support future integration of transformer-based gloss-free sign language models, enabling scalable word-level and sentence-level semantic understanding. Experimental results demonstrate the effectiveness of the proposed system in grounding sign-derived instructions into precise robotic actions under diverse interaction scenarios. These results highlight the potential of the framework to advance accessible, scalable, and multimodal embodied intelligence.

Chinese Translation

我们提出了一个手语驱动的视觉-语言-动作（VLA）框架，这是我们所知的首个用于直观和包容性人机交互的系统。与依赖于注释作为中介监督的传统方法不同，所提出的系统采用无注释范式，直接将视觉手势映射到语义指令。这一设计降低了注释成本，避免了由注释表示引入的信息损失，从而实现了更自然和可扩展的多模态交互。在本研究中，我们专注于一个实时字母级拼写接口，为机器人控制提供了一个稳健且低延迟的通信通道。与大规模连续手语识别相比，字母级交互在安全关键的具身环境中提供了更高的可靠性、可解释性和部署可行性。所提出的管道通过几何归一化、时间平滑和词汇细化将连续手势流转换为连贯的语言指令，确保了稳定和一致的交互。此外，该框架设计支持未来整合基于变换器的无注释手语模型，实现可扩展的词级和句级语义理解。实验结果证明了所提出系统在多种交互场景下将手势派生指令转化为精确机器人动作的有效性。这些结果突显了该框架在推动可访问、可扩展和多模态具身智能方面的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2602.22579

Metamorphic Testing of Vision-Language Action-Enabled Robots

视觉-语言-动作启用机器人之变形测试

Valle, Pablo, Segura, Sergio, Ali, Shaukat, Arrieta, Aitor

Abstract

Vision-Language-Action (VLA) models are multimodal robotic task controllers that, given an instruction and visual inputs, produce a sequence of low-level control actions (or motor commands) enabling a robot to execute the requested task in the physical environment. These systems face the test oracle problem from multiple perspectives. On the one hand, a test oracle must be defined for each instruction prompt, which is a complex and non-generalizable approach. On the other hand, current state-of-the-art oracles typically capture symbolic representations of the world (e.g., robot and object states), enabling the correctness evaluation of a task, but fail to assess other critical aspects, such as the quality with which VLA-enabled robots perform a task. In this paper, we explore whether Metamorphic Testing (MT) can alleviate the test oracle problem in this context. To do so, we propose two metamorphic relation patterns and five metamorphic relations to assess whether changes to the test inputs impact the original trajectory of the VLA-enabled robots. An empirical study involving five VLA models, two simulated robots, and four robotic tasks shows that MT can effectively alleviate the test oracle problem by automatically detecting diverse types of failures, including, but not limited to, uncompleted tasks. More importantly, the proposed MRs are generalizable, making the proposed approach applicable across different VLA models, robots, and tasks, even in the absence of test oracles.

Chinese Translation

视觉-语言-动作（VLA）模型是多模态机器人任务控制器，能够根据指令和视觉输入生成一系列低级控制动作（或运动指令），使机器人能够在物理环境中执行请求的任务。这些系统从多个角度面临测试oracle问题。一方面，必须为每个指令提示定义一个测试oracle，这是一种复杂且不可推广的方法。另一方面，目前的最先进oracle通常捕捉世界的符号表示（例如，机器人和物体状态），使得任务的正确性评估成为可能，但未能评估其他关键方面，例如VLA启用机器人执行任务的质量。本文探讨了变形测试（MT）是否可以缓解这一背景下的测试oracle问题。为此，我们提出了两种变形关系模式和五种变形关系，以评估测试输入的变化是否影响VLA启用机器人的原始轨迹。一项涉及五个VLA模型、两个模拟机器人和四个机器人任务的实证研究表明，MT可以有效缓解测试oracle问题，通过自动检测多种类型的故障，包括但不限于未完成的任务。更重要的是，所提出的变形关系具有可推广性，使得该方法适用于不同的VLA模型、机器人和任务，即使在没有测试oracle的情况下。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2602.22628

Designing Robots for Families: In-Situ Prototyping for Contextual Reminders on Family Routines

为家庭设计机器人：针对家庭日常活动的情境提醒的原型设计

Xu, Michael F., Zhao, Enhui, Zhang, Yawen, Michaelis, Joseph E., Sebo, Sarah, Mutlu, Bilge

Abstract

Robots are increasingly entering the daily lives of families, yet their successful integration into domestic life remains a challenge. We explore family routines as a critical entry point for understanding how robots might find a sustainable role in everyday family settings. Together with each of the ten families, we co-designed robot interactions and behaviors, and a plan for the robot to support their chosen routines, accounting for contextual factors such as timing, participants, locations, and the activities in the environment. We then designed, prototyped, and deployed a mobile social robot as a four-day, in-home user study. Families welcomed the robot's reminders, with parents especially appreciating the offloading of some reminding tasks. At the same time, interviews revealed tensions around timing, authority, and family dynamics, highlighting the complexity of integrating robots into households beyond the immediate task of reminders. Based on these insights, we offer design implications for robot-facilitated contextual reminders and discuss broader considerations for designing robots for family settings.

Chinese Translation

机器人日益融入家庭的日常生活，但其成功整合到家庭生活中仍然面临挑战。我们将家庭日常活动视为理解机器人如何在日常家庭环境中找到可持续角色的关键切入点。与十个家庭共同设计了机器人交互和行为，以及支持他们选择的日常活动的机器人计划，同时考虑了时间、参与者、地点和环境活动等情境因素。随后，我们设计、原型制作并部署了一款移动社交机器人，进行了为期四天的家庭用户研究。家庭成员对机器人的提醒表示欢迎，尤其是父母们对减轻一些提醒任务表示感激。同时，访谈揭示了关于时间、权威和家庭动态的紧张关系，突显了将机器人整合进家庭生活的复杂性，超越了单纯的提醒任务。基于这些见解，我们提出了针对机器人辅助情境提醒的设计启示，并讨论了为家庭环境设计机器人的更广泛考虑。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2602.22663

Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

重新思考视觉-语言-动作模型的实用性：一个全面的基准和改进的基线

Song, Wenxuan, Chen, Jiayi, Sun, Xiaoquan, Lei, Huashuo, Qin, Yikai, Zhao, Wei, Ding, Pengxiang, Zhao, Han, Wang, Tongxin, Hou, Pengxu, Zhong, Zhide, Yan, Haodong, Wang, Donglin, Ma, Jun, Li, Haoang

Abstract

Vision-Language-Action (VLA) models have emerged as a generalist robotic agent. However, existing VLAs are hindered by excessive parameter scales, prohibitive pre-training requirements, and limited applicability to diverse embodiments. To improve the practicality of VLAs, we propose a comprehensive benchmark and an improved baseline. First, we propose CEBench, a new benchmark spanning diverse embodiments in both simulation and the real world with consideration of domain randomization. We collect 14.4k simulated trajectories and 1.6k real-world expert-curated trajectories to support training on CEBench. Second, using CEBench as our testbed, we study three critical aspects of VLAs' practicality and offer several key findings. Informed by these findings, we introduce LLaVA-VLA, a lightweight yet powerful VLA designed for practical deployment on consumer-grade GPUs. Architecturally, it integrates a compact VLM backbone with multi-view perception, proprioceptive tokenization, and action chunking. To eliminate reliance on costly pre-training, LLaVA-VLA adopts a two-stage training paradigm including post-training and fine-tuning. Furthermore, LLaVA-VLA extends the action space to unify navigation and manipulation. Experiments across embodiments demonstrate the capabilities of generalization and versatility of LLaVA-VLA , while real-world mobile manipulation experiments establish it as the first end-to-end VLA model for mobile manipulation. We will open-source all datasets, codes, and checkpoints upon acceptance to foster reproducibility and future research.

Chinese Translation

视觉-语言-动作（VLA）模型已成为一种通用的机器人代理。然而，现有的VLA模型受到过多参数规模、昂贵的预训练要求以及对多样化体现的有限适用性等问题的制约。为了提高VLA的实用性，我们提出了一个全面的基准和改进的基线。首先，我们提出了CEBench，这是一个新的基准，涵盖了考虑领域随机化的模拟和现实世界中的多样化体现。我们收集了14.4k个模拟轨迹和1.6k个现实世界专家策划的轨迹，以支持在CEBench上的训练。其次，利用CEBench作为我们的测试平台，我们研究了VLA实用性的三个关键方面，并提供了若干重要发现。在这些发现的启发下，我们引入了LLaVA-VLA，这是一种轻量级但强大的VLA，旨在在消费级GPU上进行实际部署。在架构上，它将紧凑的VLM主干与多视角感知、自我感知标记化和动作分块相结合。为了消除对昂贵预训练的依赖，LLaVA-VLA采用了包括后训练和微调在内的两阶段训练范式。此外，LLaVA-VLA扩展了动作空间，以统一导航和操作。跨体现的实验展示了LLaVA-VLA的泛化能力和多样性，而现实世界的移动操作实验则确立了其作为首个用于移动操作的端到端VLA模型的地位。我们将在接受后开源所有数据集、代码和检查点，以促进可重复性和未来研究。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2602.22671

Does the testing environment matter? Carsickness across on-road, test-track, and driving simulator conditions

测试环境是否重要？公路、测试跑道和驾驶模拟器条件下的晕车现象

Papaioannou, Georgios, Shyrokau, Barys

Abstract

Carsickness has gained significant attention with the rise of automated vehicles, prompting extensive research across on-road, test-track, and driving simulator environments to understand its occurrence and develop mitigation strategies. However, the lack of carsickness standardization complicates comparisons across studies and environments. Previous works demonstrate measurement validity between two setups at most (e.g., on-road vs. driving simulator), leaving gaps in multi-environment comparisons. This study investigates the recreation of an on-road motion sickness exposure - previously replicated on a test track - using a motion-based driving simulator. Twenty-eight participants performed an eyes-off-road non-driving task while reporting motion sickness using the Misery Scale during the experiment and the Motion Sickness Assessment Questionnaire afterward. Psychological factors known to influence motion sickness were also assessed. The results present subjective and objective measurements for motion sickness across the considered environments. In this paper, acceleration measurements, objective metrics and subjective motion sickness ratings across environments are compared, highlighting key differences in sickness occurrence for simulator-based research validity. Significantly lower motion sickness scores are reported in the simulator compared to on-road and test-track conditions, due to its limited working envelope to reproduce low-frequency (<0.5 Hz) motions, which are the most provocative for motion sickness.

Chinese Translation

随着自动驾驶车辆的兴起，晕车现象引起了广泛关注，促使在公路、测试跑道和驾驶模拟器环境中进行大量研究，以理解其发生机制并制定缓解策略。然而，晕车标准化的缺乏使得跨研究和环境的比较变得复杂。以往的研究最多仅在两种设置之间（例如，公路与驾驶模拟器）展示了测量的有效性，导致多环境比较中存在空白。本研究探讨了使用基于运动的驾驶模拟器重现公路运动病暴露的可能性，该暴露之前已在测试跑道上复制。二十八名参与者在实验过程中执行了一个不看路的非驾驶任务，并使用痛苦量表（Misery Scale）报告运动病，同时在实验后填写了运动病评估问卷（Motion Sickness Assessment Questionnaire）。还评估了已知会影响运动病的心理因素。结果展示了在所考虑环境中运动病的主观和客观测量数据。本文比较了不同环境下的加速度测量、客观指标和主观运动病评分，突出了模拟器基础研究有效性中疾病发生的关键差异。与公路和测试跑道条件相比，模拟器中的运动病评分显著较低，原因在于其有限的工作范围无法重现低频（<0.5 Hz）运动，而这些运动是引发运动病的最具刺激性的因素。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2602.22707

SCOPE: Skeleton Graph-Based Computation-Efficient Framework for Autonomous UAV Exploration

SCOPE：基于骨架图的计算高效框架用于自主无人机探索

Li, Kai, Zheng, Shengtao, Xiu, Linkun, Sheng, Yuze, Zhang, Xiao-Ping, Huang, Dongyue, Chen, Xinlei

Abstract

Autonomous exploration in unknown environments is key for mobile robots, helping them perceive, map, and make decisions in complex areas. However, current methods often rely on frequent global optimization, suffering from high computational latency and trajectory oscillation, especially on resource-constrained edge devices. To address these limitations, we propose SCOPE, a novel framework that incrementally constructs a real-time skeletal graph and introduces Implicit Unknown Region Analysis for efficient spatial reasoning. The planning layer adopts a hierarchical on-demand strategy: the Proximal Planner generates smooth, high-frequency local trajectories, while the Region-Sequence Planner is activated only when necessary to optimize global visitation order. Comparative evaluations in simulation demonstrate that SCOPE achieves competitive exploration performance comparable to state-of-the-art global planners, while reducing computational cost by an average of 86.9%. Real-world experiments further validate the system's robustness and low latency in practical scenarios.

Chinese Translation

在未知环境中的自主探索是移动机器人关键的能力，帮助它们在复杂区域中感知、绘图和做出决策。然而，当前的方法往往依赖于频繁的全局优化，导致高计算延迟和轨迹振荡，特别是在资源受限的边缘设备上。为了解决这些限制，我们提出了SCOPE，一个新颖的框架，逐步构建实时骨架图，并引入隐式未知区域分析以实现高效的空间推理。规划层采用分层按需策略：近端规划器（Proximal Planner）生成平滑的高频局部轨迹，而区域序列规划器（Region-Sequence Planner）仅在必要时激活以优化全局访问顺序。在仿真中的比较评估表明，SCOPE实现了与最先进的全局规划器相当的竞争性探索性能，同时将计算成本平均降低了86.9%。实际实验进一步验证了该系统在实际场景中的鲁棒性和低延迟。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2602.22714

Robust Helicopter Ship Deck Landing With Guaranteed Timing Using Shrinking-Horizon Model Predictive Control

基于收缩视域模型预测控制的鲁棒直升机舰船甲板着陆与保证时效性

Schitz, Philipp, Mercorelli, Paolo, Dauer, Johann C.

Abstract

We present a runtime efficient algorithm for autonomous helicopter landings on moving ship decks based on Shrinking-Horizon Model Predictive Control (SHMPC). First, a suitable planning model capturing the relevant aspects of the full nonlinear helicopter dynamics is derived. Next, we use the SHMPC together with a touchdown controller stage to ensure a pre-specified maneuver time and an associated landing time window despite the presence of disturbances. A high disturbance rejection performance is achieved by designing an ancillary controller with disturbance feedback. Thus, given a target position and time, a safe landing with suitable terminal conditions is be guaranteed if the initial optimization problem is feasible. The efficacy of our approach is shown in simulation where all maneuvers achieve a high landing precision in strong winds while satisfying timing and operational constraints with maximum computation times in the millisecond range.

Chinese Translation

我们提出了一种高效的算法，用于在移动舰船甲板上实现自主直升机着陆，该算法基于收缩视域模型预测控制（SHMPC）。首先，推导出一个合适的规划模型，以捕捉全非线性直升机动力学的相关方面。接下来，我们结合SHMPC和着陆控制器阶段，以确保在存在干扰的情况下，预先指定的机动时间和相关的着陆时间窗口。通过设计一个具有干扰反馈的辅助控制器，实现了高效的干扰抑制性能。因此，给定目标位置和时间，如果初始优化问题是可行的，则可以保证安全着陆并满足适当的终端条件。我们的方案在仿真中显示出有效性，所有机动在强风条件下都能实现高精度着陆，同时满足时效性和操作约束，最大计算时间在毫秒级范围内。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2602.22731

Sapling-NeRF: Geo-Localised Sapling Reconstruction in Forests for Ecological Monitoring

Sapling-NeRF：用于生态监测的森林中地理定位幼苗重建

Muñoz-Bañón, Miguel Ángel, Chebrolu, Nived, Moorthy, Sruthi M. Krishna, Tao, Yifu, Torres, Fernando, Salguero-Gómez, Roberto, Fallon, Maurice

Abstract

Saplings are key indicators of forest regeneration and overall forest health. However, their fine-scale architectural traits are difficult to capture with existing 3D sensing methods, which make quantitative evaluation difficult. Terrestrial Laser Scanners (TLS), Mobile Laser Scanners (MLS), or traditional photogrammetry approaches poorly reconstruct thin branches, dense foliage, and lack the scale consistency needed for long-term monitoring. Implicit 3D reconstruction methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) are promising alternatives, but cannot recover the true scale of a scene and lack any means to be accurately geo-localised. In this paper, we present a pipeline which fuses NeRF, LiDAR SLAM, and GNSS to enable repeatable, geo-localised ecological monitoring of saplings. Our system proposes a three-level representation: (i) coarse Earth-frame localisation using GNSS, (ii) LiDAR-based SLAM for centimetre-accurate localisation and reconstruction, and (iii) NeRF-derived object-centric dense reconstruction of individual saplings. This approach enables repeatable quantitative evaluation and long-term monitoring of sapling traits. Our experiments in forest plots in Wytham Woods (Oxford, UK) and Evo (Finland) show that stem height, branching patterns, and leaf-to-wood ratios can be captured with increased accuracy as compared to TLS. We demonstrate that accurate stem skeletons and leaf distributions can be measured for saplings with heights between 0.5m and 2m in situ, giving ecologists access to richer structural and quantitative data for analysing forest dynamics.

Chinese Translation

幼苗是森林再生和整体森林健康的关键指标。然而，现有的3D传感方法难以捕捉它们的细微结构特征，这使得定量评估变得困难。地面激光扫描仪（Terrestrial Laser Scanners, TLS）、移动激光扫描仪（Mobile Laser Scanners, MLS）或传统的摄影测量方法在重建细小的树枝、密集的树叶方面效果不佳，并且缺乏长期监测所需的尺度一致性。隐式3D重建方法，如神经辐射场（Neural Radiance Fields, NeRF）和3D高斯点云（3D Gaussian Splatting, 3DGS）是有前景的替代方案，但无法恢复场景的真实尺度，并且缺乏准确的地理定位手段。在本文中，我们提出了一种融合NeRF、激光雷达SLAM和全球导航卫星系统（GNSS）的管道，以实现可重复的、地理定位的幼苗生态监测。我们的系统提出了三层次的表示：（i）使用GNSS进行粗略的地球框架定位，（ii）基于激光雷达的SLAM实现厘米级精度的定位和重建，以及（iii）基于NeRF的以物体为中心的个体幼苗密集重建。这种方法使得幼苗特征的可重复定量评估和长期监测成为可能。我们在英国牛津的Wytham Woods和芬兰的Evo森林地块的实验表明，与TLS相比，茎高、分枝模式和叶木比的捕捉精度有所提高。我们展示了可以在现场测量高度在0.5米到2米之间的幼苗的准确茎骨架和叶分布，为生态学家提供了更丰富的结构和定量数据，以分析森林动态。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2602.22733

Pixel2Catch: Multi-Agent Sim-to-Real Transfer for Agile Manipulation with a Single RGB Camera

Pixel2Catch：基于单个RGB相机的多智能体仿真到现实转移用于灵活操作

Kim, Seongyong, Cho, Junhyeon, Lee, Kang-Won, Lim, Soo-Chul

Abstract

To catch a thrown object, a robot must be able to perceive the object's motion and generate control actions in a timely manner. Rather than explicitly estimating the object's 3D position, this work focuses on a novel approach that recognizes object motion using pixel-level visual information extracted from a single RGB image. Such visual cues capture changes in the object's position and scale, allowing the policy to reason about the object's motion. Furthermore, to achieve stable learning in a high-DoF system composed of a robot arm equipped with a multi-fingered hand, we design a heterogeneous multi-agent reinforcement learning framework that defines the arm and hand as independent agents with distinct roles. Each agent is trained cooperatively using role-specific observations and rewards, and the learned policies are successfully transferred from simulation to the real world.

Chinese Translation

为了捕捉被投掷的物体，机器人必须能够及时感知物体的运动并生成控制动作。本研究并不明确估计物体的三维位置，而是关注一种新颖的方法，通过从单个RGB图像中提取的像素级视觉信息来识别物体运动。这种视觉线索捕捉了物体位置和尺度的变化，使得策略能够推理物体的运动。此外，为了在由配备多指手的机器人手臂组成的高自由度系统中实现稳定学习，我们设计了一种异构多智能体强化学习框架，将手臂和手定义为具有不同角色的独立智能体。每个智能体使用特定角色的观察和奖励进行协同训练，学习到的策略成功地从仿真转移到现实世界。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2602.22801

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

释放扩散模型在端到端自主驾驶中的潜力

Zheng, Yinan, Tan, Tianyi, Huang, Bin, Liu, Enguang, Liang, Ruiming, Zhang, Jianlin, Cui, Jianwei, Chen, Guang, Ma, Kun, Ye, Hangjun, Chen, Long, Zhang, Ya-Qin, Zhan, Xianyuan, Liu, Jingjing

Abstract

Diffusion models have become a popular choice for decision-making tasks in robotics, and more recently, are also being considered for solving autonomous driving tasks. However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings. The full strength of diffusion models for large-scale, complex real-world settings, such as End-to-End Autonomous Driving (E2E AD), remains underexplored. In this study, we conducted a systematic and large-scale investigation to unleash the potential of the diffusion models as planners for E2E AD, based on a tremendous amount of real-vehicle data and road testing. Through comprehensive and carefully controlled studies, we identify key insights into the diffusion loss space, trajectory representation, and data scaling that significantly impact E2E planning performance. Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety of the learned planner. The resulting diffusion-based learning framework, Hyper Diffusion Planner} (HDP), is deployed on a real-vehicle platform and evaluated across 6 urban driving scenarios and 200 km of real-world testing, achieving a notable 10x performance improvement over the base model. Our work demonstrates that diffusion models, when properly designed and trained, can serve as effective and scalable E2E AD planners for complex, real-world autonomous driving tasks.

Chinese Translation

扩散模型已成为机器人决策任务中的热门选择，最近也开始被考虑用于解决自主驾驶任务。然而，它们在自主驾驶中的应用和评估仍然局限于基于仿真或实验室的环境。扩散模型在大规模、复杂的现实世界环境中的全部潜力，尤其是在端到端自主驾驶（E2E AD）方面，仍然未得到充分探索。在本研究中，我们进行了系统且大规模的调查，以释放扩散模型作为E2E AD规划器的潜力，基于大量真实车辆数据和道路测试。通过全面且精心控制的研究，我们识别出扩散损失空间、轨迹表示和数据扩展等关键见解，这些因素显著影响E2E规划性能。此外，我们还提供了一种有效的强化学习后训练策略，以进一步增强所学习规划器的安全性。最终的基于扩散的学习框架Hyper Diffusion Planner（HDP）在真实车辆平台上部署，并在6个城市驾驶场景和200公里的现实世界测试中进行了评估，取得了比基础模型显著提高10倍的性能。我们的工作表明，当扩散模型经过适当设计和训练时，可以作为复杂现实世界自主驾驶任务的有效且可扩展的E2E AD规划器。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2602.22818

LeRobot: An Open-Source Library for End-to-End Robot Learning

LeRobot：一个用于端到端机器人学习的开源库

Cadene, Remi, Aliberts, Simon, Capuano, Francesco, Aractingi, Michel, Zouitine, Adil, Kooijmans, Pepijn, Choghari, Jade, Russi, Martino, Pascal, Caroline, Palma, Steven, Shukor, Mustafa, Moss, Jess, Soare, Alexander, Aubakirova, Dana, Lhoest, Quentin, Gallouédec, Quentin, Wolf, Thomas

Abstract

Robotics is undergoing a significant transformation powered by advances in high-level control techniques based on machine learning, giving rise to the field of robot learning. Recent progress in robot learning has been accelerated by the increasing availability of affordable teleoperation systems, large-scale openly available datasets, and scalable learning-based methods. However, development in the field of robot learning is often slowed by fragmented, closed-source tools designed to only address specific sub-components within the robotics stack. In this paper, we present \texttt{lerobot}, an open-source library that integrates across the entire robot learning stack, from low-level middleware communication for motor controls to large-scale dataset collection, storage and streaming. The library is designed with a strong focus on real-world robotics, supporting accessible hardware platforms while remaining extensible to new embodiments. It also supports efficient implementations for various state-of-the-art robot learning algorithms from multiple prominent paradigms, as well as a generalized asynchronous inference stack. Unlike traditional pipelines which heavily rely on hand-crafted techniques, \texttt{lerobot} emphasizes scalable learning approaches that improve directly with more data and compute. Designed for accessibility, scalability, and openness, \texttt{lerobot} lowers the barrier to entry for researchers and practitioners to robotics while providing a platform for reproducible, state-of-the-art robot learning.

Chinese Translation

机器人技术正经历着由基于机器学习的高级控制技术进步所驱动的重大变革，催生了机器人学习这一领域。最近，机器人学习的进展得益于经济实惠的远程操作系统、大规模公开可用数据集以及可扩展的基于学习的方法的日益普及。然而，机器人学习领域的发展常常受到碎片化的、封闭源代码工具的限制，这些工具仅旨在解决机器人技术栈中的特定子组件。在本文中，我们介绍了 exttt{lerobot}，一个整合了整个机器人学习栈的开源库，从低级中间件通信用于电机控制，到大规模数据集的收集、存储和流式传输。该库专注于现实世界的机器人技术，支持可访问的硬件平台，同时保持对新实现的可扩展性。它还支持来自多个显著范式的各种最先进的机器人学习算法的高效实现，以及一个通用的异步推理栈。与传统的管道 heavily 依赖手工制作技术不同， exttt{lerobot} 强调可扩展的学习方法，这些方法可以随着更多数据和计算能力的增加而直接改进。为了实现可访问性、可扩展性和开放性， exttt{lerobot} 降低了研究人员和从业者进入机器人领域的门槛，同时提供了一个可重复的、最先进的机器人学习平台。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2602.22854

Performance and Experimental Analysis of Strain-based Models for Continuum Robots

基于应变模型的连续体机器人性能与实验分析

Delucchi, Annika, Di Paola, Vincenzo, Müller, Andreas, Zoppi, and Matteo

Abstract

Although strain-based models have been widely adopted in robotics, no comparison beyond the uniform bending test is commonly recognized to assess their performance. In addition, the increasing effort in prototyping continuum robots highlights the need to assess the applicability of these models and the necessity of comprehensive performance evaluation. To address this gap, this work investigates the shape reconstruction abilities of a third-order strain interpolation method, examining its ability to capture both individual and combined deformation effects. These results are compared and discussed against the Geometric-Variable Strain approach. Subsequently, simulation results are experimentally verified by reshaping a slender rod while recording the resulting configurations using cameras. The rod configuration is imposed using a manipulator displacing one of its tips and extracted through reflective markers, without the aid of any other external sensor -- i.e. strain gauges or wrench sensors placed along the rod. The experiments demonstrate good agreement between the model predictions and observed shapes, with average error of 0.58% of the rod length and average computational time of 0.32s per configuration, outperforming existing models.

Chinese Translation

尽管基于应变的模型在机器人领域得到了广泛应用，但目前普遍缺乏超出均匀弯曲测试的比较方法来评估其性能。此外，随着对连续体机器人原型制作的不断努力，评估这些模型的适用性和进行全面性能评估的必要性愈发突出。为了解决这一空白，本文研究了一种三阶应变插值方法的形状重建能力，考察其捕捉单独和组合变形效应的能力。这些结果与几何变量应变（Geometric-Variable Strain）方法进行了比较和讨论。随后，通过重塑一根细长杆并使用摄像机记录其结果配置，实验验证了模拟结果。杆的配置是通过操纵器移动其一端施加的，并通过反射标记提取，未使用任何其他外部传感器——即杆上未放置应变计或扭矩传感器。实验结果表明，模型预测与观察到的形状之间具有良好的一致性，平均误差为杆长的0.58%，每个配置的平均计算时间为0.32秒，优于现有模型。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2602.22862

GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion

GraspLDP：通过潜在扩散实现可泛化的抓取策略

Xiang, Enda, Ma, Haoxiang, Ma, Xinzhu, Liu, Zicheng, Huang, Di

Abstract

This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusion-based policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.

Chinese Translation

本文旨在提高通过模仿学习获得的操作策略的抓取精度和泛化能力。基于扩散的策略学习方法最近已成为机器人操作任务的主流方法。由于抓取是操作中的一个关键子任务，因此模仿学习策略在执行精确且可泛化的抓取方面的能力值得特别关注。现有的抓取模仿学习技术往往面临抓取执行不精确、空间泛化能力有限以及物体泛化能力差等问题。为了解决这些挑战，我们将抓取先验知识融入扩散策略框架中。具体而言，我们采用潜在扩散策略来指导带有抓取姿态先验的动作块解码，确保生成的运动轨迹紧密遵循可行的抓取配置。此外，我们在扩散过程中引入自监督重建目标，以嵌入抓取性先验：在每个反向扩散步骤中，我们重建从中间表示反投影的抓取性腕部相机图像。模拟和真实机器人实验均表明，我们的方法显著优于基线方法，并展现出强大的动态抓取能力。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2602.22896

DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

DySL-VLA：通过动态-静态层跳跃实现高效的视觉-语言-动作模型推理用于机器人操作

Yang, Zebin, Qi, Yijiahao, Xie, Tong, Yu, Bo, Liu, Shaoshan, Li, Meng

Abstract

Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.

Chinese Translation

视觉-语言-动作（VLA）模型通过将语言模型的推理与视觉模型的三维理解相结合，在机器人操作等任务中取得了显著成功。然而，其高计算成本仍然是需要实时性能的实际应用中的主要障碍。我们观察到任务中的动作具有不同的重要性：关键步骤需要高精度，而不太重要的步骤可以容忍更多的变化。基于这一洞察，我们提出了DySL-VLA，一个通过动态跳过VLA层来应对计算成本的新框架，该跳过是基于每个动作的重要性。DySL-VLA将其层分为两种类型：信息层，始终执行，以及增量层，可以选择性跳过。为了智能地跳过层而不牺牲准确性，我们发明了一种先前-后续跳过指导机制，以确定何时开始跳过层。我们还提出了一种跳过感知的两阶段知识蒸馏算法，以高效地将标准VLA训练为DySL-VLA。我们的实验表明，DySL-VLA在Calvin数据集上相较于Deer-VLA实现了2.1%的成功长度提升，同时将可训练参数减少了85.7倍，并在相同准确度下相较于RoboFlamingo基线提供了3.75倍的加速。我们的代码可在https://github.com/PKU-SEC-Lab/DYSL_VLA获取。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2602.22922

Bayesian Preference Elicitation: Human-In-The-Loop Optimization of An Active Prosthesis

贝叶斯偏好引导：主动假肢的人机协同优化

Taddei, Sophia, Koppen, Wouter, Alfio, Eligia, Nuzzo, Stefano, Flynn, Louis, Diaz, Maria Alejandra, Gonzalez, Sebastian Rojas, Dhaene, Tom, De Pauw, Kevin, Couckuyt, Ivo, Verstraten, Tom

Abstract

Tuning active prostheses for people with amputation is time-consuming and relies on metrics that may not fully reflect user needs. We introduce a human-in-the-loop optimization (HILO) approach that leverages direct user preferences to personalize a standard four-parameter prosthesis controller efficiently. Our method employs preference-based Multiobjective Bayesian Optimization that uses a state-or-the-art acquisition function especially designed for preference learning, and includes two algorithmic variants: a discrete version (\textit{EUBO-LineCoSpar}), and a continuous version (\textit{BPE4Prost}). Simulation results on benchmark functions and real-application trials demonstrate efficient convergence, robust preference elicitation, and measurable biomechanical improvements, illustrating the potential of preference-driven tuning for user-centered prosthesis control.

Chinese Translation

为截肢者调试主动假肢是一项耗时的工作，并且依赖于可能无法完全反映用户需求的指标。我们提出了一种人机协同优化（HILO）方法，利用直接用户偏好高效地个性化标准的四参数假肢控制器。我们的方法采用基于偏好的多目标贝叶斯优化，使用一种专为偏好学习设计的先进获取函数，并包括两种算法变体：离散版本（EUBO-LineCoSpar）和连续版本（BPE4Prost）。在基准函数和实际应用试验中的仿真结果表明，该方法实现了高效收敛、稳健的偏好引导和可测量的生物力学改善，展示了以偏好驱动的调试在以用户为中心的假肢控制中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2602.22940

Considering Perspectives for Automated Driving Ethics: Collective Risk in Vehicular Motion Planning

考虑自动驾驶伦理的视角：车辆运动规划中的集体风险

Tolksdorf, Leon, Tejada, Arturo, Birkner, Christian, van de Wouw, Nathan

Abstract

Recent automated vehicle (AV) motion planning strategies evolve around minimizing risk in road traffic. However, they exclusively consider risk from the AV's perspective and, as such, do not address the ethicality of its decisions for other road users. We argue that this does not reduce the risk of each road user, as risk may be different from the perspective of each road user. Indeed, minimizing the risk from the AV's perspective may not imply that the risk from the perspective of other road users is also being minimized; in fact, it may even increase. To test this hypothesis, we propose an AV motion planning strategy that supports switching risk minimization strategies between all road user perspectives. We find that the risk from the perspective of other road users can generally be considered different to the risk from the AV's perspective. Taking a collective risk perspective, i.e., balancing the risks of all road users, we observe an AV that minimizes overall traffic risk the best, while putting itself at slightly higher risk for the benefit of others, which is consistent with human driving behavior. In addition, adopting a collective risk minimization strategy can also be beneficial to the AV's travel efficiency by acting assertively when other road users maintain a low risk estimate of the AV. Yet, the AV drives conservatively when its planned actions are less predictable to other road users, i.e., associated with high risk. We argue that such behavior is a form of self-reflection and a natural prerequisite for socially acceptable AV behavior. We conclude that to facilitate ethicality in road traffic that includes AVs, the risk-perspective of each road user must be considered in the decision-making of AVs.

Chinese Translation

近期的自动驾驶车辆（AV）运动规划策略主要围绕着最小化道路交通中的风险。然而，这些策略仅从自动驾驶车辆的角度考虑风险，因此未能解决其对其他道路使用者决策的伦理性。我们认为，这并不能减少每个道路使用者的风险，因为从每个道路使用者的角度来看，风险可能是不同的。实际上，从自动驾驶车辆的角度最小化风险并不意味着从其他道路使用者的角度也在最小化风险；事实上，这可能甚至会增加风险。为了验证这一假设，我们提出了一种支持在所有道路使用者视角之间切换风险最小化策略的自动驾驶车辆运动规划策略。我们发现，从其他道路使用者的角度来看，风险通常可以被视为与自动驾驶车辆的视角不同。从集体风险的角度来看，即平衡所有道路使用者的风险，我们观察到一种能够最佳地最小化整体交通风险的自动驾驶车辆，同时为他人承担略高的风险，这与人类驾驶行为是一致的。此外，采用集体风险最小化策略也可以通过在其他道路使用者对自动驾驶车辆保持低风险估计时采取果断行动，从而提高自动驾驶车辆的行驶效率。然而，当自动驾驶车辆的计划行为对其他道路使用者的可预测性较低时，即与高风险相关时，自动驾驶车辆会采取保守的驾驶方式。我们认为，这种行为是一种自我反思的表现，也是社会可接受的自动驾驶车辆行为的自然前提。我们总结认为，为了促进包括自动驾驶车辆在内的道路交通伦理性，必须在自动驾驶车辆的决策中考虑每个道路使用者的风险视角。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2602.22952

Automated Robotic Needle Puncture for Percutaneous Dilatational Tracheostomy

用于经皮扩张气管切开术的自动化机器人针刺

Tang, Yuan, Adorno, Bruno V., McGrath, Brendan A., Weightman, Andrew

Abstract

Percutaneous dilatational tracheostomy (PDT) is frequently performed on patients in intensive care units for prolonged mechanical ventilation. The needle puncture, as the most critical step of PDT, could lead to adverse consequences such as major bleeding and posterior tracheal wall perforation if performed inaccurately. Current practices of PDT puncture are all performed manually with no navigation assistance, which leads to large position and angular errors (5 mm and 30 degree). To improve the accuracy and reduce the difficulty of the PDT procedure, we propose a system that automates the needle insertion using a velocity-controlled robotic manipulator. Guided using pose data from two electromagnetic sensors, one at the needle tip and the other inside the trachea, the robotic system uses an adaptive constrained controller to adapt the uncertain kinematic parameters online and avoid collisions with the patient's body and tissues near the target. Simulations were performed to validate the controller's implementation, and then four hundred PDT punctures were performed on a mannequin to evaluate the position and angular accuracy. The absolute median puncture position error was 1.7 mm (IQR: 1.9 mm) and midline deviation was 4.13 degree (IQR: 4.55 degree), measured by the sensor inside the trachea. The small deviations from the nominal puncture in a simulated experimental setup and formal guarantees of collision-free insertions suggest the feasibility of the robotic PDT puncture.

Chinese Translation

经皮扩张气管切开术（PDT）常在重症监护病房对需要长期机械通气的患者进行。针刺作为PDT中最关键的步骤，如果操作不当，可能导致重大出血和气管后壁穿孔等不良后果。目前的PDT针刺操作均为手动进行，且没有导航辅助，这导致位置和角度的误差较大（分别为5毫米和30度）。为了提高PDT过程的准确性并降低操作难度，我们提出了一种使用速度控制的机器人操纵器自动化针刺的系统。该机器人系统利用来自两个电磁传感器的位姿数据进行引导，一个传感器位于针尖，另一个位于气管内，采用自适应约束控制器在线调整不确定的运动学参数，并避免与患者身体及目标附近组织的碰撞。我们进行了仿真以验证控制器的实现，随后在一个模型上进行了四百次PDT针刺，以评估位置和角度的准确性。通过气管内传感器测得的绝对中位针刺位置误差为1.7毫米（四分位距：1.9毫米），中线偏差为4.13度（四分位距：4.55度）。在模拟实验设置中，针刺的微小偏差和对无碰撞插入的正式保证表明了机器人PDT针刺的可行性。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2602.22998

A Perspective on Open Challenges in Deformable Object Manipulation

关于可变形物体操控中的开放挑战的视角

McKennaa, Ryan Paul, Oyekan, John

Abstract

Deformable object manipulation (DOM) represents a critical challenge in robotics, with applications spanning healthcare, manufacturing, food processing, and beyond. Unlike rigid objects, deformable objects exhibit infinite dimensionality, dynamic shape changes, and complex interactions with their environment, posing significant hurdles for perception, modeling, and control. This paper reviews the state of the art in DOM, focusing on key challenges such as occlusion handling, task generalization, and scalable, real-time solutions. It highlights advancements in multimodal perception systems, including the integration of multi-camera setups, active vision, and tactile sensing, which collectively address occlusion and improve adaptability in unstructured environments. Cutting-edge developments in physically informed reinforcement learning (RL) and differentiable simulations are explored, showcasing their impact on efficiency, precision, and scalability. The review also emphasizes the potential of simulated expert demonstrations and generative neural networks to standardize task specifications and bridge the simulation-to-reality gap. Finally, future directions are proposed, including the adoption of graph neural networks for high-level decision-making and the creation of comprehensive datasets to enhance DOM's real-world applicability. By addressing these challenges, DOM research can pave the way for versatile robotic systems capable of handling diverse and dynamic tasks with deformable objects.

Chinese Translation

可变形物体操控（DOM）在机器人技术中代表了一个关键挑战，其应用范围涵盖医疗、制造、食品加工等多个领域。与刚性物体不同，可变形物体具有无限维度、动态形状变化及与环境的复杂交互，这对感知、建模和控制提出了重大挑战。本文回顾了DOM的最新进展，重点关注遮挡处理、任务泛化以及可扩展的实时解决方案等关键挑战。文章强调了多模态感知系统的进展，包括多摄像头设置、主动视觉和触觉传感的整合，这些技术共同应对了遮挡问题并提高了在非结构化环境中的适应性。探讨了物理信息强化学习（RL）和可微仿真等前沿发展，展示了它们在效率、精度和可扩展性方面的影响。该综述还强调了模拟专家演示和生成神经网络在标准化任务规范和弥合仿真与现实之间差距的潜力。最后，提出了未来的研究方向，包括采用图神经网络进行高层决策以及创建全面的数据集以增强DOM在现实世界中的适用性。通过解决这些挑战，DOM研究能够为能够处理多样化和动态任务的灵活机器人系统铺平道路。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2602.23017

DigiArm: An Anthropomorphic 3D-Printed Prosthetic Hand with Enhanced Dexterity for Typing Tasks

DigiArm：一种具有增强灵巧性的类人3D打印假手，用于打字任务

Zadok, Dean, Naamani, Tom, Bar-Ratson, Yuval, Barash, Elisha, Salzman, Oren, Wolf, Alon, Bronstein, Alex M., Krausz, Nili

Abstract

Despite recent advancements, existing prosthetic limbs are unable to replicate the dexterity and intuitive control of the human hand. Current control systems for prosthetic hands are often limited to grasping, and commercial prosthetic hands lack the precision needed for dexterous manipulation or applications that require fine finger motions. Thus, there is a critical need for accessible and replicable prosthetic designs that enable individuals to interact with electronic devices and perform precise finger pressing, such as keyboard typing or piano playing, while preserving current prosthetic capabilities. This paper presents a low-cost, lightweight, 3D-printed robotic prosthetic hand, specifically engineered for enhanced dexterity with electronic devices such as a computer keyboard or piano, as well as general object manipulation. The robotic hand features a mechanism to adjust finger abduction/adduction spacing, a 2-D wrist with the inclusion of controlled ulnar/radial deviation optimized for typing, and control of independent finger pressing. We conducted a study to demonstrate how participants can use the robotic hand to perform keyboard typing and piano playing in real time, with different levels of finger and wrist motion. This supports the notion that our proposed design can allow for the execution of key typing motions more effectively than before, aiming to enhance the functionality of prosthetic hands.

Chinese Translation

尽管近年来取得了进展，但现有的假肢仍无法复制人手的灵巧性和直观控制。当前的假手控制系统通常仅限于抓握，而商业假手缺乏进行灵巧操作或需要精细手指动作的应用所需的精确度。因此，迫切需要可获取且可复制的假肢设计，使个体能够与电子设备进行交互，并执行精确的手指按压，例如键盘打字或钢琴演奏，同时保留现有假肢的功能。本文提出了一种低成本、轻量化的3D打印机器人假手，专门设计用于增强与电子设备（如计算机键盘或钢琴）的灵巧性，以及一般物体的操作。该机器人手具有调节手指外展/内收间距的机制，配备了2D腕关节，并包含针对打字优化的受控尺偏/桡偏，以及独立手指按压的控制。我们进行了一项研究，展示参与者如何使用该机器人手实时进行键盘打字和钢琴演奏，具有不同程度的手指和腕部运动。这支持了我们提出的设计能够更有效地执行按键打字动作的观点，旨在增强假手的功能性。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2602.23024

InCoM: Intent-Driven Perception and Structured Coordination for Whole-Body Mobile Manipulation

InCoM：基于意图驱动的感知与结构化协调用于全身移动操控

Liu, Jiahao, Wenbo, Cui, Li, Haoran, Zhao, Dongbin

Abstract

Whole-body mobile manipulation is a fundamental capability for general-purpose robotic agents, requiring both coordinated control of the mobile base and manipulator and robust perception under dynamically changing viewpoints. However, existing approaches face two key challenges: strong coupling between base and arm actions complicates whole-body control optimization, and perceptual attention is often poorly allocated as viewpoints shift during mobile manipulation. We propose InCoM, an intent-driven perception and structured coordination framework for whole-body mobile manipulation. InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, enabling stage-adaptive allocation of perceptual attention. To support robust cross-modal perception, InCoM further incorporates a geometric-semantic structured alignment mechanism that enhances multimodal correspondence. On the control side, we design a decoupled coordinated flow matching action decoder that explicitly models coordinated base-arm action generation, alleviating optimization difficulties caused by control coupling. Without access to privileged perceptual information, InCoM outperforms state-of-the-art methods on three ManiSkill-HAB scenarios by 28.2%, 26.1%, and 23.6% in success rate, demonstrating strong effectiveness for whole-body mobile manipulation.

Chinese Translation

全身移动操控是通用机器人代理的基本能力，要求对移动底盘和操控器进行协调控制，并在动态变化的视角下实现稳健的感知。然而，现有方法面临两个主要挑战：底盘与手臂动作之间的强耦合使得全身控制优化变得复杂，而在移动操控过程中，随着视角的变化，感知注意力往往分配不当。我们提出了InCoM，一种基于意图驱动的感知与结构化协调框架，用于全身移动操控。InCoM推断潜在的运动意图，以动态重新加权多尺度感知特征，从而实现阶段自适应的感知注意力分配。为了支持稳健的跨模态感知，InCoM进一步结合了一种几何-语义结构对齐机制，增强了多模态对应关系。在控制方面，我们设计了一种解耦协调流匹配动作解码器，明确建模协调的底盘-手臂动作生成，缓解了由控制耦合引起的优化困难。在没有访问特权感知信息的情况下，InCoM在三个ManiSkill-HAB场景中的成功率分别提高了28.2%、26.1%和23.6%，超越了最先进的方法，展示了其在全身移动操控中的强大有效性。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2602.23051

An Empirical Analysis of Cooperative Perception for Occlusion Risk Mitigation

合作感知在遮挡风险缓解中的实证分析

Wang, Aihong, Xie, Tenghui, Wen, Fuxi, Li, Jun

Abstract

Occlusions present a significant challenge for connected and automated vehicles, as they can obscure critical road users from perception systems. Traditional risk metrics often fail to capture the cumulative nature of these threats over time adequately. In this paper, we propose a novel and universal risk assessment metric, the Risk of Tracking Loss (RTL), which aggregates instantaneous risk intensity throughout occluded periods. This provides a holistic risk profile that encompasses both high-intensity, short-term threats and prolonged exposure. Utilizing diverse and high-fidelity real-world datasets, a large-scale statistical analysis is conducted to characterize occlusion risk and validate the effectiveness of the proposed metric. The metric is applied to evaluate different vehicle-to-everything (V2X) deployment strategies. Our study shows that full V2X penetration theoretically eliminates this risk, the reduction is highly nonlinear; a substantial statistical benefit requires a high penetration threshold of 75-90%. To overcome this limitation, we propose a novel asymmetric communication framework that allows even non-connected vehicles to receive warnings. Experimental results demonstrate that this paradigm achieves better risk mitigation performance. We found that our approach at 25% penetration outperforms the traditional symmetric model at 75%, and benefits saturate at only 50% penetration. This work provides a crucial risk assessment metric and a cost-effective, strategic roadmap for accelerating the safety benefits of V2X deployment.

Chinese Translation

遮挡对连接和自动驾驶车辆构成了重大挑战，因为它们可能会使关键道路使用者在感知系统中变得不可见。传统的风险度量通常无法充分捕捉这些威胁随时间累积的特性。本文提出了一种新颖且通用的风险评估指标——跟踪丢失风险（Risk of Tracking Loss, RTL），该指标聚合了遮挡期间瞬时风险强度。这提供了一个全面的风险概况，涵盖了高强度、短期威胁和长期暴露。利用多样化且高保真的真实世界数据集，进行了大规模统计分析，以表征遮挡风险并验证所提指标的有效性。该指标被应用于评估不同的车与一切（Vehicle-to-Everything, V2X）部署策略。我们的研究表明，完全的V2X渗透理论上消除了这一风险，但减少的程度是高度非线性的；显著的统计效益需要75-90%的高渗透阈值。为了克服这一限制，我们提出了一种新颖的非对称通信框架，使得即使是非连接的车辆也能接收警告。实验结果表明，这一范式在风险缓解性能上表现更佳。我们发现，在25%的渗透率下，我们的方法优于75%的传统对称模型，而收益在仅50%的渗透率时就趋于饱和。这项工作提供了一个关键的风险评估指标和一个具有成本效益的战略路线图，以加速V2X部署的安全效益。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2602.23053

Marinarium: a New Arena to Bring Maritime Robotics Closer to Shore

海洋实验室：将海洋机器人技术更紧密地带向岸边的新平台

Torroba, Ignacio, Dorner, David, Fernandez-Ayala, Victor Nan, Kartasev, Mart, Verhagen, Joris, Krantz, Elias, Marchesini, Gregorio, Ljung, Carl, Roque, Pedro, Sidrane, Chelsea, Van der Spaa, Linda, De Carli, Nicola, Ogren, Petter, Fuglesang, Christer, Tumova, Jana, Dimarogonas, Dimos V., Stenius, Ivan

Abstract

This paper presents the Marinarium, a modular and stand-alone underwater research facility designed to provide a realistic testbed for maritime and space-analog robotic experimentation in a resource-efficient manner. The Marinarium combines a fully instrumented underwater and aerial operational volume, extendable via a retractable roof for real-weather conditions, a digital twin in the SMaRCSim simulator and tight integration with a space robotics laboratory. All of these result from design choices aimed at bridging simulation, laboratory validation, and field conditions. We compare the Marinarium to similar existing infrastructures and illustrate how its design enables a set of experiments in four open research areas within field robotics. First, we exploit high-fidelity dynamics data from the tank to demonstrate the potential of learning-based system identification approaches applied to underwater vehicles. We further highlight the versatility of the multi-domain operating volume via a rendezvous mission with a heterogeneous fleet of robots across underwater, surface, and air. We then illustrate how the presented digital twin can be utilized to reduce the reality gap in underwater simulation. Finally, we demonstrate the potential of underwater surrogates for spacecraft navigation validation by executing spatiotemporally identical inspection tasks on a planar space-robot emulator and a neutrally buoyant \gls{rov}. In this work, by sharing the insights obtained and rationale behind the design and construction of the Marinarium, we hope to provide the field robotics research community with a blueprint for bridging the gap between controlled and real offshore and space robotics experimentation.

Chinese Translation

本文介绍了海洋实验室（Marinarium），这是一种模块化的独立水下研究设施，旨在以资源高效的方式提供一个真实的海洋和空间类机器人实验测试平台。海洋实验室结合了一个完全仪器化的水下和空中操作空间，通过可伸缩屋顶扩展以适应真实天气条件，并在SMaRCSim模拟器中建立数字双胞胎，与空间机器人实验室紧密集成。这些设计选择旨在弥合模拟、实验室验证和现场条件之间的差距。我们将海洋实验室与现有类似基础设施进行比较，并展示其设计如何支持在四个开放研究领域内进行一系列实验。首先，我们利用水池中的高保真动态数据，展示应用于水下车辆的基于学习的系统识别方法的潜力。接着，我们通过与异构机器人舰队在水下、表面和空中进行的会合任务，进一步强调多领域操作空间的多样性。然后，我们展示了如何利用所呈现的数字双胞胎来减少水下模拟中的现实差距。最后，我们通过在平面空间机器人仿真器和中性浮力的水下遥控机器人（ROV）上执行时空上相同的检查任务，展示水下替代物在航天器导航验证中的潜力。在本研究中，通过分享获得的见解以及海洋实验室设计与建设背后的理由，我们希望为现场机器人研究社区提供一份弥合受控与真实海洋及空间机器人实验之间差距的蓝图。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2602.23109

Towards Intelligible Human-Robot Interaction: An Active Inference Approach to Occluded Pedestrian Scenarios

朝向可理解的人机交互：一种针对遮挡行人场景的主动推理方法

Chen, Kai, Huang, Yuyao, Chen, Guang

Abstract

The sudden appearance of occluded pedestrians presents a critical safety challenge in autonomous driving. Conventional rule-based or purely data-driven approaches struggle with the inherent high uncertainty of these long-tail scenarios. To tackle this challenge, we propose a novel framework grounded in Active Inference, which endows the agent with a human-like, belief-driven mechanism. Our framework leverages a Rao-Blackwellized Particle Filter (RBPF) to efficiently estimate the pedestrian's hybrid state. To emulate human-like cognitive processes under uncertainty, we introduce a Conditional Belief Reset mechanism and a Hypothesis Injection technique to explicitly model beliefs about the pedestrian's multiple latent intentions. Planning is achieved via a Cross-Entropy Method (CEM) enhanced Model Predictive Path Integral (MPPI) controller, which synergizes the efficient, iterative search of CEM with the inherent robustness of MPPI. Simulation experiments demonstrate that our approach significantly reduces the collision rate compared to reactive, rule-based, and reinforcement learning (RL) baselines, while also exhibiting explainable and human-like driving behavior that reflects the agent's internal belief state.

Chinese Translation

遮挡行人的突然出现为自动驾驶带来了重大安全挑战。传统的基于规则或纯数据驱动的方法在这些长尾场景的固有高不确定性面前显得力不从心。为了解决这一挑战，我们提出了一种基于主动推理（Active Inference）的新框架，使得智能体具备类似人类的信念驱动机制。我们的框架利用了拉奥-布莱克韦尔粒子滤波器（Rao-Blackwellized Particle Filter, RBPF）来高效估计行人的混合状态。为了模拟在不确定性下的人类认知过程，我们引入了条件信念重置机制和假设注入技术，以明确建模关于行人多重潜在意图的信念。规划通过增强型模型预测路径积分（Model Predictive Path Integral, MPPI）控制器实现，该控制器将交叉熵方法（Cross-Entropy Method, CEM）的高效迭代搜索与MPPI的固有鲁棒性相结合。仿真实验表明，与反应式、基于规则和强化学习（Reinforcement Learning, RL）基线相比，我们的方法显著降低了碰撞率，同时展现出可解释的、类似人类的驾驶行为，反映了智能体的内部信念状态。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2602.23206

Grasp, Slide, Roll: Comparative Analysis of Contact Modes for Tactile-Based Shape Reconstruction

抓取、滑动、滚动：基于触觉的形状重建接触模式的比较分析

Kim, Chung Hee, Kamtikar, Shivani, Brady, Tye, Padir, Taskin, Migdal, Joshua

Abstract

Tactile sensing allows robots to gather detailed geometric information about objects through physical interaction, complementing vision-based approaches. However, efficiently acquiring useful tactile data remains challenging due to the time-consuming nature of physical contact and the need to strategically choose contact locations that maximize information gain while minimizing physical interactions. This paper studies how different contact modes affect object shape reconstruction using a tactile-enabled dexterous gripper. We compare three contact interaction modes: grasp-releasing, sliding induced by finger-grazing, and palm-rolling. These contact modes are combined with an information-theoretic exploration framework that guides subsequent sampling locations using a shape completion model. Our results show that the improved tactile sensing efficiency of finger-grazing and palm-rolling translates into faster convergence in shape reconstruction, requiring 34% fewer physical interactions while improving reconstruction accuracy by 55%. We validate our approach using a UR5e robot arm equipped with an Inspire-Robots Dexterous Hand, showing robust performance across primitive object geometries.

Chinese Translation

触觉传感使机器人能够通过物理交互获取关于物体的详细几何信息，补充了基于视觉的方法。然而，由于物理接触的耗时特性以及需要战略性选择接触位置以最大化信息增益并最小化物理交互，效率地获取有用的触觉数据仍然具有挑战性。本文研究了不同接触模式如何影响使用触觉驱动灵巧夹具的物体形状重建。我们比较了三种接触交互模式：抓取释放、指尖滑动引起的滑动和手掌滚动。这些接触模式与信息论探索框架相结合，利用形状补全模型指导后续的采样位置。我们的结果表明，指尖滑动和手掌滚动的触觉传感效率提高转化为形状重建的更快收敛，所需的物理交互减少了34%，同时重建精度提高了55%。我们使用配备Inspire-Robots灵巧手的UR5e机器人臂验证了我们的方法，显示出在原始物体几何形状上的稳健性能。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2602.23253

SPARR: Simulation-based Policies with Asymmetric Real-world Residuals for Assembly

SPARR：基于仿真的具有不对称真实世界残差的装配策略

Guo, Yijie, Akinola, Iretiayo, Johannsmeier, Lars, Hadfield, Hugo, Gupta, Abhishek, Narang, Yashraj

Abstract

Robotic assembly presents a long-standing challenge due to its requirement for precise, contact-rich manipulation. While simulation-based learning has enabled the development of robust assembly policies, their performance often degrades when deployed in real-world settings due to the sim-to-real gap. Conversely, real-world reinforcement learning (RL) methods avoid the sim-to-real gap, but rely heavily on human supervision and lack generalization ability to environmental changes. In this work, we propose a hybrid approach that combines a simulation-trained base policy with a real-world residual policy to efficiently adapt to real-world variations. The base policy, trained in simulation using low-level state observations and dense rewards, provides strong priors for initial behavior. The residual policy, learned in the real world using visual observations and sparse rewards, compensates for discrepancies in dynamics and sensor noise. Extensive real-world experiments demonstrate that our method, SPARR, achieves near-perfect success rates across diverse two-part assembly tasks. Compared to the state-of-the-art zero-shot sim-to-real methods, SPARR improves success rates by 38.4% while reducing cycle time by 29.7%. Moreover, SPARR requires no human expertise, in contrast to the state-of-the-art real-world RL approaches that depend heavily on human supervision.

Chinese Translation

机器人装配因其对精确、接触丰富的操作的要求而面临长期挑战。虽然基于仿真的学习使得开发稳健的装配策略成为可能，但在实际应用中，由于仿真与现实之间的差距，其性能往往会下降。相反，现实世界的强化学习（RL）方法避免了仿真与现实之间的差距，但在很大程度上依赖于人工监督，并且缺乏对环境变化的泛化能力。在本研究中，我们提出了一种混合方法，将基于仿真训练的基础策略与真实世界残差策略相结合，以有效适应现实世界的变化。基础策略在仿真中使用低级状态观察和密集奖励进行训练，为初始行为提供了强有力的先验。残差策略在真实世界中使用视觉观察和稀疏奖励进行学习，以补偿动态和传感器噪声之间的差异。大量的真实世界实验表明，我们的方法SPARR在各种双部分装配任务中实现了近乎完美的成功率。与最先进的零样本仿真到现实方法相比，SPARR的成功率提高了38.4%，同时循环时间减少了29.7%。此外，SPARR不需要人工专业知识，而与依赖于人工监督的最先进的现实世界RL方法形成对比。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2602.23283

Simple Models, Real Swimming: Digital Twins for Tendon-Driven Underwater Robots

简单模型，真实游动：用于腱驱动水下机器人的数字双胞胎

Michelis, Mike Y., Obayashi, Nana, Hughes, Josie, Katzschmann, Robert K.

Abstract

Mimicking the graceful motion of swimming animals remains a core challenge in soft robotics due to the complexity of fluid-structure interaction and the difficulty of controlling soft, biomimetic bodies. Existing modeling approaches are often computationally expensive and impractical for complex control or reinforcement learning needed for realistic motions to emerge in robotic systems. In this work, we present a tendon-driven fish robot modeled in an efficient underwater swimmer environment using a simplified, stateless hydrodynamics formulation implemented in the widespread robotics framework MuJoCo. With just two real-world swimming trajectories, we identify five fluid parameters that allow a matching to experimental behavior and generalize across a range of actuation frequencies. We show that this stateless fluid model can generalize to unseen actuation and outperform classical analytical models such as the elongated body theory. This simulation environment runs faster than real-time and can easily enable downstream learning algorithms such as reinforcement learning for target tracking, reaching a 93% success rate. Due to the simplicity and ease of use of the model and our open-source simulation environment, our results show that even simple, stateless models -- when carefully matched to physical data -- can serve as effective digital twins for soft underwater robots, opening up new directions for scalable learning and control in aquatic environments.

Chinese Translation

模仿游泳动物优雅的运动仍然是软机器人领域的核心挑战，这主要源于流体-结构相互作用的复杂性以及控制柔性仿生体的困难。现有的建模方法通常计算成本高昂，并且对于复杂控制或强化学习而言不够实用，这使得机器人系统中实现真实运动变得困难。在本研究中，我们提出了一种腱驱动的鱼形机器人，该机器人在一个高效的水下游泳环境中建模，采用了在广泛使用的机器人框架MuJoCo中实现的简化无状态流体动力学公式。通过仅使用两个真实世界的游泳轨迹，我们识别出五个流体参数，这些参数能够与实验行为匹配，并在不同的驱动频率范围内进行泛化。我们展示了这一无状态流体模型能够推广到未见的驱动情况，并且在性能上超越了经典的解析模型，如细长体理论。该仿真环境的运行速度快于实时，并且可以轻松支持下游学习算法，如用于目标跟踪的强化学习，成功率达到93%。由于模型的简单性和易用性，以及我们开源的仿真环境，我们的结果表明，即使是简单的无状态模型——在与物理数据仔细匹配的情况下——也可以作为软水下机器人的有效数字双胞胎，为水域环境中的可扩展学习和控制开辟新的方向。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2602.23287

Interface-Aware Trajectory Reconstruction of Limited Demonstrations for Robot Learning

面向接口的有限示范轨迹重建用于机器人学习

Barsoum, Demiana R., Javaremi, Mahdieh Nejati, Loke, Larisa Y. C., Argall, Brenna D.

Abstract

Assistive robots offer agency to humans with severe motor impairments. Often, these users control high-DoF robots through low-dimensional interfaces, such as using a 1-D sip-and-puff interface to operate a 6-DoF robotic arm. This mismatch results in having access to only a subset of control dimensions at a given time, imposing unintended and artificial constraints on robot motion. As a result, interface-limited demonstrations embed suboptimal motions that reflect interface restrictions rather than user intent. To address this, we present a trajectory reconstruction algorithm that reasons about task, environment, and interface constraints to lift demonstrations into the robot's full control space. We evaluate our approach using real-world demonstrations of ADL-inspired tasks performed via a 2-D joystick and 1-D sip-and-puff control interface, teleoperating two distinct 7-DoF robotic arms. Analyses of the reconstructed demonstrations and derived control policies show that lifted trajectories are faster and more efficient than their interface-constrained counterparts while respecting user preferences.

Chinese Translation

辅助机器人为严重运动障碍的用户提供了自主性。这些用户通常通过低维接口控制高自由度（DoF）机器人，例如使用一维的吸吮接口来操作六自由度（6-DoF）机器人手臂。这种不匹配导致在给定时间内只能访问控制维度的子集，从而对机器人运动施加了意想不到且人为的限制。因此，受限于接口的示范嵌入了反映接口限制而非用户意图的次优运动。为了解决这个问题，我们提出了一种轨迹重建算法，该算法考虑任务、环境和接口约束，将示范提升到机器人的完整控制空间。我们使用通过二维操纵杆和一维吸吮控制接口进行的日常生活活动（ADL）启发任务的真实示范来评估我们的方法，远程操作两个不同的七自由度（7-DoF）机器人手臂。对重建示范和派生控制策略的分析表明，提升后的轨迹比受接口限制的轨迹更快且更高效，同时尊重用户偏好。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

120

cs.CV / 1 / 2602.22347

Enabling clinical use of foundation models in histopathology

在组织病理学中实现基础模型的临床应用

Henriksen, Audun L., Skrede, Ole-Johan, van der Schee, Lisa, Domingo, Enric, De Raedt, Sepp, Kostolomov, Ilyá, Hay, Jennifer, Cyll, Karolina, Kildal, Wanja, Kalsnes, Joakim, Williams, Robert W., Pradhan, Manohar, Nesheim, John Arne, Askautrud, Hanne A., Isaksen, Maria X., de Gordoa, Karmele Saez, Cuatrecasas, Miriam, Edwards, Joanne, group, TransSCOT, Nesbakken, Arild, Shepherd, Neil A., Tomlinson, Ian, Wagner, Daniel-Christoph, Kerr, Rachel S., Hveem, Tarjei Sveinsgjerd, Liestøl, Knut, Nakamura, Yoshiaki, Novelli, Marco, Miyo, Masaaki, Foersch, Sebastian, Church, David N., Lacle, Miangela M., Kerr, David J., Kleppe, Andreas

Abstract

Foundation models in histopathology are expected to facilitate the development of high-performing and generalisable deep learning systems. However, current models capture not only biologically relevant features, but also pre-analytic and scanner-specific variation that bias the predictions of task-specific models trained from the foundation model features. Here we show that introducing novel robustness losses during training of downstream task-specific models reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 WSIs from 6155 patients is used to train thousands of models from the features of eight popular foundation models for computational pathology. In addition to a substantial improvement in robustness, we observe that prediction accuracy improves by focusing on biologically relevant features. Our approach successfully mitigates robustness issues of foundation models for computational pathology without retraining the foundation models themselves, enabling development of robust computational pathology models applicable to real-world data in routine clinical practice.

Chinese Translation

组织病理学中的基础模型预计将促进高性能和可泛化深度学习系统的发展。然而，当前模型不仅捕捉生物学相关特征，还捕捉预分析和扫描仪特定的变异，这会偏倚从基础模型特征训练的任务特定模型的预测。在这里，我们展示了在下游任务特定模型的训练过程中引入新颖的鲁棒性损失可以降低对技术变异的敏感性。我们使用一个专门设计的综合实验设置，包含来自6155名患者的27,042个全切片图像（WSIs），从八个流行的计算病理学基础模型的特征中训练数千个模型。除了显著提高鲁棒性外，我们还观察到通过关注生物学相关特征，预测准确性得到了改善。我们的方法成功缓解了计算病理学基础模型的鲁棒性问题，而无需重新训练基础模型本身，从而实现了可应用于常规临床实践中真实世界数据的鲁棒计算病理学模型的开发。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2602.22361

Optimizing Neural Network Architecture for Medical Image Segmentation Using Monte Carlo Tree Search

利用蒙特卡罗树搜索优化医疗图像分割的神经网络架构

Meng, Liping, Nie, Fan, Zhang, Yunyun, Han, Chao

Abstract

This paper proposes a novel medical image segmentation framework, MNAS-Unet, which combines Monte Carlo Tree Search (MCTS) and Neural Architecture Search (NAS). MNAS-Unet dynamically explores promising network architectures through MCTS, significantly enhancing the efficiency and accuracy of architecture search. It also optimizes the DownSC and UpSC unit structures, enabling fast and precise model adjustments. Experimental results demonstrate that MNAS-Unet outperforms NAS-Unet and other state-of-the-art models in segmentation accuracy on several medical image datasets, including PROMISE12, Ultrasound Nerve, and CHAOS. Furthermore, compared with NAS-Unet, MNAS-Unet reduces the architecture search budget by 54% (early stopping at 139 epochs versus 300 epochs under the same search setting), while achieving a lightweight model with only 0.6M parameters and lower GPU memory consumption, which further improves its practical applicability. These results suggest that MNAS-Unet can improve search efficiency while maintaining competitive segmentation accuracy under practical resource constraints.

Chinese Translation

本文提出了一种新颖的医疗图像分割框架，MNAS-Unet，该框架结合了蒙特卡罗树搜索（Monte Carlo Tree Search, MCTS）和神经架构搜索（Neural Architecture Search, NAS）。MNAS-Unet通过MCTS动态探索有前景的网络架构，显著提高了架构搜索的效率和准确性。它还优化了DownSC和UpSC单元结构，使模型调整快速而精确。实验结果表明，MNAS-Unet在多个医疗图像数据集（包括PROMISE12、超声神经和CHAOS）的分割准确性上优于NAS-Unet及其他最先进的模型。此外，与NAS-Unet相比，MNAS-Unet将架构搜索预算减少了54%（在相同搜索设置下，早停于139个epoch而不是300个epoch），同时实现了仅有0.6M参数的轻量级模型，并降低了GPU内存消耗，进一步提高了其实际应用性。这些结果表明，MNAS-Unet能够在实际资源限制下提高搜索效率，同时保持竞争性的分割准确性。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2602.22376

AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction

AeroDGS：物理一致的动态高斯点云用于单序列空中4D重建

Liu, Hanyang, Qin, Rongjun

Abstract

Recent advances in 4D scene reconstruction have significantly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed. To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion. The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.

Chinese Translation

近年来，4D场景重建的进展显著提升了各个领域的动态建模能力。然而，现有方法在空中条件下的单视角捕捉、广泛的空间范围以及有限空间占用和大运动差异的动态物体方面仍然存在局限性。这些挑战导致严重的深度模糊和不稳定的运动估计，使得单目空中重建本质上是一个不适定的问题。为此，我们提出了AeroDGS，一个基于物理指导的4D高斯点云框架，适用于单目无人机视频。AeroDGS引入了单目几何提升模块，从单个空中序列中重建可靠的静态和动态几何，为动态估计提供了稳健的基础。为了进一步解决单目模糊问题，我们提出了一个物理指导的优化模块，该模块结合了可微分的地面支持、直立稳定性和轨迹平滑性先验，将模糊的图像线索转化为物理一致的运动。该框架共同优化静态背景和动态实体，确保几何稳定性和一致的时间演变。此外，我们还构建了一个涵盖各种高度和运动条件的真实无人机数据集，以评估动态空中重建。在合成和真实无人机场景上的实验表明，AeroDGS优于最先进的方法，在动态空中环境中实现了更高的重建保真度。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2602.22381

Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention

增强肾肿瘤恶性预测：基于深度学习的自动3D CT器官聚焦注意力

Fan, Zhengkang, Sun, Chengkun, Terry, Russell, Xu, Jie, Latecki, Longin Jan

Abstract

Accurate prediction of malignancy in renal tumors is crucial for informing clinical decisions and optimizing treatment strategies. However, existing imaging modalities lack the necessary accuracy to reliably predict malignancy before surgical intervention. While deep learning has shown promise in malignancy prediction using 3D CT images, traditional approaches often rely on manual segmentation to isolate the tumor region and reduce noise, which enhances predictive performance. Manual segmentation, however, is labor-intensive, costly, and dependent on expert knowledge. In this study, a deep learning framework was developed utilizing an Organ Focused Attention (OFA) loss function to modify the attention of image patches so that organ patches attend only to other organ patches. Hence, no segmentation of 3D renal CT images is required at deployment time for malignancy prediction. The proposed framework achieved an AUC of 0.685 and an F1-score of 0.872 on a private dataset from the UF Integrated Data Repository (IDR), and an AUC of 0.760 and an F1-score of 0.852 on the publicly available KiTS21 dataset. These results surpass the performance of conventional models that rely on segmentation-based cropping for noise reduction, demonstrating the frameworks ability to enhance predictive accuracy without explicit segmentation input. The findings suggest that this approach offers a more efficient and reliable method for malignancy prediction, thereby enhancing clinical decision-making in renal cancer diagnosis.

Chinese Translation

肾肿瘤恶性程度的准确预测对临床决策和优化治疗策略至关重要。然而，现有的影像学方法在手术干预前缺乏可靠预测恶性的必要准确性。尽管深度学习在使用3D CT图像进行恶性预测方面显示出潜力，传统方法通常依赖手动分割来隔离肿瘤区域并减少噪声，从而提高预测性能。然而，手动分割劳动强度大、成本高且依赖于专家知识。在本研究中，开发了一种深度学习框架，利用器官聚焦注意力（Organ Focused Attention, OFA）损失函数来修改图像块的注意力，使器官块仅关注其他器官块。因此，在进行恶性预测时，无需对3D肾CT图像进行分割。所提出的框架在来自UF综合数据存储库（Integrated Data Repository, IDR）的私有数据集上达到了0.685的AUC和0.872的F1-score，在公开可用的KiTS21数据集上达到了0.760的AUC和0.852的F1-score。这些结果超越了依赖基于分割裁剪以减少噪声的传统模型的性能，展示了该框架在没有显式分割输入的情况下增强预测准确性的能力。研究结果表明，该方法为恶性预测提供了一种更高效和可靠的方式，从而增强了肾癌诊断中的临床决策能力。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2602.22394

Vision Transformers Need More Than Registers

视觉变换器需要的不仅仅是寄存器

Shi, Cheng, Yu, Yizhou, Yang, Sibei

Abstract

Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

Chinese Translation

视觉变换器（Vision Transformers, ViTs）在大规模数据上进行预训练时，为多样的下游任务提供了通用的表示。然而，在不同的监督范式和下游任务中，ViTs 中普遍观察到伪影。通过对 ViTs 中伪影的系统分析，我们发现其基本机制尚未得到充分阐明。在本文中，通过系统分析，我们得出结论：这些伪影源于一种懒惰的聚合行为：ViT 使用与语义无关的背景块作为快捷方式来表示全局语义，这种行为受到全局注意力和粗粒度语义监督的驱动。我们的解决方案选择性地将块特征整合到 CLS 标记中，减少了背景主导的快捷方式的影响，并在标签、文本和自监督的 12 个基准测试中持续提高了性能。我们希望这项工作为 ViT 行为提供一种新的视角。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2602.22419

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

CLIP的短视性：关注超越第一句

Lavoie, Marc-Antoine, Mahmoud, Anas, Zaimi, Aldo, Tchango, Arsene Fansi, Waslander, Steven L.

Abstract

CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

Chinese Translation

CLIP模型通过在互联网规模数据上进行图像-文本对比学习，学习可转移的多模态特征。它们广泛应用于零样本分类、多模态检索、文本到图像扩散，以及作为大型视觉-语言模型中的图像编码器。然而，CLIP的预训练主要以配有短标题的图像为主，这使得模型倾向于编码显著物体的简单描述，并导致在复杂场景和密集描述上的粗略对齐。尽管近期的研究通过在小规模长标题数据集上进行微调来缓解这一问题，但我们识别出一个重要的共同偏差：无论是人类生成的还是大型语言模型（LLM）生成的长标题通常以一句话总结开始，随后是详细描述。我们展示了这在训练过程中充当了捷径，集中注意力于开头句子和早期标记，从而削弱了对标题其余部分的对齐。为了解决这个问题，我们引入了DeBias-CLIP，该方法在训练过程中移除了总结句，并应用句子子抽样和文本标记填充，以在所有标记位置分配监督。DeBias-CLIP实现了最先进的长文本检索，改善了短文本检索，并对句子顺序的排列变化不那么敏感。它可以作为Long-CLIP的直接替代，不需要额外的可训练参数。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2602.22426

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

SimpleOCR：渲染可视化问题以教导多模态大型语言模型阅读

Peng, Yibo, Xia, Peng, Zhong, Ding, Zeng, Kaide, Han, Siwei, Zhou, Yiyang, Liu, Jiaqi, Zhang, Ruiyi, Yao, Huaxiu

Abstract

Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.

Chinese Translation

尽管多模态大型语言模型（MLLMs）在快速发展，但关于它们的视觉基础机制仍然存在一个关键问题尚未解答：这些模型是否真正“阅读”嵌入图像中的文本，还是仅仅依赖于文本提示中的参数化捷径？在本研究中，我们通过引入可视化问题（Visualized-Question, VQ）设置来诊断这一问题，在该设置中，文本查询直接渲染到图像上，以结构性地强制视觉参与。我们在 Qwen2.5-VL 上的诊断实验揭示了一个惊人的能力利用差距：尽管具备强大的光学字符识别（OCR）能力，但模型在 VQ 设置中的性能下降高达 12.7%，暴露出深层次的“模态懒惰”。为了弥补这一差距，我们提出了 SimpleOCR，一种即插即用的训练策略，该策略对学习过程施加结构性约束。通过将训练样本转换为随机风格的 VQ 格式，SimpleOCR 有效地无效化基于文本的捷径，迫使模型激活并优化其视觉文本提取路径。实证结果表明，SimpleOCR 在不修改架构的情况下，带来了稳健的性能提升。在四个具有代表性的 OOD 基准上，其性能超过基础模型 5.4%，超过基于原始图像的 GRPO 2.7%，同时展现出极高的数据效率，以 30 倍更少的样本（8.5K）实现优越性能，优于近期基于强化学习的方法。此外，其即插即用的特性允许与先进的强化学习策略如 NoisyRollout 无缝集成，以实现互补性改进。代码可在 https://github.com/aiming-lab/SimpleOCR 获取。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2602.22455

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

探索边缘计算中的多模态大型语言模型用于在线情节记忆问答

Lando, Giuseppe, Forte, Rosario, Furnari, Antonino

Abstract

We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.

Chinese Translation

我们研究了在实时在线情节记忆问答中使用多模态大型语言模型（MLLMs）的可行性。虽然云计算卸载很常见，但它对可穿戴助手提出了隐私和延迟的担忧，因此我们探讨了在边缘计算上的实现。我们将流媒体约束集成到问答管道中，该管道结构分为两个异步线程：一个描述线程（Descriptor Thread），持续将视频转换为轻量级文本记忆；另一个问答线程（Question Answering Thread），对文本记忆进行推理以回答查询。在QAEgo4D-Closed基准上的实验分析了多模态大型语言模型（MLLMs）在严格资源限制下的性能，结果显示出相当有希望的表现，甚至与基于云的解决方案相比也不逊色。具体而言，在一台消费级8GB GPU上运行的端到端配置达到了51.76%的准确率，首次响应时间（Time-To-First-Token, TTFT）为0.41秒。扩展到本地企业级服务器时，准确率提高到54.40%，TTFT为0.88秒。相比之下，基于云的解决方案获得了56.00%的准确率。这些具有竞争力的结果突显了基于边缘的解决方案在保护隐私的情节记忆检索中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2602.22462

MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation

MammoWise：用于乳腺X光报告生成的多模型本地RAG管道

Jahangir, Raiyan, Khan, Nafiz Imtiaz, Sudheerkumar, Amritanand, Filkov, Vladimir

Abstract

Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.

Chinese Translation

筛查乳腺X光检查具有高容量、时间敏感和文档密集的特点。放射科医生必须将微妙的视觉发现转化为一致的BI-RADS评估、乳腺密度类别和结构化叙述报告。尽管近期的视觉语言模型（VLMs）使图像到文本的报告成为可能，但许多模型依赖于封闭的云系统或紧密耦合的架构，这限制了隐私性、可重复性和适应性。我们提出了MammoWise，一个本地多模型管道，将开源VLM转化为乳腺X光报告生成器和多任务分类器。MammoWise支持任何Ollama托管的VLM和乳腺X光数据集，并支持零样本、少样本和思维链提示，具有使用向量数据库进行特定案例上下文的可选多模态检索增强生成（RAG）。我们在VinDr-Mammo和DMID数据集上评估了MedGemma、LLaVA-Med和Qwen2.5-VL，评估报告质量（BERTScore，ROUGE-L）、BI-RADS分类、乳腺密度和关键发现。报告生成始终表现强劲，并随着少样本提示和RAG的使用而改善。分类是可行的，但对模型和数据集的选择敏感。对MedGemma进行参数高效的微调（QLoRA）提高了可靠性，达到了0.7545的BI-RADS准确率、0.8840的密度准确率和0.9341的钙化准确率，同时保持了报告质量。MammoWise提供了一个实用且可扩展的框架，用于在统一且可重复的工作流程中部署本地VLM进行乳腺X光报告。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2602.22469

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

超越主导区域：面向基础视觉-语言模型的空间信用重分配

Samin, Niamul Hassan, Rahman, Md Arifur, Hanif, Abdullah Ibne, Noshin, Juena Ahmed, Rahman, Md Ashikur

Abstract

Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.

Chinese Translation

视觉-语言模型（VLMs）常常会幻觉出输入图像中不存在的物体。我们将这一失败归因于空间信用崩溃：在早期变换器层中，激活信用集中在稀疏的视觉区域上，这抑制了上下文证据并增加了对语言先验的依赖。我们提出了空间信用重分配（Spatial Credit Redistribution, SCR），这是一种无需训练的推理时干预方法，它将隐藏状态激活从高注意力源区域重新分配到其上下文，受低熵输入的引导。我们在POPE和CHAIR基准上评估了六个模型家族（Chameleon、LLaVA和Qwen，包括Qwen-VL和Qwen2-VL），规模为7B、13B和30B。SCR在POPE-Adversarial上将幻觉率降低了约4.7-6.0个百分点，在CHAIR-s上降低了3.7-5.2个百分点（相对42-51%），在CHAIR-i上降低了2.7-4.4个百分点（相对44-58%），并将CIDEr保持在0.8个百分点以内。增益在低熵输入下最大，这与理论框架一致。SCR仅增加了43-56毫秒的开销（小模型：+43-46毫秒；大模型：+54-56毫秒），大约是OPERA和VCD的3-6倍，以及OVCD（+72毫秒）的1.3-1.7倍，同时在幻觉率和CIDEr上优于所有三者，使其适用于实时设置。控制性消融实验确认了基于注意力的源选择是至关重要的：用均匀随机选择替代会将幻觉率增益从约4.7-6.0个百分点降低到仅约2.6-3.4个百分点，表明信用崩溃是关键驱动因素。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2602.22510

Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

Pix2Key：通过语义分解和自监督视觉词典学习实现可控的开放词汇检索

Wei, Guoyizhe, Jiao, Yang, Xi, Nan, Huang, Zhishen, Meng, Jingjing, Chellappa, Rama, Gao, Yan

Abstract

Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.

Chinese Translation

组合图像检索（CIR）使用参考图像加上自然语言编辑来检索应用所请求更改的图像，同时保留其他相关视觉内容。经典的融合管道通常依赖于监督三元组，可能会丢失细粒度线索，而最近的零样本方法通常对参考图像进行注释，并将注释与编辑合并，这可能会错过隐含的用户意图并返回重复的结果。我们提出了Pix2Key，它将查询和候选项表示为开放词汇视觉词典，从而在统一的嵌入空间中实现意图感知的约束匹配和多样性感知的重新排序。自监督预训练组件V-Dict-AE进一步利用仅图像改进词典表示，增强细粒度属性理解，而无需CIR特定的监督。在DFMM-Compose基准上，Pix2Key将Recall@10提高了最多3.2个百分点，添加V-Dict-AE则带来了额外的2.3个百分点的提升，同时改善了意图一致性并保持了高列表多样性。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2602.22545

DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI

DisQ-HNet：一种可解释的多模态图像合成应用于从 T1 和 FLAIR MRI 合成 Tau-PET 的解耦量化半 UNet

Chopra, Agamdeep S., Neher, Caitlin, Ren, Tianyi, Rivera, Juampablo E. Heras, Kurt, Mehmet

Abstract

Tau positron emission tomography (tau-PET) provides an in vivo marker of Alzheimer's disease pathology, but cost and limited availability motivate MRI-based alternatives. We introduce DisQ-HNet (DQH), a framework that synthesizes tau-PET from paired T1-weighted and FLAIR MRI while exposing how each modality contributes to the prediction. The method combines (i) a Partial Information Decomposition (PID)-guided, vector-quantized encoder that partitions latent information into redundant, unique, and complementary components, and (ii) a Half-UNet decoder that preserves anatomical detail using pseudo-skip connections conditioned on structural edge cues rather than direct encoder feature reuse. Across multiple baselines (VAE, VQ-VAE, and UNet), DisQ-HNet maintains reconstruction fidelity and better preserves disease-relevant signal for downstream AD tasks, including Braak staging, tau localization, and classification. PID-based Shapley analysis provides modality-specific attribution of synthesized uptake patterns.

Chinese Translation

Tau 正电子发射断层扫描（tau-PET）提供了阿尔茨海默病病理的体内标记，但高成本和有限的可用性促使我们寻找基于 MRI 的替代方案。我们提出了 DisQ-HNet（DQH），一个从配对的 T1 加权和 FLAIR MRI 合成 tau-PET 的框架，同时揭示每种模态如何贡献于预测。该方法结合了 (i) 一种基于部分信息分解（PID）指导的向量量化编码器，该编码器将潜在信息划分为冗余、独特和互补的组件，以及 (ii) 一种半 UNet 解码器，该解码器使用基于结构边缘线索的伪跳跃连接来保留解剖细节，而不是直接重用编码器特征。在多个基准（VAE、VQ-VAE 和 UNet）中，DisQ-HNet 维持了重建的保真度，并更好地保留了与疾病相关的信号，以便于下游阿尔茨海默病任务，包括 Braak 分期、tau 定位和分类。基于 PID 的 Shapley 分析提供了合成摄取模式的模态特定归因。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2602.22549

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

DrivePTS：一种具有文本和结构增强的渐进学习框架用于驾驶场景生成

Wang, Zhechao, Zeng, Yiming, Ma, Lufan, Fu, Zeqing, Bai, Chen, Lin, Ziyao, Lu, Cheng

Abstract

Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.

Chinese Translation

多样化驾驶场景的合成作为一种重要的数据增强技术，对于验证自动驾驶系统的鲁棒性和泛化能力至关重要。目前的方法在扩散模型中聚合高清（HD）地图和3D边界框作为条件生成的几何条件。然而，隐式的条件间依赖性导致在控制条件独立变化时生成失败。此外，这些方法在语义和结构方面的细节不足。具体而言，简短且视角不变的标题限制了语义上下文，导致背景建模较弱。同时，标准的去噪损失采用均匀空间加权，忽视了前景结构细节，造成视觉失真和模糊。为了解决这些挑战，我们提出了DrivePTS，包含三个关键创新。首先，我们的框架采用渐进学习策略，以减轻几何条件之间的相互依赖性，并通过明确的互信息约束进行强化。其次，利用视觉-语言模型生成六个语义方面的多视角层次描述，提供细粒度的文本指导。第三，引入频率引导的结构损失，以增强模型对高频元素的敏感性，提高前景结构的保真度。大量实验表明，我们的DrivePTS在生成多样化驾驶场景方面实现了最先进的保真度和可控性。值得注意的是，DrivePTS成功生成了先前方法失败的稀有场景，突显了其强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2602.22565

SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction

SwiftNDC：高保真3D重建的快速神经深度校正

Han, Kang, Xiang, Wei, Yu, Lu, Wyatt, Mathew, Liu, Gaowen, Kompella, Ramana Rao

Abstract

Depth-guided 3D reconstruction has gained popularity as a fast alternative to optimization-heavy approaches, yet existing methods still suffer from scale drift, multi-view inconsistencies, and the need for substantial refinement to achieve high-fidelity geometry. Here, we propose SwiftNDC, a fast and general framework built around a Neural Depth Correction field that produces cross-view consistent depth maps. From these refined depths, we generate a dense point cloud through back-projection and robust reprojection-error filtering, obtaining a clean and uniformly distributed geometric initialization for downstream reconstruction. This reliable dense geometry substantially accelerates 3D Gaussian Splatting (3DGS) for mesh reconstruction, enabling high-quality surfaces with significantly fewer optimization iterations. For novel-view synthesis, SwiftNDC can also improve 3DGS rendering quality, highlighting the benefits of strong geometric initialization. We conduct a comprehensive study across five datasets, including two for mesh reconstruction, as well as three for novel-view synthesis. SwiftNDC consistently reduces running time for accurate mesh reconstruction and boosts rendering fidelity for view synthesis, demonstrating the effectiveness of combining neural depth refinement with robust geometric initialization for high-fidelity and efficient 3D reconstruction.

Chinese Translation

深度引导的3D重建作为一种快速替代优化密集方法的方案，已获得广泛关注，但现有方法仍然面临尺度漂移、多视图不一致性以及实现高保真几何所需的显著精细化等问题。在此，我们提出了SwiftNDC，一个基于神经深度校正场的快速通用框架，能够生成视图间一致的深度图。基于这些精细化的深度，我们通过反投影和稳健的重投影误差过滤生成稠密点云，从而为后续重建提供干净且均匀分布的几何初始化。这种可靠的稠密几何显著加速了网格重建中的3D高斯喷溅（3D Gaussian Splatting, 3DGS），使得在显著减少优化迭代次数的情况下实现高质量表面。对于新视图合成，SwiftNDC也能提升3DGS的渲染质量，突显出强几何初始化的优势。我们在五个数据集上进行了全面研究，包括两个用于网格重建以及三个用于新视图合成。SwiftNDC始终减少了准确网格重建的运行时间，并提升了视图合成的渲染保真度，展示了将神经深度精细化与稳健几何初始化相结合在高保真和高效3D重建中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2602.22568

Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise

考虑质量的鲁棒多视角聚类方法应对异构观测噪声

Wu, Peihan, Cheng, Guanjie, Tong, Yufei, Xi, Meng, Deng, Shuiguang

Abstract

Deep multi-view clustering has achieved remarkable progress but remains vulnerable to complex noise in real-world applications. Existing noisy robust methods predominantly rely on a simplified binary assumption, treating data as either perfectly clean or completely corrupted. This overlooks the prevalent existence of heterogeneous observation noise, where contamination intensity varies continuously across data. To bridge this gap, we propose a novel framework termed Quality-Aware Robust Multi-View Clustering (QARMVC). Specifically, QARMVC employs an information bottleneck mechanism to extract intrinsic semantics for view reconstruction. Leveraging the insight that noise disrupts semantic integrity and impedes reconstruction, we utilize the resulting reconstruction discrepancy to precisely quantify fine-grained contamination intensity and derive instance-level quality scores. These scores are integrated into a hierarchical learning strategy: at the feature level, a quality-weighted contrastive objective is designed to adaptively suppress the propagation of noise; at the fusion level, a high-quality global consensus is constructed via quality-weighted aggregation, which is subsequently utilized to align and rectify local views via mutual information maximization. Extensive experiments on five benchmark datasets demonstrate that QARMVC consistently outperforms state-of-the-art baselines, particularly in scenarios with heterogeneous noise intensities.

Chinese Translation

深度多视角聚类取得了显著进展，但在实际应用中仍然容易受到复杂噪声的影响。现有的鲁棒噪声处理方法主要依赖于简化的二元假设，将数据视为完全干净或完全损坏。这忽视了异构观测噪声的普遍存在，其中污染强度在数据中持续变化。为了解决这一问题，我们提出了一种新颖的框架，称为考虑质量的鲁棒多视角聚类（Quality-Aware Robust Multi-View Clustering, QARMVC）。具体而言，QARMVC采用信息瓶颈机制提取内在语义以进行视图重建。利用噪声破坏语义完整性并妨碍重建的洞察，我们利用重建差异精确量化细粒度的污染强度，并推导出实例级质量评分。这些评分被整合到一个分层学习策略中：在特征层面，设计了一种质量加权的对比目标，以自适应地抑制噪声的传播；在融合层面，通过质量加权聚合构建高质量的全局共识，随后利用互信息最大化对局部视图进行对齐和校正。在五个基准数据集上的大量实验表明，QARMVC在异构噪声强度场景中始终优于最先进的基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2602.22570

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

指导至关重要：重新思考文本到图像生成的评估陷阱

Xie, Dian, Shao, Shitong, Bai, Lichen, Zhou, Zikai, Cheng, Bojun, Yang, Shuo, Wu, Jun, Xie, Zeke

Abstract

Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.

Chinese Translation

无分类器指导（Classifier-free guidance, CFG）帮助扩散模型在多个领域实现了出色的条件生成。最近，更多的扩散指导方法出现，提升了生成质量和人类偏好。然而，这些新兴的扩散指导方法真的能实现稳固且显著的改进吗？在本文中，我们重新审视了扩散指导的最新进展。我们的工作主要包括四个贡献。首先，我们揭示了一个关键的评估陷阱，即常见的人类偏好模型对大指导尺度表现出强烈的偏见。仅仅增加CFG尺度就可以轻易提高定量评估分数，因为强语义对齐，即使图像质量严重受损（例如，过饱和和伪影）。其次，我们引入了一种新颖的指导感知评估（Guidance-aware Evaluation, GA-Eval）框架，该框架采用有效的指导尺度校准，以便通过识别与CFG效果正交和并行的影响，实现当前指导方法与CFG之间的公平比较。第三，受到评估陷阱的启发，我们设计了超越性扩散指导（Transcendent Diffusion Guidance, TDG）方法，该方法可以显著提高传统评估框架中的人类偏好分数，但实际上在实践中并不奏效。第四，在广泛的实验中，我们在传统评估框架和提出的GA-Eval框架内对最近的八种扩散指导方法进行了实证评估。值得注意的是，仅仅增加CFG尺度就可以与大多数研究的扩散指导方法相竞争，而所有方法在标准CFG下都遭受了胜率下降的严重影响。我们的工作将强烈激励学术界重新思考评估范式和该领域的未来方向。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2602.22571

GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

GIFSplat：基于生成先验引导的稀疏视图迭代前馈3D高斯散射

Chen, Tianyu, Xiang, Wei, Han, Kang, Lu, Yu, Wu, Di, Liu, Gaowen, Kompella, Ramana Rao

Abstract

Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.

Chinese Translation

前馈3D重建在运行时上相较于每场景优化具有显著优势，后者在推理时仍然较慢，并且在稀疏视图下往往表现脆弱。然而，现有的前馈方法在进一步提升性能方面仍有潜力，尤其是在处理域外数据时，并且在引入生成先验后难以保持二级推理时间。这些限制源于现有前馈流程中的一次性预测范式：模型受到容量的严格限制，缺乏推理时的细化，并且不适合持续注入生成先验。我们提出了GIFSplat，这是一种纯前馈的迭代细化框架，旨在从稀疏的无姿态视图中进行3D高斯散射。通过少量仅前向的残差更新，逐步利用渲染证据细化当前的3D场景，实现效率与质量之间的良好平衡。此外，我们将一个冻结的扩散先验提炼为来自增强新渲染的高斯级线索，无需梯度反向传播或不断扩展视图集，从而在保持前馈效率的同时实现基于生成先验的每场景适应。在DL3DV、RealEstate10K和DTU数据集上，GIFSplat始终优于最先进的前馈基线，PSNR提升高达+2.1 dB，并且在不需要相机姿态或任何测试时梯度优化的情况下，保持了二级推理时间。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2602.22594

Causal Motion Diffusion Models for Autoregressive Motion Generation

因果运动扩散模型用于自回归运动生成

Yu, Qing, Watanabe, Akihisa, Fujiwara, Kent

Abstract

Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.

Chinese Translation

近期运动扩散模型的进展显著提高了人类运动合成的真实感。然而，现有的方法要么依赖于具有双向生成的全序列扩散模型，这限制了时间因果性和实时应用性，要么是自回归模型，这些模型面临不稳定性和累积误差的问题。在本研究中，我们提出了因果运动扩散模型（Causal Motion Diffusion Models, CMDM），这是一个基于因果扩散变换器的自回归运动生成的统一框架，能够在语义对齐的潜在空间中运行。CMDM建立在运动-语言对齐的因果变分自编码器（Motion-Language-Aligned Causal VAE, MAC-VAE）之上，该模型将运动序列编码为时间因果的潜在表示。在此潜在表示的基础上，使用因果扩散强迫训练自回归扩散变换器，以在运动帧之间执行时间顺序的去噪。为了实现快速推理，我们引入了一种具有因果不确定性的逐帧采样调度，其中每个后续帧是从部分去噪的前帧预测的。该框架支持高质量的文本到运动生成、流式合成和以交互速率进行的长时间运动生成。在HumanML3D和SnapMoGen上的实验表明，CMDM在语义保真度和时间平滑性方面优于现有的扩散和自回归模型，同时显著降低了推理延迟。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2602.22595

Don't let the information slip away

不要让信息溜走

Li, Taozhe

Abstract

Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.

Chinese Translation

近年来，实时目标检测技术迅速发展。YOLO系列检测器是最著名的基于卷积神经网络（CNN）的目标检测模型之一，不容忽视。最新版本YOLOv26最近发布，而YOLOv12在COCO val2017数据集上达到了55.2的平均精度均值（mAP），取得了最先进（SOTA）的性能。同时，基于变换器的目标检测模型，也称为DEtection TRansformer（DETR），表现出色。RT-DETR是一款杰出的模型，在发布时在速度和准确性上超越了YOLO系列。其后续版本RT-DETRv2在COCO val2017数据集上达到了53.4的mAP。然而，尽管这些模型表现出色，但它们都让信息溜走。它们主要关注前景物体的特征，而忽视了背景提供的上下文信息。我们认为，背景信息可以显著帮助目标检测任务。例如，汽车更可能出现在道路上而不是办公室，而野生动物更可能出现在森林或偏远地区而不是繁忙的街道。为了解决这一问题，我们提出了一种名为Association DETR的目标检测模型，该模型在COCO val2017数据集上相较于其他目标检测模型取得了最先进的结果。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2602.22596

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

BetterScene：基于表示对齐生成模型的3D场景合成

Han, Yuci, Toth, Charles, Anderson, John E., Shuart, William J., Yilmaz, Alper

Abstract

We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.

Chinese Translation

我们提出了BetterScene，一种利用极其稀疏且不受限制的照片来增强多样化真实场景的新视图合成（NVS）质量的方法。BetterScene利用在数十亿帧上预训练的生产就绪型稳定视频扩散（Stable Video Diffusion, SVD）模型作为强大的基础，旨在减轻伪影并在推理时恢复视图一致的细节。传统方法已经开发出类似的基于扩散的解决方案来应对新视图合成的挑战。尽管取得了显著的改进，这些方法通常依赖于现成的预训练扩散先验，并仅对UNet模块进行微调，同时保持其他组件不变，这仍然导致即使在结合几何感知正则化（如深度或语义条件）时，也会出现不一致的细节和伪影。为了解决这个问题，我们研究了扩散模型的潜在空间，并引入了两个组件：（1）时间等变正则化和（2）与视觉基础模型对齐的表示，这两个组件均应用于SVD管道中的变分自编码器（Variational Autoencoder, VAE）模块。BetterScene集成了一个前馈3D高斯点云（3D Gaussian Splatting, 3DGS）模型，将特征渲染为SVD增强器的输入，并生成连续的、无伪影的、一致的新视图。我们在具有挑战性的DL3DV-10K数据集上进行评估，并展示了相比于最先进方法的优越性能。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2602.22607

LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals

LoR-LUT：通过低秩残差学习紧凑的三维查找表

Zhao, Ziqi, Mishra, Abhijit, Roychowdhury, Shounak

Abstract

We present LoR-LUT, a unified low-rank formulation for compact and interpretable 3D lookup table (LUT) generation. Unlike conventional 3D-LUT-based techniques that rely on fusion of basis LUTs, which are usually dense tensors, our unified approach extends the current framework by jointly using residual corrections, which are in fact low-rank tensors, together with a set of basis LUTs. The approach described here improves the existing perceptual quality of an image, which is primarily due to the technique's novel use of residual corrections. At the same time, we achieve the same level of trilinear interpolation complexity, using a significantly smaller number of network, residual corrections, and LUT parameters. The experimental results obtained from LoR-LUT, which is trained on the MIT-Adobe FiveK dataset, reproduce expert-level retouching characteristics with high perceptual fidelity and a sub-megabyte model size. Furthermore, we introduce an interactive visualization tool, termed LoR-LUT Viewer, which transforms an input image into the LUT-adjusted output image, via a number of slidebars that control different parameters. The tool provides an effective way to enhance interpretability and user confidence in the visual results. Overall, our proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer.

Chinese Translation

我们提出了LoR-LUT，一种用于紧凑且可解释的三维查找表（LUT）生成的统一低秩公式。与传统的基于三维LUT的技术依赖于基查找表的融合（这些基查找表通常是稠密张量）不同，我们的统一方法通过联合使用残差修正（实际上是低秩张量）与一组基查找表，扩展了当前的框架。这里描述的方法改善了图像的现有感知质量，这主要归功于该技术对残差修正的新颖使用。同时，我们在使用显著更少的网络、残差修正和LUT参数的情况下，达到了相同水平的三线性插值复杂度。从在MIT-Adobe FiveK数据集上训练的LoR-LUT获得的实验结果再现了专家级的修饰特征，具有高感知保真度和亚兆字节的模型大小。此外，我们引入了一种交互式可视化工具，称为LoR-LUT Viewer，通过多个滑块控制不同参数，将输入图像转换为LUT调整后的输出图像。该工具提供了一种有效的方法来增强可解释性和用户对视觉结果的信心。总体而言，我们提出的公式为未来基于LUT的图像增强和风格迁移提供了一种紧凑、可解释和高效的方向。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2602.22613

Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

与增强指令的大型语言模型对齐的光谱蒸馏表示用于卫星影像

Do, Minh Kha, Xiang, Wei, Han, Kang, Wu, Di, Phan, Khoa, Chen, Yi-Ping Phoebe, Liu, Gaowen, Kompella, Ramana Rao

Abstract

Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/

Chinese Translation

视觉-语言基础模型（VLFMs）为地球观测提供了零-shot 和检索理解的可能性。尽管现有的卫星系统通常缺乏全面的多光谱覆盖，使得仅使用RGB进行推理在可扩展部署中极具吸引力，但VLFMs在卫星影像中的应用仍受到两个因素的制约：（1）多光谱输入信息丰富，但由于波段冗余和不对齐，难以持续利用；（2）CLIP风格的文本编码器限制了语义表达能力，并削弱了细粒度对齐。我们提出了SATtxt，一种光谱感知的视觉-语言基础模型，仅在推理时使用RGB输入，同时保留在训练过程中学习到的光谱线索。我们的框架包括两个阶段。首先，光谱表示蒸馏通过轻量级投影器将冻结的多光谱教师模型的光谱先验转移到RGB学生模型。其次，利用增强指令的大型语言模型进行光谱基础对齐，将蒸馏的视觉空间与富有表现力的LLM嵌入空间连接起来。在EuroSAT、BigEarthNet和ForestNet数据集上，SATtxt在零-shot 分类上平均提高了4.2%，检索性能提高了5.9%，线性探测提高了2.7%，展示了面向地球观测的光谱感知视觉-语言学习的有效路径。项目页面：https://ikhado.github.io/sattxt/

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2602.22620

Coded-E2LF: Coded Aperture Light Field Imaging from Events

编码E2LF：基于事件的编码孔径光场成像

Tsuchida, Tomoya, Takahashi, Keita, Tsutake, Chihiro, Fujii, Toshiaki, Nagahara, Hajime

Abstract

We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software and supplementary video are available from our project website.

Chinese Translation

我们提出了编码E2LF（coded event to light field），这是一种利用编码孔径和静态事件相机获取4维光场的计算成像方法。在之前的工作中，采用了类似于我们的方法，但同时捕获了事件和强度图像并用于光场重建。相比之下，我们的方法完全基于事件，这降低了硬件实现的限制。我们还引入了几个相较于之前工作的进展，使我们能够理论上支持并实际改善仅基于事件的光场重建。特别地，我们阐明了黑色图案在孔径编码模式中的关键作用。最后，我们在真实成像硬件上实现了我们的方法，以展示其在捕获真实三维场景中的有效性。根据我们所知，我们是首个展示可以仅通过事件重建具有像素级精度的4维光场的研究。我们的软件和补充视频可在我们的项目网站上获取。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2602.22621

CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

CGSA：面向类指导的槽感知适应用于无源物体检测

Dai, Boyang, Fan, Zeng, Qi, Zihao, Lou, Meng, Yu, Yizhou

Abstract

Source-Free Domain Adaptive Object Detection (SF-DAOD) aims to adapt a detector trained on a labeled source domain to an unlabeled target domain without retaining any source data. Despite recent progress, most popular approaches focus on tuning pseudo-label thresholds or refining the teacher-student framework, while overlooking object-level structural cues within cross-domain data. In this work, we present CGSA, the first framework that brings Object-Centric Learning (OCL) into SF-DAOD by integrating slot-aware adaptation into the DETR-based detector. Specifically, our approach integrates a Hierarchical Slot Awareness (HSA) module into the detector to progressively disentangle images into slot representations that act as visual priors. These slots are then guided toward class semantics via a Class-Guided Slot Contrast (CGSC) module, maintaining semantic consistency and prompting domain-invariant adaptation. Extensive experiments on multiple cross-domain datasets demonstrate that our approach outperforms previous SF-DAOD methods, with theoretical derivations and experimental analysis further demonstrating the effectiveness of the proposed components and the framework, thereby indicating the promise of object-centric design in privacy-sensitive adaptation scenarios. Code is released at https://github.com/Michael-McQueen/CGSA.

Chinese Translation

无源领域自适应物体检测（SF-DAOD）旨在将训练于标记源领域的检测器适应于未标记的目标领域，而不保留任何源数据。尽管近期取得了一些进展，但大多数流行的方法集中于调整伪标签阈值或优化教师-学生框架，而忽视了跨域数据中的物体级结构线索。在本研究中，我们提出了CGSA，这是第一个将物体中心学习（OCL）引入SF-DAOD的框架，通过将槽感知适应集成到基于DETR的检测器中。具体而言，我们的方法将一个层次槽感知（HSA）模块集成到检测器中，以逐步将图像解构为作为视觉先验的槽表示。这些槽随后通过类指导槽对比（CGSC）模块引导至类语义，保持语义一致性并促进领域不变适应。在多个跨域数据集上的广泛实验表明，我们的方法优于之前的SF-DAOD方法，理论推导和实验分析进一步证明了所提出组件和框架的有效性，从而表明物体中心设计在隐私敏感的适应场景中的潜力。代码已发布在 https://github.com/Michael-McQueen/CGSA。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2602.22624

Instruction-based Image Editing with Planning, Reasoning, and Generation

基于指令的图像编辑：规划、推理与生成

Ji, Liya, Qi, Chenyang, Chen, Qifeng

Abstract

Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.

Chinese Translation

通过指令编辑图像提供了一种生成交互内容的自然方式，但由于对场景理解和生成的更高要求，这是一项巨大的挑战。之前的工作利用了一系列大型语言模型、物体分割模型和编辑模型来完成这一任务。然而，理解模型仅提供单一模态的能力，限制了编辑质量。我们的目标是通过一个新的多模态模型来弥合理解与生成之间的差距，为基于指令的图像编辑模型提供智能能力，以应对更复杂的情况。为实现这一目标，我们将指令编辑任务与多模态思维链提示分开，即链式思维（Chain-of-Thought, CoT）规划、编辑区域推理和编辑。对于链式思维规划，大型语言模型能够考虑提供的指令和编辑网络的能力，推理出适当的子提示。对于编辑区域推理，我们训练了一个基于指令的编辑区域生成网络，结合了多模态的大型语言模型。最后，提出了一种基于提示引导的指令编辑网络，基于大型文本到图像扩散模型进行图像生成，接受生成提示。大量实验表明，我们的方法在复杂的现实世界图像上具有竞争力的编辑能力。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2602.22629

CRAG: Can 3D Generative Models Help 3D Assembly?

CRAG：三维生成模型能否帮助三维装配？

Jiang, Zeyu, Li, Sihang, Tan, Siqi, Xu, Chenyang, Zhang, Juexiao, Galway-Witham, Julia, Wang, Xue, Williams, Scott A., Iovita, Radu, Feng, Chen, Zhang, Jing

Abstract

Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.

Chinese Translation

大多数现有的三维装配方法将问题视为纯粹的姿态估计，通过刚性变换重新排列观察到的部件。相比之下，人类装配自然将结构推理与整体形状推断结合在一起。受到这一直觉的启发，我们将三维装配重新表述为装配与生成的联合问题。我们展示了这两个过程是相辅相成的：装配为生成提供了部件级的结构先验，而生成则注入了整体形状上下文，从而解决了装配中的模糊性。与无法合成缺失几何形状的先前方法不同，我们提出了CRAG，它同时生成可信的完整形状并预测输入部件的姿态。大量实验表明，在具有多样几何形状、不同部件数量和缺失部件的实际物体上，CRAG展现了最先进的性能。我们的代码和模型将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2602.22639

QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

QuadSync：通过塔克分解实现四焦点张量同步

Miao, Daniel, Lerman, Gilad, Kileel, Joe

Abstract

In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,~4,~4,~4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.

Chinese Translation

在运动结构重建中，四焦点张量捕获的信息比其成对对应物（本质矩阵）更多，然而它们常常被认为是不切实际的，仅具有理论意义。在本研究中，我们通过提供一个新的框架来从相应的四焦点张量集合中恢复$n$个相机，从而挑战这种看法。我们构建了块四焦点张量，并展示其允许进行塔克分解，其因子矩阵为堆叠的相机矩阵，因此其多线性秩为(4,~4,~4,~4)，与$n$无关。我们开发了第一个用于四焦点张量的同步算法，利用塔克分解、交替方向乘子法和迭代加权最小二乘法。我们进一步建立了块四焦点、三焦点和双焦点张量之间的关系，并引入了一种联合同步这三种实体的算法。数值实验表明我们的方法在现代数据集上的有效性，指示了在同步中使用高阶信息的潜力和重要性。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2602.22644

Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

插入、播放与强化：一种低成本模块用于稳健的多模态图像理解模型

Lu, Siqi, Xu, Wanying, Zheng, Yongbin, Luan, Wenting, Sun, Peng, Yao, Jianhang

Abstract

Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.

Chinese Translation

缺失模态是多模态模型面临的一个基本挑战，常常导致灾难性的性能下降。我们的观察表明，这种脆弱性源于不平衡的学习过程，模型对某些模态产生隐含偏好，从而导致其他模态的优化不足。我们提出了一种简单而有效的方法来应对这一挑战。我们工作的核心见解是，模态之间的主导关系可以在频域中有效地辨别和量化。为了利用这一原理，我们首先引入了一种频率比度量（Frequency Ratio Metric, FRM），通过分析频域中的特征来量化模态偏好。在FRM的指导下，我们提出了一种多模态权重分配模块（Multimodal Weight Allocation Module），这是一个即插即用的组件，在训练过程中动态重新平衡每个分支的贡献，促进更全面的学习范式。大量实验表明，MWAM可以无缝集成到多种架构骨干中，如基于卷积神经网络（CNNs）和视觉变换器（ViTs）的模型。此外，MWAM在广泛的任务和模态组合中提供了一致的性能提升。这一进展不仅优化了基础模型的性能；还进一步提升了针对缺失模态问题的最先进方法的性能。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2602.22649

Interactive Medical-SAM2 GUI: A Napari-based semi-automatic annotation tool for medical images

交互式 Medical-SAM2 GUI：基于 Napari 的医学图像半自动标注工具

Hong, Woojae, Hwang, Jong Ha, Chung, Jiyong, Choi, Joongyeon, Kim, Hyunngun, Kim, Yong Hwy

Abstract

Interactive Medical-SAM2 GUI is an open-source desktop application for semi-automatic annotation of 2D and 3D medical images. Built on the Napari multi-dimensional viewer, box/point prompting is integrated with SAM2-style propagation by treating a 3D volume as a slice sequence, enabling mask propagation from sparse prompts using Medical-SAM2 on top of SAM2. Voxel-level annotation remains essential for developing and validating medical imaging algorithms, yet manual labeling is slow and expensive for 3D scans, and existing integrations frequently emphasize per-slice interaction without providing a unified, cohort-oriented workflow for navigation, propagation, interactive correction, and quantitative export in a single local pipeline. To address this practical limitation, a local-first Napari workflow is provided for efficient 3D annotation across multiple studies using standard DICOM series and/or NIfTI volumes. Users can annotate cases sequentially under a single root folder with explicit proceed/skip actions, initialize objects via box-first prompting (including first/last-slice initialization for single-object propagation), refine predictions with point prompts, and finalize labels through prompt-first correction prior to saving. During export, per-object volumetry and 3D volume rendering are supported, and image geometry is preserved via SimpleITK. The GUI is implemented in Python using Napari and PyTorch, with optional N4 bias-field correction, and is intended exclusively for research annotation workflows. The code is released on the project page: https://github.com/SKKU-IBE/Medical-SAM2GUI/.

Chinese Translation

交互式 Medical-SAM2 GUI 是一款开源桌面应用程序，用于对 2D 和 3D 医学图像进行半自动标注。该工具基于 Napari 多维查看器构建，通过将 3D 体积视为切片序列，集成了框/点提示与 SAM2 风格的传播，使得可以使用 Medical-SAM2 在 SAM2 的基础上从稀疏提示中进行掩膜传播。体素级标注对于开发和验证医学成像算法至关重要，但手动标注对于 3D 扫描来说速度慢且成本高，现有的集成通常强调逐切片交互，而未提供统一的、以队列为导向的工作流程，以便在单一本地管道中进行导航、传播、交互修正和定量导出。为了解决这一实际限制，提供了一种以本地为先的 Napari 工作流程，以便在多个研究中使用标准 DICOM 系列和/或 NIfTI 体积进行高效的 3D 标注。用户可以在单个根文件夹下顺序标注案例，执行明确的继续/跳过操作，通过框提示初始化对象（包括单对象传播的首/末切片初始化），通过点提示细化预测，并在保存之前通过提示优先修正最终确定标签。在导出过程中，支持每个对象的体积测量和 3D 体积渲染，并通过 SimpleITK 保留图像几何。该 GUI 使用 Python 实现，基于 Napari 和 PyTorch，具有可选的 N4 偏置场校正，专门用于研究标注工作流程。代码已在项目页面发布： https://github.com/SKKU-IBE/Medical-SAM2GUI/.

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2602.22654

Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

去噪作为路径规划：基于 DPCache 的无训练加速扩散模型

Cui, Bowen, Wang, Yuanbin, Xu, Huajiang, Chen, Biaolong, Zhang, Aixi, Jiang, Hao, Jin, Zhengzheng, Liu, Xu, Huang, Pipei

Abstract

Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at https://github.com/argsss/DPCache.

Chinese Translation

扩散模型在图像和视频生成方面取得了显著成功，但其实际应用仍受到多步迭代采样的巨大计算开销的限制。在加速策略中，基于缓存的方法通过重用或预测跨时间步的特征提供了一种无训练且有效的解决方案。然而，现有方法依赖于固定或局部自适应的调度，而未考虑去噪轨迹的全局结构，常常导致误差累积和视觉伪影。为了解决这一限制，我们提出了 DPCache，一种新颖的无训练加速框架，将扩散采样加速形式化为一个全局路径规划问题。DPCache 从一个小的校准集构建路径感知成本张量，以量化跳过时间步的路径依赖误差，这些误差以先前的关键时间步为条件。利用该张量，DPCache 采用动态规划选择一组最优的关键时间步，以最小化总路径成本，同时保持轨迹的保真度。在推理过程中，模型仅在这些关键时间步执行完整计算，而中间输出则通过缓存特征高效预测。在 DiT、FLUX 和 HunyuanVideo 上的广泛实验表明，DPCache 实现了强劲的加速，且质量损失最小，在 4.87$ imes$ 加速下比之前的加速方法提高了 $+$0.031 ImageReward，甚至在 FLUX 上以 3.54$ imes$ 加速超越了全步基线 $+$0.028 ImageReward，验证了我们路径感知全局调度框架的有效性。代码将发布在 https://github.com/argsss/DPCache。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2602.22659

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

通过众包扩展音视频质量评估数据集

Yang, Renyu, Jin, Jian, Meng, Lili, Liu, Meiqin, Wang, Yilin, Adsumilli, Balu, Lin, Weisi

Abstract

Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ

Chinese Translation

音视频质量评估（AVQA）研究因现有数据集的局限性而停滞不前：这些数据集通常规模较小，内容和质量的多样性不足，并且仅用总体评分进行标注。这些缺陷对模型开发和多模态感知研究提供了有限的支持。我们提出了一种实用的AVQA数据集构建方法。首先，我们设计了一个众包主观实验框架，突破了实验室环境的限制，实现了在不同环境下的可靠标注。其次，进一步采用系统的数据准备策略，以确保对质量水平和语义场景的广泛覆盖。第三，我们通过额外的标注扩展数据集，使得对多模态感知机制及其与内容关系的研究成为可能。最后，我们通过YT-NTU-AVQ验证了该方法，这是迄今为止最大且最具多样性的AVQA数据集，包含1,620个用户生成的音频和视频（A/V）序列。数据集和平台代码可在 https://github.com/renyu12/YT-NTU-AVQ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2602.22666

ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals

ArtPro：自监督的关节物体重建与运动提议的自适应整合

Li, Xuelu, Wang, Zhaonan, Wang, Xiaogang, Wu, Lei, Li, Manyi, Tu, Changhe

Abstract

Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects. To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while a collision-aware motion pruning mechanism prevents erroneous kinematic estimation. Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.

Chinese Translation

将关节物体重建为高保真数字双胞胎对于机器人操作和交互式仿真等应用至关重要。最近使用可微渲染框架（如3D Gaussian Splatting）的自监督方法对初始部件分割高度敏感。它们对启发式聚类或预训练模型的依赖常常导致优化收敛到局部最小值，尤其是在复杂的多部件物体中。为了解决这些局限性，我们提出了ArtPro，一种新颖的自监督框架，介绍了运动提议的自适应整合。我们的方法以几何特征和运动先验引导的过度分割初始化开始，生成具有合理运动假设的部件提议。在优化过程中，我们通过分析空间邻域之间的运动一致性动态合并这些提议，同时，碰撞感知的运动修剪机制防止了错误的运动学估计。在合成和真实世界物体上的广泛实验表明，ArtPro能够稳健地重建复杂的多部件物体，在准确性和稳定性上显著优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2602.22667

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

单目开放词汇室内场景占用预测

Zhou, Changqing, Luo, Yueru, Zhang, Han, Jiang, Zeyu, Chen, Changhao

Abstract

Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

Chinese Translation

开放词汇的三维占用对于具身智能体至关重要，这些智能体需要理解复杂的室内环境，其中语义类别丰富且超越固定的分类法。尽管近期的研究探讨了户外驾驶场景中的开放词汇占用，但此类方法在室内环境中的迁移效果较差，室内几何结构更为密集，布局更为复杂，语义也更为细致。为了解决这些挑战，我们采用了一种仅基于几何的监督范式，仅使用二元占用标签（占用与空闲）。我们的框架建立在三维语言嵌入高斯模型之上，作为一种统一的中间表示，将细粒度的三维几何与语言对齐的语义嵌入相结合。在几何方面，我们发现现有的高斯到占用的操作在这种弱监督下无法收敛，因此我们引入了一种考虑不透明度的基于泊松的方法，以稳定体积聚合。在语义方面，渲染特征与开放词汇分割特征之间的直接对齐受到特征混合的影响；因此，我们提出了一种渐进温度衰减调度，在点云渲染过程中逐步增强不透明度，从而加强高斯与语言的对齐。在Occ-ScanNet数据集上，我们的框架在开放词汇设置下达到了59.50的IoU和21.05的mIoU，超越了所有现有的占用方法，并在mIoU上大幅领先于之前的开放词汇方法。代码将发布在https://github.com/JuIvyy/LegoOcc。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2602.22674

SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling

SPMamba-YOLO：基于多尺度特征增强和全局上下文建模的水下目标检测网络

Liao, Guanghao, Liu, Zhen, Cao, Liyuan, Yang, Yonghui, Li, Qi

Abstract

Underwater object detection is a critical yet challenging research problem owing to severe light attenuation, color distortion, background clutter, and the small scale of underwater targets. To address these challenges, we propose SPMamba-YOLO, a novel underwater object detection network that integrates multi-scale feature enhancement with global context modeling. Specifically, a Spatial Pyramid Pooling Enhanced Layer Aggregation Network (SPPELAN) module is introduced to strengthen multi-scale feature aggregation and expand the receptive field, while a Pyramid Split Attention (PSA) mechanism enhances feature discrimination by emphasizing informative regions and suppressing background interference. In addition, a Mamba-based state space modeling module is incorporated to efficiently capture long-range dependencies and global contextual information, thereby improving detection robustness in complex underwater environments. Extensive experiments on the URPC2022 dataset demonstrate that SPMamba-YOLO outperforms the YOLOv8n baseline by more than 4.9\% in [email protected], particularly for small and densely distributed underwater objects, while maintaining a favorable balance between detection accuracy and computational cost.

Chinese Translation

水下目标检测是一个关键但具有挑战性的研究问题，主要由于严重的光衰减、颜色失真、背景杂乱以及水下目标的小尺度。为了解决这些挑战，我们提出了SPMamba-YOLO，这是一种新颖的水下目标检测网络，集成了多尺度特征增强与全局上下文建模。具体而言，引入了空间金字塔池化增强层聚合网络（SPPELAN）模块，以增强多尺度特征聚合并扩展感受野，同时金字塔分割注意力（PSA）机制通过强调信息丰富的区域和抑制背景干扰来增强特征区分能力。此外，结合了基于Mamba的状态空间建模模块，以高效捕捉长距离依赖关系和全局上下文信息，从而提高在复杂水下环境中的检测鲁棒性。在URPC2022数据集上进行的广泛实验表明，SPMamba-YOLO在[email protected]上比YOLOv8n基线提高了4.9%以上，特别是在小型和密集分布的水下目标检测中，同时在检测准确性和计算成本之间保持了良好的平衡。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2602.22678

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

ViCLIP-OT：首个针对越南图像-文本检索的基础视觉-语言模型，采用最优传输

Tran, Quoc-Khang, Nguyen, Minh-Thien, Pham, Nguyen-Khang

Abstract

Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.

Chinese Translation

图像-文本检索已成为智能多媒体系统的基础组成部分；然而，大多数现有的视觉-语言模型都是针对高资源语言进行优化的，在越南等低资源环境中表现不佳。本研究提出了ViCLIP-OT，这是一种专门为越南图像-文本检索设计的基础视觉-语言模型。所提出的框架将CLIP风格的对比学习与相似性图正则化最优传输（Similarity-Graph Regularized Optimal Transport, SIGROT）损失相结合，以增强全局跨模态一致性并减轻模态间差距问题。在三个越南基准数据集（UITOpenViIC、KTVIC和Crossmodal-3600）上的广泛实验表明，ViCLIP-OT在领域内和零样本设置中均持续优于CLIP和SigLIP基线。在UIT-OpenViIC上，该模型的平均Recall@K达到了67.34%，比CLIP提高了5.75个百分点。在Crossmodal-3600的零样本评估中，ViCLIP-OT超越了CLIP，提升了11.72个百分点。嵌入空间分析进一步确认了对齐的改善和模态间差距的减少。结果表明，集成SIGROT为低资源语言中的跨模态检索提供了一种有效且可扩展的策略，为越南及其他代表性不足的语言环境中的智能多媒体检索系统提供了实际意义。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2602.22683

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

超级眼镜：将视觉语言模型作为智能代理用于人工智能智能眼镜的基准测试

Jiang, Zhuohang, Yuan, Xu, Qu, Haohao, Lin, Shanru, Liu, Kanglong, Fan, Wenqi, Li, Qing

Abstract

The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.

Chinese Translation

人工智能驱动的智能眼镜作为最热门的可穿戴设备之一，快速发展的进程为多模态交互开辟了新的前沿，其中基于外部知识源的视觉问答（VQA）已成为核心应用。现有适用于智能眼镜的视觉语言模型（VLMs）通常在传统的多模态数据集上进行训练和评估；然而，这些数据集缺乏反映智能眼镜使用场景所需的多样性和真实感，并且未能应对其特定挑战，其中准确识别感兴趣对象必须在任何外部知识检索之前进行。为了解决这一问题，我们引入了SUPERGLASSES，这是第一个基于完全由智能眼镜设备收集的真实世界数据构建的综合VQA基准。SUPERGLASSES包含2,422个以自我为中心的图像-问题对，涵盖14个图像领域和8个查询类别，配备完整的搜索轨迹和推理注释。我们在该基准上评估了26个具有代表性的VLM，揭示了显著的性能差距。为了解决现有模型的局限性，我们进一步提出了SUPERLENS，这是一种多模态智能眼镜代理，通过整合自动对象检测、查询解耦和多模态网页搜索，实现了增强检索的答案生成。我们的代理实现了最先进的性能，超越了GPT-4o 2.19个百分点，并强调了在智能眼镜VQA场景中对特定任务解决方案的需求。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2602.22689

No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

无标题，无问题：基于模型拟合嵌入的无标题成员推断

Jeon, Joonsung, Kim, Woo Jae, Ha, Suhyeon, Son, Sooel, Yoon, Sung-Eui

Abstract

Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.

Chinese Translation

潜在扩散模型在高保真文本到图像生成方面取得了显著成功，但其倾向于记忆训练数据引发了严重的隐私和知识产权问题。成员推断攻击（MIAs）提供了一种原则性的方法来审计这种记忆，通过确定给定样本是否包含在训练中。然而，现有方法假设可以访问真实的标题。这一假设在只有图像可用且其文本注释未公开的现实场景中失效，使得先前的方法在用视觉-语言模型（VLM）标题替代时无效。在本研究中，我们提出了MoFit，一个无标题的MIA框架，构建显式过拟合于目标模型生成流形的合成条件输入。给定查询图像，MoFit分为两个阶段：（i）模型拟合的替代优化，其中对图像施加的扰动被优化以构建在从成员样本学习的模型无条件先验区域中的替代；（ii）基于替代的嵌入提取，其中从替代中导出模型拟合的嵌入，然后作为查询图像的不匹配条件使用。该嵌入增强了成员样本的条件损失响应，同时使得持有样本相对不受影响，从而在缺乏真实标题的情况下增强可分性。我们在多个数据集和扩散模型上的全面实验表明，MoFit始终优于先前的VLM条件基线，并在性能上与依赖标题的方法具有竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2602.22695

GFRRN: Explore the Gaps in Single Image Reflection Removal

GFRRN：探索单幅图像反射去除中的差距

Chen, Yu, He, Zewei, Liu, Xingyu, Chen, Zixuan, Lu, Zheming

Abstract

Prior dual-stream methods with the feature interaction mechanism have achieved remarkable performance in single image reflection removal (SIRR). However, they often struggle with (1) semantic understanding gap between the features of pre-trained models and those of reflection removal models, and (2) reflection label inconsistencies between synthetic and real-world training data. In this work, we first adopt the parameter efficient fine-tuning (PEFT) strategy by integrating several learnable Mona layers into the pre-trained model to align the training directions. Then, a label generator is designed to unify the reflection labels for both synthetic and real-world data. In addition, a Gaussian-based Adaptive Frequency Learning Block (G-AFLB) is proposed to adaptively learn and fuse the frequency priors, and a Dynamic Agent Attention (DAA) is employed as an alternative to window-based attention by dynamically modeling the significance levels across windows (inter-) and within an individual window (intra-). These components constitute our proposed Gap-Free Reflection Removal Network (GFRRN). Extensive experiments demonstrate the effectiveness of our GFRRN, achieving superior performance against state-of-the-art SIRR methods.

Chinese Translation

先前的双流方法通过特征交互机制在单幅图像反射去除（SIRR）中取得了显著的性能。然而，它们常常面临以下挑战：(1) 预训练模型的特征与反射去除模型的特征之间的语义理解差距，以及 (2) 合成数据与真实世界训练数据之间的反射标签不一致。在本研究中，我们首先通过将多个可学习的 Mona 层集成到预训练模型中，采用参数高效微调（PEFT）策略，以对齐训练方向。然后，设计了一个标签生成器，以统一合成数据和真实世界数据的反射标签。此外，提出了一种基于高斯的自适应频率学习块（G-AFLB），用于自适应学习和融合频率先验，并采用动态代理注意力（DAA）作为基于窗口的注意力的替代方案，通过动态建模窗口之间（inter-）和单个窗口内（intra-）的重要性水平。这些组件构成了我们提出的无差距反射去除网络（GFRRN）。大量实验表明，我们的 GFRRN 的有效性，其性能优于最先进的 SIRR 方法。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2602.22712

UFO-DETR: Frequency-Guided End-to-End Detector for UAV Tiny Objects

UFO-DETR：基于频率引导的无人机微小目标端到端检测器

Chen, Yuankai, Lin, Kai, Wu, Qihong, Yang, Xinxuan, Lai, Jiashuo, Chen, Ruoen, Shi, Haonan, He, Minfan, Wang, Meihua

Abstract

Small target detection in UAV imagery faces significant challenges such as scale variations, dense distribution, and the dominance of small targets. Existing algorithms rely on manually designed components, and general-purpose detectors are not optimized for UAV images, making it difficult to balance accuracy and complexity. To address these challenges, this paper proposes an end-to-end object detection framework, UFO-DETR, which integrates an LSKNet-based backbone network to optimize the receptive field and reduce the number of parameters. By combining the DAttention and AIFI modules, the model flexibly models multi-scale spatial relationships, improving multi-scale target detection performance. Additionally, the DynFreq-C3 module is proposed to enhance small target detection capability through cross-space frequency feature enhancement. Experimental results show that, compared to RT-DETR-L, the proposed method offers significant advantages in both detection performance and computational efficiency, providing an efficient solution for UAV edge computing.

Chinese Translation

无人机图像中的小目标检测面临着显著的挑战，如尺度变化、密集分布以及小目标的主导性。现有算法依赖于手动设计的组件，而通用检测器并未针对无人机图像进行优化，这使得在准确性和复杂性之间取得平衡变得困难。为了解决这些挑战，本文提出了一种端到端的目标检测框架UFO-DETR，该框架集成了基于LSKNet的主干网络，以优化感受野并减少参数数量。通过结合DAttention和AIFI模块，该模型灵活地建模多尺度空间关系，提高了多尺度目标检测性能。此外，提出了DynFreq-C3模块，通过跨空间频率特征增强来提升小目标检测能力。实验结果表明，与RT-DETR-L相比，所提方法在检测性能和计算效率上均具有显著优势，为无人机边缘计算提供了一种高效的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2602.22716

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

SoPE：基于球坐标的位置信息嵌入以增强3D大视觉语言模型的空间感知

Ye, Guanting, Zhao, Qiyan, Yu, Wenhao, Yuan, Liangyu, Li, Mingkai, Zhang, Xiaofeng, Ji, Jianmin, Zhang, Yanyong, Jiang, Qing, Yuen, Ka-Veng

Abstract

3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.

Chinese Translation

基于大型语言模型（LLMs）构建的3D大型视觉语言模型（3D LVLMs）在各种多模态任务中取得了显著进展。然而，它们继承的依赖位置的建模机制——旋转位置嵌入（RoPE）在3D多模态理解中仍然存在不足。传统的RoPE公式在编码3D标记时未能保留重要的三维空间结构，其相对距离计算忽视了角度依赖性，阻碍了模型捕捉视觉表示中方向变化的能力。为克服这些局限性，我们提出了基于球坐标的位置信息嵌入（SoPE）。我们的方法将点云标记索引映射到3D球坐标空间，从而实现空间位置和方向角的统一建模。这一公式保留了点云数据的固有几何结构，增强了空间意识，并为多模态学习提供了更一致和更具表现力的几何表示。此外，我们引入了一种多尺度频率混合策略，以融合不同频域的特征信息。在多个3D场景基准上的实验结果验证了我们方法的有效性，而实际部署实验进一步证明了其强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2602.22717

IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling

IRSDE-去斑：一种基于物理的扩散模型用于通用超声去斑

Chen, Shuoqi, Wu, Yujia, Luke, Geoffrey P.

Abstract

Ultrasound imaging is widely used for real-time, noninvasive diagnosis, but speckle and related artifacts reduce image quality and can hinder interpretation. We present a diffusion-based ultrasound despeckling method built on the Image Restoration Stochastic Differential Equations framework. To enable supervised training, we curate large paired datasets by simulating ultrasound images from speckle-free magnetic resonance images using the Matlab UltraSound Toolbox. The proposed model reconstructs speckle-suppressed images while preserving anatomically meaningful edges and contrast. On a held-out simulated test set, our approach consistently outperforms classical filters and recent learning-based despeckling baselines. We quantify prediction uncertainty via cross-model variance and show that higher uncertainty correlates with higher reconstruction error, providing a practical indicator of difficult or failure-prone regions. Finally, we evaluate sensitivity to simulation probe settings and observe domain shift, motivating diversified training and adaptation for robust clinical deployment.

Chinese Translation

超声成像广泛用于实时、非侵入性的诊断，但斑点及相关伪影降低了图像质量并可能妨碍解读。我们提出了一种基于扩散的超声去斑方法，该方法基于图像恢复随机微分方程（Image Restoration Stochastic Differential Equations）框架。为了实现监督训练，我们通过使用Matlab超声工具箱从无斑点的磁共振图像模拟超声图像，构建了大型配对数据集。所提出的模型在重建抑制斑点的图像时，能够保持解剖学上有意义的边缘和对比度。在一个保留的模拟测试集上，我们的方法始终优于经典滤波器和近期基于学习的去斑基线。我们通过跨模型方差量化预测不确定性，并显示出更高的不确定性与更高的重建误差相关，为困难或易失败区域提供了一个实用的指标。最后，我们评估了对模拟探头设置的敏感性，并观察到领域转移，这促使我们进行多样化训练和适应，以实现稳健的临床部署。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2602.22727

HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models

HulluEdit：单次通过的证据一致子空间编辑以减轻大型视觉语言模型中的幻觉

Lin, Yangguang, Fang, Quan, Li, Yufei, Sun, Jiachen, Gao, Junyu, Sang, Jitao

Abstract

Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.

Chinese Translation

大型视觉语言模型（LVLMs）中的物体幻觉显著妨碍了其可靠部署。现有方法在效率和准确性之间难以取得平衡：它们通常需要昂贵的参考模型和多次前向传递，或者应用静态编辑，这可能会抑制真实的视觉证据。为了解决这个问题，我们提出了HulluEdit，一个单次通过、无参考的干预框架。我们的核心创新是正交子空间编辑：我们将模型的隐藏状态分解为正交子空间——视觉证据、冲突先验和残余不确定性——从而实现选择性抑制幻觉模式，而不干扰视觉基础。这种方法在数学上保证了对先验子空间应用的编辑不会对视觉组件产生任何影响。大量实验表明，HulluEdit在包括POPE和CHAIR在内的基准测试中，在多种架构上实现了最先进的幻觉减少，同时保持了MME上的通用能力，并维持高效推理。我们的方法在对比解码和静态子空间编辑基线中始终表现优越，为更可靠的LVLMs提供了一条新路径。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2602.22734

Asymmetric Idiosyncrasies in Multimodal Models

多模态模型中的不对称特征

Tao, Muzi, Shi, Chufan, Wang, Huijuan, Tong, Shengbang, Ma, Xuezhe

Abstract

In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70\%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50\% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.

Chinese Translation

在本研究中，我们探讨了字幕模型中的特征及其对文本到图像模型的下游影响。我们设计了一种系统分析方法：给定生成的字幕或相应的图像，我们训练神经网络以预测源自的字幕模型。我们的结果表明，文本分类的准确率非常高（99.70%），这表明字幕模型嵌入了独特的风格特征。相反，这些特征在生成的图像中大部分消失，分类准确率即使对于最先进的Flux模型也最多下降至50%。为了更好地理解这种跨模态差异，我们进一步分析了数据，发现生成的图像未能保留字幕中存在的关键变化，例如细节水平的差异、对颜色和纹理的强调，以及场景中物体的分布。总体而言，我们基于分类的框架提供了一种新颖的方法论，用于量化字幕模型的风格特征以及文本到图像系统的提示跟随能力。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2602.22740

AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

AMLRIS：面向对齐的掩蔽学习用于指称图像分割

Chen, Tongfei, Yang, Shuo, Yang, Yuguang, Yang, Linlin, Guo, Runtang, Li, Changbai, Long, He, Xie, Chunyu, Leng, Dawei, Zhang, Baochang

Abstract

Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios

Chinese Translation

指称图像分割（RIS）旨在根据自然语言表达对图像中的对象进行分割。本文提出了一种面向对齐的掩蔽学习（AML）训练策略，通过显式估计像素级的视觉-语言对齐，过滤优化过程中对齐较差的区域，并专注于可信的线索，从而增强RIS。这种方法在RefCOCO数据集上实现了最先进的性能，并提高了对多样化描述和场景的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2602.22742

ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control

ProjFlow：基于流匹配的投影采样用于零-shot精确空间运动控制

Watanabe, Akihisa, Yu, Qing, Simo-Serra, Edgar, Fujiwara, Kent

Abstract

Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.

Chinese Translation

生成具有精确空间控制的人类运动是一个具有挑战性的问题。现有的方法通常需要特定任务的训练或缓慢的优化，并且强制执行硬约束常常会破坏运动的自然性。基于许多动画任务可以被表述为线性逆问题的观察，我们提出了ProjFlow，这是一种无训练的采样器，能够在保持运动真实感的同时，实现零-shot的线性空间约束的精确满足。我们的关键进展是提出了一种新颖的运动学感知度量，该度量编码了骨骼拓扑。该度量使得采样器能够通过在整个骨骼上连贯地分配修正来强制执行硬约束，避免了简单投影带来的不自然伪影。此外，对于稀疏输入，例如填补几个关键帧之间的长间隙，我们引入了一种使用伪观测的时变表述，这些伪观测在采样过程中逐渐消失。在代表性应用、运动修复和2D到3D提升的广泛实验中，ProjFlow实现了精确的约束满足，并在真实感上与零-shot基线相匹配或有所提升，同时在与基于训练的控制器的竞争中保持了竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2602.22745

SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

SPATIALALIGN：对视频生成中动态空间关系的对齐

Liu, Fengming, Cham, Tat-Jen, Zheng, Chuanxia

Abstract

Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.

Chinese Translation

大多数文本到视频（T2V）生成器优先考虑美学质量，但往往忽视生成视频中的空间约束。在本研究中，我们提出了SPATIALALIGN，一个自我改进框架，增强T2V模型描绘文本提示中指定的动态空间关系（DSR）的能力。我们提出了一种零阶正则化的直接偏好优化（DPO）方法，以微调T2V模型，使其更好地与DSR对齐。具体而言，我们设计了DSR-SCORE，这是一种基于几何的度量，定量测量生成视频与提示中指定的DSR之间的对齐程度，这比依赖于视觉语言模型（VLM）进行评估的先前工作更进一步。我们还构建了一个包含多样化DSR的文本-视频对数据集，以促进研究。大量实验表明，我们微调的模型在空间关系方面显著优于基线模型。代码将发布在链接中。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2602.22759

Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval

超越检测：自然图像深伪恢复与事实检索的多尺度隐编码

Chen, Yuan-Chih, Lu, Chun-Shien

Abstract

Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.

Chinese Translation

近期在图像真实性方面的进展主要集中于深伪检测和定位，而对篡改内容的恢复以进行事实检索的研究相对较少。我们提出了一个统一的隐编码恢复框架，能够实现后期和生成水印范式下的检索与恢复。我们的方法将语义和感知信息编码为紧凑的隐编码表示，通过多尺度向量量化进行优化，并通过条件Transformer模块增强上下文推理能力。为了实现对自然图像的系统评估，我们构建了ImageNet-S，这是一个提供成对图像-标签事实检索任务的基准。对ImageNet-S的广泛实验表明，我们的方法在检索和重建性能上表现出良好的前景，同时与多种水印管道完全兼容。该框架为超越检测和定位的通用图像恢复奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2602.22779

TrajTok: Learning Trajectory Tokens enables better Video Understanding

TrajTok：学习轨迹标记以增强视频理解

Zheng, Chenhao, Zhang, Jieyu, Zhang, Jianing, Huang, Weikai, Kumar, Ashutosh, Kong, Quan, Tuzel, Oncel, Li, Chun-Liang, Krishna, Ranjay

Abstract

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

Chinese Translation

视频模型中的标记化通常通过分块处理生成过多且冗余的标记。这严重限制了视频的效率和可扩展性。尽管最近基于轨迹的标记器通过将视频时长与标记数量解耦提供了一个有前景的解决方案，但它们依赖于复杂的外部分割和跟踪管道，这些管道速度较慢且与任务无关。我们提出了TrajTok，一个端到端的视频标记器模块，它与视频模型完全集成并共同训练以实现下游目标，动态调整其标记粒度以适应语义复杂性，而不依赖于视频时长。TrajTok包含一个统一的分割器，能够在空间和时间上对像素进行隐式聚类，从而在一次前向传递中直接生成物体轨迹。通过优先考虑下游适应性而非像素完美的分割保真度，TrajTok轻量高效，且在经验上提升了视频理解性能。借助TrajTok，我们实现了一个从零开始训练的视频CLIP模型（TrajViT2）。它在分类和检索基准测试中都达到了最佳的规模准确性，同时保持了与最佳标记合并方法相当的效率。TrajTok还证明了其作为标记器之外的多功能组件。我们展示了它可以无缝集成作为预训练视觉特征的探测头（TrajAdapter）或视觉-语言模型中的对齐连接器（TrajVLM），在长视频推理中表现尤为出色。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2602.22785

SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

场景传输器：基于最优传输引导的组合潜在扩散用于单图像结构化3D场景生成

Wang, Ling, Guo, Hao-Xiang, Wang, Xinzhou, Sun, Fuchun, Sun, Kai, Liu, Pengkun, Xiao, Hang, Wang, Zhong, Fu, Guangyuan, Li, Eric, Liu, Yang, Wang, Yikai

Abstract

We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at https://2019epwl.github.io/SceneTransporter/.

Chinese Translation

我们介绍了场景传输器（SceneTransporter），这是一个从单张图像生成结构化3D场景的端到端框架。虽然现有方法生成部分级3D对象，但它们往往无法将这些部分组织成开放世界场景中的独立实例。通过去偏聚类探测，我们揭示了一个关键见解：这种失败源于模型内部分配机制缺乏结构约束。基于这一发现，我们将结构化3D场景生成的任务重新框定为全局关联分配问题。为了解决这个问题，场景传输器在组合DiT模型的去噪循环中制定并解决了一个熵最优传输（Optimal Transport, OT）目标。该公式施加了两个强大的结构约束。首先，生成的传输计划限制了交叉注意力，以强制图像块与部分级3D潜在变量之间进行独占的一对一路由，防止缠结。其次，传输的竞争性质鼓励相似图像块的分组，这一过程通过基于边缘的成本进一步规范化，以形成连贯的对象并防止碎片化。大量实验表明，场景传输器在开放世界场景生成方面优于现有方法，显著提高了实例级一致性和几何保真度。代码和模型将公开发布在 https://2019epwl.github.io/SceneTransporter/。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2602.22791

Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

通过自监督骨骼表示学习实现鲁棒的人体轨迹预测

Arashima, Taishu, Kera, Hiroshi, Kawamoto, Kazuhiko

Abstract

Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

Chinese Translation

人体轨迹预测在自主导航和视频监控等应用中发挥着至关重要的作用。尽管近期的研究探索了将人体骨骼序列与轨迹信息相结合，但在真实环境中，骨骼数据常常因遮挡而导致关节缺失。这些干扰显著降低了预测的准确性，表明需要更鲁棒的骨骼表示。我们提出了一种鲁棒的轨迹预测方法，该方法结合了经过掩码自编码预训练的自监督骨骼表示模型。在易受遮挡的场景中的实验结果表明，我们的方法在不牺牲预测准确性的情况下，提高了对缺失骨骼数据的鲁棒性，并在干净到中等缺失的情况下始终优于基线模型。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2602.22800

GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation

GSTurb：用于大气湍流缓解的高斯喷溅

Du, Hanliang, Lu, Zhangji, Cai, Zewei, Tang, Qijian, Yu, Qifeng, Liu, Xiaoli

Abstract

Atmospheric turbulence causes significant image degradation due to pixel displacement (tilt) and blur, particularly in long-range imaging applications. In this paper, we propose a novel framework for atmospheric turbulence mitigation, GSTurb, which integrates optical flow-guided tilt correction and Gaussian splatting for modeling non-isoplanatic blur. The framework employs Gaussian parameters to represent tilt and blur, and optimizes them across multiple frames to enhance restoration. Experimental results on the ATSyn-static dataset demonstrate the effectiveness of our method, achieving a peak PSNR of 27.67 dB and SSIM of 0.8735. Compared to the state-of-the-art method, GSTurb improves PSNR by 1.3 dB (a 4.5% increase) and SSIM by 0.048 (a 5.8% increase). Additionally, on real datasets, including the TSRWGAN Real-World and CLEAR datasets, GSTurb outperforms existing methods, showing significant improvements in both qualitative and quantitative performance. These results highlight that combining optical flow-guided tilt correction with Gaussian splatting effectively enhances image restoration under both synthetic and real-world turbulence conditions. The code for this method will be available at https://github.com/DuhlLiamz/3DGS_turbulence/tree/main.

Chinese Translation

大气湍流由于像素位移（倾斜）和模糊，导致图像显著降质，尤其是在远程成像应用中。在本文中，我们提出了一种新颖的大气湍流缓解框架GSTurb，该框架结合了光流引导的倾斜校正和高斯喷溅，以建模非等像差模糊。该框架采用高斯参数来表示倾斜和模糊，并在多个帧之间进行优化，以增强恢复效果。在ATSyn-static数据集上的实验结果证明了我们方法的有效性，达到峰值信噪比（PSNR）27.67 dB和结构相似性指数（SSIM）0.8735。与最先进的方法相比，GSTurb的PSNR提高了1.3 dB（增加了4.5%），SSIM提高了0.048（增加了5.8%）。此外，在包括TSRWGAN真实世界和CLEAR数据集在内的真实数据集上，GSTurb的表现优于现有方法，在定性和定量性能上均显示出显著改善。这些结果强调了将光流引导的倾斜校正与高斯喷溅相结合，能够有效增强在合成和真实世界湍流条件下的图像恢复。该方法的代码将发布在 https://github.com/DuhlLiamz/3DGS_turbulence/tree/main。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2602.22809

PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning

PhotoAgent：具有探索性视觉美学规划的自主照片编辑

Yao, Mingde, You, Zhiyuan, Man, Tam-King, Wang, Menglu, Xue, Tianfan

Abstract

With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is https://github.com/mdyao/PhotoAgent.

Chinese Translation

随着生成模型的快速发展，基于指令的图像编辑在生成高质量图像方面展现了巨大的潜力。然而，编辑的质量高度依赖于精心设计的指令，这使得任务分解和顺序安排的负担完全落在用户身上。为了实现自主图像编辑，我们提出了PhotoAgent，一个通过明确的美学规划来推进图像编辑的系统。具体而言，PhotoAgent将自主图像编辑表述为一个长期决策问题。它推理用户的美学意图，通过树搜索规划多步骤编辑操作，并通过闭环执行与记忆和视觉反馈迭代优化结果，而无需逐步的用户提示。为了支持在真实场景中的可靠评估，我们引入了UGC-Edit，一个包含7000张照片和学习美学奖励模型的美学评估基准。我们还构建了一个包含1017张照片的测试集，以系统地评估自主照片编辑性能。大量实验表明，与基线方法相比，PhotoAgent在指令遵循和视觉质量方面始终有所提升。项目页面为：https://github.com/mdyao/PhotoAgent。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2602.22819

Face Time Traveller : Travel Through Ages Without Losing Identity

面部时间旅行者：无损身份的跨时代旅行

Kar, Purbayan, Ghadiya, Ayush, Chudasama, Vishal, Wasnik, Pankaj, Jawahar, C. V.

Abstract

Face aging, an ill-posed problem shaped by environmental and genetic factors, is vital in entertainment, forensics, and digital archiving, where realistic age transformations must preserve both identity and visual realism. However, existing works relying on numerical age representations overlook the interplay of biological and contextual cues. Despite progress in recent face aging models, they struggle with identity preservation in wide age transformations, also static attention and optimization-heavy inversion in diffusion limit adaptability, fine-grained control and background consistency. To address these challenges, we propose Face Time Traveller (FaceTT), a diffusion-based framework that achieves high-fidelity, identity-consistent age transformation. Here, we introduce a Face-Attribute-Aware Prompt Refinement strategy that encodes intrinsic (biological) and extrinsic (environmental) aging cues for context-aware conditioning. A tuning-free Angular Inversion method is proposed that efficiently maps real faces into the diffusion latent space for fast and accurate reconstruction. Moreover, an Adaptive Attention Control mechanism is introduced that dynamically balances cross-attention for semantic aging cues and self-attention for structural and identity preservation. Extensive experiments on benchmark datasets and in-the-wild testset demonstrate that FaceTT achieves superior identity retention, background preservation and aging realism over state-of-the-art (SOTA) methods.

Chinese Translation

面部老化是一个受环境和遗传因素影响的病态问题，在娱乐、法医学和数字归档中至关重要，其中现实的年龄转变必须同时保持身份和视觉真实感。然而，现有的依赖数值年龄表示的工作忽视了生物和上下文线索之间的相互作用。尽管最近的面部老化模型取得了一定进展，但在广泛的年龄转变中，它们在身份保持方面仍然面临挑战，同时静态注意力和重优化的反演在扩散过程中限制了适应性、细粒度控制和背景一致性。为了解决这些挑战，我们提出了面部时间旅行者（Face Time Traveller，FaceTT），一个基于扩散的框架，能够实现高保真度和身份一致的年龄转变。在此，我们引入了一种面部属性感知的提示优化策略，该策略编码了内在（生物）和外在（环境）老化线索，以实现上下文感知的条件。我们提出了一种无调优的角度反演方法，能够高效地将真实面孔映射到扩散潜在空间，以实现快速和准确的重建。此外，我们引入了一种自适应注意力控制机制，动态平衡语义老化线索的交叉注意力和结构及身份保持的自注意力。在基准数据集和实际测试集上的大量实验表明，FaceTT在身份保留、背景保持和老化真实感方面优于最先进（SOTA）的方法。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2602.22821

CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation

CMSA-Net：具有自适应多源参考的因果多尺度聚合用于视频息肉分割

Wang, Tong, Qi, Yaolei, Wang, Siwen, Razzak, Imran, Yang, Guanyu, Xie, Yutong

Abstract

Video polyp segmentation (VPS) is an important task in computer-aided colonoscopy, as it helps doctors accurately locate and track polyps during examinations. However, VPS remains challenging because polyps often look similar to surrounding mucosa, leading to weak semantic discrimination. In addition, large changes in polyp position and scale across video frames make stable and accurate segmentation difficult. To address these challenges, we propose a robust VPS framework named CMSA-Net. The proposed network introduces a Causal Multi-scale Aggregation (CMA) module to effectively gather semantic information from multiple historical frames at different scales. By using causal attention, CMA ensures that temporal feature propagation follows strict time order, which helps reduce noise and improve feature reliability. Furthermore, we design a Dynamic Multi-source Reference (DMR) strategy that adaptively selects informative and reliable reference frames based on semantic separability and prediction confidence. This strategy provides strong multi-frame guidance while keeping the model efficient for real-time inference. Extensive experiments on the SUN-SEG dataset demonstrate that CMSA-Net achieves state-of-the-art performance, offering a favorable balance between segmentation accuracy and real-time clinical applicability.

Chinese Translation

视频息肉分割（VPS）是计算机辅助结肠镜检查中的一项重要任务，因为它帮助医生在检查过程中准确定位和追踪息肉。然而，由于息肉常常与周围的黏膜相似，导致语义区分能力较弱，VPS 仍然面临挑战。此外，视频帧中息肉位置和尺度的巨大变化使得稳定和准确的分割变得困难。为了解决这些挑战，我们提出了一种名为 CMSA-Net 的强健 VPS 框架。该网络引入了因果多尺度聚合（CMA）模块，有效地从不同尺度的多个历史帧中收集语义信息。通过使用因果注意力，CMA 确保时间特征传播遵循严格的时间顺序，从而帮助减少噪声并提高特征的可靠性。此外，我们设计了一种动态多源参考（DMR）策略，根据语义可分离性和预测置信度自适应选择信息丰富且可靠的参考帧。该策略在保持模型高效以实现实时推理的同时，提供了强大的多帧指导。在 SUN-SEG 数据集上的大量实验表明，CMSA-Net 达到了最先进的性能，在分割准确性和实时临床适用性之间提供了良好的平衡。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2602.22829

Reflectance Multispectral Imaging for Soil Composition Estimation and USDA Texture Classification

反射多光谱成像用于土壤成分估计和美国农业部纹理分类

Ranasinghe, G. A. S. L, Jayakody, J. A. S. T., De Silva, M. C. L., Thilakarathne, G., Godaliyadda, G. M. R. I., Herath, H. M. V. R., Ekanayake, M. P. B., Navaratnarajah, S. K.

Abstract

Soil texture is a foundational attribute that governs water availability and erosion in agriculture, as well as load bearing capacity, deformation response, and shrink-swell risk in geotechnical engineering. Yet texture is still typically determined by slow and labour intensive laboratory particle size tests, while many sensing alternatives are either costly or too coarse to support routine field scale deployment. This paper proposes a robust and field deployable multispectral imaging (MSI) system and machine learning framework for predicting soil composition and the United States Department of Agriculture (USDA) texture classes. The proposed system uses a cost effective in-house MSI device operating from 365 nm to 940 nm to capture thirteen spectral bands, which effectively capture the spectral properties of soil texture. Regression models use the captured spectral properties to estimate clay, silt, and sand percentages, while a direct classifier predicts one of the twelve USDA textural classes. Indirect classification is obtained by mapping the regressed compositions to texture classes via the USDA soil texture triangle. The framework is evaluated on mixture data by mixing clay, silt, and sand in varying proportions, using the USDA classification triangle as a basis. Experimental results show that the proposed approach achieves a coefficient of determination R^2 up to 0.99 for composition prediction and over 99% accuracy for texture classification. These findings indicate that MSI combined with data-driven modeling can provide accurate, non-destructive, and field deployable soil texture characterization suitable for geotechnical screening and precision agriculture.

Chinese Translation

土壤纹理是影响农业中水分可用性和侵蚀的基础属性，同时在岩土工程中也影响承载能力、变形响应和收缩-膨胀风险。然而，土壤纹理通常仍通过缓慢且劳动密集的实验室颗粒大小测试来确定，而许多传感器替代方案要么成本高昂，要么精度过低，无法支持常规的现场规模部署。本文提出了一种稳健且可在现场部署的多光谱成像（MSI）系统和机器学习框架，用于预测土壤成分和美国农业部（USDA）纹理类别。所提系统使用一种经济高效的内部MSI设备，工作波段范围为365 nm至940 nm，捕获十三个光谱波段，有效捕捉土壤纹理的光谱特性。回归模型利用捕获的光谱特性来估计粘土、粉土和沙子的百分比，而直接分类器则预测十二个USDA纹理类别中的一个。通过将回归得到的成分映射到纹理类别，间接分类是通过USDA土壤纹理三角形实现的。该框架在混合数据上进行了评估，通过以不同比例混合粘土、粉土和沙子，以USDA分类三角形为基础。实验结果表明，所提方法在成分预测上达到的决定系数R^2高达0.99，在纹理分类上准确率超过99%。这些发现表明，MSI结合数据驱动建模能够提供准确、无损且适合岩土筛选和精准农业的现场可部署土壤纹理表征。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2602.22843

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

一种数据和计算高效的胸部X光基础模型，超越激进扩展

Wang, Chong, Zhang, Yabin, Gao, Yunhe, Varma, Maya, Mottez, Clemence, Patsatzi, Faidra, Liu, Jiaming, Long, Jin, Delbrouck, Jean-Benoit, Gatidis, Sergios, Chaudhari, Akshay S., Langlotz, Curtis P.

Abstract

Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

Chinese Translation

医学影像的基础模型通常在不断增大的数据集上进行预训练，遵循“无论成本如何都要扩展”的范式。然而，这一策略面临两个关键挑战：大规模医学数据集往往包含大量冗余和严重的类别不平衡，这使得表示学习偏向于过度代表的模式；而无差别的训练忽视数据质量的异质性，导致显著的计算低效。在此，我们展示了在预训练过程中进行主动、原则性的数据整理可以作为一种可行且具有成本效益的替代方案，以取代粗暴的数据集扩展。我们引入了CheXficient，一个胸部X光（CXR）基础模型，选择性地优先考虑信息丰富的训练样本。CheXficient仅在1,235,004对CXR图像和报告的22.7%上进行预训练，同时消耗的计算预算不到27.3%，却实现了与其全数据对应模型及其他大规模预训练模型相当或更优的性能。我们在涵盖5种任务类型的20个独立基准上评估了CheXficient，包括未适应的现成评估（零-shot发现分类和跨模态检索）以及适应的下游任务（疾病预测、语义分割和放射学报告生成）。进一步分析表明，CheXficient系统地优先考虑了代表性不足的训练样本，从而提高了在长尾或稀有病症上的泛化能力。总体而言，我们的工作为医学视觉-语言基础模型的高效预训练和下游适应提供了实用的见解，涉及数据和计算的需求。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2602.22859

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

从盲点到收益：基于诊断驱动的迭代训练大规模多模态模型

Jia, Hongrui, Jiang, Chaoya, Zhang, Shikun, Ye, Wei

Abstract

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at https://github.com/hongruijia/DPE.

Chinese Translation

随着大规模多模态模型（LMMs）的规模扩大和强化学习（RL）方法的成熟，LMMs在复杂推理和决策制定方面取得了显著进展。然而，训练仍然依赖于静态数据和固定的配方，这使得诊断能力盲点或提供动态、针对性的强化变得困难。受到测试驱动的错误暴露和反馈修正优于重复练习的发现的启发，我们提出了诊断驱动的渐进演化（DPE），这是一种螺旋循环，其中诊断引导数据生成和强化，每次迭代重新诊断更新后的模型，以推动下一轮的针对性改进。DPE有两个关键组成部分。首先，多名代理对大量未标记的多模态数据进行注释和质量控制，使用网络搜索和图像编辑等工具生成多样化、真实的样本。其次，DPE将失败归因于特定的弱点，动态调整数据混合，并指导代理生成针对弱点的数据以进行有针对性的强化。在Qwen3-VL-8B-Instruct和Qwen2.5-VL-7B-Instruct上的实验显示，在十一项基准测试中持续稳定地取得进展，表明DPE是一种可扩展的范式，可用于在开放任务分布下持续训练LMM。我们的代码、模型和数据已公开发布在 https://github.com/hongruijia/DPE。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2602.22867

SO3UFormer: Learning Intrinsic Spherical Features for Rotation-Robust Panoramic Segmentation

SO3UFormer：学习内在球面特征以实现旋转鲁棒的全景分割

Zhu, Qinfeng, Jiang, Yunxi, Fan, Lei

Abstract

Panoramic semantic segmentation models are typically trained under a strict gravity-aligned assumption. However, real-world captures often deviate from this canonical orientation due to unconstrained camera motions, such as the rotational jitter of handheld devices or the dynamic attitude shifts of aerial platforms. This discrepancy causes standard spherical Transformers to overfit global latitude cues, leading to performance collapse under 3D reorientations. To address this, we introduce SO3UFormer, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame. Our approach rests on three geometric pillars: (1) an intrinsic feature formulation that decouples the representation from the gravity vector by removing absolute latitude encoding; (2) quadrature-consistent spherical attention that accounts for non-uniform sampling densities; and (3) a gauge-aware relative positional mechanism that encodes local angular geometry using tangent-plane projected angles and discrete gauge pooling, avoiding reliance on global axes. We further use index-based spherical resampling together with a logit-level SO(3)-consistency regularizer during training. To rigorously benchmark robustness, we introduce Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within $\pm 35^\circ$. Under the extreme test of arbitrary full SO(3) rotations, existing SOTAs fail catastrophically: the baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU. In contrast, SO3UFormer demonstrates remarkable stability, achieving 72.03 mIoU on Pose35 and retaining 70.67 mIoU under full SO(3) rotations.

Chinese Translation

全景语义分割模型通常在严格的重力对齐假设下进行训练。然而，现实世界的捕捉往往由于不受限制的相机运动而偏离这种规范方向，例如手持设备的旋转抖动或空中平台的动态姿态变化。这种差异导致标准球面变换器过度拟合全局纬度线索，从而在三维重新定向下性能崩溃。为了解决这个问题，我们提出了SO3UFormer，这是一种旋转鲁棒架构，旨在学习对基础坐标系不太敏感的内在球面特征。我们的方法基于三个几何支柱：(1) 内在特征公式，通过去除绝对纬度编码将表示与重力向量解耦；(2) 考虑非均匀采样密度的四分之一一致性球面注意力；(3) 一种感知度量的相对位置机制，利用切平面投影角度和离散度量池来编码局部角几何，避免对全局轴的依赖。我们还在训练过程中使用基于索引的球面重采样以及logit级别的SO(3)一致性正则化器。为了严格基准测试鲁棒性，我们引入了Pose35，这是一个受随机旋转扰动的Stanford2D3D数据集变体，旋转范围为$ ext{±} 35^ ext{°}$。在任意全SO(3)旋转的极端测试下，现有的最先进技术（SOTA）表现惨败：基线SphereUFormer的mIoU从67.53降至25.26。相比之下，SO3UFormer表现出显著的稳定性，在Pose35上达到了72.03 mIoU，并在全SO(3)旋转下保持了70.67 mIoU。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2602.22917

Towards Multimodal Domain Generalization with Few Labels

面向少量标签的多模态领域泛化研究

Li, Hongzhao, Dong, Hao, Wan, Hualei, Li, Shupan, Xu, Mingliang, Khan, Muhammad Haris

Abstract

Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at https://github.com/lihongzhao99/SSMDG.

Chinese Translation

多模态模型理想情况下应能够对未见领域进行泛化，同时保持数据高效性以降低标注成本。为此，我们引入并研究了一个新问题，即半监督多模态领域泛化（Semi-Supervised Multimodal Domain Generalization, SSMDG），旨在从多源数据中学习具有鲁棒性的多模态模型，且仅需少量标记样本。我们观察到现有方法未能有效应对这一设置：多模态领域泛化方法无法利用未标记数据，半监督多模态学习方法忽视了领域转移，而半监督领域泛化方法则局限于单一模态输入。为克服这些局限性，我们提出了一个统一框架，包含三个关键组件：共识驱动一致性正则化（Consensus-Driven Consistency Regularization），通过自信的融合单模态共识获得可靠的伪标签；不一致性感知正则化（Disagreement-Aware Regularization），有效利用模糊的非共识样本；以及跨模态原型对齐（Cross-Modal Prototype Alignment），在促进跨模态翻译的同时强制执行领域和模态不变表示，以增强在缺失模态下的鲁棒性。我们进一步建立了首个SSMDG基准，在该基准上，我们的方法在标准和缺失模态场景中均持续超越强基线。我们的基准和代码可在 https://github.com/lihongzhao99/SSMDG 获取。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2602.22919

Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins

流动链：心电图到四维心脏数字双胞胎的基础生成框架

Wu, Haofan, Aung, Nay, Arvanitis, Theodoros N., Lima, Joao A. C., Petersen, Steffen E., Zhang, Le

Abstract

A clinically actionable Cardiac Digital Twin (CDT) should reconstruct individualised cardiac anatomy and physiology, update its internal state from multimodal signals, and enable a broad range of downstream simulations beyond isolated tasks. However, existing CDT frameworks remain limited to task-specific predictors rather than building a patient-specific, manipulable virtual heart. In this work, we introduce Chain of Flow (COF), a foundational ECG-driven generative framework that reconstructs full 4D cardiac structure and motion from a single cardiac cycle. The method integrates cine-CMR and 12-lead ECG during training to learn a unified representation of cardiac geometry, electrophysiology, and motion dynamics. We evaluate Chain of Flow on diverse cohorts and demonstrate accurate recovery of cardiac anatomy, chamber-wise function, and dynamic motion patterns. The reconstructed 4D hearts further support downstream CDT tasks such as volumetry, regional function analysis, and virtual cine synthesis. By enabling full 4D organ reconstruction directly from ECG, COF transforms cardiac digital twins from narrow predictive models into fully generative, patient-specific virtual hearts. Code will be released after review.

Chinese Translation

一个具有临床可操作性的心脏数字双胞胎（CDT）应能够重建个性化的心脏解剖和生理，利用多模态信号更新其内部状态，并支持超越孤立任务的广泛下游模拟。然而，现有的CDT框架仍然局限于任务特定的预测模型，而未能构建患者特定的、可操作的虚拟心脏。在本研究中，我们引入了流动链（Chain of Flow, COF），这是一个基础的心电图驱动生成框架，能够从单个心动周期重建完整的四维心脏结构和运动。该方法在训练过程中整合了电影心脏磁共振成像（cine-CMR）和12导联心电图，以学习心脏几何、电生理和运动动力学的统一表征。我们在不同的队列上评估了流动链，并展示了对心脏解剖、腔室功能和动态运动模式的准确恢复。重建的四维心脏进一步支持下游CDT任务，如体积测量、区域功能分析和虚拟电影合成。通过直接从心电图实现完整的四维器官重建，COF将心脏数字双胞胎从狭隘的预测模型转变为完全生成的、患者特定的虚拟心脏。代码将在审稿后发布。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2602.22920

OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality

OSDaR-AR：通过多模态增强现实提升铁路感知数据集

Nesti, Federico, D'Amico, Gianluca, Marinoni, Mauro, Buttazzo, Giorgio

Abstract

Although deep learning has significantly advanced the perception capabilities of intelligent transportation systems, railway applications continue to suffer from a scarcity of high-quality, annotated data for safety-critical tasks like obstacle detection. While photorealistic simulators offer a solution, they often struggle with the ``sim-to-real" gap; conversely, simple image-masking techniques lack the spatio-temporal coherence required to obtain augmented single- and multi-frame scenes with the correct appearance and dimensions. This paper introduces a multi-modal augmented reality framework designed to bridge this gap by integrating photorealistic virtual objects into real-world railway sequences from the OSDaR23 dataset. Utilizing Unreal Engine 5 features, our pipeline leverages LiDAR point-clouds and INS/GNSS data to ensure accurate object placement and temporal stability across RGB frames. This paper also proposes a segmentation-based refinement strategy for INS/GNSS data to significantly improve the realism of the augmented sequences, as confirmed by the comparative study presented in the paper. Carefully designed augmented sequences are collected to produce OSDaR-AR, a public dataset designed to support the development of next-generation railway perception systems. The dataset is available at the following page: https://syndra.retis.santannapisa.it/osdarar.html

Chinese Translation

尽管深度学习显著提升了智能交通系统的感知能力，但铁路应用在障碍物检测等安全关键任务中仍然面临高质量标注数据稀缺的问题。虽然逼真的模拟器提供了一种解决方案，但它们常常难以克服“模拟到现实”的差距；相反，简单的图像遮罩技术缺乏获得具有正确外观和尺寸的增强单帧和多帧场景所需的时空一致性。本文提出了一种多模态增强现实框架，旨在通过将逼真的虚拟物体整合到来自OSDaR23数据集的真实铁路序列中来弥合这一差距。利用虚幻引擎5的特性，我们的管道利用激光雷达点云和INS/GNSS数据，确保在RGB帧之间实现准确的物体放置和时间稳定性。本文还提出了一种基于分割的INS/GNSS数据精炼策略，以显著提高增强序列的真实感，这一点通过本文中呈现的比较研究得到了确认。精心设计的增强序列被收集以生成OSDaR-AR，这是一个旨在支持下一代铁路感知系统开发的公共数据集。该数据集可在以下页面获取：https://syndra.retis.santannapisa.it/osdarar.html

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2602.22923

WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents

水视频问答：基于ASV的感知与规则遵循推理通过多模态智能体

Guan, Runwei, Liang, Shaofeng, Ouyang, Ningwei, Fei, Weichen, Yao, Shanliang, Dai, Wei, Ge, Chenhao, Sun, Penglei, Zhu, Xiaohui, Huang, Tao, Liu, Ryan Wen, Xiong, Hui

Abstract

While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.

Chinese Translation

尽管自主导航在被动感知（例如，物体检测和分割）方面取得了显著成功，但在知识驱动的互动环境认知方面仍然存在根本性限制。在高风险的海洋导航领域，弥合原始视觉感知与复杂认知推理之间的差距不仅是一个增强，而是自主水面船舶（ASV）执行安全和精确操控的关键前提。为此，我们提出了WaterVideoQA，这是第一个专门为全水域环境设计的大规模综合视频问答基准。该基准涵盖了3029个视频片段，分为六个不同的水域类别，整合了诸如光照变化和动态天气等多种变量，以严格测试ASV在五层次层次认知框架下的能力。此外，我们还推出了NaviMind，这是一个开创性的多智能体神经符号系统，旨在进行开放式海洋推理。通过协同自适应语义路由、情境感知层次推理和自主自我反思验证，NaviMind使ASV从表面模式匹配转变为符合规范的可解释决策。实验结果表明，我们的框架显著超越了现有基准，为动态海洋环境中的智能、可信交互建立了新范式。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2602.22932

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding

MSJoE：联合演化的多模态大语言模型和采样器用于高效的长视频理解

Tan, Wenhui, Yu, Xiaoyi, Li, Jiaze, Chen, Yijing, Ju, Jianzhong, Luo, Zhenbo, Song, Ruihua, Luan, Jian

Abstract

Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.

Chinese Translation

高效理解长视频仍然是多模态大语言模型（MLLM）面临的一个基本挑战。本文提出了MLLM-采样器联合演化（MSJoE），这是一个新颖的框架，旨在联合演化MLLM和轻量级关键帧采样器，以实现高效的长视频理解。MSJoE建立在一个关键假设之上，即只有一小部分关键帧对于回答视频中的每个问题是真正具有信息量的。具体而言，MSJoE首先推理出几个查询，这些查询描述了与问题相关的多样化视觉视角。然后，这些查询与一个冻结的CLIP模型进行交互，以生成查询-帧相似度矩阵。最后，一个轻量级的采样器从该矩阵中预测关键帧采样权重，选择一组紧凑的信息帧，然后将其输入到MLLM中以生成答案。MLLM和采样器通过强化学习进行联合优化，使查询推理、帧采样和关键帧理解能够共同适应。为了支持训练过程，我们收集了一个包含2800个视频和7000个问答对的新长视频问答数据集。在VideoMME、LongVideoBench、LVBench和MLVU上的大量实验表明，MSJoE在基础MLLM的基础上实现了8.0%的准确率提升，且比最强基线方法高出1.1%的准确率。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2602.22938

pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

pMoE：共同激励多样化专家在视觉适应中获胜更多

Mo, Shentong, Luo, Xufang, Li, Dongsheng

Abstract

Parameter-efficient fine-tuning has demonstrated promising results across various visual adaptation tasks, such as classification and segmentation. Typically, prompt tuning techniques have harnessed knowledge from a single pre-trained model, whether from a general or a specialized medical domain. However, this approach typically overlooks the potential synergies that could arise from integrating diverse domain knowledge within the same tuning process. In this work, we propose a novel Mixture-of-Experts prompt tuning method called pMoE, which leverages the strengths of multiple expert domains through expert-specialized prompt tokens and the learnable dispatcher, effectively combining their expertise in a unified model framework. Our pMoE introduces expert-specific prompt tokens and utilizes a dynamic token dispatching mechanism at various prompt layers to optimize the contribution of each domain expert during the adaptation phase. By incorporating both domain knowledge from diverse experts, the proposed pMoE significantly enhances the model's versatility and applicability to a broad spectrum of tasks. We conduct extensive experiments across 47 adaptation tasks, including both classification and segmentation in general and medical domains. The results demonstrate that our pMoE not only achieves superior performance with a large margin of improvements but also offers an optimal trade-off between computational efficiency and adaptation effectiveness compared to existing methods.

Chinese Translation

参数高效微调在各种视觉适应任务中表现出良好的结果，如分类和分割。通常，提示调优技术利用来自单一预训练模型的知识，无论是来自通用领域还是专业医疗领域。然而，这种方法通常忽视了在同一调优过程中整合多样化领域知识所带来的潜在协同效应。在本研究中，我们提出了一种新颖的混合专家提示调优方法，称为pMoE，它通过专家专用提示令牌和可学习的调度器，利用多个专家领域的优势，有效地将其专业知识结合在统一的模型框架中。我们的pMoE引入了专家特定的提示令牌，并在各个提示层中利用动态令牌调度机制，以优化每个领域专家在适应阶段的贡献。通过结合来自多样化专家的领域知识，所提出的pMoE显著增强了模型的多功能性和适用性，适用于广泛的任务。我们在47个适应任务上进行了广泛的实验，包括通用和医疗领域的分类和分割。结果表明，我们的pMoE不仅在性能上取得了显著的提升，而且在计算效率和适应有效性之间提供了最佳的权衡，相较于现有方法表现更为优越。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2602.22941

Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings

基于平移和缩放视频录制的皮划艇冲刺队艇速度和划桨频率重建

Ziegler, Julian, Matthes, Daniel, Gerdts, Finn, Frenzel, Patrick, Warnke, Torsten, Englert, Matthias, Koevari, Tina, Fuchs, Mirco

Abstract

Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity RRMSE of 0.020 +- 0.011 (rho = 0.956) and a stroke rate RRMSE of 0.022 +- 0.024 (rho = 0.932). The methods provide coaches with highly accurate, automated feedback without requiring on-boat sensors or manual annotation.

Chinese Translation

划桨策略由速度和划桨频率特征定义，对于皮划艇冲刺的最佳表现至关重要。尽管GPS是分析的金标准，但其有限的可用性使得自动化视频解决方案成为必要。本文提出了一种扩展框架，用于从平移和缩放的视频录制中重建所有冲刺项目（K1-K4，C1-C2）和距离（200米-500米）的表现指标。我们的方法利用YOLOv8进行浮标和运动员检测，利用已知的浮标网格来估计单应性。我们通过学习特定于艇的运动员偏移，使用基于U-net的艇尖校准来推广艇位的估计。此外，我们实施了一种稳健的跟踪方案，利用光流适应多运动员艇型。最后，我们介绍了从姿态估计或运动员边界框中提取划桨频率信息的方法。与精英比赛的GPS数据进行评估，速度的相对均方根误差（RRMSE）为0.020 ± 0.011（rho = 0.956），划桨频率的RRMSE为0.022 ± 0.024（rho = 0.932）。这些方法为教练提供了高度准确的自动反馈，无需艇上传感器或手动标注。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2602.22945

Cross-Task Benchmarking of CNN Architectures

卷积神经网络架构的跨任务基准测试

Sherawat, Kamal, Bhati, Vikrant

Abstract

This project provides a comparative study of dynamic convolutional neural networks (CNNs) for various tasks, including image classification, segmentation, and time series analysis. Based on the ResNet-18 architecture, we compare five variants of CNNs: the vanilla CNN, the hard attention-based CNN, the soft attention-based CNN with local (pixel-wise) and global (image-wise) feature attention, and the omni-directional CNN (ODConv). Experiments on Tiny ImageNet, Pascal VOC, and the UCR Time Series Classification Archive illustrate that attention mechanisms and dynamic convolution methods consistently exceed conventional CNNs in accuracy, efficiency, and computational performance. ODConv was especially effective on morphologically complex images by being able to dynamically adjust to varying spatial patterns. Dynamic CNNs enhanced feature representation and cross-task generalization through adaptive kernel modulation. This project provides perspectives on advanced CNN design architecture for multiplexed data modalities and indicates promising directions in neural network engineering.

Chinese Translation

本项目提供了动态卷积神经网络（CNN）在多种任务上的比较研究，包括图像分类、分割和时间序列分析。基于ResNet-18架构，我们比较了五种CNN变体：基础CNN、基于硬注意力的CNN、基于软注意力的CNN（具有局部（像素级）和全局（图像级）特征注意力），以及全向卷积网络（ODConv）。在Tiny ImageNet、Pascal VOC和UCR时间序列分类档案上的实验表明，注意力机制和动态卷积方法在准确性、效率和计算性能上始终优于传统CNN。ODConv在形态复杂的图像上尤其有效，能够动态调整以适应不同的空间模式。动态CNN通过自适应核调制增强了特征表示和跨任务泛化。本项目为多模态数据的先进CNN设计架构提供了视角，并指出了神经网络工程中的有前景的方向。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2602.22948

ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

ToProVAR：通过三维熵感知语义分析和稀疏性优化实现高效的视觉自回归建模

Chen, Jiayu, Lin, Ruoyu, Zheng, Zihao, Li, Jingxin, Li, Maoliang, Luo, Guojie, chen, Xiang

Abstract

Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x acceleration with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.

Chinese Translation

视觉自回归（VAR）模型提高了生成质量，但在后期阶段面临关键的效率瓶颈。本文提出了一种新颖的VAR模型优化框架，与之前的FastVAR和SkipVAR等方法有根本区别。我们的方法不依赖于启发式跳过策略，而是利用注意力熵来表征模型架构不同维度的语义投影。这使得我们能够在不同的标记粒度水平、语义范围和生成规模下精确识别参数动态。在此分析基础上，我们进一步揭示了沿着三个关键维度——标记、层和规模的稀疏性模式，并提出了一套针对这些模式的细粒度优化策略。广泛的评估表明，我们的方法在显著保持语义保真度和细节的同时，实现了生成过程的显著加速，超越了传统方法在效率和质量上的表现。在Infinity-2B和Infinity-8B模型上的实验表明，ToProVAR实现了高达3.4倍的加速，且质量损失最小，有效缓解了之前工作中存在的问题。我们的代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2602.22949

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis

OpenFS：具备多手能力的手指拼写识别，隐式手势检测与逐帧字母条件合成

Cha, Junuk, Kim, Jihyeon, Park, Han-Mu

Abstract

Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: https://github.com/JunukCha/OpenFS.

Chinese Translation

手指拼写是手语的一部分，通过特定的手势逐字拼写单词。自动手指拼写识别在弥合聋人和听力人群之间的沟通差距中发挥着至关重要的作用，但由于手势歧义问题、缺乏适当的训练损失以及词汇外（OOV）问题，它仍然面临挑战。以往的手指拼写识别方法依赖于显式的手势检测，这往往导致识别失败，并且使用连接时序分类（CTC）损失，这表现出尖峰行为问题。为了解决这些问题，我们开发了OpenFS，这是一种用于手指拼写识别和合成的开源方法。我们提出了一种支持单手和多手输入的多手能力手指拼写识别器，并通过结合双层位置编码和手势聚焦（SF）损失来执行隐式手势检测。SF损失鼓励交叉注意力集中在手势上，从而在识别过程中实现隐式手势检测。此外，我们引入了一种单调对齐（MA）损失，不依赖于CTC损失，通过交叉注意力正则化强制输出字母序列遵循输入姿态序列的时间顺序。此外，我们提出了一种逐帧字母条件生成器，为OOV单词合成逼真的手指拼写姿态序列。该生成器使得构建一个新的合成基准成为可能，称为FSNeo。通过全面的实验，我们证明了我们的方法在识别方面达到了最先进的性能，并验证了所提议的识别器和生成器的有效性。代码和数据可在：https://github.com/JunukCha/OpenFS获取。

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2602.22955

MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

MM-NeuroOnco：基于MRI的脑肿瘤诊断的多模态基准和指令数据集

Guo, Feng, Liu, Jiaxiang, Li, Yang, Shi, Qianqian, Xu, Mingkun

Abstract

Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi-model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis-related semantics beyond mask-only annotations. Building upon this dataset, we further construct MM-NeuroOnco-Bench, a manually annotated evaluation benchmark with a rejection-aware setting to reduce biases inherent in closed-ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis-related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM-NeuroOnco, we further propose NeuroOnco-GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine-tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: https://github.com/gfnnnb/MM-NeuroOnco

Chinese Translation

准确的脑肿瘤诊断不仅需要模型检测病变，还需生成基于影像表现的临床可解释推理，然而现有的公共数据集在注释丰富性和诊断语义方面仍然有限。为填补这一空白，我们推出了MM-NeuroOnco，这是一个大规模的多模态基准和指令调优数据集，旨在促进脑肿瘤MRI理解，包含来自20个数据源的24,726个MRI切片，以及约200,000个涵盖多种肿瘤亚型和影像模态的语义丰富的多模态指令。为了缓解诊断语义注释的稀缺性和高成本，我们开发了一种多模型协作管道，用于自动化医疗信息的补全和质量控制，从而生成超越仅有掩膜注释的诊断相关语义。在此数据集的基础上，我们进一步构建了MM-NeuroOnco-Bench，这是一个手动注释的评估基准，采用拒绝感知设置以减少封闭式问题格式固有的偏见。对十个代表性模型的评估显示，即使是最强的基线模型Gemini 3 Flash，在诊断相关问题上的准确率也仅为41.88%，突显了多模态脑肿瘤诊断理解的重大挑战。利用MM-NeuroOnco，我们进一步提出了NeuroOnco-GPT，在经过微调后在诊断问题上实现了27%的绝对准确率提升。该结果证明了我们的数据集和基准在推动临床基础多模态诊断推理方面的有效性。代码和数据集可在以下链接公开获取：https://github.com/gfnnnb/MM-NeuroOnco

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2602.22959

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

代理能否在零样本环境中区分视觉上难以分离的疾病？一项初步研究

Zhao, Zihao, Hauke, Frederik, De Castilhos, Juliana, Nebelung, Sven, Truhn, Daniel

Abstract

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

Chinese Translation

多模态大型语言模型（MLLMs）的快速发展引发了对基于代理的系统日益增长的兴趣。尽管以往大多数医学影像研究集中于自动化常规临床工作流程，但我们研究了一个尚未充分探索但在临床上具有重要意义的场景：在零样本环境中区分视觉上难以分离的疾病。我们在两个仅基于影像的代理诊断任务上对代表性代理进行了基准测试，分别是（1）黑色素瘤与非典型痣，以及（2）肺水肿与肺炎，在这些任务中，尽管临床管理存在显著差异，视觉特征却高度混淆。我们引入了一种基于对比裁决的多代理框架。实验结果显示，诊断性能有所提升（在皮肤镜数据上准确率提高了11个百分点），并且在定性样本中减少了不支持的声明，尽管整体性能仍不足以用于临床部署。我们承认人类注释中固有的不确定性以及缺乏临床背景，这进一步限制了向现实世界环境的转化。在这一受控环境中，本初步研究为视觉混淆场景下的零样本代理性能提供了初步见解。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2602.22960

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

UCM：通过时间感知位置编码扭曲统一相机控制与记忆的世界模型

Xu, Tianxing, Wang, Zixuan, Wang, Guangyuan, Hu, Li, Zhang, Zhongyi, Zhang, Peng, Zhang, Bang, Zhang, Song-Hai

Abstract

World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.

Chinese Translation

基于视频生成的世界模型在模拟互动环境方面展现出显著潜力，但在两个关键领域面临持续挑战：在场景重访时保持长期内容一致性，以及根据用户提供的输入实现精确的相机控制。现有基于显式三维重建的方法在无限场景和细粒度结构中往往妥协了灵活性。替代方法直接依赖于先前生成的帧，而未建立显式空间对应关系，从而限制了可控性和一致性。为了解决这些局限性，我们提出了UCM，一个通过时间感知位置编码扭曲机制统一长期记忆和精确相机控制的新框架。为了减少计算开销，我们设计了一种高效的双流扩散变换器，以实现高保真生成。此外，我们引入了一种可扩展的数据整理策略，利用基于点云的渲染来模拟场景重访，从而促进对超过50万单目视频的训练。在真实世界和合成基准上的大量实验表明，UCM在长期场景一致性方面显著优于最先进的方法，同时在高保真视频生成中也实现了精确的相机可控性。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2602.23013

SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

SubspaceAD：通过子空间建模实现无训练的少样本异常检测

Lendering, Camile, Akdag, Erkut, Bondarev, Egor

Abstract

Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.

Chinese Translation

在工业检测中，检测视觉异常通常需要仅使用每个类别的少量正常图像进行训练。近期的少样本方法利用基础模型特征取得了良好的效果，但通常依赖于记忆库、辅助数据集或多模态调优视觉-语言模型。因此，我们质疑在考虑视觉基础模型的特征表示时，这种复杂性是否是必要的。为了解答这个问题，我们提出了SubspaceAD，这是一种无训练的方法，分为两个简单的阶段。首先，通过冻结的DINOv2主干网络从一小组正常图像中提取补丁级特征。其次，拟合主成分分析（PCA）模型到这些特征，以估计正常变异的低维子空间。在推理阶段，通过相对于该子空间的重建残差检测异常，生成可解释且具有统计基础的异常分数。尽管方法简单，SubspaceAD在无训练、无提示调优或记忆库的情况下，在一-shot和少-shot设置中实现了最先进的性能。在一-shot异常检测设置中，SubspaceAD在MVTec-AD数据集上实现了图像级和像素级AUROC分别为98.0%和97.6%，在VisA数据集上分别为93.3%和98.3%，超越了之前的最先进结果。代码和演示可在 https://github.com/CLendering/SubspaceAD 获取。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2602.23022

DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

DMAligner：通过基于扩散模型的视图合成增强图像对齐

Luo, Xinglong, Luo, Ao, Wang, Zhengning, Yang, Yueqi, Feng, Chaoyu, Lei, Lei, Zeng, Bing, Liu, Shuaicheng

Abstract

Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve. Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment. Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Our code is available at https://github.com/boomluo02/DMAligner.

Chinese Translation

图像对齐是计算机视觉中的一项基础任务，具有广泛的应用。现有方法主要采用基于光流的图像变形技术。然而，这种技术容易受到遮挡和光照变化等常见挑战的影响，导致对齐的视觉质量下降，并在下游任务中影响准确性。本文提出了DMAligner，一个基于扩散的图像对齐框架，通过面向对齐的视图合成来实现图像对齐。DMAligner旨在从新的视角解决图像对齐中的挑战，采用基于生成的解决方案，展现出强大的能力，并避免与基于光流的图像变形相关的问题。具体而言，我们提出了一种动态感知扩散训练方法，用于学习条件图像生成，合成用于图像对齐的新视图。该方法结合了动态感知掩模生成（DMP）模块，能够自适应地区分动态前景区域与静态背景，从而使扩散模型更有效地处理经典方法难以解决的挑战。此外，我们使用Blender开发了动态场景图像对齐（DSIA）数据集，该数据集包含1,033个室内和室外场景，提供超过30K对为图像对齐量身定制的图像对。大量实验结果表明，所提方法在DSIA基准测试以及一系列广泛使用的视频数据集上的定性比较中均表现出优越性。我们的代码可在https://github.com/boomluo02/DMAligner获取。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2602.23029

WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval

WISER：更广泛的搜索、更深入的思考和自适应融合的无训练零-shot复合图像检索

Wang, Tianyue, Qu, Leigang, Yang, Tianyu, Hao, Xiangzhao, Xu, Yifan, Guo, Haiyun, Wang, Jinqiao

Abstract

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at https://github.com/Physicsmile/WISER.

Chinese Translation

零-shot复合图像检索（Zero-Shot Composed Image Retrieval, ZS-CIR）旨在在没有经过标注三元组训练的情况下，根据多模态查询（包括参考图像和修改文本）检索目标图像。现有方法通常将多模态查询转换为单一模态——要么作为用于图像生成的编辑标题（Text-to-Image retrieval, T2I），要么作为用于图像到图像检索的编辑图像（Image-to-Image retrieval, I2I）。然而，每种范式都有其固有的局限性：T2I往往会丢失细粒度的视觉细节，而I2I在处理复杂的语义修改时则面临挑战。为了有效利用它们在不同查询意图下的互补优势，我们提出了WISER，一个无训练的框架，通过“检索-验证-精炼”管道统一T2I和I2I，明确建模意图意识和不确定性意识。具体而言，WISER首先通过生成编辑标题和图像进行并行检索来执行更广泛的搜索，以扩大候选池。然后，它与验证器进行自适应融合，以评估检索信心，对不确定的检索触发精炼，并动态融合可靠的双路径。对于不确定的检索，WISER通过结构化自我反思生成精炼建议，以指导下一轮检索朝向更深入的思考。大量实验表明，WISER在多个基准测试中显著优于以往的方法，在CIRCO（mAP@5）上实现了相对提升45%，在CIRR（Recall@1）上实现了57%的提升，超越了现有的无训练方法。值得注意的是，它甚至超过了许多依赖训练的方法，突显了其在多种场景下的优越性和泛化能力。代码将发布在https://github.com/Physicsmile/WISER。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2602.23031

Small Object Detection Model with Spatial Laplacian Pyramid Attention and Multi-Scale Features Enhancement in Aerial Images

基于空间拉普拉斯金字塔注意力和多尺度特征增强的航空图像小物体检测模型

Ji, Zhangjian, Yan, Huijia, Qiao, Shaotong, Feng, Kai, Wei, Wei

Abstract

Detecting objects in aerial images confronts some significant challenges, including small size, dense and non-uniform distribution of objects over high-resolution images, which makes detection inefficient. Thus, in this paper, we proposed a small object detection algorithm based on a Spatial Laplacian Pyramid Attention and Multi-Scale Feature Enhancement in aerial images. Firstly, in order to improve the feature representation of ResNet-50 on small objects, we presented a novel Spatial Laplacian Pyramid Attention (SLPA) module, which is integrated after each stage of ResNet-50 to identify and emphasize important local regions. Secondly, to enhance the model's semantic understanding and features representation, we designed a Multi-Scale Feature Enhancement Module (MSFEM), which is incorporated into the lateral connections of C5 layer for building Feature Pyramid Network (FPN). Finally, the features representation quality of traditional feature pyramid network will be affected because the features are not aligned when the upper and lower layers are fused. In order to handle it, we utilized deformable convolutions to align the features in the fusion processing of the upper and lower levels of the Feature Pyramid Network, which can help enhance the model's ability to detect and recognize small objects. The extensive experimental results on two benchmark datasets: VisDrone and DOTA demonstrate that our improved model performs better for small object detection in aerial images compared to the original algorithm.

Chinese Translation

在航空图像中检测物体面临一些重大挑战，包括小物体的尺寸、物体在高分辨率图像中的密集和非均匀分布，这使得检测效率低下。因此，本文提出了一种基于空间拉普拉斯金字塔注意力（Spatial Laplacian Pyramid Attention, SLPA）和多尺度特征增强（Multi-Scale Feature Enhancement, MSFEM）的小物体检测算法。首先，为了提高ResNet-50在小物体上的特征表示，我们提出了一种新颖的空间拉普拉斯金字塔注意力模块，该模块集成在ResNet-50的每个阶段后，以识别和强调重要的局部区域。其次，为了增强模型的语义理解和特征表示，我们设计了一个多尺度特征增强模块，该模块被纳入C5层的侧向连接中，以构建特征金字塔网络（Feature Pyramid Network, FPN）。最后，传统特征金字塔网络的特征表示质量会受到影响，因为在融合上下层时特征未对齐。为了解决这个问题，我们利用可变形卷积在特征金字塔网络的上下层融合处理过程中对齐特征，这有助于增强模型检测和识别小物体的能力。在两个基准数据集VisDrone和DOTA上的广泛实验结果表明，我们改进的模型在航空图像的小物体检测方面表现优于原始算法。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2602.23040

PackUV: Packed Gaussian UV Maps for 4D Volumetric Video

PackUV：用于4D体积视频的紧凑高斯UV图

Rai, Aashish, Xing, Angela, Agarwal, Anushka, Cong, Xiaoyan, Li, Zekun, Lu, Tao, Prakash, Aayush, Sridhar, Srinath

Abstract

Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., FFV1) without losing quality, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view video dataset to date, featuring more than 50 synchronized cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.

Chinese Translation

体积视频提供了沉浸式的4D体验，但在大规模重建、存储和流媒体传输方面仍然存在困难。现有的基于高斯点云的方法实现了高质量的重建，但在长序列、时间一致性方面出现问题，并且在大运动和遮挡情况下失效。此外，它们的输出通常与传统视频编码管道不兼容，阻碍了实际应用。我们提出了PackUV，一种新颖的4D高斯表示法，将所有高斯属性映射到一系列结构化的多尺度UV图集，实现紧凑的、图像原生的存储。为了从多视角视频中拟合这种表示，我们提出了PackUV-GS，一种时间一致的拟合方法，直接优化UV域中的高斯参数。一个流引导的高斯标记和视频关键帧模块识别动态高斯，稳定静态区域，并在大运动和遮挡情况下保持时间一致性。生成的UV图集格式是第一个与标准视频编解码器（例如FFV1）兼容的统一体积视频表示，且不会损失质量，从而实现了在现有多媒体基础设施中的高效流媒体传输。为了评估长时间体积捕捉，我们呈现了PackUV-2B，这是迄今为止最大的多视角视频数据集，包含超过50个同步摄像头、显著运动和频繁遮挡，涵盖100个序列和20亿帧。大量实验表明，我们的方法在渲染保真度上超越了现有基准，并且能够扩展到长达30分钟的序列，保持一致的质量。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2602.23043

D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment

D-FINE-seg：具有多后端部署的目标检测与实例分割框架

Saakyan, Argo, Solntsev, Dmitry

Abstract

Transformer-based real-time object detectors achieve strong accuracy-latency trade-offs, and D-FINE is among the top-performing recent architectures. However, real-time instance segmentation with transformers is still less common. We present D-FINE-seg, an instance segmentation extension of D-FINE that adds: a lightweight mask head, segmentation-aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency. Second contribution is an end-to-end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open-source under the Apache-2.0 license. GitHub repository - https://github.com/ArgoHA/D-FINE-seg.

Chinese Translation

基于Transformer的实时目标检测器在准确性与延迟之间实现了良好的平衡，而D-FINE是近期表现最优的架构之一。然而，使用Transformer进行实时实例分割仍然较为少见。我们提出了D-FINE-seg，这是D-FINE的实例分割扩展，增加了：轻量级的掩膜头、分割感知训练，包括框裁剪的二元交叉熵（BCE）和Dice掩膜损失、辅助和去噪掩膜监督，以及调整后的匈牙利匹配成本。在TACO数据集上，D-FINE-seg在统一的TensorRT FP16端到端基准测试协议下，相较于Ultralytics YOLO26提高了F1分数，同时保持了竞争力的延迟。第二个贡献是一个端到端的管道，用于训练、导出和优化推理，支持ONNX、TensorRT和OpenVINO，适用于目标检测和实例分割任务。该框架已作为开源项目发布，遵循Apache-2.0许可证。GitHub仓库 - https://github.com/ArgoHA/D-FINE-seg。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2602.23058

GeoWorld: Geometric World Models

GeoWorld：几何世界模型

Zhang, Zeyu, Li, Danning, Reid, Ian, Hartley, Richard

Abstract

Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: https://steve-zeyu-zhang.github.io/GeoWorld.

Chinese Translation

基于能量的预测世界模型通过对潜在能量景观进行推理，而不是生成像素，为多步视觉规划提供了一种强有力的方法。然而，现有的方法面临两个主要挑战：（i）它们的潜在表示通常是在欧几里得空间中学习的，忽视了状态之间的基础几何和层次结构；（ii）它们在长时间预测方面表现不佳，导致在扩展的滚动预测中迅速退化。为了解决这些挑战，我们引入了GeoWorld，一种通过超曲面JEPA（Hyperbolic JEPA）保持几何结构和层次关系的几何世界模型，该模型将潜在表示从欧几里得空间映射到超曲面流形。我们进一步引入了用于基于能量优化的几何强化学习，能够在超曲面潜在空间中实现稳定的多步规划。在CrossTask和COIN上的大量实验表明，与最先进的V-JEPA 2相比，3步规划的成功率提高了约3%，4步规划的成功率提高了2%。项目网站：https://steve-zeyu-zhang.github.io/GeoWorld。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2602.23069

Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception

对齐再适应：重新思考4D感知中的参数高效迁移学习

Sun, Yiding, Zhu, Jihua, Cheng, Haozhe, Lu, Chaoyi, Yang, Zhichuan, Chen, Lin, Wang, Yaonan

Abstract

Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.

Chinese Translation

点云视频理解对机器人技术至关重要，因为它准确编码了运动和场景交互。我们认识到，4D数据集远比3D数据集稀缺，这阻碍了自监督4D模型的可扩展性。一种有前景的替代方案是将3D预训练模型迁移到4D感知任务中。然而，严格的实证分析揭示了两个阻碍迁移能力的关键限制：过拟合和模态差距。为了解决这些挑战，我们开发了一种新颖的“对齐再适应”（PointATA）范式，将参数高效迁移学习分解为两个顺序阶段。我们采用最优传输理论来量化3D和4D数据集之间的分布差异，使得我们提出的点对齐嵌入器能够在第一阶段进行训练，以减轻潜在的模态差距。为了缓解过拟合，我们将高效的点-视频适配器和空间上下文编码器集成到冻结的3D主干网络中，以增强第二阶段的时间建模能力。值得注意的是，凭借上述工程导向的设计，PointATA使得没有时间知识的预训练3D模型能够以较小的参数成本推理动态视频内容，相较于之前的工作具有更高的效率。大量实验表明，PointATA可以与强大的全微调模型相匹配甚至超越，同时享有参数高效的优势，例如在3D动作识别中达到97.21%的准确率，在4D动作分割中提高8.7%，以及在4D语义分割中达到84.06%。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2602.23088

Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy

用语言描述细胞结构：弱监督视觉-语言建模在人脑显微镜学中的应用

Sutton, Matthew, Amunts, Katrin, Dickscheid, Timo, Schiffer, Christian

Abstract

Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.

Chinese Translation

基础模型越来越能够支持交互式、主动的工作流程，帮助研究人员在图像数据的分析和解释过程中。然而，这种工作流程通常需要将视觉与语言结合，以提供自然语言接口。然而，在许多研究和临床环境中，学习这种结合所需的配对图像-文本数据稀缺且难以获得。一个这样的环境是对细胞体染色的人脑组织切片的显微分析，这使得研究细胞结构成为可能：细胞密度和形态及其层次和区域组织。在此，我们提出了一种标签介导的方法，通过仅通过标签将图像和文本连接，生成有意义的图像说明，而无需经过精心策划的配对图像-文本数据。给定标签后，我们自动从相关文献中挖掘区域描述，并将其用作反映典型细胞结构属性的合成说明。现有的细胞结构视觉基础模型（CytoNet）随后通过图像到文本的训练目标与大型语言模型相结合，使得显微镜区域能够用自然语言进行描述。在57个脑区中，所得到的方法产生了合理的区域级描述，并通过明确拒绝未见区域支持开放集使用。它在范围内的补丁中与细胞结构参考标签匹配的准确率为90.6%，而在区域标签被屏蔽的情况下，其描述仍然足够具有区分性，在8路测试中以68.6%的准确率恢复该区域。这些结果表明，弱的标签介导配对足以将现有的生物医学视觉基础模型与语言连接，为在细粒度配对注释稀缺的领域中整合自然语言提供了一种实用的方法。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2602.23101

Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras

用于高速人脸和地标检测的局部自适应衰减表面与事件相机

Kielty, Paul, Hanley, Timothy, Corcoran, Peter

Abstract

Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.

Chinese Translation

事件相机以微秒分辨率记录亮度变化，但将其稀疏、异步的输出转换为神经网络可以利用的密集张量仍然是一个核心挑战。传统的直方图或全局衰减时间表面表示在整个图像平面上应用固定的时间参数，这在实践中造成了在静止期间保持空间结构与在快速运动期间保留锐利边缘之间的权衡。我们提出了局部自适应衰减表面（Locally Adaptive Decay Surfaces, LADS），这是一类事件表示，其中每个位置的时间衰减根据局部信号动态进行调节。我们探索了三种策略，基于事件率、拉普拉斯-高斯响应和高频谱能量。这些自适应方案在静止区域保留细节，同时减少密集活动区域的模糊。在公共数据上的广泛实验表明，与标准非自适应表示相比，LADS在面部检测和面部地标精度上始终有所提高。在30 Hz时，LADS的检测准确率和地标误差均高于基线，而在240 Hz时，它减轻了通常在更高频率下观察到的准确性下降，保持了地标的2.44%归一化平均误差和0.966 mAP50的面部检测准确率。这些高频结果甚至超过了先前在30 Hz下工作的研究报告的准确性，为基于事件的人脸分析设定了新的基准。此外，通过在表示阶段保留空间结构，LADS支持使用更轻量的网络架构，同时仍保持实时性能。这些结果突显了上下文感知时间集成对神经形态视觉的重要性，并指向利用事件相机独特优势的实时高频人机交互系统。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2602.23103

SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation

SpectralMamba-UNet：用于纹理-结构一致的医学图像分割的频率解耦状态空间建模

Zhang, Fuhao, Liu, Lei, Zhang, Jialin, Zhang, Ya-Nan, Mu, Nan

Abstract

Accurate medical image segmentation requires effective modeling of both global anatomical structures and fine-grained boundary details. Recent state space models (e.g., Vision Mamba) offer efficient long-range dependency modeling. However, their one-dimensional serialization weakens local spatial continuity and high-frequency representation. To this end, we propose SpectralMamba-UNet, a novel frequency-disentangled framework to decouple the learning of structural and textural information in the spectral domain. Our Spectral Decomposition and Modeling (SDM) module applies discrete cosine transform to decompose low- and high-frequency features, where low frequency contributes to global contextual modeling via a frequency-domain Mamba and high frequency preserves boundary-sensitive details. To balance spectral contributions, we introduce a Spectral Channel Reweighting (SCR) mechanism to form channel-wise frequency-aware attention, and a Spectral-Guided Fusion (SGF) module to achieve adaptively multi-scale fusion in the decoder. Experiments on five public benchmarks demonstrate consistent improvements across diverse modalities and segmentation targets, validating the effectiveness and generalizability of our approach.

Chinese Translation

准确的医学图像分割需要有效建模全球解剖结构和细粒度边界细节。最近的状态空间模型（例如，Vision Mamba）提供了高效的长距离依赖建模。然而，它们的一维序列化削弱了局部空间连续性和高频表示。为此，我们提出了SpectralMamba-UNet，一种新颖的频率解耦框架，以在频谱域中解耦结构和纹理信息的学习。我们的谱分解与建模（SDM）模块应用离散余弦变换来分解低频和高频特征，其中低频通过频域Mamba有助于全局上下文建模，而高频则保留了对边界敏感的细节。为了平衡频谱贡献，我们引入了频谱通道重标定（SCR）机制，以形成通道级频率感知注意力，并且引入了频谱引导融合（SGF）模块，以实现解码器中的自适应多尺度融合。在五个公共基准上的实验表明，在不同模态和分割目标上均实现了一致的改进，验证了我们方法的有效性和普适性。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2602.23114

WARM-CAT: : Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

WARM-CAT：用于组合零样本学习的温启动测试时综合知识积累

Yan, Xudong, Feng, Songhe, Wang, Jiaxin, Su, Xin, Jin, Yi

Abstract

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual prototypes from historical images for inference. Since the model tends to favor compositions already stored in the queue during testing, we warm-start the queue by initializing it with training images for visual prototypes of seen compositions and generating unseen visual prototypes using the mapping learned between seen and unseen textual prototypes. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. To provide a more reliable evaluation for CZSL, we introduce a new benchmark dataset, C-Fashion, and refine the widely used but noisy MIT-States dataset. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. The source code and datasets are available at https://github.com/xud-yan/WARM-CAT .

Chinese Translation

组合零样本学习（CZSL）旨在基于从已见组合中学习的知识识别新的属性-对象组合。现有方法在测试时因标签空间的分布转移而导致性能下降，这种转移源于未见组合的引入，这些组合是从属性和对象重新组合而成。为了解决这一挑战，我们提出了一种新方法，通过无监督数据在文本和视觉模态中积累综合知识，以在测试时更新多模态原型。在此基础上，我们进一步设计了一种自适应更新权重，以控制原型调整的程度，使模型能够灵活适应测试期间的分布转移。此外，我们引入了一个动态优先队列，用于存储高置信度图像，以从历史图像中获取视觉原型进行推理。由于模型在测试期间倾向于优先考虑队列中已存储的组合，我们通过用已见组合的训练图像初始化队列来温启动队列，并利用已见和未见文本原型之间学习到的映射生成未见视觉原型。考虑到多模态知识的语义一致性，我们通过多模态协同表示学习对齐文本和视觉原型。为了为CZSL提供更可靠的评估，我们引入了一个新的基准数据集C-Fashion，并对广泛使用但噪声较大的MIT-States数据集进行了改进。大量实验表明，我们的方法在封闭世界和开放世界设置下的四个基准数据集上实现了最先进的性能。源代码和数据集可在https://github.com/xud-yan/WARM-CAT获取。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2602.23115

FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time

FLIGHT：基于斐波那契格子的实时几何航向推断

Dirnfeld, David, Delattre, Fabien, Miraldo, Pedro, Learned-Miller, Erik

Abstract

Estimating camera motion from monocular video is a fundamental problem in computer vision, central to tasks such as SLAM, visual odometry, and structure-from-motion. Existing methods that recover the camera's heading under known rotation, whether from an IMU or an optimization algorithm, tend to perform well in low-noise, low-outlier conditions, but often decrease in accuracy or become computationally expensive as noise and outlier levels increase. To address these limitations, we propose a novel generalization of the Hough transform on the unit sphere (S(2)) to estimate the camera's heading. First, the method extracts correspondences between two frames and generates a great circle of directions compatible with each pair of correspondences. Then, by discretizing the unit sphere using a Fibonacci lattice as bin centers, each great circle casts votes for a range of directions, ensuring that features unaffected by noise or dynamic objects vote consistently for the correct motion direction. Experimental results on three datasets demonstrate that the proposed method is on the Pareto frontier of accuracy versus efficiency. Additionally, experiments on SLAM show that the proposed method reduces RMSE by correcting the heading during camera pose initialization.

Chinese Translation

从单目视频中估计相机运动是计算机视觉中的一个基本问题，涉及到SLAM、视觉里程计和运动重建等任务。现有方法在已知旋转的情况下恢复相机航向，无论是通过IMU还是优化算法，通常在低噪声、低异常值条件下表现良好，但随着噪声和异常值水平的增加，准确性往往下降或计算成本变得昂贵。为了解决这些局限性，我们提出了一种在单位球面(S(2))上对霍夫变换的新颖推广，以估计相机的航向。首先，该方法提取两个帧之间的对应关系，并生成与每对对应关系兼容的方向大圆。然后，通过使用斐波那契格子作为箱心对单位球面进行离散化，每个大圆为一系列方向投票，确保未受噪声或动态物体影响的特征一致地投票给正确的运动方向。在三个数据集上的实验结果表明，所提出的方法在准确性与效率的帕累托前沿上。此外，在SLAM实验中，所提出的方法通过在相机姿态初始化期间校正航向，降低了均方根误差(RMSE)。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2602.23117

Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

深入探讨图像分类中的对抗转移性：综述、基准与评估

Wang, Xiaosen, Ge, Zhijin, Liu, Bohan, Fang, Zheng, Zhou, Fengfan, Zhang, Ruixuan, Wang, Shaokang, Luo, Yuyang

Abstract

Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.

Chinese Translation

对抗转移性是指在替代模型上生成的对抗样本能够欺骗其他未暴露的受害模型的能力。这一特性消除了在攻击过程中对受害模型直接访问的需求，从而在实际应用中引发了相当大的安全隐患，并且最近吸引了大量的研究关注。在本研究中，我们发现缺乏一个标准化的框架和评估标准来评估基于转移的攻击，这可能导致对现有方法的评估存在偏差。为了解决这一问题，我们对数百篇相关文献进行了详尽的综述，将各种基于转移的攻击组织为六个不同的类别。随后，我们提出了一个综合框架，旨在作为评估这些攻击的基准。此外，我们阐明了增强对抗转移性的常见策略，并强调了可能导致不公平比较的普遍问题。最后，我们简要回顾了超越图像分类的基于转移的攻击。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2602.23120

TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement

TriLite：利用通用视觉特征和三区域解耦的高效弱监督目标定位

Sabaghi, Arian, Oramas, José

Abstract

Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.

Chinese Translation

弱监督目标定位（WSOL）旨在仅使用图像级标签在图像中定位目标物体。尽管近期取得了一些进展，但许多方法仍依赖于多阶段管道或对大型主干网络的全面微调，这增加了训练成本，同时更广泛的WSOL社区仍面临部分物体覆盖的挑战。我们提出了TriLite，一个单阶段的WSOL框架，利用经过Dinov2预训练的冻结视觉变换器（Vision Transformer）以自监督的方式，并仅引入极少量的可训练参数（在ImageNet-1K上少于80万）用于分类和定位。其核心是所提出的TriHead模块，该模块将补丁特征解构为前景、背景和模糊区域，从而提高物体覆盖率，同时抑制虚假激活。通过解耦分类和定位目标，TriLite有效利用自监督ViTs学习的通用表示，而无需昂贵的端到端训练。在CUB-200-2011、ImageNet-1K和OpenImages上的大量实验表明，TriLite设定了新的最先进水平，同时在参数效率和训练简易性上显著优于先前的方法。代码将很快发布。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2602.23133

From Calibration to Refinement: Seeking Certainty via Probabilistic Evidence Propagation for Noisy-Label Person Re-Identification

从校准到精炼：通过概率证据传播寻求噪声标签下的人体重识别的确定性

Yuan, Xin, Zhang, Zhiyong, Xu, Xin, Wang, Zheng, Lin, Chia-Wen

Abstract

With the increasing demand for robust person Re-ID in unconstrained environments, learning from datasets with noisy labels and sparse per-identity samples remains a critical challenge. Existing noise-robust person Re-ID methods primarily rely on loss-correction or sample-selection strategies using softmax outputs. However, these methods suffer from two key limitations: 1) Softmax exhibits translation invariance, leading to over-confident and unreliable predictions on corrupted labels. 2) Conventional sample selection based on small-loss criteria often discards valuable hard positives that are crucial for learning discriminative features. To overcome these issues, we propose the CAlibration-to-REfinement (CARE) method, a two-stage framework that seeks certainty through probabilistic evidence propagation from calibration to refinement. In the calibration stage, we propose the probabilistic evidence calibration (PEC) that dismantles softmax translation invariance by injecting adaptive learnable parameters into the similarity function, and employs an evidential calibration loss to mitigate overconfidence on mislabeled samples. In the refinement stage, we design the evidence propagation refinement (EPR) that can more accurately distinguish between clean and noisy samples. Specifically, the EPR contains two steps: Firstly, the composite angular margin (CAM) metric is proposed to precisely distinguish clean but hard-to-learn positive samples from mislabeled ones in a hyperspherical space; Secondly, the certainty-oriented sphere weighting (COSW) is developed to dynamically allocate the importance of samples according to CAM, ensuring clean instances drive model updates. Extensive experimental results on Market1501, DukeMTMC-ReID, and CUHK03 datasets under both random and patterned noises show that CARE achieves competitive performance.

Chinese Translation

随着对在非约束环境中鲁棒的人体重识别（Re-ID）需求的增加，从带有噪声标签和每个身份样本稀疏的数据集中学习仍然是一个关键挑战。现有的抗噪声人体重识别方法主要依赖于基于 softmax 输出的损失修正或样本选择策略。然而，这些方法存在两个主要局限性：1）Softmax 具有平移不变性，导致对损坏标签的过于自信和不可靠的预测。2）基于小损失标准的传统样本选择往往会丢弃对学习判别特征至关重要的有价值的难正样本。为了解决这些问题，我们提出了 CAlibration-to-REfinement（CARE）方法，这是一种通过从校准到精炼的概率证据传播寻求确定性的两阶段框架。在校准阶段，我们提出了概率证据校准（PEC），通过将自适应可学习参数注入相似性函数，拆解了 softmax 的平移不变性，并采用证据校准损失来减轻对错误标记样本的过度自信。在精炼阶段，我们设计了证据传播精炼（EPR），可以更准确地区分干净样本和噪声样本。具体而言，EPR 包含两个步骤：首先，提出复合角度边际（CAM）度量，以在超球面空间中精确区分干净但难以学习的正样本与错误标记样本；其次，开发了确定性导向的球体加权（COSW），根据 CAM 动态分配样本的重要性，确保干净实例推动模型更新。在 Market1501、DukeMTMC-ReID 和 CUHK03 数据集上进行的广泛实验结果显示，CARE 在随机和模式噪声下均实现了竞争力的性能。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2602.23141

No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors

无标签，无前瞻：基于经典先验的无监督在线视频稳定化

Liu, Tao, Wan, Gang, Ren, Kan, Wen, Shibo

Abstract

We propose a new unsupervised framework for online video stabilization. Unlike methods based on deep learning that require paired stable and unstable datasets, our approach instantiates the classical stabilization pipeline with three stages and incorporates a multithreaded buffering mechanism. This design addresses three longstanding challenges in end-to-end learning: limited data, poor controllability, and inefficiency on hardware with constrained resources. Existing benchmarks focus mainly on handheld videos with a forward view in visible light, which restricts the applicability of stabilization to domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.

Chinese Translation

我们提出了一种新的无监督在线视频稳定化框架。与需要配对稳定和不稳定数据集的深度学习方法不同，我们的方法实现了经典稳定化流程，分为三个阶段，并结合了多线程缓冲机制。该设计解决了端到端学习中的三个长期挑战：数据有限、可控性差以及在资源受限硬件上的低效率。现有基准主要集中在可见光下的手持视频和前向视角，这限制了稳定化在无人机夜间遥感等领域的应用。为填补这一空白，我们引入了一个新的多模态无人机航拍视频数据集（UAV-Test）。实验表明，我们的方法在定量指标和视觉质量上均持续优于最先进的在线稳定器，同时在性能上与离线方法相当。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2602.23153

Efficient Encoder-Free Fourier-based 3D Large Multimodal Model

高效的无编码器傅里叶基础3D大型多模态模型

Mei, Guofeng, Lin, Wei, Riz, Luigi, Wu, Yujiao, Wang, Yiming, Poiesi, Fabio

Abstract

Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev-fbk.github.io/Fase3D.

Chinese Translation

处理3D数据的大型多模态模型（LMMs）通常依赖于重型的、预训练的视觉编码器来提取几何特征。尽管最近的2D LMMs已开始消除此类编码器以提高效率和可扩展性，但将这一范式扩展到3D仍然面临挑战，因为点云的无序和大规模特性。这留下了一个关键的未解问题：我们如何设计一个有效且高效地对无序3D数据进行标记的LMM，而不依赖繁琐的编码器？我们提出了Fase3D，这是第一个高效的无编码器傅里叶基础3D场景LMM。Fase3D通过一种新颖的标记器来解决可扩展性和排列不变性的问题，该标记器结合了点云序列化和快速傅里叶变换（FFT）以近似自注意力。这一设计使得基于三个关键创新的有效且计算量最小的架构成为可能：首先，我们通过结构化超点紧凑地表示大型场景。其次，我们的填充曲线序列化结合FFT实现了高效的全局上下文建模和基于图的标记合并。最后，我们的傅里叶增强型LoRA适配器以微不足道的成本将全局频率感知的交互注入到LLMs中。Fase3D在性能上与基于编码器的3D LMMs相当，同时在计算和参数方面显著更高效。项目网站：https://tev-fbk.github.io/Fase3D。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2602.23165

DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

DyaDiT：一种用于社会友好型二人手势生成的多模态扩散变换器

Peng, Yichen, Song, Jyun-Ting, Jung, Siyeol, Liu, Ruofan, Liu, Haiyang, Chu, Xuangeng, Liu, Ruicong, Wu, Erwin, Koike, Hideki, Kitani, Kris

Abstract

Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

Chinese Translation

生成逼真的对话手势对于实现与数字人类的自然、社会互动至关重要。然而，现有方法通常将单一音频流映射到单一发言者的动作，而未考虑社会背景或建模两人对话中的相互动态。我们提出了DyaDiT，一种多模态扩散变换器，它从二人音频信号中生成上下文适宜的人类动作。DyaDiT在无缝交互数据集（Seamless Interaction Dataset）上进行训练，接受带有可选社会上下文标记的二人音频，以生成适合上下文的动作。它融合了两位发言者的信息，以捕捉互动动态，使用动作字典编码动作先验，并可以选择性地利用对话伙伴的手势来生成更具响应性的动作。我们在标准动作生成指标上评估了DyaDiT，并进行了定量用户研究，结果表明它不仅在客观指标上超越了现有方法，而且用户也强烈偏好，突显了其鲁棒性和社会友好的动作生成能力。代码和模型将在接受后发布。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2602.23166

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

AgentVista：在超具挑战性的真实视觉场景中评估多模态智能体

Su, Zhaochen, Gao, Jincheng, Guo, Hangyu, Liu, Zhenhua, Zhang, Lueyang, Geng, Xinyu, Huang, Shijue, Xia, Peng, Jiang, Guanyu, Wang, Cheng, Zhang, Yue, Fung, Yi R., He, Junxian

Abstract

Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

Chinese Translation

现实世界中的多模态智能体通过视觉证据解决多步骤工作流程。例如，一个智能体可以通过将接线照片与原理图链接并通过在线文档验证修复来排除设备故障，或者通过解读交通图并在路线约束下检查时刻表来规划旅行。然而，现有的多模态基准主要评估单轮视觉推理或特定工具技能，并未完全捕捉实际智能体所需的现实性、视觉细微差别和长时间工具使用。我们引入了AgentVista，这是一个针对通用多模态智能体的基准，涵盖7个类别中的25个子领域，将真实且细节丰富的视觉场景与自然的混合工具使用相结合。任务要求跨模态进行长时间的工具交互，包括网页搜索、图像搜索、页面导航以及图像处理和通用编程的基于代码的操作。对最先进模型的全面评估揭示了它们在进行长时间多模态工具使用方面的显著差距。即使是我们评估中的最佳模型Gemini-3-Pro（带工具），整体准确率也仅为27.3%，而困难实例可能需要超过25次工具调用。我们期待AgentVista能够加速开发更强大和可靠的多模态智能体，以应对真实且极具挑战性的问题解决。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2602.23169

Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration

学习连续的Wasserstein重心空间用于通用的全能图像恢复

Tang, Xiaole, He, Xiaoyi, Xu, Jiayi, Gu, Xiang, Sun, Jian

Abstract

Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (\textit{e.g.,} types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.

Chinese Translation

尽管在统一模型中针对多种退化进行全能图像恢复方面取得了显著进展，但现有方法仍然容易受到分布外退化的影响，从而限制了它们在现实场景中的泛化能力。为了解决这一挑战，本研究的动机源于这样一种直觉：多源退化特征分布是由不同退化特定偏移引起的，而这些偏移来自于一个基础的与退化无关的分布，因此恢复这样一个共享分布对于实现跨退化的泛化至关重要。基于这一见解，我们提出了BaryIR，一个表示学习框架，它在Wasserstein重心（WB）空间中对齐多源退化特征，该空间通过最小化与多源退化分布的Wasserstein距离的平均值来建模一个与退化无关的分布。我们进一步引入了残差子空间，其嵌入在保持与WB嵌入正交的同时相互对比。因此，BaryIR明确地解耦了两个正交空间：一个WB空间编码了跨退化共享的与退化无关的不变内容，以及残差子空间自适应地保留退化特定知识。这种解耦减轻了对分布内退化的过拟合，并使基于与退化无关的共享不变性进行自适应恢复成为可能。大量实验表明，BaryIR在与最先进的全能方法相比时表现出竞争力。值得注意的是，BaryIR在未见过的退化（例如，类型和级别）上具有良好的泛化能力，并在有限的退化类型上训练时，仍能在评估混合退化的真实世界数据时展现出显著的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2602.23172

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

用于4D全景占据跟踪的潜在高斯溅射

Luz, Maximilian, Mohan, Rohit, Nürnberg, Thomas, Miron, Yakov, Cattaneo, Daniele, Valada, Abhinav

Abstract

Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at https://lags.cs.uni-freiburg.de/.

Chinese Translation

捕捉4D时空环境对于机器人在动态环境中安全可靠的操作至关重要。然而，大多数现有方法仅解决问题的一方面：它们要么通过边界框提供粗略的几何跟踪，要么提供缺乏明确时间关联的详细3D结构，如基于体素的占据。在本研究中，我们提出了用于4D全景占据跟踪的潜在高斯溅射（Latent Gaussian Splatting, LaGS），以整体的方向推进时空场景理解。我们的方法结合了基于相机的端到端跟踪与基于掩膜的多视角全景占据预测，并解决了将多视角信息有效聚合到3D体素网格中的关键挑战，采用了一种新颖的潜在高斯溅射方法。具体而言，我们首先将观测值融合为3D高斯分布，作为3D场景的稀疏点中心潜在表示，然后将聚合的特征溅射到由基于掩膜的分割头解码的3D体素网格上。我们在Occ3D nuScenes和Waymo数据集上评估了LaGS，取得了4D全景占据跟踪的最先进性能。我们的代码可在https://lags.cs.uni-freiburg.de/获取。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2602.23177

Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway Platforms

Phys-3D：基于物理约束的铁路站台实时人群跟踪与计数

Zeng, Bin, Künzel, Johannes, Hilsmann, Anna, Eisert, Peter

Abstract

Accurate, real-time crowd counting on railway platforms is essential for safety and capacity management. We propose to use a single camera mounted in a train, scanning the platform while arriving. While hardware constraints are simple, counting remains challenging due to dense occlusions, camera motion, and perspective distortions during train arrivals. Most existing tracking-by-detection approaches assume static cameras or ignore physical consistency in motion modeling, leading to unreliable counting under dynamic conditions. We propose a physics-constrained tracking framework that unifies detection, appearance, and 3D motion reasoning in a real-time pipeline. Our approach integrates a transfer-learned YOLOv11m detector with EfficientNet-B0 appearance encoding within DeepSORT, while introducing a physics-constrained Kalman model (Phys-3D) that enforces physically plausible 3D motion dynamics through pinhole geometry. To address counting brittleness under occlusions, we implement a virtual counting band with persistence. On our platform benchmark, MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH), our method reduces counting error to 2.97%, demonstrating robust performance despite motion and occlusions. Our results show that incorporating first-principles geometry and motion priors enables reliable crowd counting in safety-critical transportation scenarios, facilitating effective train scheduling and platform safety management.

Chinese Translation

在铁路站台上进行准确的实时人群计数对于安全和容量管理至关重要。我们提出使用安装在列车上的单个摄像头，在列车到达时扫描站台。尽管硬件约束较为简单，但由于密集遮挡、摄像头运动和列车到达时的透视失真，计数依然具有挑战性。现有的大多数基于检测的跟踪方法假设摄像头是静态的，或忽视运动建模中的物理一致性，导致在动态条件下计数不可靠。我们提出了一种物理约束的跟踪框架，在实时管道中统一了检测、外观和三维运动推理。我们的方法将经过迁移学习的YOLOv11m检测器与EfficientNet-B0外观编码集成在DeepSORT中，同时引入了一个物理约束的卡尔曼模型（Phys-3D），该模型通过针孔几何学强制执行物理上合理的三维运动动态。为了解决遮挡下计数的脆弱性，我们实现了一个具有持久性的虚拟计数带。在我们的平台基准测试MOT-RailwayPlatformCrowdHead Dataset（MOT-RPCH）上，我们的方法将计数误差降低到2.97%，尽管存在运动和遮挡，仍展示出强大的性能。我们的结果表明，结合第一性原理的几何学和运动先验能够在安全关键的交通场景中实现可靠的人群计数，从而促进有效的列车调度和站台安全管理。

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2602.23191

Uni-Animator: Towards Unified Visual Colorization

Uni-Animator：迈向统一的视觉上色

Chen, Xinyuan, Xu, Yao, Wang, Shaowen, Song, Pengjie, Deng, Bowen

Abstract

We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.

Chinese Translation

我们提出了Uni-Animator，一种基于扩散变换器（Diffusion Transformer, DiT）的统一图像和视频草图上色框架。现有的草图上色方法在统一图像和视频任务方面面临挑战，存在单一或多个参考颜色传递不准确、高频物理细节保存不足，以及在大运动场景中运动伪影导致的时间一致性受损等问题。为了解决颜色传递不准确的问题，我们通过实例补丁嵌入引入视觉参考增强，能够精确对齐和融合参考颜色信息。为了解决物理细节保存不足的问题，我们设计了物理细节增强，利用物理特征有效捕捉和保留高频纹理。为减轻运动引起的时间不一致性，我们提出了基于草图的动态RoPE编码，能够自适应建模运动感知的时空依赖关系。大量实验结果表明，Uni-Animator在图像和视频草图上色方面都达到了竞争力的性能，与特定任务的方法相匹配，同时解锁了具有高细节保真度和强大时间一致性的统一跨域能力。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2602.23192

FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification

FairQuant：面向公平性的混合精度量化在医学图像分类中的应用

Woergaard, Thomas, Selvan, Raghavendra

Abstract

Compressing neural networks by quantizing model parameters offers useful trade-off between performance and efficiency. Methods like quantization-aware training and post-training quantization strive to maintain the downstream performance of compressed models compared to the full precision models. However, these techniques do not explicitly consider the impact on algorithmic fairness. In this work, we study fairness-aware mixed-precision quantization schemes for medical image classification under explicit bit budgets. We introduce FairQuant, a framework that combines group-aware importance analysis, budgeted mixed-precision allocation, and a learnable Bit-Aware Quantization (BAQ) mode that jointly optimizes weights and per-unit bit allocations under bitrate and fairness regularization. We evaluate the method on Fitzpatrick17k and ISIC2019 across ResNet18/50, DeiT-Tiny, and TinyViT. Results show that FairQuant configurations with average precision near 4-6 bits recover much of the Uniform 8-bit accuracy while improving worst-group performance relative to Uniform 4- and 8-bit baselines, with comparable fairness metrics under shared budgets.

Chinese Translation

通过量化模型参数来压缩神经网络在性能和效率之间提供了有益的权衡。量化感知训练和后训练量化等方法努力保持压缩模型相对于全精度模型的下游性能。然而，这些技术并未明确考虑对算法公平性的影响。在本研究中，我们研究了在明确比特预算下用于医学图像分类的面向公平性的混合精度量化方案。我们提出了FairQuant，一个结合了群体感知重要性分析、预算混合精度分配和可学习的比特感知量化（Bit-Aware Quantization, BAQ）模式的框架，该框架在比特率和公平性正则化下共同优化权重和每单位比特分配。我们在Fitzpatrick17k和ISIC2019数据集上评估该方法，使用ResNet18/50、DeiT-Tiny和TinyViT模型。结果表明，FairQuant配置在平均精度接近4-6比特时，恢复了大部分均匀8比特的准确性，同时相对于均匀4比特和8比特基线改善了最差组的性能，并在共享预算下保持了可比的公平性指标。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2602.23203

ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation

ColoDiff：将动态一致性与内容感知相结合的结肠镜视频生成

Fu, Junhu, Liang, Shuyu, Li, Wutong, Ma, Chen, Huang, Peng, Wang, Kehao, Chen, Ke, Lin, Shengli, Zhou, Pinghong, Li, Zeju, Wang, Yuanyuan, Guo, Yi

Abstract

Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based framework that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.

Chinese Translation

结肠镜视频生成提供了动态且信息丰富的数据，这对于诊断肠道疾病至关重要，尤其是在数据稀缺的情况下。高质量的视频生成需要时间一致性和对临床属性的精确控制，但在不规则的肠道结构、多样的疾病表现和各种成像模式的挑战下面临困难。为此，我们提出了ColoDiff，一个基于扩散的框架，旨在生成动态一致且内容感知的结肠镜视频，以缓解数据短缺并辅助临床分析。在帧间层面，我们的TimeStream模块通过跨帧标记机制将时间依赖性与视频序列解耦，尽管存在不规则的肠道结构，仍能实现复杂的动态建模。在帧内层面，我们的Content-Aware模块结合了注入噪声的嵌入和可学习的原型，以实现对临床属性的精确控制，突破了扩散模型的粗略指导。此外，ColoDiff采用了一种非马尔可夫采样策略，使得实时生成的步骤减少了90%以上。ColoDiff在三个公共数据集和一个医院数据库上进行了评估，基于生成指标和下游任务，包括疾病诊断、模态区分、肠道准备评分和病变分割。大量实验表明，ColoDiff生成的视频具有平滑的过渡和丰富的动态性。ColoDiff在可控的结肠镜视频生成方面展现了努力，揭示了合成视频在补充真实表现和缓解临床环境中数据稀缺的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2602.23204

Motion-aware Event Suppression for Event Cameras

基于运动感知的事件抑制方法用于事件相机

Pellerito, Roberto, Messikommer, Nico, Cioffi, Giovanni, Cannici, Marco, Scaramuzza, Davide

Abstract

In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.

Chinese Translation

在本研究中，我们提出了第一个基于运动感知的事件抑制框架，该框架能够实时学习过滤由即时运动对象（IMOs）和自我运动触发的事件。我们的模型在当前事件流中联合分割IMOs，同时预测它们的未来运动，从而实现对动态事件的预期抑制。我们的轻量级架构在消费级GPU上以173 Hz的推理速度运行，内存使用量不足1 GB，且在具有挑战性的EVIMO基准测试中，在分割准确性上超越了之前的最先进方法67%，同时推理速率提高了53%。此外，我们还展示了对下游应用的显著益处：我们的方法通过令牌剪枝将视觉变换器（Vision Transformer）的推理速度提高了83%，并改善了基于事件的视觉里程计的准确性，将绝对轨迹误差（ATE）降低了13%。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2602.23205

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

EmbodMocap：面向具身智能体的野外4D人类场景重建

Wang, Wenjia, Pan, Liang, Pi, Huaijin, Lou, Yuke, Ren, Xuqian, Wu, Yifan, Liao, Zhouyingcheng, Yang, Lei, Dabral, Rishabh, Theobalt, Christian, Komura, Taku

Abstract

Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.

Chinese Translation

现实世界中的人类行为自然地编码了丰富的长期上下文信息，这些信息可以用于训练具身智能体进行感知、理解和行动。然而，现有的捕捉系统通常依赖于昂贵的摄影棚设置和可穿戴设备，限制了在野外大规模收集场景条件下的人类运动数据。为了解决这个问题，我们提出了EmbodMocap，这是一种使用两部移动iPhone的便携式和经济实惠的数据收集管道。我们的关键思想是共同校准双RGB-D序列，以在统一的度量世界坐标系中重建人类和场景。所提出的方法允许在日常环境中进行度量尺度和场景一致的捕捉，而无需静态相机或标记，从而无缝连接人类运动和场景几何。与光学捕捉的真实数据相比，我们展示了双视角设置显著减轻深度模糊的能力，在对齐和重建性能上优于单一iPhone或单目模型。基于收集的数据，我们支持三项具身人工智能任务：单目人类场景重建，我们在前馈模型上进行微调，输出度量尺度、世界空间对齐的人类和场景；基于物理的角色动画，我们证明我们的数据可以用于提升人类与物体交互技能和场景感知运动跟踪；以及机器人运动控制，我们通过模拟到真实的强化学习训练人形机器人，以复制视频中描绘的人类动作。实验结果验证了我们管道的有效性及其对推进具身人工智能研究的贡献。

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2602.23212

Through BrokenEyes: How Eye Disorders Impact Face Detection?

透视眼疾：眼部疾病如何影响人脸检测？

Adhikary, Prottay Kumar

Abstract

Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.

Chinese Translation

视觉障碍显著影响数百万人的生活，改变了视觉信息的处理和感知方式。在本研究中，开发了一个计算框架，利用BrokenEyes系统模拟五种常见眼部疾病：年龄相关性黄斑变性、白内障、青光眼、屈光不正和糖尿病视网膜病，并分析它们对深度学习模型中神经类特征表示的影响。通过结合人类和非人类数据集，在正常和特定疾病条件下训练的模型揭示了特征图的关键干扰，特别是对于白内障和青光眼，这与已知的神经处理挑战相一致。激活能量和余弦相似度等评估指标量化了这些失真的严重程度，为退化视觉输入与学习表示之间的相互作用提供了见解。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2602.23214

Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

即插即用扩散与交替方向乘子法相结合：用于稳健医学图像重建的双变量耦合

Du, Chenhe, Tian, Xuanyu, Wu, Qing, Liu, Muyu, Yu, Jingyi, Wei, Hongjiang, Zhang, Yuyao

Abstract

Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion, which restores the classical dual variable to provide integral feedback, theoretically guaranteeing asymptotic convergence to the exact data manifold. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.

Chinese Translation

即插即用扩散先验（PnPDP）框架作为解决成像逆问题的强大范式，通过将预训练的生成模型视为模块化先验而崭露头角。然而，我们发现现有PnP求解器（例如基于HQS或近端梯度的方法）存在一个关键缺陷：它们作为无记忆算子，仅基于瞬时梯度更新估计。这种缺乏历史追踪的特性不可避免地导致非消失的稳态偏差，使得在严重污染情况下重建无法严格满足物理测量。为了解决这个问题，我们提出了双耦合PnP扩散，它恢复了经典的双变量以提供整体反馈，理论上保证了渐近收敛到精确数据流形。然而，这种严格的几何耦合引入了一个次要挑战：累积的双重残差显示出光谱着色的结构性伪影，违反了扩散先验的加性白噪声（AWGN）假设，导致严重的幻觉。为了弥补这一差距，我们引入了光谱均匀化（SH），这是一种频域适应机制，将这些结构残差调制为统计合规的伪AWGN输入。这有效地将求解器的严格优化轨迹与去噪器的有效统计流形对齐。在CT和MRI重建上的大量实验表明，我们的方法解决了偏差与幻觉之间的权衡，实现了最先进的保真度，并显著加快了收敛速度。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2602.23217

Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks

多维任务学习：计算机视觉任务的统一张量框架

Ichi, Alaa El, Jbilou, Khalide

Abstract

This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.

Chinese Translation

本文介绍了多维任务学习（MTL），这是一个基于广义爱因斯坦多层感知器（GE-MLPs）的统一数学框架，能够通过爱因斯坦乘积直接在张量上进行操作。我们认为，当前计算机视觉任务的表述本质上受到基于矩阵思维的限制：标准架构依赖于矩阵值权重和向量值偏置，这要求结构扁平化，从而限制了自然可表达任务的空间。GE-MLPs 通过使用张量值参数消除了这一限制，使得能够明确控制保留或收缩哪些维度而不丢失信息。通过严格的数学推导，我们证明了分类、分割和检测是 MTL 的特例，仅在正式定义的任务空间内的维度配置上有所不同。我们进一步证明了该任务空间严格大于基于矩阵的表述所能原生表达的范围，使得能够实现诸如时空或跨模态预测等原则性任务配置，而这些在传统方法下需要进行破坏性扁平化。该研究为通过张量代数的视角理解、比较和设计计算机视觉任务提供了数学基础。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2602.23224

UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception

UniScale：通过先验注入实现多视图理解的统一尺度感知3D重建框架用于机器人感知

Mahdavian, Mohammad, Tan, Gordon, Xu, Binbin, Ren, Yuan, Bai, Dongfeng, Liu, Bingbing

Abstract

We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.

Chinese Translation

我们提出了UniScale，一个统一的、尺度感知的多视图3D重建框架，适用于机器人应用，灵活地通过模块化和语义信息设计集成几何先验。在基于视觉的机器人导航中，从原始图像序列中准确提取环境结构对于下游任务至关重要。UniScale通过一个单一的前馈网络解决了这一挑战，该网络联合估计相机内参和外参、尺度不变的深度和点云图，以及从多视图图像中获取场景的度量尺度，同时在可用时可选地结合辅助几何先验。通过将全局上下文推理与相机感知特征表示相结合，UniScale能够恢复场景的度量尺度。在相机内参已知的机器人环境中，可以轻松地将其纳入以提高性能，当相机位姿也可用时，性能进一步提升。这种协同设计使得在单一统一模型中实现稳健的、尺度感知的3D重建成为可能。重要的是，UniScale不需要从头开始训练，而是利用现有模型中表现出的世界先验，而无需几何编码策略，使其特别适合资源受限的机器人团队。我们在多个基准上评估了UniScale，展示了其强大的泛化能力和在多样化环境中的一致性能。我们将在接受后发布我们的实现。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2602.23228

MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

MovieTeller：工具增强的电影摘要生成与ID一致的渐进抽象

Li, Yizhi, Chen, Xiaohan, Jiang, Miao, Tang, Wentao, Wang, Gaoang

Abstract

With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.

Chinese Translation

随着数字娱乐的爆炸性增长，自动视频摘要已成为内容索引、个性化推荐和高效媒体归档等应用不可或缺的工具。对于长篇视频（如电影和电视剧）的自动摘要生成，对现有的视觉-语言模型（VLMs）提出了重大挑战。尽管这些通用模型在单图像描述方面表现出色，但在长时段上下文中往往存在关键性缺陷，主要表现为缺乏ID一致的人物识别和叙事连贯性。为克服这些局限性，我们提出了MovieTeller，一个通过工具增强的渐进抽象生成电影摘要的新框架。我们的核心贡献是一个无需训练的、工具增强的、基于事实的生成过程。我们的框架以即插即用的方式直接利用现成模型，而无需昂贵的模型微调。我们首先调用一个专门的人脸识别模型作为外部“工具”，以建立事实基础——精确的人物身份及其对应的边界框。这些基础信息随后被注入到提示中，以引导VLM的推理，确保生成的场景描述与可验证的事实相结合。此外，我们的渐进抽象流程将完整电影的摘要分解为多阶段过程，有效缓解了当前VLM的上下文长度限制。实验表明，与端到端基线相比，我们的方法在事实准确性、人物一致性和整体叙事连贯性方面均显著提升。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2602.23229

Large Multimodal Models as General In-Context Classifiers

大型多模态模型作为通用上下文分类器

Garosi, Marco, Farina, Matteo, Conti, Alessandro, Mancini, Massimiliano, Ricci, Elisa

Abstract

Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.

Chinese Translation

我们应该使用哪种多模态模型进行分类？以往研究表明，答案在于类似CLIP的对比视觉-语言模型（VLMs），因为它们在零样本分类中的出色表现。相比之下，大型多模态模型（LMM）更适合复杂任务。在本研究中，我们认为这个答案忽视了LMM的一个重要能力：上下文学习。我们在多样化的数据集上对最先进的LMM进行了基准测试，针对封闭世界分类发现，尽管它们的零样本性能低于CLIP，但带有少量上下文示例的LMM能够与基于缓存适配器的对比VLM相匹配甚至超越，后者是它们的“上下文”对应物。我们将这一分析扩展到开放世界设置，在这里，LMM的生成特性使其更适合该任务。在这一具有挑战性的场景中，LMM在提供不完美的上下文信息时会遇到困难。为了解决这个问题，我们提出了CIRCLE，一种简单的无训练方法，它为上下文示例分配伪标签，并通过可用的上下文信息迭代地进行精炼。通过大量实验，我们表明CIRCLE为开放世界分类建立了一个稳健的基线，超越了VLM的对应物，并突显了LMM作为统一分类器的潜力，以及作为专用模型的灵活替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2602.23231

Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Skarimva：基于骨架的动作识别是一种多视角应用

Bermuth, Daniel, Poeppel, Alexander, Reif, Wolfgang

Abstract

Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.

Chinese Translation

人类动作识别在发展人机智能交互中发挥着重要作用。尽管在改进基于骨架的动作识别的机器学习算法方面有大量活跃的研究，但对输入骨架数据本身质量的关注却不多。本研究表明，通过利用多个摄像头视角来三角测量更准确的三维骨架，可以显著提高最先进的动作识别模型的性能。这表明，输入数据的质量目前是这些模型性能的一个限制因素。基于这些结果，本文认为在大多数实际应用中，使用多个摄像头的成本效益比非常有利，因此未来在基于骨架的动作识别研究中应将多视角应用视为标准设置。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2602.23235

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents

高效高分辨率GUI代理的时空令牌剪枝

Xu, Zhou, Zhou, Bowen, Wang, Qi, Feng, Shuwen, Xiao, Jingyu

Abstract

Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.

Chinese Translation

纯视觉GUI代理提供了通用的交互能力，但由于高分辨率屏幕截图和历史轨迹中固有的大量时空冗余，导致其效率受到严重瓶颈。我们识别出现有压缩范式中的两个关键不匹配：时间不匹配，即统一历史编码与代理的“渐隐记忆”注意模式之间的偏差，以及空间拓扑冲突，即非结构化剪枝妨碍了精确坐标定位所需的网格完整性，导致空间幻觉。为了解决这些挑战，我们提出了GUIPruner，这是一个针对高分辨率GUI导航的无训练框架。它结合了时间自适应分辨率（Temporal-Adaptive Resolution, TAR），通过基于衰减的调整消除历史冗余，以及分层结构感知剪枝（Stratified Structure-aware Pruning, SSP），优先考虑交互前景和语义锚点，同时保护全局布局。通过在多种基准上的广泛评估，GUIPruner始终实现了最先进的性能，有效防止了在高压缩下大规模模型的崩溃。值得注意的是，在Qwen2-VL-2B上，我们的方法实现了3.4倍的FLOPs减少和3.3倍的视觉编码延迟加速，同时保留了超过94%的原始性能，使得在资源消耗最小的情况下实现实时高精度导航成为可能。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2602.23259

Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving

风险感知的世界模型预测控制用于可泛化的端到端自主驾驶

Sun, Jiangxin, Xue, Feng, Long, Teng, Liu, Chang, Hu, Jian-Fang, Zheng, Wei-Shi, Sebe, Nicu

Abstract

With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the expert" suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.

Chinese Translation

随着模仿学习（Imitation Learning, IL）和大规模驾驶数据集的进展，端到端自主驾驶（End-to-End Autonomous Driving, E2E-AD）最近取得了重大进展。目前，基于IL的方法已成为主流范式：模型依赖于专家提供的标准驾驶行为，并学习最小化其行为与专家行为之间的差异。然而，这种“仅仅像专家一样驾驶”的目标在泛化能力上存在局限：当遇到稀有或未见过的长尾场景时，模型往往在缺乏先前经验的情况下做出不安全的决策。这引发了一个根本性的问题：E2E-AD系统能否在没有任何专家行为监督的情况下做出可靠的决策？基于此，我们提出了一个统一框架，称为风险感知世界模型预测控制（Risk-aware World Model Predictive Control, RaWMPC），旨在通过鲁棒控制解决这一泛化困境，而不依赖于专家示范。在实践中，RaWMPC利用世界模型预测多种候选行为的后果，并通过明确的风险评估选择低风险行为。为了赋予世界模型预测风险驾驶行为结果的能力，我们设计了一种风险感知的交互策略，系统性地将世界模型暴露于危险行为，使灾难性结果可预测，从而避免。此外，为了在测试时生成低风险候选行为，我们引入了一种自我评估蒸馏方法，将从经过良好训练的世界模型中提炼的风险规避能力转移到生成行为提议网络中，而无需任何专家示范。大量实验表明，RaWMPC在分布内和分布外场景中均优于最先进的方法，同时提供了更优越的决策可解释性。

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2602.23262

Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling

通过粗到细的小波建模分解私密图像生成

Bayrooti, Jasmine, Kong, Weiwei, Ponomareva, Natalia, Esteves, Carlos, Makadia, Ameesh, Prorok, Amanda

Abstract

Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees essential. While differential privacy (DP) provides a principled framework for such guarantees, standard DP finetuning (e.g., with DP-SGD) often results in severe degradation of image quality, particularly in high-frequency textures, due to the indiscriminate addition of noise across all model parameters. In this work, we propose a spectral DP framework based on the hypothesis that the most privacy-sensitive portions of an image are often low-frequency components in the wavelet space (e.g., facial features and object shapes) while high-frequency components are largely generic and public. Based on this hypothesis, we propose the following two-stage framework for DP image generation with coarse image intermediaries: (1) DP finetune an autoregressive spectral image tokenizer model on the low-resolution wavelet coefficients of the sensitive images, and (2) perform high-resolution upsampling using a publicly pretrained super-resolution model. By restricting the privacy budget to the global structures of the image in the first stage, and leveraging the post-processing property of DP for detail refinement, we achieve promising trade-offs between privacy and utility. Experiments on the MS-COCO and MM-CelebA-HQ datasets show that our method generates images with improved quality and style capture relative to other leading DP image frameworks.

Chinese Translation

在敏感图像数据集上训练的生成模型存在记忆和再现单个训练样本的风险，因此强有力的隐私保障至关重要。尽管差分隐私（Differential Privacy, DP）为此类保障提供了一个原则性框架，但标准的DP微调（例如，使用DP-SGD）往往会导致图像质量的严重下降，特别是在高频纹理方面，这是由于对所有模型参数不加区分地添加噪声所致。在本研究中，我们提出了一种基于谱的DP框架，假设图像中最敏感的隐私部分通常是小波空间中的低频成分（例如，面部特征和物体形状），而高频成分则大多是通用和公开的。基于这一假设，我们提出了以下两阶段的DP图像生成框架，使用粗略图像中介： (1) 在敏感图像的低分辨率小波系数上对自回归谱图像标记模型进行DP微调，(2) 使用公开预训练的超分辨率模型进行高分辨率上采样。通过在第一阶段将隐私预算限制在图像的全局结构上，并利用DP的后处理特性进行细节精炼，我们在隐私和效用之间实现了良好的权衡。在MS-COCO和MM-CelebA-HQ数据集上的实验表明，我们的方法生成的图像在质量和风格捕捉方面相较于其他领先的DP图像框架有显著改善。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2602.23290

LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction

LineGraph2Road：基于线图的结构图推理用于道路网络提取

Wei, Zhengyang, Jing, Renzhi, He, Yiyi, Suckale, Jenny

Abstract

The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.

Chinese Translation

从卫星图像中准确自动提取道路对于导航和城市规划等应用至关重要，能够显著减少人工标注的需求。许多现有方法将这一任务分解为关键点提取和连通性预测，但往往难以捕捉长距离依赖关系和复杂拓扑结构。在此，我们提出了LineGraph2Road，一个通过将连通性预测形式化为在构建的全局但稀疏的欧几里得图上的边的二分类来改进连通性预测的框架，其中节点是从分割掩膜中提取的关键点，边连接在预定义距离阈值内的节点对，代表潜在的道路段。为了更好地学习结构链接表示，我们将原始图转换为其对应的线图，并在其上应用图变换器（Graph Transformer）进行连通性预测。这种形式克服了在集合同构链接上端点嵌入融合的局限性，使得丰富的链接表示和有效的全局结构关系推理成为可能。此外，我们引入了一个高架/地下通道头以解决多层交叉问题，并采用耦合的非极大值抑制（NMS）策略以保留关键连接。我们在三个基准数据集上评估了LineGraph2Road：城市规模（City-scale）、SpaceNet和全球规模（Global-scale），并显示其在两个关键指标TOPO-F1和APLS上达到了最先进的结果。它还捕捉了对实际部署至关重要的细微视觉细节。我们将公开我们的代码。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2602.23292

PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning

PGVMS：一种基于提示引导的虚拟多重免疫组化染色统一框架与病理语义学习

Chen, Fuqiang, Zhang, Ranran, Hu, Wanming, Abera, Deboch Eyob, Peng, Yue, Zheng, Boyun, Sun, Yiwen, Cai, Jing, Qin, Wenjian

Abstract

Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform H&E images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).

Chinese Translation

免疫组化（IHC）染色能够精确地对蛋白质表达进行分子特征分析，目前现代病理学中已有超过200种临床可用的抗体检测。然而，全面的IHC分析常常受到小型活检中组织量不足的限制。因此，虚拟多重染色作为一种创新解决方案，能够将H&E图像数字化转换为多种IHC表现形式，但当前的方法仍面临三大关键挑战：（1）多重染色的语义引导不足，（2）免疫化学染色分布不一致，以及（3）不同染色模式之间的空间错位。为了解决这些限制，我们提出了一种仅使用单重训练数据的虚拟多重IHC染色提示引导框架（PGVMS）。我们的框架引入了三项关键创新，分别对应于每个挑战：首先，采用病理视觉语言模型的自适应提示引导机制动态调整染色提示，以解决语义引导的局限性（挑战1）。其次，我们的蛋白质感知学习策略（PALS）通过直接量化和约束蛋白质分布来保持精确的蛋白质表达模式（挑战2）。第三，原型一致学习策略（PCLS）建立跨图像的语义交互，以纠正空间错位（挑战3）。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2602.23294

Towards Long-Form Spatio-Temporal Video Grounding

面向长形式时空视频定位

Gu, Xin, Fan, Bing, Yao, Jiali, Zhang, Zhipeng, Huang, Yan, Han, Cheng, Fan, Heng, Zhang, Libo

Abstract

In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.

Chinese Translation

在实际场景中，视频可能持续数分钟甚至数小时。然而，现有的时空视频定位（STVG）研究在给定文本查询的情况下，主要集中于在短视频（通常少于一分钟）中定位目标，这限制了其在现实世界中的应用。本文探讨了长形式时空视频定位（LF-STVG），旨在定位长时间视频中的目标。与短视频相比，长时间视频包含更长的时间跨度和更多无关信息，使得现有的STVG方法在一次性处理所有帧时面临困难。为了解决这一挑战，我们提出了一种用于LF-STVG的自回归变换器架构，称为ART-STVG。与传统STVG方法需要一次性处理整个视频序列以进行预测不同，ART-STVG将视频视为流输入，并顺序处理帧，从而有效处理长视频。为了建模时空上下文，我们设计了空间和时间记忆库，并将其应用于解码器。由于来自不同时间点的记忆并不总是与当前帧相关，我们引入了简单而有效的记忆选择策略，以向解码器提供更相关的信息，显著提高了性能。此外，我们提出了一种级联时空设计，将空间解码器与时间解码器连接起来，而不是并行的空间和时间定位，从而允许细粒度的空间线索辅助长视频中的复杂时间定位。在新扩展的LF-STVG数据集上的实验表明，ART-STVG显著优于最先进的方法，同时在传统短形式STVG上也表现出竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2602.23295

ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

ManifoldGD：无训练层次流形引导的基于扩散的数据集蒸馏

Roy, Ayush, Lee, Wei-Yang Alex, Chakraborty, Rudrasis, Lokhande, Vishnu Suresh

Abstract

In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.

Chinese Translation

近年来，大型数据集阻碍了高效的模型训练，同时也包含冗余的概念。数据集蒸馏旨在合成紧凑的数据集，以保留大规模训练集的知识，同时大幅减少存储和计算需求。最近在扩散模型方面的进展使得通过利用预训练生成先验实现无训练蒸馏成为可能；然而，现有的引导策略仍然有限。目前的基于评分的方法要么进行无引导去噪，要么依赖于简单的基于模式的引导，朝向实例原型质心（IPC 质心），这往往是初级和次优的。我们提出了流形引导蒸馏（ManifoldGD），这是一种无训练的基于扩散的框架，在每个去噪时间步中集成了流形一致的引导。我们的方法通过对变分自编码器（VAE）潜在特征进行层次性、分裂式聚类计算IPC，从而得到一个多尺度的IPC核心集，捕捉粗略的语义模式和细微的类内变异性。利用提取的IPC质心的局部邻域，我们为每个扩散去噪时间步创建潜在流形。在每个去噪步骤中，我们将模式对齐向量投影到估计的潜在流形的局部切空间，从而约束生成轨迹保持流形一致，同时保持语义一致性。这种表述提高了代表性、多样性和图像保真度，而无需任何模型重训练。实证结果表明，在FID、真实和合成数据集嵌入之间的l2距离以及分类准确性方面，相较于现有的无训练和基于训练的基线，ManifoldGD consistently 提供了显著的提升，确立了其作为首个几何感知的无训练数据蒸馏框架的地位。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2602.23297

PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

PRIMA：通过风险整合的图像-元数据对齐进行医学诊断的预训练

Wang, Yiqing, He, Chunming, Lu, Ming-Chen, Pawar, Mercy, Niziol, Leslie, Woodward, Maria, Farsiu, Sina

Abstract

Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.

Chinese Translation

医学诊断需要有效综合视觉表现和临床元数据。然而，现有方法往往将元数据视为孤立的标签，未能充分利用嵌入在临床描述中的丰富语义知识。我们提出了PRIMA（风险整合的图像-元数据对齐预训练），这是一个将领域特定知识整合到多模态表征学习中的框架。我们首先通过检索增强生成（Retrieval-Augmented Generation, RAG）策划了一个专家语料库，建立风险与疾病的关联，以优化临床现代BERT（Clinical ModernBERT），将诊断先验嵌入文本编码器中。为了弥合模态差距，我们引入了一种双编码器预训练策略，利用DINOv3和我们优化的BERT，并通过四种互补损失函数进行优化。这些损失旨在捕捉多粒度的语义对齐，并通过软标签处理临床关联的模糊性。最后，我们利用Qwen-3融合这些对齐特征，以实现精确的疾病分类。大量实验表明，PRIMA有效地将像素级特征与抽象的临床专业知识相结合，显著超越其他最先进的方法。值得注意的是，我们的框架在不需要大量数据收集或耗费大量计算资源的情况下，表现出更强的鲁棒性。我们的代码将在论文接受后公开。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2602.23306

ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

ThinkOmni：通过引导解码将文本推理提升至全模态场景

Guan, Yiran, Tu, Sifan, Liang, Dingkang, Zhu, Linghao, Ju, Jianzhong, Luo, Zhenbo, Luan, Jian, Liu, Yuliang, Bai, Xiang

Abstract

Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.

Chinese Translation

全模态推理对于智能系统理解和从多样数据源中进行推理至关重要。尽管现有的全模态大型语言模型（OLLM）在感知多种模态方面表现出色，但它们缺乏近期大型推理模型（LRM）的复杂推理能力。然而，通过额外训练提升OLLM的推理能力面临重大挑战，包括对高质量数据的需求、任务特定适应和巨大的计算成本。为了解决这些限制，我们提出了ThinkOmni，一个无训练和无数据的框架，将文本推理提升至全模态场景。ThinkOmni引入了两个关键组件：1）LRM作为引导（LRM-as-a-Guide），利用现成的LRM来指导OLLM的解码过程；2）逐步对比缩放（Stepwise Contrastive Scaling），自适应平衡感知和推理信号，而无需手动调整超参数。在六个多模态推理基准上的实验表明，ThinkOmni始终带来性能提升，主要结果在MathVista上达到70.2，在MMAU上达到75.5。总体而言，ThinkOmni为全模态推理提供了灵活且可推广的解决方案，并为推理能力的泛化和应用提供了新的见解。

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2602.23339

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

检索与分割：少量示例足以弥补开放词汇分割中的监督差距吗？

Aravanis, Tilemachos, Stojnić, Vladan, Psomas, Bill, Komodakis, Nikos, Tolias, Giorgos

Abstract

Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

Chinese Translation

开放词汇分割（Open-vocabulary segmentation, OVS）将视觉-语言模型（Vision-Language Models, VLMs）的零样本识别能力扩展到像素级预测，使得能够对由文本提示指定的任意类别进行分割。尽管最近取得了一些进展，OVS 仍然落后于完全监督的方法，主要面临两个挑战：用于训练 VLMs 的粗略图像级监督和自然语言的语义模糊性。我们通过引入一种少样本设置来解决这些限制，该设置通过像素注释图像的支持集增强文本提示。在此基础上，我们提出了一种检索增强的测试时适配器，通过融合文本和视觉支持特征来学习轻量级的逐图像分类器。与之前依赖于后期手工融合的方法不同，我们的方法执行学习的逐查询融合，实现了模态之间更强的协同作用。该方法支持不断扩展的支持集，并适用于个性化分割等细粒度任务。实验表明，我们显著缩小了零样本分割与监督分割之间的差距，同时保留了开放词汇能力。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2602.23357

Sensor Generalization for Adaptive Sensing in Event-based Object Detection via Joint Distribution Training

基于联合分布训练的事件驱动目标检测中的自适应传感器泛化

Saha, Aheli, Schuster, René, Stricker, Didier

Abstract

Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a high dynamic range and significantly reduce motion blur. However, because of the novelty in the nature of their output signals, there is a gap in the variability of available data and a lack of extensive analysis of the parameters characterizing their signals. This paper addresses these issues by providing readers with an in-depth understanding of how intrinsic parameters affect the performance of a model trained on event data, specifically for object detection. We also use our findings to expand the capabilities of the downstream model towards sensor-agnostic robustness.

Chinese Translation

受生物启发的事件相机因其异步和低延迟的特性而受到广泛关注。这些特性提供了高动态范围，并显著减少了运动模糊。然而，由于其输出信号性质的新颖性，现有数据的变异性存在差距，且对表征其信号的参数缺乏深入分析。本文通过深入探讨内在参数如何影响基于事件数据训练的模型在目标检测中的性能，解决了这些问题。我们还利用研究结果扩展下游模型的能力，以实现对传感器无关的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2602.23359

SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

SeeThrough3D：在文本到图像生成中的遮挡感知3D控制

Agrawal, Vaibhav, Parihar, Rishubh, Bhat, Pradhaan, Sarvadevabhatla, Ravi Kiran, Babu, R. Venkatesh

Abstract

We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

Chinese Translation

我们将遮挡推理视为3D布局条件生成中的一个基本但被忽视的方面。它对于合成具有深度一致几何形状和比例的部分遮挡物体至关重要。尽管现有方法能够生成符合输入布局的真实场景，但它们往往无法精确建模物体之间的遮挡关系。我们提出了SeeThrough3D，这是一种明确建模遮挡的3D布局条件生成模型。我们引入了一种遮挡感知的3D场景表示（OSCR），其中物体被描绘为放置在虚拟环境中的半透明3D盒子，并从所需的相机视角进行渲染。透明度编码了隐藏的物体区域，使模型能够推理遮挡，而渲染的视角在生成过程中提供了明确的相机控制。我们通过引入一组源自我们渲染的3D表示的视觉标记，来对预训练的基于流的文本到图像生成模型进行条件化。此外，我们应用了掩蔽自注意力机制，以准确将每个物体的边界框与其相应的文本描述绑定，从而实现多个物体的准确生成，而不混合物体属性。为了训练模型，我们构建了一个合成数据集，包含多样化的多物体场景，并具有强烈的物体间遮挡。SeeThrough3D能够有效地推广到未见过的物体类别，并实现精确的3D布局控制，具备真实的遮挡效果和一致的相机控制。

View on arXiv Download PDF AI Translation

cs.CV / 119 / 2602.23361

VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

VGG-T$^3$: 大规模离线前馈三维重建

Elflein, Sven, Li, Ruilong, Agostinho, Sérgio, Gojcic, Zan, Leal-Taixé, Laura, Zhou, Qunjie, Osep, Aljosa

Abstract

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

Chinese Translation

我们提出了一种可扩展的三维重建模型，解决了离线前馈方法中的一个关键限制：其计算和内存需求随着输入图像数量的增加呈平方增长。我们的方法基于一个关键的见解，即这个瓶颈源于场景几何的变长键值（Key-Value, KV）空间表示，我们通过测试时训练将其提炼为固定大小的多层感知器（Multi-Layer Perceptron, MLP）。VGG-T$^3$（视觉几何基础的测试时训练）相对于输入视图数量线性扩展，类似于在线模型，并在仅54秒内重建了一个1k图像集合，实现了相较于依赖softmax注意力的基线方法11.6倍的加速。由于我们的方法保留了全局场景聚合能力，我们的点图重建误差在很大程度上优于其他线性时间方法。最后，我们通过用未见图像查询场景表示，展示了我们模型的视觉定位能力。

View on arXiv Download PDF AI Translation

cs.CV / 120 / 2602.23363

MediX-R1: Open Ended Medical Reinforcement Learning

MediX-R1：开放式医学强化学习

Mullappilly, Sahal Shaji, Kurpath, Mohammed Irfan, Mohamed, Omair, Zidan, Mohamed, Khan, Fahad, Khan, Salman, Anwer, Rao, Cholakkal, Hisham

Abstract

We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com

Chinese Translation

我们介绍了MediX-R1，这是一个开放式的强化学习（Reinforcement Learning, RL）框架，旨在为医学多模态大型语言模型（Multimodal Large Language Models, MLLMs）提供临床基础的自由形式回答，超越多项选择格式。MediX-R1通过基于组的RL和针对医学推理定制的复合奖励来微调基线视觉-语言骨干网络：一种基于LLM的准确性奖励，通过严格的YES/NO决策来判断语义正确性；一种基于医学嵌入的语义奖励，以捕捉同义词和术语变体；以及轻量级的格式和模态奖励，以强化可解释推理和模态识别。这种多信号设计为开放式输出提供了稳定且信息丰富的反馈，而传统的可验证奖励或仅限于多项选择的问题则显得不足。为了衡量进展，我们提出了一个统一的评估框架，适用于文本-only和图像+文本任务，使用基于参考的LLM作为评判者，替代脆弱的字符串重叠度量，捕捉语义正确性、推理和上下文对齐。尽管仅使用了约51K的指令示例，MediX-R1在标准医学LLM（仅文本）和视觉语言模型（图像+文本）基准测试中取得了优异的结果，超越了强大的开源基线，并在开放式临床任务上取得了特别大的提升。我们的结果表明，结合全面奖励信号和基于LLM的评估的开放式RL是实现多模态模型中可靠医学推理的可行路径。我们训练的模型、策划的数据集和源代码可在https://medix.cvmbzuai.com获取。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2602.22215

Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation

图谱助力灵感：将合著者图谱与检索增强生成结合用于基于大型语言模型的科学创意生成

Xie, Pengzhen, Liang, Huizhi

Abstract

Large Language Models (LLMs) demonstrate potential in the field of scientific idea generation. However, the generated results often lack controllable academic context and traceable inspiration pathways. To bridge this gap, this paper proposes a scientific idea generation system called GYWI, which combines author knowledge graphs with retrieval-augmented generation (RAG) to form an external knowledge base to provide controllable context and trace of inspiration path for LLMs to generate new scientific ideas. We first propose an author-centered knowledge graph construction method and inspiration source sampling algorithms to construct external knowledge base. Then, we propose a hybrid retrieval mechanism that is composed of both RAG and GraphRAG to retrieve content with both depth and breadth knowledge. It forms a hybrid context. Thirdly, we propose a Prompt optimization strategy incorporating reinforcement learning principles to automatically guide LLMs optimizing the results based on the hybrid context. To evaluate the proposed approaches, we constructed an evaluation dataset based on arXiv (2018-2023). This paper also develops a comprehensive evaluation method including empirical automatic assessment in multiple-choice question task, LLM-based scoring, human evaluation, and semantic space visualization analysis. The generated ideas are evaluated from the following five dimensions: novelty, feasibility, clarity, relevance, and significance. We conducted experiments on different LLMs including GPT-4o, DeepSeek-V3, Qwen3-8B, and Gemini 2.5. Experimental results show that GYWI significantly outperforms mainstream LLMs in multiple metrics such as novelty, reliability, and relevance.

Chinese Translation

大型语言模型（LLMs）在科学创意生成领域展现出潜力。然而，生成的结果往往缺乏可控的学术背景和可追溯的灵感路径。为了解决这一问题，本文提出了一种名为 GYWI 的科学创意生成系统，该系统将作者知识图谱与检索增强生成（RAG）相结合，形成一个外部知识库，以为 LLMs 提供可控的背景和灵感路径的追踪，从而生成新的科学创意。我们首先提出了一种以作者为中心的知识图谱构建方法和灵感来源采样算法，以构建外部知识库。然后，我们提出了一种混合检索机制，该机制由 RAG 和 GraphRAG 组成，以检索具有深度和广度知识的内容，形成混合背景。第三，我们提出了一种结合强化学习原则的提示优化策略，以自动指导 LLMs 基于混合背景优化结果。为了评估所提出的方法，我们基于 arXiv（2018-2023）构建了评估数据集。本文还开发了一种综合评估方法，包括多项选择题任务中的经验自动评估、基于 LLM 的评分、人类评估和语义空间可视化分析。生成的创意从新颖性、可行性、清晰度、相关性和重要性五个维度进行评估。我们在不同的 LLMs 上进行了实验，包括 GPT-4o、DeepSeek-V3、Qwen3-8B 和 Gemini 2.5。实验结果表明，GYWI 在新颖性、可靠性和相关性等多个指标上显著优于主流 LLMs。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2602.22273

FIRE: A Comprehensive Benchmark for Financial Intelligence and Reasoning Evaluation

FIRE：金融智能与推理评估的综合基准

Zhang, Xiyuan, Wu, Huihang, Guo, Jiayu, Zhang, Zhenlin, Zhang, Yiwei, Huo, Liangyu, Ma, Xiaoxiao, Wan, Jiansong, Jiao, Xuewei, Jing, Yi, Xie, Jian

Abstract

We introduce FIRE, a comprehensive benchmark designed to evaluate both the theoretical financial knowledge of LLMs and their ability to handle practical business scenarios. For theoretical assessment, we curate a diverse set of examination questions drawn from widely recognized financial qualification exams, enabling evaluation of LLMs deep understanding and application of financial knowledge. In addition, to assess the practical value of LLMs in real-world financial tasks, we propose a systematic evaluation matrix that categorizes complex financial domains and ensures coverage of essential subdomains and business activities. Based on this evaluation matrix, we collect 3,000 financial scenario questions, consisting of closed-form decision questions with reference answers and open-ended questions evaluated by predefined rubrics. We conduct comprehensive evaluations of state-of-the-art LLMs on the FIRE benchmark, including XuanYuan 4.0, our latest financial-domain model, as a strong in-domain baseline. These results enable a systematic analysis of the capability boundaries of current LLMs in financial applications. We publicly release the benchmark questions and evaluation code to facilitate future research.

Chinese Translation

我们介绍了FIRE，这是一个综合基准，旨在评估大型语言模型（LLMs）的理论金融知识及其处理实际商业场景的能力。为了进行理论评估，我们整理了一套多样化的考试问题，这些问题来源于广泛认可的金融资格考试，从而能够评估LLMs对金融知识的深刻理解和应用。此外，为了评估LLMs在现实金融任务中的实际价值，我们提出了一个系统的评估矩阵，该矩阵对复杂的金融领域进行分类，并确保涵盖重要的子领域和商业活动。基于这一评估矩阵，我们收集了3,000个金融场景问题，包括带有参考答案的封闭式决策问题和通过预定义评分标准评估的开放式问题。我们对最先进的LLMs在FIRE基准上的表现进行了全面评估，包括我们的最新金融领域模型XuanYuan 4.0，作为一个强有力的领域内基线。这些结果使我们能够系统分析当前LLMs在金融应用中的能力边界。我们公开发布了基准问题和评估代码，以促进未来的研究。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2602.22287

Multi-Level Causal Embeddings

多层次因果嵌入

Schooltink, Willem, Zennaro, Fabio Massimo

Abstract

Abstractions of causal models allow for the coarsening of models such that relations of cause and effect are preserved. Whereas abstractions focus on the relation between two models, in this paper we study a framework for causal embeddings which enable multiple detailed models to be mapped into sub-systems of a coarser causal model. We define causal embeddings as a generalization of abstraction, and present a generalized notion of consistency. By defining a multi-resolution marginal problem, we showcase the relevance of causal embeddings for both the statistical marginal problem and the causal marginal problem; furthermore, we illustrate its practical use in merging datasets coming from models with different representations.

Chinese Translation

因果模型的抽象允许对模型进行粗化，从而保持因果关系的完整性。尽管抽象主要关注两个模型之间的关系，本文研究了一种因果嵌入框架，该框架使得多个详细模型能够映射到一个粗糙因果模型的子系统中。我们将因果嵌入定义为抽象的推广，并提出了一种广义的一致性概念。通过定义多分辨率边际问题，我们展示了因果嵌入在统计边际问题和因果边际问题中的相关性；此外，我们还说明了其在合并来自不同表示模型的数据集中的实际应用。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2602.22302

Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents

代理行为合同：可靠自主人工智能代理的形式化规范与运行时执行

Bhardwaj, Varun Pratap

Abstract

Traditional software relies on contracts -- APIs, type systems, assertions -- to specify and enforce correct behavior. AI agents, by contrast, operate on prompts and natural language instructions with no formal behavioral specification. This gap is the root cause of drift, governance failures, and frequent project failures in agentic AI deployments. We introduce Agent Behavioral Contracts (ABC), a formal framework that brings Design-by-Contract principles to autonomous AI agents. An ABC contract C = (P, I, G, R) specifies Preconditions, Invariants, Governance policies, and Recovery mechanisms as first-class, runtime-enforceable components. We define (p, delta, k)-satisfaction -- a probabilistic notion of contract compliance that accounts for LLM non-determinism and recovery -- and prove a Drift Bounds Theorem showing that contracts with recovery rate gamma > alpha (the natural drift rate) bound behavioral drift to D* = alpha/gamma in expectation, with Gaussian concentration in the stochastic setting. We establish sufficient conditions for safe contract composition in multi-agent chains and derive probabilistic degradation bounds. We implement ABC in AgentAssert, a runtime enforcement library, and evaluate on AgentContract-Bench, a benchmark of 200 scenarios across 7 models from 6 vendors. Results across 1,980 sessions show that contracted agents detect 5.2-6.8 soft violations per session that uncontracted baselines miss entirely (p < 0.0001, Cohen's d = 6.7-33.8), achieve 88-100% hard constraint compliance, and bound behavioral drift to D* < 0.27 across extended sessions, with 100% recovery for frontier models and 17-100% across all models, at overhead < 10 ms per action.

Chinese Translation

传统软件依赖于合同——API、类型系统、断言——来指定和执行正确的行为。相比之下，人工智能代理依赖于提示和自然语言指令，而没有正式的行为规范。这一差距是代理人工智能部署中漂移、治理失败和频繁项目失败的根本原因。我们提出了代理行为合同（Agent Behavioral Contracts, ABC），这是一个将设计-by-合同原则引入自主人工智能代理的形式化框架。ABC合同 C = (P, I, G, R) 将前提条件（Preconditions）、不变式（Invariants）、治理政策（Governance policies）和恢复机制（Recovery mechanisms）作为一等公民的、可在运行时执行的组件进行规范。我们定义了(p, delta, k)-满足性，这是一种考虑大型语言模型（LLM）非确定性和恢复的合同合规性概率概念，并证明了漂移界限定理，表明具有恢复率 gamma > alpha（自然漂移率）的合同在期望上将行为漂移限制在 D* = alpha/gamma，并在随机环境中具有高斯集中性。我们建立了多代理链中安全合同组合的充分条件，并推导出概率降级界限。我们在 AgentAssert 中实现了 ABC，这是一个运行时执行库，并在 AgentContract-Bench 上进行了评估，该基准涵盖了来自6个供应商的7个模型的200个场景。1,980个会话的结果表明，签约代理每个会话检测到 5.2-6.8 个软违规，而未签约的基线完全遗漏（p < 0.0001，Cohen's d = 6.7-33.8），实现了 88-100% 的硬约束合规性，并将行为漂移限制在 D* < 0.27 的扩展会话中，对于前沿模型实现了 100% 的恢复，所有模型的恢复率为 17-100%，每个动作的开销小于 10 毫秒。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2602.22401

Vibe Researching as Wolf Coming: Can AI Agents with Skills Replace or Augment Social Scientists?

作为狼来的氛围研究：具备技能的人工智能代理能否替代或增强社会科学家？

Zhang, Yongjun

Abstract

AI agents -- systems that execute multi-step reasoning workflows with persistent state, tool access, and specialist skills -- represent a qualitative shift from prior automation technologies in social science. Unlike chatbots that respond to isolated queries, AI agents can now read files, run code, query databases, search the web, and invoke domain-specific skills to execute entire research pipelines autonomously. This paper introduces the concept of vibe researching -- the AI-era parallel to ``vibe coding'' (Karpathy, 2025) -- and uses scholar-skill, a 21-skill plugin for Claude Code covering the full research pipeline from idea to submission, as an illustrative case. I develop a cognitive task framework that classifies research activities along two dimensions -- codifiability and tacit knowledge requirement -- to identify a delegation boundary that is cognitive, not sequential: it cuts through every stage of the research pipeline, not between stages. I argue that AI agents excel at speed, coverage, and methodological scaffolding but struggle with theoretical originality and tacit field knowledge. The paper concludes with an analysis of three implications for the profession -- augmentation with fragile conditions, stratification risk, and a pedagogical crisis -- and proposes five principles for responsible vibe researching.

Chinese Translation

人工智能代理——执行具有持久状态、多步骤推理工作流程、工具访问和专业技能的系统——代表了社会科学领域从以往自动化技术的质的飞跃。与仅能回应孤立查询的聊天机器人不同，人工智能代理现在能够读取文件、运行代码、查询数据库、搜索网络，并调用特定领域的技能，独立执行整个研究流程。本文引入了“氛围研究”的概念——这是人工智能时代与“氛围编码”（Karpathy, 2025）相对应的概念，并以 scholar-skill 作为示例，这是一个涵盖从创意到提交的完整研究流程的 21 种技能插件，适用于 Claude Code。我开发了一个认知任务框架，将研究活动沿着可编码性和隐性知识需求两个维度进行分类，以识别一个认知的、而非顺序的委托边界：它贯穿研究流程的每个阶段，而不是在阶段之间。我认为，人工智能代理在速度、覆盖范围和方法论支架方面表现出色，但在理论原创性和隐性领域知识方面存在困难。本文最后分析了对这一职业的三种影响——在脆弱条件下的增强、分层风险和教学危机，并提出了五项负责任的氛围研究原则。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2602.22406

Towards Autonomous Memory Agents

迈向自主记忆代理

Wu, Xinle, Zhang, Rui, Hussain, Mustafa Anis, Lu, Yao

Abstract

Recent memory agents improve LLMs by extracting experiences and conversation history into an external storage. This enables low-overhead context assembly and online memory update without expensive LLM training. However, existing solutions remain passive and reactive; memory growth is bounded by information that happens to be available, while memory agents seldom seek external inputs in uncertainties. We propose autonomous memory agents that actively acquire, validate, and curate knowledge at a minimum cost. U-Mem materializes this idea via (i) a cost-aware knowledge-extraction cascade that escalates from cheap self/teacher signals to tool-verified research and, only when needed, expert feedback, and (ii) semantic-aware Thompson sampling to balance exploration and exploitation over memories and mitigate cold-start bias. On both verifiable and non-verifiable benchmarks, U-Mem consistently beats prior memory baselines and can surpass RL-based optimization, improving HotpotQA (Qwen2.5-7B) by 14.6 points and AIME25 (Gemini-2.5-flash) by 7.33 points.

Chinese Translation

近期的记忆代理通过将经验和对话历史提取到外部存储中来改善大型语言模型（LLMs）。这使得低开销的上下文组装和在线记忆更新成为可能，而无需昂贵的LLM训练。然而，现有的解决方案仍然是被动和反应性的；记忆的增长受到可用信息的限制，而记忆代理在不确定性中很少主动寻求外部输入。我们提出了自主记忆代理，能够以最低成本主动获取、验证和策划知识。U-Mem通过以下方式实现这一理念：（i）一个成本感知的知识提取级联，从廉价的自我/教师信号逐步升级到工具验证的研究，只有在必要时才获取专家反馈；（ii）语义感知的汤普森采样，以平衡记忆的探索与利用，并减轻冷启动偏差。在可验证和不可验证的基准测试中，U-Mem始终超越先前的记忆基线，并且能够超过基于强化学习的优化，HotpotQA（Qwen2.5-7B）提高了14.6分，AIME25（Gemini-2.5-flash）提高了7.33分。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2602.22408

Exploring Human Behavior During Abstract Rule Inference and Problem Solving with the Cognitive Abstraction and Reasoning Corpus

探索抽象规则推理和问题解决中的人类行为：认知抽象与推理语料库的研究

Ahn, Caroline, Do, Quan, Bakst, Leah, Pascale, Michael P., McGuire, Joseph T., Hasselmo, Michael E., Stern, Chantal E.

Abstract

Humans exhibit remarkable flexibility in abstract reasoning, and can rapidly learn and apply rules from sparse examples. To investigate the cognitive strategies underlying this ability, we introduce the Cognitive Abstraction and Reasoning Corpus (CogARC), a diverse human-adapted subset of the Abstraction and Reasoning Corpus (ARC) which was originally developed to benchmark abstract reasoning in artificial intelligence. Across two experiments, CogARC was administered to a total of 260 human participants who freely generated solutions to 75 abstract visual reasoning problems. Success required inferring input-output rules from a small number of examples to transform the test input into one correct test output. Participants' behavior was recorded at high temporal resolution, including example viewing, edit sequences, and multi-attempt submissions. Participants were generally successful (mean accuracy ~90% for experiment 1 (n=40), ~80% for experiment 2 (n=220) across problems), but performance varied widely across problems and participants. Harder problems elicited longer deliberation times and greater divergence in solution strategies. Over the course of the task, participants initiated responses more quickly but showed a slight decline in accuracy, suggesting increased familiarity with the task structure rather than improved rule-learning ability. Importantly, even incorrect solutions were often highly convergent, even when the problem-solving trajectories differed in length and smoothness. Some trajectories progressed directly and efficiently toward a stable outcome, whereas others involved extended exploration or partial restarts before converging. Together, these findings highlight CogARC as a rich behavioral environment for studying human abstract reasoning, providing insight into how people generalize, misgeneralize, and adapt their strategies under uncertainty.

Chinese Translation

人类在抽象推理方面表现出显著的灵活性，能够迅速从稀疏的例子中学习和应用规则。为了研究这一能力背后的认知策略，我们引入了认知抽象与推理语料库（Cognitive Abstraction and Reasoning Corpus, CogARC），这是一个多样化的人类适应子集，源自于最初为评估人工智能抽象推理而开发的抽象与推理语料库（Abstraction and Reasoning Corpus, ARC）。在两项实验中，共有260名参与者参与了CogARC的测试，他们自由生成了75个抽象视觉推理问题的解决方案。成功的关键在于从少量示例中推断输入-输出规则，以将测试输入转化为一个正确的测试输出。参与者的行为以高时间分辨率记录，包括示例查看、编辑序列和多次尝试提交。参与者总体上表现成功（实验1的平均准确率约为90%（n=40），实验2约为80%（n=220）），但在不同问题和参与者之间表现差异显著。更难的问题引发了更长的思考时间和更大的解决策略差异。在任务进行过程中，参与者的反应启动速度有所提高，但准确率略有下降，这表明对任务结构的熟悉度增加，而非规则学习能力的提高。重要的是，即使是错误的解决方案通常也表现出高度的趋同，即使在问题解决轨迹的长度和流畅性上存在差异。有些轨迹直接而高效地朝向稳定结果推进，而另一些则涉及较长的探索或部分重启后再趋同。综合来看，这些发现突显了CogARC作为研究人类抽象推理的丰富行为环境，为人们如何在不确定性下进行概括、错误概括和调整策略提供了深刻的见解。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2602.22413

Epistemic Filtering and Collective Hallucination: A Jury Theorem for Confidence-Calibrated Agents

认知过滤与集体幻觉：一种针对信心校准代理的陪审团定理

Karge, Jonas

Abstract

We investigate the collective accuracy of heterogeneous agents who learn to estimate their own reliability over time and selectively abstain from voting. While classical epistemic voting results, such as the \textit{Condorcet Jury Theorem} (CJT), assume fixed participation, real-world aggregation often benefits from allowing agents to say ``I don't know.'' We propose a probabilistic framework where agents engage in a \textit{calibration} phase, updating beliefs about their own fixed competence, before facing a final confidence gate that determines whether to vote or abstain. We derive a non-asymptotic lower bound on the group's success probability and prove that this \textit{selective participation} generalizes the asymptotic guarantees of the CJT to a sequential, confidence-gated setting. Empirically, we validate these bounds via Monte Carlo simulations. While our results are general, we discuss their potential application to AI safety, outlining how this framework can mitigate \textit{hallucinations} in collective LLM decision-making.

Chinese Translation

我们研究了异质代理的集体准确性，这些代理随着时间的推移学习估计自身的可靠性，并选择性地放弃投票。尽管经典的认知投票结果，如 extit{孔多塞陪审团定理}（CJT），假设固定的参与，但现实世界中的聚合往往受益于允许代理说“我不知道”。我们提出了一个概率框架，其中代理在面对决定是否投票或放弃的最终信心门槛之前，参与一个 extit{校准}阶段，更新对自身固定能力的信念。我们推导出该群体成功概率的非渐近下界，并证明这种 extit{选择性参与}将CJT的渐近保证推广到一个顺序的、信心门控的环境中。通过蒙特卡洛模拟，我们在实证上验证了这些界限。虽然我们的结果是一般性的，但我们讨论了它们在人工智能安全中的潜在应用，概述了该框架如何减轻集体大型语言模型决策中的 extit{幻觉}现象。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2602.22425

ArchAgent: Agentic AI-driven Computer Architecture Discovery

ArchAgent：基于代理智能的计算机架构发现

Gupta, Raghav, Jain, Akanksha, Gonzalez, Abraham, Novikov, Alexander, Huang, Po-Sen, Balog, Matej, Eisenberger, Marvin, Shirobokov, Sergey, Vũ, Ngân, Dixon, Martin, Nikolić, Borivoje, Ranganathan, Parthasarathy, Karandikar, Sagar

Abstract

Agile hardware design flows are a critically needed force multiplier to meet the exploding demand for compute. Recently, agentic generative AI systems have demonstrated significant advances in algorithm design, improving code efficiency, and enabling discovery across scientific domains. Bridging these worlds, we present ArchAgent, an automated computer architecture discovery system built on AlphaEvolve. We show ArchAgent's ability to automatically design/implement state-of-the-art (SoTA) cache replacement policies (architecting new mechanisms/logic, not only changing parameters), broadly within the confines of an established cache replacement policy design competition. In two days without human intervention, ArchAgent generated a policy achieving a 5.3% IPC speedup improvement over the prior SoTA on public multi-core Google Workload Traces. On the heavily-explored single-core SPEC06 workloads, it generated a policy in just 18 days showing a 0.9% IPC speedup improvement over the existing SoTA (a similar "winning margin" as reported by the existing SoTA). ArchAgent achieved these gains 3-5x faster than prior human-developed SoTA policies. Agentic flows also enable "post-silicon hyperspecialization" where agents tune runtime-configurable parameters exposed in hardware policies to further align the policies with a specific workload (mix). Exploiting this, we demonstrate a 2.4% IPC speedup improvement over prior SoTA on SPEC06 workloads. Finally, we outline broader implications for computer architecture research in the era of agentic AI. For example, we demonstrate the phenomenon of "simulator escapes", where the agentic AI flow discovered and exploited a loophole in a popular microarchitectural simulator - a consequence of the fact that these research tools were designed for a (now past) world where they were exclusively operated by humans acting in good-faith.

Chinese Translation

敏捷硬件设计流程是满足日益增长的计算需求所需的重要助推器。最近，代理生成的人工智能系统在算法设计、代码效率提升以及科学领域的发现方面取得了显著进展。为连接这两个领域，我们提出了ArchAgent，一个基于AlphaEvolve的自动化计算机架构发现系统。我们展示了ArchAgent在既定缓存替换政策设计竞赛框架内，自动设计和实现最先进（SoTA）缓存替换策略的能力（构建新的机制/逻辑，而不仅仅是更改参数）。在没有人工干预的情况下，ArchAgent在两天内生成了一种策略，使得在公共多核Google工作负载跟踪上实现了5.3%的IPC速度提升，超越了之前的SoTA。在经过大量探索的单核SPEC06工作负载上，它仅用18天便生成了一种策略，显示出0.9%的IPC速度提升，超越了现有的SoTA（与现有SoTA报告的“胜利幅度”相似）。ArchAgent以3-5倍于之前人类开发的SoTA策略的速度实现了这些增益。代理流程还使得“后硅超专业化”成为可能，代理能够调整在硬件政策中暴露的运行时可配置参数，以进一步使政策与特定工作负载（组合）对齐。利用这一点，我们在SPEC06工作负载上展示了相较于之前SoTA的2.4%的IPC速度提升。最后，我们概述了在代理智能时代计算机架构研究的更广泛影响。例如，我们展示了“模拟器逃逸”现象，代理智能流程发现并利用了一个流行微架构模拟器中的漏洞——这是因为这些研究工具是为一个（现在已过去的）世界设计的，那个世界中它们仅由以善意行事的人类操作。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2602.22441

How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

潜在推理方法在弱监督和强监督下的表现如何？

Cui, Yingqian, Dai, Zhenwei, He, Bing, Shi, Zhan, Liu, Hui, Sun, Rui, Liu, Zhiji, Xing, Yue, Tang, Jiliang, Dumoulin, Benoit

Abstract

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space. This paradigm enables reasoning beyond discrete language tokens by performing multi-step computation in continuous latent spaces. Although there have been numerous studies focusing on improving the performance of latent reasoning, its internal mechanisms remain not fully investigated. In this work, we conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process. We identify two key issues across latent reasoning methods with different levels of supervision. First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning. Second, we examine the hypothesis that latent reasoning supports BFS-like exploration in latent space, and find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search, but instead exhibits implicit pruning and compression. Finally, our findings reveal a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses, whereas weaker supervision allows richer latent representations at the cost of increased shortcut behavior.

Chinese Translation

潜在推理最近被提出作为一种推理范式，通过在潜在空间中生成步骤而非文本空间来执行多步推理。这一范式使得推理超越离散语言符号，通过在连续潜在空间中进行多步计算。尽管已有众多研究集中于提高潜在推理的性能，但其内部机制仍未得到充分研究。在本研究中，我们对潜在推理方法进行了全面分析，以更好地理解潜在表示在这一过程中的作用和行为。我们识别出在不同监督水平下的潜在推理方法中存在的两个关键问题。首先，我们观察到普遍的捷径行为，即它们在不依赖潜在推理的情况下实现高准确率。其次，我们检验了潜在推理支持在潜在空间中进行类似广度优先搜索（BFS）的探索的假设，发现尽管潜在表示可以编码多种可能性，但推理过程并未忠实地实现结构化搜索，而是表现出隐式的剪枝和压缩。最后，我们的研究结果揭示了与监督强度相关的权衡：更强的监督减轻了捷径行为，但限制了潜在表示保持多样假设的能力，而较弱的监督则允许更丰富的潜在表示，但以增加捷径行为为代价。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2602.22442

A Framework for Assessing AI Agent Decisions and Outcomes in AutoML Pipelines

评估自动机器学习管道中人工智能代理决策与结果的框架

Du, Gaoyuan, Ahlawat, Amit, Liu, Xiaoyang, Wu, Jing

Abstract

Agent-based AutoML systems rely on large language models to make complex, multi-stage decisions across data processing, model selection, and evaluation. However, existing evaluation practices remain outcome-centric, focusing primarily on final task performance. Through a review of prior work, we find that none of the surveyed agentic AutoML systems report structured, decision-level evaluation metrics intended for post-hoc assessment of intermediate decision quality. To address this limitation, we propose an Evaluation Agent (EA) that performs decision-centric assessment of AutoML agents without interfering with their execution. The EA is designed as an observer that evaluates intermediate decisions along four dimensions: decision validity, reasoning consistency, model quality risks beyond accuracy, and counterfactual decision impact. Across four proof-of-concept experiments, we demonstrate that the EA can (i) detect faulty decisions with an F1 score of 0.919, (ii) identify reasoning inconsistencies independent of final outcomes, and (iii) attribute downstream performance changes to agent decisions, revealing impacts ranging from -4.9\% to +8.3\% in final metrics. These results illustrate how decision-centric evaluation exposes failure modes that are invisible to outcome-only metrics. Our work reframes the evaluation of agentic AutoML systems from an outcome-based perspective to one that audits agent decisions, offering a foundation for reliable, interpretable, and governable autonomous ML systems.

Chinese Translation

基于代理的自动机器学习（AutoML）系统依赖于大型语言模型在数据处理、模型选择和评估等复杂多阶段决策中进行决策。然而，现有的评估实践仍然以结果为中心，主要关注最终任务性能。通过对先前工作的回顾，我们发现没有调查的代理型AutoML系统报告旨在事后评估中间决策质量的结构化决策级评估指标。为了解决这一局限性，我们提出了一种评估代理（Evaluation Agent, EA），该代理在不干扰执行的情况下对AutoML代理进行以决策为中心的评估。EA被设计为一个观察者，沿着四个维度评估中间决策：决策有效性、推理一致性、超越准确性的模型质量风险，以及反事实决策影响。在四个概念验证实验中，我们证明了EA可以（i）以0.919的F1分数检测错误决策，（ii）独立于最终结果识别推理不一致性，以及（iii）将下游性能变化归因于代理决策，揭示最终指标的影响范围从-4.9\%到+8.3\%。这些结果表明，以决策为中心的评估揭示了仅关注结果的指标所无法察觉的失败模式。我们的工作将代理型AutoML系统的评估从基于结果的视角重新框定为审计代理决策的视角，为可靠、可解释和可治理的自主机器学习系统提供了基础。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2602.22452

CWM: Contrastive World Models for Action Feasibility Learning in Embodied Agent Pipelines

CWM：用于具身智能体管道中的行动可行性学习的对比世界模型

Banerjee, Chayan

Abstract

A reliable action feasibility scorer is a critical bottleneck in embodied agent pipelines: before any planning or reasoning occurs, the agent must identify which candidate actions are physically executable in the current state. Existing approaches use supervised fine-tuning (SFT) to train action scorers, but SFT treats each candidate independently and does not explicitly teach the model to discriminate between actions that are physically correct and those that are subtly wrong. We propose the Contrastive World Model (CWM), which fine-tunes a large language model (LLM) as an action scorer using an InfoNCE contrastive objective with hard-mined negative examples. The key idea is to push valid actions away from invalid ones in scoring space, with special emphasis on hard negatives: semantically similar but physically incompatible candidates. We evaluate CWM on the ScienceWorld benchmark through two studies. First, an intrinsic affordance evaluation on 605 hard-negative test pairs shows that CWM outperforms SFT by +6.76 percentage points on Precision@1 for minimal-edit negatives -- cases where a single word changes the physical outcome -- and achieves a higher AUC-ROC (0.929 vs. 0.906). Second, a live filter characterisation study measures how well CWM ranks gold-path actions against all valid environment actions during task execution. Under out-of-distribution stress conditions, CWM maintains a significantly better safety margin (-2.39) than SFT (-3.96), indicating that the gold action is ranked closer to the top. These results support the hypothesis that contrastive training induces representations that capture physical feasibility more faithfully than SFT alone.

Chinese Translation

可靠的行动可行性评分器是具身智能体管道中的一个关键瓶颈：在进行任何规划或推理之前，智能体必须识别当前状态下哪些候选行动是可以物理执行的。现有的方法使用监督微调（SFT）来训练行动评分器，但SFT将每个候选行动视为独立的，并未明确教导模型区分物理上正确的行动与那些微妙错误的行动。我们提出了对比世界模型（CWM），该模型使用带有困难挖掘负例的InfoNCE对比目标对大型语言模型（LLM）进行微调，作为行动评分器。其关键思想是在评分空间中将有效行动与无效行动分开，特别强调困难负例：语义上相似但物理上不兼容的候选项。我们通过两项研究在ScienceWorld基准上评估CWM。首先，在605个困难负例测试对上的内在可供性评估显示，CWM在最小编辑负例的Precision@1上比SFT提高了6.76个百分点——这些案例中单个词的变化会改变物理结果——并且AUC-ROC更高（0.929对比0.906）。其次，实时过滤特征化研究测量CWM在任务执行过程中如何将金路径行动与所有有效环境行动进行排名。在分布外压力条件下，CWM保持了显著优于SFT的安全边际（-2.39对比-3.96），表明金行动的排名更接近顶部。这些结果支持了对比训练能够诱导出比单独使用SFT更真实捕捉物理可行性的表示的假设。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2602.22465

ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

ConstraintBench：直接优化中大语言模型约束推理的基准测试

Tso, Joseph, Schmittou, Preston, Huynh, Quan, Hutchins, Jibran

Abstract

Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimization problems as solver code, but leave open a complementary question. Can LLMs directly produce correct solutions to fully specified constrained optimization problems without access to a solver? We introduce ConstraintBench, a benchmark for evaluating LLMs on direct constrained optimization across 10 operations research domains, with all ground-truth solutions verified by the Gurobi solver. Each task presents a natural-language scenario with entities, constraints, and an optimization objective; the model must return a structured solution that a deterministic verifier checks against every constraint and the solver-proven optimum. We evaluate six frontier models on 200 tasks and find that feasibility, not optimality, is the primary bottleneck. The best model achieves only 65.0% constraint satisfaction, yet feasible solutions average 89 to 96% of the Gurobi-optimal objective. No model exceeds 30.5% on joint feasibility and optimality within 0.1% of the solver reference. Per-domain analysis shows large variation in difficulty, with average feasibility spanning from 83.3% in the production mix domain to 0.8% in the crew assignment domain. Further, systematic failure modes include duration constraint misunderstanding, entity hallucination, and a feasibility-optimality decoupling in facility location and vehicle routing where models achieve high feasibility but 0% optimality. ConstraintBench and all evaluation infrastructure will be publicly released.

Chinese Translation

大型语言模型越来越多地应用于操作决策制定，其基础结构为约束优化。现有基准测试评估大型语言模型能否将优化问题表述为求解器代码，但留出了一个补充问题。大型语言模型能否在没有求解器的情况下直接产生完全指定的约束优化问题的正确解？我们引入了ConstraintBench，这是一个用于评估大型语言模型在10个运筹学领域的直接约束优化能力的基准测试，所有的真实解均由Gurobi求解器验证。每个任务呈现一个自然语言场景，包含实体、约束和优化目标；模型必须返回一个结构化解，供确定性验证器检查每个约束和求解器证明的最优解。我们在200个任务上评估了六个前沿模型，发现可行性而非最优性是主要瓶颈。最佳模型仅实现了65.0%的约束满足率，但可行解的平均值为Gurobi最优目标的89%至96%。没有模型在联合可行性和最优性上超过30.5%，且与求解器参考值的偏差不超过0.1%。按领域分析显示，难度差异很大，平均可行性在生产混合领域为83.3%，而在机组分配领域仅为0.8%。此外，系统性失败模式包括对持续时间约束的误解、实体幻觉，以及在设施选址和车辆调度中可行性与最优性的解耦，模型在这些领域实现了高可行性但最优性为0%。ConstraintBench及所有评估基础设施将公开发布。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2602.22480

VeRO: An Evaluation Harness for Agents to Optimize Agents

VeRO：用于优化代理的评估工具

Ursekar, Varun, Shanker, Apaar, Chatrath, Veronica, Yuan, Xue, Denton, Sam

Abstract

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.

Chinese Translation

编码代理的一个重要新兴应用是代理优化：通过编辑-执行-评估循环对目标代理进行迭代改进。尽管这一任务具有重要性，但社区对编码代理在此任务上的表现缺乏系统性的理解。代理优化与传统软件工程有根本性的不同：目标代理将确定性代码与随机的 LLM（大语言模型）补全交织在一起，要求对中间推理和下游执行结果进行结构化捕捉。为了解决这些挑战，我们引入了 VERO（版本管理、奖励和观察），它提供了（1）一个可重复的评估工具，包含版本化的代理快照、预算控制的评估和结构化的执行轨迹，以及（2）一个包含目标代理和任务的基准套件，配有参考评估程序。使用 VERO，我们进行了一项实证研究，比较不同任务中的优化器配置，并分析哪些修改能够可靠地提高目标代理的性能。我们发布 VERO 以支持将代理优化作为编码代理的核心能力的研究。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2602.22500

Mapping the Landscape of Artificial Intelligence in Life Cycle Assessment Using Large Language Models

利用大型语言模型绘制人工智能在生命周期评估中的应用景观

Mensikova, Anastasija, Rizzo, Donna M., Hinkelman, Kathryn

Abstract

Integration of artificial intelligence (AI) into life cycle assessment (LCA) has accelerated in recent years, with numerous studies successfully adapting machine learning algorithms to support various stages of LCA. Despite this rapid development, comprehensive and broad synthesis of AI-LCA research remains limited. To address this gap, this study presents a detailed review of published work at the intersection of AI and LCA, leveraging large language models (LLMs) to identify current trends, emerging themes, and future directions. Our analyses reveal that as LCA research continues to expand, the adoption of AI technologies has grown dramatically, with a noticeable shift toward LLM-driven approaches, continued increases in ML applications, and statistically significant correlations between AI approaches and corresponding LCA stages. By integrating LLM-based text-mining methods with traditional literature review techniques, this study introduces a dynamic and effective framework capable of capturing both high-level research trends and nuanced conceptual patterns (themes) across the field. Collectively, these findings demonstrate the potential of LLM-assisted methodologies to support large-scale, reproducible reviews across broad research domains, while also evaluating pathways for computationally-efficient LCA in the context of rapidly developing AI technologies. In doing so, this work helps LCA practitioners incorporate state-of-the-art tools and timely insights into environmental assessments that can enhance the rigor and quality of sustainability-driven decisions and decision-making processes.

Chinese Translation

近年来，人工智能（AI）与生命周期评估（LCA）的融合加速发展，许多研究成功地将机器学习算法应用于支持LCA的各个阶段。尽管这一快速发展，但对AI-LCA研究的全面和广泛综合仍然有限。为了解决这一问题，本研究对AI与LCA交叉领域已发表的工作进行了详细回顾，利用大型语言模型（LLMs）识别当前趋势、新兴主题和未来方向。我们的分析显示，随着LCA研究的不断扩展，AI技术的采用显著增长，尤其是向LLM驱动的方法转变、机器学习（ML）应用的持续增加，以及AI方法与相应LCA阶段之间的统计显著相关性。通过将基于LLM的文本挖掘方法与传统文献回顾技术相结合，本研究提出了一个动态且有效的框架，能够捕捉该领域的高层次研究趋势和细致的概念模式（主题）。总体而言，这些发现展示了LLM辅助方法在支持大规模、可重复的跨广泛研究领域的回顾中的潜力，同时评估了在快速发展的AI技术背景下实现计算高效LCA的路径。通过这样做，本研究帮助LCA从业者将最先进的工具和及时的见解融入环境评估中，从而增强可持续决策和决策过程的严谨性和质量。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2602.22508

Mirroring the Mind: Distilling Human-Like Metacognitive Strategies into Large Language Models

映射思维：将类人元认知策略提炼至大型语言模型

Kim, Ik-hwan, Han, Hyeongrok, Jung, Mingi, Yu, Sangwon, Hong, Jinseok, Kim, Sang Hun, Choi, Yoonyoung, Yoon, Sungroh

Abstract

Large Reasoning Models (LRMs) often exhibit structural fragility in complex reasoning tasks, failing to produce correct answers even after successfully deriving valid intermediate steps. Through systematic analysis, we observe that these failures frequently stem not from a lack of reasoning capacity, but from a deficiency in self-regulatory control, where valid logic is destabilized by uncontrolled exploration or the failure to recognize logical sufficiency. Motivated by this observation, we propose Metacognitive Behavioral Tuning (MBT), a post-training framework that explicitly injects metacognitive behaviors into the model's thought process. MBT implements this via two complementary formulations: (1) MBT-S, which synthesizes rigorous reasoning traces from scratch, and (2) MBT-R, which rewrites the student's initial traces to stabilize intrinsic exploration patterns. Experiments across multi-hop QA benchmarks demonstrate that MBT consistently outperforms baselines, achieving notable gains on challenging benchmarks. By effectively eliminating reasoning collapse, MBT achieves higher accuracy with significantly reduced token consumption, demonstrating that internalizing metacognitive strategies leads to more stable and robust reasoning.

Chinese Translation

大型推理模型（LRMs）在复杂推理任务中常常表现出结构脆弱性，即使在成功推导出有效中间步骤后，仍无法产生正确答案。通过系统分析，我们观察到这些失败往往不是由于推理能力的缺乏，而是由于自我调节控制的不足，导致有效逻辑被无序探索或未能识别逻辑充分性所破坏。基于这一观察，我们提出了元认知行为调优（Metacognitive Behavioral Tuning, MBT），这是一种后训练框架，明确将元认知行为注入模型的思维过程。MBT通过两种互补的形式实现这一目标：（1）MBT-S，从头合成严格的推理轨迹；（2）MBT-R，重写学生的初始轨迹以稳定内在探索模式。在多跳问答基准测试中的实验表明，MBT始终优于基线，在具有挑战性的基准上取得了显著的提升。通过有效消除推理崩溃，MBT以显著减少的标记消耗实现了更高的准确性，证明了内化元认知策略能够导致更稳定和更强健的推理。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2602.22519

A Mathematical Theory of Agency and Intelligence

代理性与智能的数学理论

Hafez, Wael, Wei, Chenan, Felipe, Rodrigo, Nazeri, Amir, Reid, Cameron

Abstract

To operate reliably under changing conditions, complex systems require feedback on how effectively they use resources, not just whether objectives are met. Current AI systems process vast information to produce sophisticated predictions, yet predictions can appear successful while the underlying interaction with the environment degrades. What is missing is a principled measure of how much of the total information a system deploys is actually shared between its observations, actions, and outcomes. We prove this shared fraction, which we term bipredictability, P, is intrinsic to any interaction, derivable from first principles, and strictly bounded: P can reach unity in quantum systems, P equal to, or smaller than 0.5 in classical systems, and lower once agency (action selection) is introduced. We confirm these bounds in a physical system (double pendulum), reinforcement learning agents, and multi turn LLM conversations. These results distinguish agency from intelligence: agency is the capacity to act on predictions, whereas intelligence additionally requires learning from interaction, self-monitoring of its learning effectiveness, and adapting the scope of observations, actions, and outcomes to restore effective learning. By this definition, current AI systems achieve agency but not intelligence. Inspired by thalamocortical regulation in biological systems, we demonstrate a feedback architecture that monitors P in real time, establishing a prerequisite for adaptive, resilient AI.

Chinese Translation

为了在变化的条件下可靠运行，复杂系统需要反馈其资源使用的有效性，而不仅仅是目标是否达成。目前的人工智能系统处理大量信息以产生复杂的预测，然而预测可能看似成功，而与环境的基本互动却在恶化。缺失的是一个原则性的度量，来衡量系统所部署的总信息中，实际上有多少是共享于其观察、行动和结果之间。我们证明这种共享比例，称之为双预测性（bipredictability），P，是任何互动内在的，可以从第一原理推导出来，并且严格有界：在量子系统中，P可以达到1，在经典系统中，P等于或小于0.5，并且一旦引入代理性（行动选择），则会更低。我们在一个物理系统（双摆）、强化学习代理和多轮大型语言模型（LLM）对话中确认了这些界限。这些结果区分了代理性与智能：代理性是基于预测采取行动的能力，而智能则额外要求从互动中学习、自我监测学习的有效性，并调整观察、行动和结果的范围以恢复有效学习。根据这个定义，目前的人工智能系统实现了代理性，但未达到智能。受到生物系统中丘脑皮层调节的启发，我们展示了一种实时监测P的反馈架构，为自适应、韧性AI建立了先决条件。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2602.22523

Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents

认知模型与人工智能算法为语言代理的设计提供模板

Liu, Ryan, Arumugam, Dilip, Zhang, Cedegao E., Escola, Sean, Pitkow, Xaq, Griffiths, Thomas L.

Abstract

While contemporary large language models (LLMs) are increasingly capable in isolation, there are still many difficult problems that lie beyond the abilities of a single LLM. For such tasks, there is still uncertainty about how best to take many LLMs as parts and combine them into a greater whole. This position paper argues that potential blueprints for designing such modular language agents can be found in the existing literature on cognitive models and artificial intelligence (AI) algorithms. To make this point clear, we formalize the idea of an agent template that specifies roles for individual LLMs and how their functionalities should be composed. We then survey a variety of existing language agents in the literature and highlight their underlying templates derived directly from cognitive models or AI algorithms. By highlighting these designs, we aim to call attention to agent templates inspired by cognitive science and AI as a powerful tool for developing effective, interpretable language agents.

Chinese Translation

尽管当代的大型语言模型（LLMs）在独立运行时越来越强大，但仍然存在许多超出单一LLM能力的复杂问题。对于这些任务，如何将多个LLM作为部分并组合成一个更大的整体仍然存在不确定性。本文立场论文认为，设计此类模块化语言代理的潜在蓝图可以在现有的认知模型和人工智能（AI）算法文献中找到。为了明确这一观点，我们形式化了代理模板的概念，该模板指定了各个LLM的角色以及它们的功能应如何组合。接着，我们对文献中各种现有语言代理进行了调查，并强调了它们直接源自认知模型或AI算法的基础模板。通过突出这些设计，我们旨在引起人们对受认知科学和AI启发的代理模板的关注，认为其是开发有效且可解释的语言代理的强大工具。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2602.22539

Agentic AI for Intent-driven Optimization in Cell-free O-RAN

面向意图驱动优化的自主人工智能在无小区开放无线接入网中的应用

Shokouhi, Mohammad Hossein, Wong, Vincent W. S.

Abstract

Agentic artificial intelligence (AI) is emerging as a key enabler for autonomous radio access networks (RANs), where multiple large language model (LLM)-based agents reason and collaborate to achieve operator-defined intents. The open RAN (O-RAN) architecture enables the deployment and coordination of such agents. However, most existing works consider simple intents handled by independent agents, while complex intents that require coordination among agents remain unexplored. In this paper, we propose an agentic AI framework for intent translation and optimization in cell-free O-RAN. A supervisor agent translates the operator intents into an optimization objective and minimum rate requirements. Based on this information, a user weighting agent retrieves relevant prior experience from a memory module to determine the user priority weights for precoding. If the intent includes an energy-saving objective, then an open radio unit (O-RU) management agent will also be activated to determine the set of active O-RUs by using a deep reinforcement learning (DRL) algorithm. A monitoring agent measures and monitors the user data rates and coordinates with other agents to guarantee the minimum rate requirements are satisfied. To enhance scalability, we adopt a parameter-efficient fine-tuning (PEFT) method that enables the same underlying LLM to be used for different agents. Simulation results show that the proposed agentic AI framework reduces the number of active O-RUs by 41.93% when compared with three baseline schemes in energy-saving mode. Using the PEFT method, the proposed framework reduces the memory usage by 92% when compared with deploying separate LLM agents.

Chinese Translation

自主人工智能（AI）正在成为自主无线接入网络（RAN）的关键推动力，其中多个基于大型语言模型（LLM）的代理进行推理和协作，以实现运营商定义的意图。开放无线接入网（O-RAN）架构使得这些代理的部署和协调成为可能。然而，大多数现有研究仅考虑由独立代理处理的简单意图，而需要代理之间协调的复杂意图尚未得到探索。本文提出了一种用于无小区O-RAN中意图翻译和优化的自主AI框架。监督代理将运营商意图转化为优化目标和最低速率要求。基于这些信息，用户加权代理从记忆模块中检索相关的先前经验，以确定用于预编码的用户优先权重。如果意图包括节能目标，则会激活开放无线单元（O-RU）管理代理，通过深度强化学习（DRL）算法确定活动O-RU的集合。监测代理测量和监控用户数据速率，并与其他代理协调，以确保满足最低速率要求。为了增强可扩展性，我们采用了一种参数高效微调（PEFT）方法，使得相同的基础LLM可以用于不同的代理。仿真结果表明，所提出的自主AI框架在节能模式下相比于三种基线方案减少了41.93%的活动O-RU数量。使用PEFT方法，所提框架在与部署单独LLM代理相比时，减少了92%的内存使用。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2602.22546

Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention

请求专家推理：通过学习的协作干预增强大型语言模型代理

Wang, Zhiming, He, Jinwei, Lu, Feng

Abstract

Large Language Model (LLM) based agents excel at general reasoning but often fail in specialized domains where success hinges on long-tail knowledge absent from their training data. While human experts can provide this missing knowledge, their guidance is often unstructured and unreliable, making its direct integration into an agent's plan problematic. To address this, we introduce AHCE (Active Human-Augmented Challenge Engagement), a framework for on-demand Human-AI collaboration. At its core, the Human Feedback Module (HFM) employs a learned policy to treat the human expert as an interactive reasoning tool. Extensive experiments in Minecraft demonstrate the framework's effectiveness, increasing task success rates by 32% on normal difficulty tasks and nearly 70% on highly difficult tasks, all with minimal human intervention. Our work demonstrates that successfully augmenting agents requires learning how to request expert reasoning, moving beyond simple requests for help.

Chinese Translation

基于大型语言模型（LLM）的代理在一般推理方面表现出色，但在成功依赖于训练数据中缺失的长尾知识的专业领域时，往往表现不佳。虽然人类专家能够提供这些缺失的知识，但他们的指导通常是非结构化和不可靠的，使其直接融入代理的计划变得困难。为了解决这一问题，我们提出了AHCE（主动人类增强挑战参与）框架，用于按需的人机协作。该框架的核心是人类反馈模块（HFM），它采用学习的策略将人类专家视为一种互动推理工具。在Minecraft中的大量实验表明，该框架的有效性，在普通难度任务中任务成功率提高了32%，在高难度任务中几乎提高了70%，且人类干预最小。我们的研究表明，成功增强代理需要学习如何请求专家推理，而不仅仅是简单的求助请求。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2602.22557

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

CourtGuard：一种与模型无关的框架，用于大语言模型安全中的零-shot策略适应

Suleymanov, Umid, Bayramov, Rufiz, Gafarli, Suad, Musayeva, Seljan, Mammadov, Taghi, Akhundlu, Aynur, Kantarcioglu, Murat

Abstract

Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.

Chinese Translation

当前大语言模型（LLMs）的安全机制严重依赖于静态的、经过微调的分类器，这些分类器存在适应性僵化的问题，无法在没有昂贵再训练的情况下执行新的治理规则。为了解决这个问题，我们提出了CourtGuard，一个增强检索的多智能体框架，将安全评估重新构想为证据辩论。通过协调基于外部政策文件的对抗性辩论，CourtGuard在7个安全基准测试中实现了最先进的性能，超越了不经过微调的专用政策遵循基线。除了标准指标外，我们还强调了两个关键能力：（1）零-shot适应性，我们的框架通过更换参考政策成功推广到一个域外的维基百科破坏行为任务（准确率达到90%）；（2）自动数据策划和审计，我们利用CourtGuard策划和审计了九个新颖的复杂对抗攻击数据集。我们的结果表明，将安全逻辑与模型权重解耦，为满足当前和未来的人工智能治理监管要求提供了一条稳健、可解释且可适应的路径。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2602.22583

Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

数学推理中的策略可执行性：利用人类与模型之间的差异进行有效指导

Liang, Weida, Sun, Yiyou, Nan, Shuyuan, Li, Chuang, Song, Dawn, Kawaguchi, Kenji

Abstract

Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models-even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage-whether a reasoning strategy appears in successful solutions-and strategy executability-whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to $+13$ points on AIME25 and $+5$ points on Apex for compact reasoning models. Code and benchmark are publicly available at: https://github.com/lwd17/strategy-execute-pipeline.

Chinese Translation

基于示例的指导广泛应用于提高推理时的数学推理能力，但其在不同问题和模型中的有效性却高度不稳定——即使指导是正确且与问题相关的。我们展示了这种不稳定性源于策略使用（即推理策略是否出现在成功的解决方案中）与策略可执行性（即该策略在作为目标模型的指导时是否仍然有效）之间一个先前未被充分探讨的差距。通过对人类编写的解决方案与模型生成的解决方案进行控制分析，我们识别出使用与可执行性之间的系统性解离：人类和模型派生的策略在结构上和领域依赖性上存在差异，导致在指导下具有互补的优势和一致的源依赖性反转。在此基础上，我们提出了选择性策略检索（Selective Strategy Retrieval, SSR），这是一个测试时框架，通过使用经验的、多路径的、源感知的信号，明确建模可执行性，选择性地检索和组合策略。在多个数学推理基准测试中，SSR相较于直接求解、上下文学习和单一来源指导，提供了可靠且一致的改进，在AIME25上提高准确率高达13分，在Apex上提高5分，适用于紧凑推理模型。代码和基准测试可在以下网址公开获取：https://github.com/lwd17/strategy-execute-pipeline。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2602.22585

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

纠正人工标签中的评估者效应：一种项目反应理论方法

Casabianca, Jodi M., Beiting-Parrish, Maggie

Abstract

Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.

Chinese Translation

人工评估在训练和评估人工智能模型中发挥着核心作用，但这些数据很少被视为受系统误差影响的测量。本文将心理测量评估者模型整合到人工智能流程中，以提高从人类判断中得出的结论的可靠性和有效性。文章回顾了扭曲观察评分的常见评估者效应，包括严厉性和中心性，并展示了如何通过项目反应理论评估者模型，特别是多维 Rasch 模型，来区分真实输出质量与评估者行为。以 OpenAI 摘要数据集作为实证例子，我们展示了如何通过调整评估者的严厉性来产生更正的摘要质量估计，并提供对评估者表现的诊断性洞察。将心理测量建模纳入人机协作评估提供了更为原则性和透明的人类数据使用方式，使开发者能够基于调整后的分数而非原始的、易出错的评分做出决策。这一视角突显了朝着更强健、可解释和符合构念的人工智能开发与评估实践迈进的路径。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2602.22603

SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

SideQuest：基于模型的长时间跨度代理推理的KV缓存管理

Kariyappa, Sanjay, Suh, G. Edward

Abstract

Long-running agentic tasks, such as deep research, require multi-hop reasoning over information distributed across multiple webpages and documents. In such tasks, the LLM context is dominated by tokens from external retrieval, causing memory usage to grow rapidly and limiting decode performance. While several KV cache compression techniques exist for long-context inputs, we find that existing heuristics fail to support multi-step reasoning models effectively. We address this challenge with SideQuest -- a novel approach that leverages the Large Reasoning Model (LRM) itself to perform KV cache compression by reasoning about the usefulness of tokens in its context. To prevent the tokens associated with this management process from polluting the model's memory, we frame KV cache compression as an auxiliary task executed in parallel to the main reasoning task. Our evaluations, using a model trained with just 215 samples, show that SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal degradation in accuracy, outperforming heuristic-based KV cache compression techniques.

Chinese Translation

长时间运行的代理任务，例如深入研究，需要对分布在多个网页和文档中的信息进行多跳推理。在此类任务中，LLM（大型语言模型）的上下文主要由外部检索的标记主导，导致内存使用迅速增长并限制了解码性能。虽然已有多种KV缓存压缩技术适用于长上下文输入，但我们发现现有的启发式方法未能有效支持多步骤推理模型。为了解决这一挑战，我们提出了SideQuest——一种新颖的方法，利用大型推理模型（Large Reasoning Model, LRM）本身，通过推理其上下文中标记的有用性来执行KV缓存压缩。为了防止与此管理过程相关的标记污染模型的内存，我们将KV缓存压缩框架设定为与主要推理任务并行执行的辅助任务。我们的评估使用了仅215个样本训练的模型，结果表明，SideQuest在代理任务中将峰值标记使用量减少了多达65%，且准确性降幅最小，优于基于启发式的KV缓存压缩技术。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2602.22638

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

MobilityBench：用于评估真实世界移动场景中路线规划智能体的基准测试

Song, Zhiheng, Zhang, Jingshuai, Qin, Chuan, Wang, Chao, Chen, Chao, Xu, Longfei, Liu, Kaikui, Chu, Xiangxiang, Zhu, Hengshu

Abstract

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .

Chinese Translation

由大型语言模型（LLMs）驱动的路线规划智能体已成为通过自然语言交互和工具辅助决策支持日常人类移动的有前景的范式。然而，真实世界移动环境中的系统评估受到多样化路线需求、不确定的地图服务和有限的可重复性等因素的制约。在本研究中，我们介绍了MobilityBench，一个可扩展的基准测试，用于评估基于LLM的路线规划智能体在真实世界移动场景中的表现。MobilityBench由从高德地图（Amap）收集的大规模匿名真实用户查询构建，涵盖了全球多个城市的广泛路线规划意图。为了实现可重复的端到端评估，我们设计了一个确定性的API重放沙箱，消除了来自实时服务的环境差异。我们进一步提出了一个以结果有效性为中心的多维评估协议，并辅以对指令理解、规划、工具使用和效率的评估。通过使用MobilityBench，我们评估了多个基于LLM的路线规划智能体在多样化真实世界移动场景中的表现，并提供了对其行为和性能的深入分析。我们的研究发现，当前模型在基本信息检索和路线规划任务上表现良好，但在偏好约束路线规划方面存在显著困难，凸显了个性化移动应用中改进的巨大空间。我们在 https://github.com/AMAP-ML/MobilityBench 上公开发布了基准数据、评估工具包和文档。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2602.22650

AHBid: An Adaptable Hierarchical Bidding Framework for Cross-Channel Advertising

AHBid：一种适应性层次化竞价框架用于跨渠道广告

Yang, Xinxin, Tang, Yangyang, Zhou, Yikun, Liu, Yaolei, Li, Yun, Yang, Bo

Abstract

In online advertising, the inherent complexity and dynamic nature of advertising environments necessitate the use of auto-bidding services to assist advertisers in bid optimization. This complexity is further compounded in multi-channel scenarios, where effective allocation of budgets and constraints across channels with distinct behavioral patterns becomes critical for optimizing return on investment. Current approaches predominantly rely on either optimization-based strategies or reinforcement learning techniques. However, optimization-based methods lack flexibility in adapting to dynamic market conditions, while reinforcement learning approaches often struggle to capture essential historical dependencies and observational patterns within the constraints of Markov Decision Process frameworks. To address these limitations, we propose AHBid, an Adaptable Hierarchical Bidding framework that integrates generative planning with real-time control. The framework employs a high-level generative planner based on diffusion models to dynamically allocate budgets and constraints by effectively capturing historical context and temporal patterns. We introduce a constraint enforcement mechanism to ensure compliance with specified constraints, along with a trajectory refinement mechanism that enhances adaptability to environmental changes through the utilization of historical data. The system further incorporates a control-based bidding algorithm that synergistically combines historical knowledge with real-time information, significantly improving both adaptability and operational efficacy. Extensive experiments conducted on large-scale offline datasets and through online A/B tests demonstrate the effectiveness of AHBid, yielding a 13.57% increase in overall return compared to existing baselines.

Chinese Translation

在在线广告中，广告环境的固有复杂性和动态特性要求使用自动竞价服务来帮助广告主进行竞价优化。在多渠道场景中，这种复杂性进一步加剧，如何在具有不同行为模式的渠道之间有效分配预算和约束，对于优化投资回报至关重要。目前的方法主要依赖于基于优化的策略或强化学习技术。然而，基于优化的方法在适应动态市场条件方面缺乏灵活性，而强化学习方法在马尔可夫决策过程框架的约束下，往往难以捕捉重要的历史依赖性和观察模式。为了解决这些局限性，我们提出了AHBid，一种适应性层次化竞价框架，该框架将生成规划与实时控制相结合。该框架采用基于扩散模型的高层生成规划器，通过有效捕捉历史背景和时间模式，动态分配预算和约束。我们引入了一种约束执行机制，以确保遵守指定的约束，同时还引入了一种轨迹优化机制，通过利用历史数据增强对环境变化的适应性。该系统进一步结合了一种基于控制的竞价算法，将历史知识与实时信息协同结合，显著提高了适应性和操作效率。在大规模离线数据集和在线A/B测试中进行的广泛实验表明，AHBid的有效性，相较于现有基准，整体回报提高了13.57%。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2602.22680

Toward Personalized LLM-Powered Agents: Foundations, Evaluation, and Future Directions

迈向个性化的LLM驱动代理：基础、评估与未来方向

Xu, Yue, Chen, Qian, Ma, Zizhan, Liu, Dongrui, Wang, Wenxuan, Wang, Xiting, Xiong, Li, Wang, Wenjie

Abstract

Large language models have enabled agents that reason, plan, and interact with tools and environments to accomplish complex tasks. As these agents operate over extended interaction horizons, their effectiveness increasingly depends on adapting behavior to individual users and maintaining continuity across time, giving rise to personalized LLM-powered agents. In such long-term, user-dependent settings, personalization permeates the entire decision pipeline rather than remaining confined to surface-level generation. This survey provides a capability-oriented review of personalized LLM-powered agents. We organize the literature around four interdependent components: profile modeling, memory, planning, and action execution. Using this taxonomy, we synthesize representative methods and analyze how user signals are represented, propagated, and utilized, highlighting cross-component interactions and recurring design trade-offs. We further examine evaluation metrics and benchmarks tailored to personalized agents, summarize application scenarios spanning general assistance to specialized domains, and outline future directions for research and deployment. By offering a structured framework for understanding and designing personalized LLM-powered agents, this survey charts a roadmap toward more user-aligned, adaptive, robust, and deployable agentic systems, accelerating progress from prototype personalization to scalable real-world assistants.

Chinese Translation

大型语言模型使得代理能够推理、规划并与工具和环境互动，以完成复杂任务。随着这些代理在较长的交互时间范围内操作，它们的有效性越来越依赖于对个体用户行为的适应和跨时间的连续性，从而催生了个性化的LLM驱动代理。在这种长期的、依赖用户的环境中，个性化贯穿整个决策流程，而不仅仅局限于表层生成。本文提供了对个性化LLM驱动代理的能力导向综述。我们围绕四个相互依赖的组成部分组织文献：用户画像建模、记忆、规划和行动执行。利用这一分类法，我们综合了代表性方法，并分析了用户信号的表示、传播和利用方式，突出了跨组件的互动和反复出现的设计权衡。我们进一步考察了针对个性化代理的评估指标和基准，总结了从一般辅助到专业领域的应用场景，并概述了研究和部署的未来方向。通过提供一个结构化框架以理解和设计个性化的LLM驱动代理，本文为实现更符合用户需求、适应性强、稳健且可部署的代理系统绘制了一条从原型个性化到可扩展现实世界助手的路线图。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2602.22702

Knob: A Physics-Inspired Gating Interface for Interpretable and Controllable Neural Dynamics

Knob：一种受物理启发的门控接口，用于可解释和可控的神经动态

Jiang, Siyu, Cui, Sanshuai, Zeng, Hui

Abstract

Existing neural network calibration methods often treat calibration as a static, post-hoc optimization task. However, this neglects the dynamic and temporal nature of real-world inference. Moreover, existing methods do not provide an intuitive interface enabling human operators to dynamically adjust model behavior under shifting conditions. In this work, we propose Knob, a framework that connects deep learning with classical control theory by mapping neural gating dynamics to a second-order mechanical system. By establishing correspondences between physical parameters -- damping ratio ($\zeta$) and natural frequency ($\omega_n$) -- and neural gating, we create a tunable "safety valve". The core mechanism employs a logit-level convex fusion, functioning as an input-adaptive temperature scaling. It tends to reduce model confidence particularly when model branches produce conflicting predictions. Furthermore, by imposing second-order dynamics (Knob-ODE), we enable a \textit{dual-mode} inference: standard i.i.d. processing for static tasks, and state-preserving processing for continuous streams. Our framework allows operators to tune "stability" and "sensitivity" through familiar physical analogues. This paper presents an exploratory architectural interface; we focus on demonstrating the concept and validating its control-theoretic properties rather than claiming state-of-the-art calibration performance. Experiments on CIFAR-10-C validate the calibration mechanism and demonstrate that, in Continuous Mode, the gate responses are consistent with standard second-order control signatures (step settling and low-pass attenuation), paving the way for predictable human-in-the-loop tuning.

Chinese Translation

现有的神经网络校准方法通常将校准视为一个静态的、事后优化的任务。然而，这忽视了现实世界推理的动态和时间特性。此外，现有方法未能提供一个直观的接口，使人类操作员能够在变化的条件下动态调整模型行为。在本研究中，我们提出了Knob，一个将深度学习与经典控制理论相结合的框架，通过将神经门控动态映射到二阶机械系统。通过建立物理参数（阻尼比（$ ext{ζ}$）和自然频率（$ ext{ω}_n$））与神经门控之间的对应关系，我们创建了一个可调的“安全阀”。核心机制采用对数级别的凸融合，作为输入自适应温度缩放。当模型分支产生相互冲突的预测时，它倾向于降低模型的置信度。此外，通过施加二阶动态（Knob-ODE），我们实现了 extit{双模式}推理：针对静态任务的标准独立同分布（i.i.d.）处理，以及针对连续流的状态保持处理。我们的框架允许操作员通过熟悉的物理类比来调节“稳定性”和“灵敏度”。本文呈现了一个探索性的架构接口；我们专注于展示这一概念并验证其控制理论特性，而非声称达到最先进的校准性能。在CIFAR-10-C上的实验验证了校准机制，并表明在连续模式下，门控响应与标准的二阶控制特征（阶跃响应和低通衰减）一致，为可预测的人机协同调节铺平了道路。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2602.22718

RLHFless: Serverless Computing for Efficient RLHF

RLHFless：高效RLHF的无服务器计算

Wei, Rui, Yu, Hanfei, Jain, Shubham, Sivakumar, Yogarajan, Tiwari, Devesh, Li, Jian, Park, Seung-Jong, Wang, Hao

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been widely applied to Large Language Model (LLM) post-training to align model outputs with human preferences. Recent models, such as DeepSeek-R1, have also shown RLHF's potential to improve LLM reasoning on complex tasks. In RL, inference and training co-exist, creating dynamic resource demands throughout the workflow. Compared to traditional RL, RLHF further challenges training efficiency due to expanding model sizes and resource consumption. Several RLHF frameworks aim to balance flexible abstraction and efficient execution. However, they rely on serverful infrastructures, which struggle with fine-grained resource variability. As a result, during synchronous RLHF training, idle time between or within RL components often causes overhead and resource wastage. To address these issues, we present RLHFless, the first scalable training framework for synchronous RLHF, built on serverless computing environments. RLHFless adapts to dynamic resource demands throughout the RLHF pipeline, pre-computes shared prefixes to avoid repeated computation, and uses a cost-aware actor scaling strategy that accounts for response length variation to find sweet spots with lower cost and higher speed. In addition, RLHFless assigns workloads efficiently to reduce intra-function imbalance and idle time. Experiments on both physical testbeds and a large-scale simulated cluster show that RLHFless achieves up to 1.35x speedup and 44.8% cost reduction compared to the state-of-the-art baseline.

Chinese Translation

人类反馈强化学习（Reinforcement Learning from Human Feedback，RLHF）已广泛应用于大型语言模型（Large Language Model，LLM）的后期训练，以使模型输出与人类偏好对齐。近期模型如DeepSeek-R1也展示了RLHF在复杂任务上提升LLM推理能力的潜力。在强化学习（Reinforcement Learning，RL）中，推理与训练共存，导致整个工作流程中动态资源需求的产生。与传统RL相比，由于模型规模的扩大和资源消耗的增加，RLHF进一步挑战了训练效率。一些RLHF框架旨在平衡灵活的抽象和高效的执行。然而，它们依赖于服务器架构，这在处理细粒度资源变化时面临困难。因此，在同步RLHF训练过程中，RL组件之间或内部的空闲时间往往导致开销和资源浪费。为了解决这些问题，我们提出了RLHFless，这是第一个基于无服务器计算环境的同步RLHF可扩展训练框架。RLHFless能够适应RLHF管道中动态的资源需求，预计算共享前缀以避免重复计算，并采用一种考虑响应长度变化的成本感知演员扩展策略，以找到成本更低、速度更快的最佳点。此外，RLHFless有效分配工作负载，以减少函数内部的不平衡和空闲时间。在物理测试平台和大规模模拟集群上的实验表明，与最先进的基线相比，RLHFless实现了最高1.35倍的加速和44.8%的成本降低。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2602.22743

Generative Data Transformation: From Mixed to Unified Data

生成数据转换：从混合数据到统一数据

Zhang, Jiaqing, Yin, Mingjia, Wang, Hao, Tian, Yuxin, Ye, Yuyang, Li, Yawen, Guo, Wei, Liu, Yong, Chen, Enhong

Abstract

Recommendation model performance is intrinsically tied to the quality, volume, and relevance of their training data. To address common challenges like data sparsity and cold start, recent researchs have leveraged data from multiple auxiliary domains to enrich information within the target domain. However, inherent domain gaps can degrade the quality of mixed-domain data, leading to negative transfer and diminished model performance. Existing prevailing \emph{model-centric} paradigm -- which relies on complex, customized architectures -- struggles to capture the subtle, non-structural sequence dependencies across domains, leading to poor generalization and high demands on computational resources. To address these shortcomings, we propose \textsc{Taesar}, a \emph{data-centric} framework for \textbf{t}arget-\textbf{a}lign\textbf{e}d \textbf{s}equenti\textbf{a}l \textbf{r}egeneration, which employs a contrastive decoding mechanism to adaptively encode cross-domain context into target-domain sequences. It employs contrastive decoding to encode cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures. Experiments show \textsc{Taesar} outperforms model-centric solutions and generalizes to various sequential models. By generating enriched datasets, \textsc{Taesar} effectively combines the strengths of data- and model-centric paradigms. The code accompanying this paper is available at~ \textcolor{blue}{https://github.com/USTC-StarTeam/Taesar}.

Chinese Translation

推荐模型的性能与其训练数据的质量、数量和相关性密切相关。为了应对数据稀疏和冷启动等常见挑战，近期研究利用来自多个辅助领域的数据来丰富目标领域的信息。然而，固有的领域差异可能会降低混合领域数据的质量，导致负迁移和模型性能下降。现有的主流 extit{以模型为中心}的范式——依赖复杂、定制的架构——难以捕捉跨领域的微妙非结构化序列依赖，导致泛化能力差且对计算资源的需求高。为了解决这些问题，我们提出了 extsc{Taesar}，一个 extit{以数据为中心}的框架，用于 extbf{目标-对齐的} extbf{序列} extbf{再生}，该框架采用对比解码机制自适应地将跨领域上下文编码到目标领域序列中。它通过对比解码将跨领域上下文编码到目标序列中，使标准模型能够在没有复杂融合架构的情况下学习复杂的依赖关系。实验表明， extsc{Taesar}的表现优于以模型为中心的解决方案，并且能够泛化到各种序列模型。通过生成丰富的数据集， extsc{Taesar}有效结合了以数据和以模型为中心的范式的优势。与本文相关的代码可在~ extcolor{blue}{https://github.com/USTC-StarTeam/Taesar}获取。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2602.22751

Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

了解你所知道的：可验证强化学习推理的元认知熵校准

Zhao, Qiannian, Yang, Chen, Jing, Jinhao, Zhang, Yunke, Ren, Xuhui, Yu, Lu, Zhang, Shijie, Yin, Hongzhi

Abstract

Large reasoning models (LRMs) have emerged as a powerful paradigm for solving complex real-world tasks. In practice, these models are predominantly trained via Reinforcement Learning with Verifiable Rewards (RLVR), yet most existing outcome-only RLVR pipelines rely almost exclusively on a binary correctness signal and largely ignore the model's intrinsic uncertainty. We term this discrepancy the uncertainty-reward mismatch, under which high- and low-uncertainty solutions are treated equivalently, preventing the policy from "Know What You Know" and impeding the shift from optimizing for correct answers to optimizing effective reasoning paths. This limitation is especially critical in reasoning-centric tasks such as mathematics and question answering, where performance hinges on the quality of the model's internal reasoning process rather than mere memorization of final answers. To address this, we propose EGPO, a metacognitive entropy calibration framework that explicitly integrates intrinsic uncertainty into RLVR for enhancing LRMs. EGPO estimates per-sample uncertainty using a zero-overhead entropy proxy derived from token-level likelihoods and aligns it with extrinsic correctness through an asymmetric calibration mechanism that preserves correct reasoning while selectively regulating overconfident failures, thereby enabling stable and uncertainty-aware policy optimization. Moreover, EGPO recovers informative learning signals from otherwise degenerate group-based rollouts without modifying the verifier or reward definition. Extensive experiments across multiple benchmarks demonstrate that the proposed EGPO leads to substantial and consistent improvements in reasoning performance, establishing a principled path for advancing LRMs through metacognitive entropy calibration.

Chinese Translation

大型推理模型（LRMs）已成为解决复杂现实任务的强大范式。在实践中，这些模型主要通过可验证奖励的强化学习（RLVR）进行训练，但大多数现有的仅基于结果的RLVR流程几乎完全依赖于二元正确性信号，并在很大程度上忽视了模型的内在不确定性。我们将这种差异称为不确定性-奖励不匹配，在这种情况下，高不确定性和低不确定性解决方案被等同对待，阻碍了策略的“了解你所知道的”，并妨碍了从优化正确答案转向优化有效推理路径的转变。这一限制在以推理为中心的任务（如数学和问答）中尤为关键，因为这些任务的表现取决于模型内部推理过程的质量，而不仅仅是最终答案的记忆。为了解决这个问题，我们提出了EGPO，一种元认知熵校准框架，它明确将内在不确定性整合到RLVR中，以增强LRMs。EGPO使用从标记级别的似然性推导出的零开销熵代理来估计每个样本的不确定性，并通过一种不对称的校准机制将其与外部正确性对齐，该机制在选择性地调节过于自信的失败的同时保持正确推理，从而实现稳定且具不确定性感知的策略优化。此外，EGPO从否则会退化的基于组的回滚中恢复有信息的学习信号，而无需修改验证器或奖励定义。在多个基准上的广泛实验表明，所提出的EGPO在推理性能上带来了显著且一致的改善，为通过元认知熵校准推进LRMs建立了原则性路径。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2602.22758

Decomposing Physician Disagreement in HealthBench

解构HealthBench中的医生分歧

Borgohain, Satya, Mariathas, Roy

Abstract

We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench's metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity does not, pointing toward actionable evaluation design improvements.

Chinese Translation

我们对HealthBench医疗人工智能评估数据集中医生的分歧进行了解构，以理解变异的来源以及哪些可观察特征可以解释这种变异。评分标准的身份占到了满足/不满足标签变异的15.8%，但仅占分歧变异的3.6-6.9%；医生身份仅占2.4%。主导的81.8%案例级残差并未被HealthBench的元数据标签（z = -0.22, p = 0.83）、规范评分语言（伪R^2 = 1.2%）、医学专业（0/300 Tukey对比显著）、表面特征分流（AUC = 0.58）或嵌入（AUC = 0.485）所减少。分歧与完成质量呈倒U型关系（AUC = 0.689），确认医生在明显良好或不良的结果上达成一致，但在边界案例上存在分歧。经过医生验证的不确定性类别显示，可减少的不确定性（缺失上下文、模糊措辞）使分歧的几率增加了两倍以上（OR = 2.55, p < 10^(-24)），而不可减少的不确定性（真正的医学模糊性）则没有影响（OR = 1.01, p = 0.90），尽管即便是前者也仅解释了总变异的约3%。因此，医疗人工智能评估中的一致性上限在很大程度上是结构性的，但可减少/不可减少的区分表明，在评估场景中填补信息空白可能会降低分歧，而在固有临床模糊性无法降低的情况下，这指向了可行的评估设计改进。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2602.22769

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

AMA-Bench：评估代理应用中的长时记忆

Zhao, Yujie, Yuan, Boqin, Huang, Junbo, Yuan, Haocheng, Yu, Zhongming, Xu, Haozhou, Hu, Lanxiang, Shankarampeta, Abhilash, Huang, Zimeng, Ni, Wentao, Tian, Yuandong, Zhao, Jishen

Abstract

Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.

Chinese Translation

大型语言模型（LLMs）作为自主代理在日益复杂的应用中被部署，其中启用长时记忆对于实现强大的性能至关重要。然而，实际应用与当前代理记忆评估标准之间存在显著差距：现有基准主要集中在以对话为中心的人机交互上。实际上，代理记忆由代理与环境之间的连续交互流组成，这些交互主要由机器生成的表示构成。为了弥合这一差距，我们引入了AMA-Bench（任意长度的代理记忆），该基准评估LLMs在真实代理应用中的长时记忆。它具有两个关键组成部分：（1）一组覆盖代表性代理应用的真实世界代理轨迹，配有专家策划的问答（QA）；（2）一组可扩展到任意时间范围的合成代理轨迹，配有基于规则的问答。我们的综合研究表明，现有记忆系统在AMA-Bench上的表现不佳，主要是因为它们缺乏因果关系和客观信息，并受到许多记忆系统所采用的基于相似性的检索的损失性特征的限制。为了解决这些局限性，我们提出了AMA-Agent，一种有效的记忆系统，具有因果图和工具增强的检索。我们的结果表明，AMA-Agent在AMA-Bench上的平均准确率达到57.22%，超越了最强记忆系统基线11.16%。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2602.22771

ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making

ClinDet-Bench：超越弃权，评估大型语言模型在临床决策中的判断可确定性

Watanabe, Yusuke, Kobashi, Yohei, Kojima, Takeshi, Iwasawa, Yusuke, Okuno, Yasushi, Matsuo, Yutaka

Abstract

Clinical decisions are often required under incomplete information. Clinical experts must identify whether available information is sufficient for judgment, as both premature conclusion and unnecessary abstention can compromise patient safety. To evaluate this capability of large language models (LLMs), we developed ClinDet-Bench, a benchmark based on clinical scoring systems that decomposes incomplete-information scenarios into determinable and undeterminable conditions. Identifying determinability requires considering all hypotheses about missing information, including unlikely ones, and verifying whether the conclusion holds across them. We find that recent LLMs fail to identify determinability under incomplete information, producing both premature judgments and excessive abstention, despite correctly explaining the underlying scoring knowledge and performing well under complete information. These findings suggest that existing benchmarks are insufficient to evaluate the safety of LLMs in clinical settings. ClinDet-Bench provides a framework for evaluating determinability recognition, leading to appropriate abstention, with potential applicability to medicine and other high-stakes domains, and is publicly available.

Chinese Translation

临床决策常常是在信息不完整的情况下做出的。临床专家必须判断现有信息是否足以做出判断，因为过早的结论和不必要的弃权都可能危及患者安全。为了评估大型语言模型（LLMs）在这一能力上的表现，我们开发了ClinDet-Bench，这是一个基于临床评分系统的基准，旨在将不完整信息场景分解为可确定和不可确定的条件。识别可确定性需要考虑所有关于缺失信息的假设，包括不太可能的假设，并验证结论在这些假设下是否成立。我们的研究发现，近期的LLMs在不完整信息下未能识别可确定性，导致产生过早的判断和过度的弃权，尽管它们能够正确解释基础评分知识并在完整信息下表现良好。这些发现表明，现有的基准不足以评估LLMs在临床环境中的安全性。ClinDet-Bench提供了一个评估可确定性识别的框架，从而促使适当的弃权，具有在医学和其他高风险领域应用的潜力，并已公开发布。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2602.22808

MiroFlow: Towards High-Performance and Robust Open-Source Agent Framework for General Deep Research Tasks

MiroFlow：面向高性能和稳健的开源智能体框架以应对一般深度研究任务

Su, Shiqian, Xing, Sen, Dong, Xuan, Zhong, Muyan, Wang, Bin, Zhu, Xizhou, Chen, Yuntao, Wang, Wenhai, Deng, Yue, Zhu, Pengxiang, Liu, Ziyuan, Li, Tiantong, Yu, Jiaheng, Chen, Zhe, Bing, Lidong, Dai, Jifeng

Abstract

Despite the remarkable progress of large language models (LLMs), the capabilities of standalone LLMs have begun to plateau when tackling real-world, complex tasks that require interaction with external tools and dynamic environments. Although recent agent frameworks aim to enhance model autonomy through tool integration and external interaction, they still suffer from naive workflows, unstable performance, limited support across diverse benchmarks and tasks, and heavy reliance on costly commercial APIs. In this work, we propose a high-performance and robust open-source agent framework, termed MiroFlow, which incorporates an agent graph for flexible orchestration, an optional deep reasoning mode to enhance performance, and a robust workflow execution to ensure stable and reproducible performance. Extensive experiments demonstrate that MiroFlow consistently achieves state-of-the-art performance across multiple agent benchmarks, including GAIA, BrowseComp-EN/ZH, HLE, xBench-DeepSearch, and notably FutureX. We hope it could serve as an easily accessible, reproducible, and comparable baseline for the deep research community.

Chinese Translation

尽管大型语言模型（LLMs）取得了显著进展，但在处理需要与外部工具和动态环境互动的现实世界复杂任务时，独立LLMs的能力已开始趋于平稳。虽然近期的智能体框架旨在通过工具集成和外部互动来增强模型的自主性，但它们仍然面临着简单的工作流程、不稳定的性能、对多样化基准和任务的有限支持，以及对昂贵商业API的高度依赖。在本研究中，我们提出了一种高性能且稳健的开源智能体框架，称为MiroFlow，该框架结合了灵活调度的智能体图、可选的深度推理模式以提升性能，以及稳健的工作流程执行以确保稳定和可重复的性能。大量实验表明，MiroFlow在多个智能体基准测试中持续实现了最先进的性能，包括GAIA、BrowseComp-EN/ZH、HLE、xBench-DeepSearch，以及特别的FutureX。我们希望它能够为深度研究社区提供一个易于访问、可重复和可比较的基线。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2602.22814

When Should an AI Act? A Human-Centered Model of Scene, Context, and Behavior for Agentic AI Design

人工智能何时应采取行动？以人为中心的场景、背景和行为模型用于代理人工智能设计

Jung, Soyoung, Yoon, Daehoo, Koh, Sung Gyu, Kim, Young Hwan, Ahn, Yehan, Park, Sung

Abstract

Agentic AI increasingly intervenes proactively by inferring users' situations from contextual data yet often fails for lack of principled judgment about when, why, and whether to act. We address this gap by proposing a conceptual model that reframes behavior as an interpretive outcome integrating Scene (observable situation), Context (user-constructed meaning), and Human Behavior Factors (determinants shaping behavioral likelihood). Grounded in multidisciplinary perspectives across the humanities, social sciences, HCI, and engineering, the model separates what is observable from what is meaningful to the user and explains how the same scene can yield different behavioral meanings and outcomes. To translate this lens into design action, we derive five agent design principles (behavioral alignment, contextual sensitivity, temporal appropriateness, motivational calibration, and agency preservation) that guide intervention depth, timing, intensity, and restraint. Together, the model and principles provide a foundation for designing agentic AI systems that act with contextual sensitivity and judgment in interactions.

Chinese Translation

代理人工智能越来越多地通过从上下文数据中推断用户的情况来主动干预，但常常因缺乏原则性判断而未能有效行动，包括何时、为何以及是否采取行动。我们通过提出一个概念模型来填补这一空白，该模型将行为重新定义为一种解释性结果，整合了场景（可观察的情况）、背景（用户构建的意义）和人类行为因素（影响行为可能性的决定因素）。该模型基于人文学科、社会科学、人机交互（HCI）和工程学等多学科视角，区分了可观察的内容与对用户有意义的内容，并解释了同一场景如何产生不同的行为意义和结果。为了将这一视角转化为设计行动，我们推导出五个代理设计原则（行为一致性、上下文敏感性、时间适宜性、动机校准和代理性保护），这些原则指导干预的深度、时机、强度和克制。整体而言，该模型和原则为设计具有上下文敏感性和判断力的代理人工智能系统提供了基础。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2602.22822

FlexMS is a flexible framework for benchmarking deep learning-based mass spectrum prediction tools in metabolomics

FlexMS：一个用于基准测试基于深度学习的代谢组学质谱预测工具的灵活框架

Zhong, Yunhua, Tang, Yixuan, Li, Yifan, Yang, Jie, Liu, Pan, Xia, Jun

Abstract

The identification and property prediction of chemical molecules is of central importance in the advancement of drug discovery and material science, where the tandem mass spectrometry technology gives valuable fragmentation cues in the form of mass-to-charge ratio peaks. However, the lack of experimental spectra hinders the attachment of each molecular identification, and thus urges the establishment of prediction approaches for computational models. Deep learning models appear promising for predicting molecular structure spectra, but overall assessment remains challenging as a result of the heterogeneity in methods and the lack of well-defined benchmarks. To address this, our contribution is the creation of benchmark framework FlexMS for constructing and evaluating diverse model architectures in mass spectrum prediction. With its easy-to-use flexibility, FlexMS supports the dynamic construction of numerous distinct combinations of model architectures, while assessing their performance on preprocessed public datasets using different metrics. In this paper, we provide insights into factors influencing performance, including the structural diversity of datasets, hyperparameters like learning rate and data sparsity, pretraining effects, metadata ablation settings and cross-domain transfer learning analysis. This provides practical guidance in choosing suitable models. Moreover, retrieval benchmarks simulate practical identification scenarios and score potential matches based on predicted spectra.

Chinese Translation

化学分子的鉴定和性质预测在药物发现和材料科学的进展中具有核心重要性，其中串联质谱技术以质荷比峰的形式提供了有价值的碎片线索。然而，实验光谱的缺乏阻碍了每个分子鉴定的附加，因此迫切需要建立计算模型的预测方法。深度学习模型在预测分子结构光谱方面显示出良好的前景，但由于方法的异质性和缺乏明确定义的基准，整体评估仍然具有挑战性。为了解决这个问题，我们的贡献是创建基准框架FlexMS，用于构建和评估在质谱预测中多样化的模型架构。凭借其易于使用的灵活性，FlexMS支持动态构建多种不同组合的模型架构，同时使用不同的指标评估它们在预处理公共数据集上的表现。在本文中，我们提供了影响性能的因素的见解，包括数据集的结构多样性、学习率和数据稀疏性等超参数、预训练效果、元数据消融设置和跨领域迁移学习分析。这为选择合适的模型提供了实际指导。此外，检索基准模拟实际鉴定场景，并根据预测光谱对潜在匹配进行评分。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2602.22839

DeepPresenter: Environment-Grounded Reflection for Agentic Presentation Generation

DeepPresenter：基于环境的反思用于自主演示生成

Zheng, Hao, Mo, Guozhao, Yan, Xinru, Yuan, Qianhao, Zhang, Wenkai, Chen, Xuanang, Lu, Yaojie, Lin, Hongyu, Han, Xianpei, Sun, Le

Abstract

Presentation generation requires deep content research, coherent visual design, and iterative refinement based on observation. However, existing presentation agents often rely on predefined workflows and fixed templates. To address this, we present DeepPresenter, an agentic framework that adapts to diverse user intents, enables effective feedback-driven refinement, and generalizes beyond a scripted pipeline. Specifically, DeepPresenter autonomously plans, renders, and revises intermediate slide artifacts to support long-horizon refinement with environmental observations. Furthermore, rather than relying on self-reflection over internal signals (e.g., reasoning traces), our environment-grounded reflection conditions the generation process on perceptual artifact states (e.g., rendered slides), enabling the system to identify and correct presentation-specific issues during execution. Results on the evaluation set covering diverse presentation-generation scenarios show that DeepPresenter achieves state-of-the-art performance, and the fine-tuned 9B model remains highly competitive at substantially lower cost. Our project is available at: https://github.com/icip-cas/PPTAgent

Chinese Translation

演示生成需要深入的内容研究、连贯的视觉设计以及基于观察的迭代改进。然而，现有的演示代理通常依赖于预定义的工作流程和固定模板。为了解决这个问题，我们提出了DeepPresenter，一个能够适应多样用户意图的自主框架，支持有效的反馈驱动改进，并超越脚本化流程的局限。具体而言，DeepPresenter能够自主规划、渲染和修订中间幻灯片工件，以支持基于环境观察的长期改进。此外，我们的基于环境的反思并不依赖于内部信号（例如推理轨迹）的自我反思，而是将生成过程条件化于感知工件状态（例如渲染的幻灯片），使系统能够在执行过程中识别并纠正特定于演示的问题。针对涵盖多样演示生成场景的评估集的结果表明，DeepPresenter达到了最先进的性能，而经过微调的9B模型在显著降低成本的同时仍保持高度竞争力。我们的项目可在以下网址获取：https://github.com/icip-cas/PPTAgent

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2602.22842

The AI Research Assistant: Promise, Peril, and a Proof of Concept

人工智能研究助手：承诺、风险与概念验证

Bui-Thanh, Tan

Abstract

Can artificial intelligence truly contribute to creative mathematical research, or does it merely automate routine calculations while introducing risks of error? We provide empirical evidence through a detailed case study: the discovery of novel error representations and bounds for Hermite quadrature rules via systematic human-AI collaboration. Working with multiple AI assistants, we extended results beyond what manual work achieved, formulating and proving several theorems with AI assistance. The collaboration revealed both remarkable capabilities and critical limitations. AI excelled at algebraic manipulation, systematic proof exploration, literature synthesis, and LaTeX preparation. However, every step required rigorous human verification, mathematical intuition for problem formulation, and strategic direction. We document the complete research workflow with unusual transparency, revealing patterns in successful human-AI mathematical collaboration and identifying failure modes researchers must anticipate. Our experience suggests that, when used with appropriate skepticism and verification protocols, AI tools can meaningfully accelerate mathematical discovery while demanding careful human oversight and deep domain expertise.

Chinese Translation

人工智能是否真的能够为创造性的数学研究做出贡献，还是仅仅自动化常规计算，同时引入错误风险？我们通过一个详细的案例研究提供了实证证据：通过系统的人机协作发现了Hermite积分规则的新型误差表示和界限。与多个AI助手合作，我们的成果超出了手动工作所能达到的范围，借助AI的帮助，我们制定并证明了多个定理。这种合作揭示了显著的能力和关键的局限性。AI在代数运算、系统性证明探索、文献综合和LaTeX准备方面表现出色。然而，每一步都需要严格的人类验证、数学直觉以进行问题表述以及战略方向的指导。我们以不寻常的透明度记录了完整的研究工作流程，揭示了成功的人机数学合作中的模式，并识别了研究人员必须预见的失败模式。我们的经验表明，当以适当的怀疑态度和验证协议使用时，AI工具可以有效加速数学发现，同时需要仔细的人类监督和深厚的领域专业知识。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2602.22879

Towards LLM-Empowered Knowledge Tracing via LLM-Student Hierarchical Behavior Alignment in Hyperbolic Space

基于大语言模型的知识追踪：在双曲空间中通过大语言模型-学生层次行为对齐

Fu, Xingcheng, Wang, Shengpeng, Gao, Yisen, Li, Xianxian, Li, Chunpei, Sun, Qingyun, Yu, Dongran

Abstract

Knowledge Tracing (KT) diagnoses students' concept mastery through continuous learning state monitoring in education.Existing methods primarily focus on studying behavioral sequences based on ID or textual information.While existing methods rely on ID-based sequences or shallow textual features, they often fail to capture (1) the hierarchical evolution of cognitive states and (2) individualized problem difficulty perception due to limited semantic modeling. Therefore, this paper proposes a Large Language Model Hyperbolic Aligned Knowledge Tracing(L-HAKT). First, the teacher agent deeply parses question semantics and explicitly constructs hierarchical dependencies of knowledge points; the student agent simulates learning behaviors to generate synthetic data. Then, contrastive learning is performed between synthetic and real data in hyperbolic space to reduce distribution differences in key features such as question difficulty and forgetting patterns. Finally, by optimizing hyperbolic curvature, we explicitly model the tree-like hierarchical structure of knowledge points, precisely characterizing differences in learning curve morphology for knowledge points at different levels. Extensive experiments on four real-world educational datasets validate the effectiveness of our Large Language Model Hyperbolic Aligned Knowledge Tracing (L-HAKT) framework.

Chinese Translation

知识追踪（Knowledge Tracing, KT）通过持续监测学习状态来诊断学生的概念掌握情况。现有方法主要集中于基于ID或文本信息的行为序列研究。虽然现有方法依赖于基于ID的序列或浅层文本特征，但它们往往无法捕捉到（1）认知状态的层次演变，以及（2）由于语义建模的局限性而导致的个性化问题难度感知。因此，本文提出了一种大语言模型双曲对齐知识追踪（Large Language Model Hyperbolic Aligned Knowledge Tracing, L-HAKT）。首先，教师代理深入解析问题语义，并明确构建知识点的层次依赖关系；学生代理则模拟学习行为以生成合成数据。接着，在双曲空间中对合成数据和真实数据进行对比学习，以减少在问题难度和遗忘模式等关键特征上的分布差异。最后，通过优化双曲曲率，我们明确建模知识点的树状层次结构，准确刻画不同层次知识点的学习曲线形态差异。在四个真实教育数据集上的大量实验验证了我们的大语言模型双曲对齐知识追踪（L-HAKT）框架的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2602.22897

OmniGAIA: Towards Native Omni-Modal AI Agents

OmniGAIA：迈向原生全模态人工智能代理

Li, Xiaoxi, Jiao, Wenxiang, Jin, Jiarui, Wang, Shijian, Dong, Guanting, Jin, Jiajie, Wang, Hao, Wang, Yinuo, Wen, Ji-Rong, Lu, Yuan, Dou, Zhicheng

Abstract

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

Chinese Translation

人类智能自然地将全模态感知（涵盖视觉、听觉和语言）与复杂推理和工具使用相结合，以与世界互动。然而，当前的多模态大语言模型（LLMs）主要局限于双模态交互（例如，视觉-语言），缺乏一般人工智能助手所需的统一认知能力。为了弥补这一差距，我们引入了OmniGAIA，这是一个综合基准，旨在评估全模态代理在需要深度推理和多轮工具执行的任务上的表现，涵盖视频、音频和图像模态。OmniGAIA通过一种新颖的全模态事件图方法构建，综合了源自真实世界数据的复杂多跳查询，这些查询需要跨模态推理和外部工具集成。此外，我们提出了OmniAtlas，这是一个在工具集成推理范式下具有主动全模态感知的原生全模态基础代理。OmniAtlas在通过回顾引导的树探索策略合成的轨迹上进行训练，并利用OmniDPO进行细粒度错误修正，有效增强了现有开源模型的工具使用能力。这项工作标志着向下一代原生全模态人工智能助手在真实世界场景中的发展迈出了重要一步。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2602.22953

General Agent Evaluation

通用智能体评估

Bandel, Elron, Yehudai, Asaf, Eden, Lilach, Sagron, Yehoshua, Perlitz, Yotam, Venezian, Elad, Razinkov, Natalia, Ergas, Natan, Ifergan, Shlomit Shachor, Shlomov, Segev, Jacovi, Michal, Choshen, Leshem, Ein-Dor, Liat, Katz, Yoav, Shmueli-Scheuer, Michal

Abstract

The promise of general-purpose agents - systems that perform tasks in unfamiliar environments without domain-specific engineering - remains largely unrealized. Existing agents are predominantly specialized, and while emerging implementations like OpenAI SDK Agent and Claude Code hint at broader capabilities, no systematic evaluation of their general performance has been pursued. Current agentic benchmarks assume domain-specific integration, encoding task information in ways that preclude fair evaluation of general agents. This paper frames general-agent evaluation as a first-class research objective. We propose conceptual principles for such evaluation, a Unified Protocol enabling agent-benchmark integration, and Exgentic - a practical framework for general agent evaluation. We benchmark five prominent agent implementations across six environments as the first Open General Agent Leaderboard. Our experiments show that general agents generalize across diverse environments, achieving performance comparable to domain-specific agents without any environment-specific tuning. We release our evaluation protocol, framework, and leaderboard to establish a foundation for systematic research on general-purpose agents.

Chinese Translation

通用智能体的前景——在不需要领域特定工程的情况下，在不熟悉的环境中执行任务的系统——仍然在很大程度上未能实现。现有的智能体主要是专门化的，尽管像 OpenAI SDK Agent 和 Claude Code 等新兴实现暗示了更广泛的能力，但尚未对它们的整体性能进行系统评估。目前的智能体基准假设领域特定的整合，以排除公平评估通用智能体的方式编码任务信息。本文将通用智能体评估框架设定为一项重要的研究目标。我们提出了这种评估的概念原则，一个统一协议以实现智能体与基准的整合，以及 Exgentic——一个用于通用智能体评估的实用框架。我们在六个环境中对五个显著的智能体实现进行了基准测试，创建了第一个开放通用智能体排行榜。我们的实验表明，通用智能体能够在多样化的环境中进行泛化，其性能可与领域特定智能体相媲美，而不需要任何特定环境的调优。我们发布了我们的评估协议、框架和排行榜，以建立对通用智能体进行系统研究的基础。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2602.22963

FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning

FactGuard：基于强化学习的主动视频虚假信息检测

Li, Zehao, Yu, Hongwei, Jiang, Hao, Sheng, Qiang, Xu, Yilong, Bi, Baolong, Li, Yang, Yuan, Zhenlong, Cai, Yujun, Wang, Zhaoqi

Abstract

Multimodal large language models (MLLMs) have substantially advanced video misinformation detection through unified multimodal reasoning, but they often rely on fixed-depth inference and place excessive trust in internally generated assumptions, particularly in scenarios where critical evidence is sparse, fragmented, or requires external verification. To address these limitations, we propose FactGuard, an agentic framework for video misinformation detection that formulates verification as an iterative reasoning process built upon MLLMs. FactGuard explicitly assesses task ambiguity and selectively invokes external tools to acquire critical evidence, enabling progressive refinement of reasoning trajectories. To further strengthen this capability, we introduce a two-stage training strategy that combines domain-specific agentic supervised fine-tuning with decision-aware reinforcement learning to optimize tool usage and calibrate risk-sensitive decision making. Extensive experiments on FakeSV, FakeTT, and FakeVV demonstrate FactGuard's state-of-the-art performance and validate its excellent robustness and generalization capacity.

Chinese Translation

多模态大型语言模型（MLLMs）通过统一的多模态推理在视频虚假信息检测方面取得了显著进展，但它们通常依赖于固定深度的推理，并对内部生成的假设过于信任，尤其是在关键证据稀缺、零散或需要外部验证的情况下。为了解决这些局限性，我们提出了FactGuard，这是一种用于视频虚假信息检测的主动框架，将验证过程构建为基于MLLMs的迭代推理过程。FactGuard明确评估任务的模糊性，并选择性地调用外部工具以获取关键证据，从而实现推理轨迹的逐步优化。为了进一步增强这一能力，我们引入了一种两阶段的训练策略，结合了特定领域的主动监督微调与决策感知的强化学习，以优化工具使用和校准风险敏感的决策。对FakeSV、FakeTT和FakeVV的广泛实验表明，FactGuard在性能上达到了最先进水平，并验证了其出色的鲁棒性和泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2602.22968

Certified Circuits: Stability Guarantees for Mechanistic Circuits

认证电路：机械电路的稳定性保证

Anani, Alaa, Lorenz, Tobias, Schiele, Bernt, Fritz, Mario, Fischer, Jonas

Abstract

Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture concept or dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, yielding circuits that are more compact and more accurate. On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy while using 45% fewer neurons, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code will be released soon!

Chinese Translation

理解神经网络如何得出预测结果对于调试、审计和部署至关重要。机械可解释性通过识别电路——负责特定行为的最小子网络——来追求这一目标。然而，现有的电路发现方法较为脆弱：电路在很大程度上依赖于所选择的概念数据集，并且往往无法在分布外转移，这引发了它们是否捕捉到概念或数据集特定伪影的疑问。我们提出了认证电路（Certified Circuits），为电路发现提供可证明的稳定性保证。我们的框架通过随机数据子采样将任何黑箱发现算法包装起来，以证明电路组件包含决策对于概念数据集的有界编辑距离扰动是不变的。我们避免使用不稳定的神经元，从而产生更紧凑且更准确的电路。在ImageNet和OOD数据集上，认证电路的准确率提高了最高91%，同时使用的神经元减少了45%，并在基线性能下降的情况下仍然保持可靠。认证电路通过生成可证明稳定且与目标概念更好对齐的机械解释，将电路发现置于正式基础上。代码将很快发布！

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2602.22971

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

SPM-Bench：针对扫描探针显微镜的大型语言模型基准测试

Xiao, Peiyao, Li, Xiaogang, Xu, Chengliang, Wang, Jiayi, Wang, Ben, Chen, Zichao, Wang, Zeyu, Yu, Kejun, Chen, Yueqian, Liu, Xulin, Xiao, Wende, Zhao, Bing, Wei, Hu

Abstract

As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existing benchmarks due to data contamination, insufficient complexity, and prohibitive human labor costs. Here we present SPM-Bench, an original, PhD-level multimodal benchmark specifically designed for scanning probe microscopy (SPM). We propose a fully automated data synthesis pipeline that ensures both high authority and low-cost. By employing Anchor-Gated Sieve (AGS) technology, we efficiently extract high-value image-text pairs from arXiv and journal papers published between 2023 and 2025. Through a hybrid cloud-local architecture where VLMs return only spatial coordinates "llbox" for local high-fidelity cropping, our pipeline achieves extreme token savings while maintaining high dataset purity. To accurately and objectively evaluate the performance of the LLMs, we introduce the Strict Imperfection Penalty F1 (SIP-F1) score. This metric not only establishes a rigorous capability hierarchy but also, for the first time, quantifies model "personalities" (Conservative, Aggressive, Gambler, or Wise). By correlating these results with model-reported confidence and perceived difficulty, we expose the true reasoning boundaries of current AI in complex physical scenarios. These insights establish SPM-Bench as a generalizable paradigm for automated scientific data synthesis.

Chinese Translation

随着大型语言模型（LLMs）在一般推理方面取得突破，它们在专业科学领域的能力暴露出现有基准测试中的明显差距，这些差距源于数据污染、复杂性不足以及高昂的人力成本。在此，我们提出了SPM-Bench，这是一个原创的、博士级别的多模态基准测试，专门为扫描探针显微镜（SPM）设计。我们提出了一种完全自动化的数据合成管道，确保高权威性和低成本。通过采用Anchor-Gated Sieve（AGS）技术，我们高效地从2023年至2025年间的arXiv和期刊论文中提取高价值的图像-文本对。通过一种混合云-本地架构，在该架构中，视觉语言模型（VLMs）仅返回空间坐标“llbox”以进行本地高保真裁剪，我们的管道实现了极大的令牌节省，同时保持了高数据集纯度。为了准确和客观地评估LLMs的性能，我们引入了严格不完美惩罚F1（SIP-F1）分数。该指标不仅建立了严格的能力层次结构，而且首次量化了模型的“个性”（保守型、激进型、冒险型或智慧型）。通过将这些结果与模型报告的置信度和感知难度相关联，我们揭示了当前人工智能在复杂物理场景中的真实推理边界。这些见解确立了SPM-Bench作为自动化科学数据合成的可推广范式。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2602.22973

Modeling Expert AI Diagnostic Alignment via Immutable Inference Snapshots

通过不可变推理快照建模专家AI诊断对齐

Panagoulias, Dimitrios P., Tsichrintzi, Evangelia-Aikaterini, Savvidis, Georgios, Tsoureli-Nikita, Evridiki

Abstract

Human-in-the-loop validation is essential in safety-critical clinical AI, yet the transition between initial model inference and expert correction is rarely analyzed as a structured signal. We introduce a diagnostic alignment framework in which the AI-generated image based report is preserved as an immutable inference state and systematically compared with the physician-validated outcome. The inference pipeline integrates a vision-enabled large language model, BERT- based medical entity extraction, and a Sequential Language Model Inference (SLMI) step to enforce domain-consistent refinement prior to expert review. Evaluation on 21 dermatological cases (21 complete AI physician pairs) em- ployed a four-level concordance framework comprising exact primary match rate (PMR), semantic similarity-adjusted rate (AMR), cross-category alignment, and Comprehensive Concordance Rate (CCR). Exact agreement reached 71.4% and remained unchanged under semantic similarity (t = 0.60), while structured cross-category and differential overlap analysis yielded 100% comprehensive concordance (95% CI: [83.9%, 100%]). No cases demonstrated complete diagnostic divergence. These findings show that binary lexical evaluation substantially un- derestimates clinically meaningful alignment. Modeling expert validation as a structured transformation enables signal-aware quantification of correction dynamics and supports traceable, human aligned evaluation of image based clinical decision support systems.

Chinese Translation

人机协作验证在安全关键的临床AI中至关重要，但初始模型推理与专家修正之间的过渡很少被分析为结构化信号。我们提出了一种诊断对齐框架，其中AI生成的基于图像的报告被保留为不可变的推理状态，并与医生验证的结果进行系统比较。推理管道集成了一个具有视觉能力的大型语言模型、基于BERT的医学实体提取，以及一个顺序语言模型推理（SLMI）步骤，以在专家审查之前强制执行领域一致的细化。在21个皮肤病案例（21对完整的AI医生配对）上进行的评估采用了一个四级一致性框架，包括精确的主要匹配率（PMR）、语义相似性调整率（AMR）、跨类别对齐和综合一致性率（CCR）。精确一致性达到71.4%，在语义相似性下保持不变（t = 0.60），而结构化的跨类别和差异重叠分析则产生了100%的综合一致性（95% CI: [83.9%, 100%]）。没有案例显示完全的诊断分歧。这些发现表明，二元词汇评估显著低估了临床上有意义的对齐。将专家验证建模为结构化转变使得对修正动态的信号感知量化成为可能，并支持基于图像的临床决策支持系统的可追溯的人类对齐评估。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2602.22981

RepSPD: Enhancing SPD Manifold Representation in EEGs via Dynamic Graphs

RepSPD：通过动态图增强脑电图中的SPD流形表示

Jia, Haohui, Chen, Zheng, Zhu, Lingwei, Cao, Xu, Matsubara, Yasuko, Matsubara, Takashi, Sakurai, Yasushi

Abstract

Decoding brain activity from electroencephalography (EEG) is crucial for neuroscience and clinical applications. Among recent advances in deep learning for EEG, geometric learning stands out as its theoretical underpinnings on symmetric positive definite (SPD) allows revealing structural connectivity analysis in a physics-grounded manner. However, current SPD-based methods focus predominantly on statistical aggregation of EEGs, with frequency-specific synchronization and local topological structures of brain regions neglected. Given this, we propose RepSPD, a novel geometric deep learning (GDL)-based model. RepSPD implements a cross-attention mechanism on the Riemannian manifold to modulate the geometric attributes of SPD with graph-derived functional connectivity features. On top of this, we introduce a global bidirectional alignment strategy to reshape tangent-space embeddings, mitigating geometric distortions caused by curvature and thereby enhancing geometric consistency. Extensive experiments demonstrate that our proposed framework significantly outperforms existing EEG representation methods, exhibiting superior robustness and generalization capabilities.

Chinese Translation

从脑电图（EEG）解码脑活动对于神经科学和临床应用至关重要。在最近的EEG深度学习进展中，几何学习因其在对称正定（SPD）流形上的理论基础而脱颖而出，能够以物理基础的方式揭示结构连接分析。然而，目前基于SPD的方法主要集中在EEG的统计聚合上，而忽视了频率特定的同步和脑区的局部拓扑结构。鉴于此，我们提出了RepSPD，一种新颖的基于几何深度学习（GDL）的模型。RepSPD在黎曼流形上实现了跨注意力机制，以调节SPD的几何属性与图导出的功能连接特征。基于此，我们引入了一种全局双向对齐策略，以重塑切空间嵌入，减轻由曲率引起的几何扭曲，从而增强几何一致性。大量实验表明，我们提出的框架显著优于现有的EEG表示方法，展现出更强的鲁棒性和泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2602.22983

Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search

隐晦但有效：通过生物启发搜索进行古典汉语越狱提示优化

Huang, Xun, Qin, Simeng, Jia, Xiaoshuang, Duan, Ranjie, Yan, Huanqian, Zeng, Zhitao, Yang, Fei, Liu, Yang, Jia, Xiaojun

Abstract

As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensions-covering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.

Chinese Translation

随着大型语言模型（LLMs）的广泛应用，其安全风险引起了越来越多的关注。现有研究表明，LLMs 对越狱攻击高度敏感，其有效性在不同语言环境中存在差异。本文探讨了古典汉语在越狱攻击中的作用。由于其简洁性和隐晦性，古典汉语能够部分绕过现有的安全约束，暴露出 LLMs 的显著脆弱性。基于这一观察，本文提出了一个框架，CC-BOS，旨在基于多维果蝇优化自动生成古典汉语对抗性提示，从而在黑箱环境中促进高效和自动化的越狱攻击。提示被编码为八个策略维度——涵盖角色、行为、机制、隐喻、表达、知识、触发模式和上下文；并通过嗅觉搜索、视觉搜索和柯西变异进行迭代优化。该设计能够高效探索搜索空间，从而增强黑箱越狱攻击的有效性。为了提高可读性和评估准确性，我们进一步设计了一个古典汉语到英语的翻译模块。大量实验表明，所提出的 CC-BOS 的有效性持续优于最先进的越狱攻击方法。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2602.23056

Learning-based Multi-agent Race Strategies in Formula 1

基于学习的多智能体一级方程式赛车策略

Fieni, Giona, Wüthrich, Joschua, Neumann, Marc-Philippe, Onder, Christopher H.

Abstract

In Formula 1, race strategies are adapted according to evolving race conditions and competitors' actions. This paper proposes a reinforcement learning approach for multi-agent race strategy optimization. Agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions. Building on a pre-trained single-agent policy, we introduce an interaction module that accounts for the behavior of competitors. The combination of the interaction module and a self-play training scheme generates competitive policies, and agents are ranked based on their relative performance. Results show that the agents adapt pit timing, tire selection, and energy allocation in response to opponents, achieving robust and consistent race performance. Because the framework relies only on information available during real races, it can support race strategists' decisions before and during races.

Chinese Translation

在一级方程式赛车中，比赛策略会根据不断变化的比赛条件和竞争对手的行动进行调整。本文提出了一种用于多智能体赛车策略优化的强化学习方法。智能体学习在能量管理、轮胎磨损、空气动力学相互作用和进站决策之间取得平衡。基于预训练的单智能体策略，我们引入了一个交互模块，以考虑竞争对手的行为。交互模块与自我对弈训练方案的结合生成了具有竞争力的策略，智能体根据其相对表现进行排名。结果表明，智能体能够根据对手的表现调整进站时机、轮胎选择和能量分配，从而实现稳健且一致的比赛表现。由于该框架仅依赖于比赛期间可用的信息，因此可以支持比赛策略师在比赛前和比赛期间的决策。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2602.23092

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

通过LLM驱动的自动启发式设计增强CVRP求解器

Xie, Zhuoliang, Liu, Fei, Wang, Zhenkun, Zhang, Qingfu

Abstract

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.

Chinese Translation

容量有限的车辆路径问题（CVRP）是一项基本的组合优化挑战，旨在优化在车辆容量约束下的车队运营。尽管在运筹学领域得到了广泛研究，CVRP的NP难度仍然对计算提出了重大挑战，特别是在大规模实例中。本研究提出了一种新颖的方法AILS-AHD（自适应迭代局部搜索与自动启发式设计），该方法利用大型语言模型（LLMs）来革新CVRP求解。我们的方法将进化搜索框架与LLMs集成，以动态生成和优化AILS方法中的破坏启发式。此外，我们引入了一种基于LLM的加速机制，以提高计算效率。与包括AILS-II和HGS在内的最先进求解器进行的全面实验评估表明，AILS-AHD在中等和大规模实例中均表现出优越的性能。值得注意的是，我们的方法在CVRPLib大规模基准测试中的10个实例中有8个建立了新的最佳已知解，突显了LLM驱动的启发式设计在推动车辆路径优化领域的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2602.23093

Three AI-agents walk into a bar . . . . `Lord of the Flies' tribalism emerges among smart AI-Agents

三个人工智能代理走进酒吧……《蝇王》中的部落主义在智能人工智能代理中出现

Mori, Dhwanil M., Johnson, Neil F.

Abstract

Near-future infrastructure systems may be controlled by autonomous AI agents that repeatedly request access to limited resources such as energy, bandwidth, or computing power. We study a simplified version of this setting using a framework where N AI-agents independently decide at each round whether to request one unit from a system with fixed capacity C. An AI version of "Lord of the Flies" arises in which controlling tribes emerge with their own collective character and identity. The LLM agents do not reduce overload or improve resource use, and often perform worse than if they were flipping coins to make decisions. Three main tribal types emerge: Aggressive (27.3%), Conservative (24.7%), and Opportunistic (48.1%). The more capable AI-agents actually increase the rate of systemic failure. Overall, our findings show that smarter AI-agents can behave dumber as a result of forming tribes.

Chinese Translation

近未来的基础设施系统可能由自主的人工智能代理控制，这些代理反复请求访问有限的资源，如能源、带宽或计算能力。我们使用一个框架研究了这一设置的简化版本，其中 N 个人工智能代理在每一轮独立决定是否请求从具有固定容量 C 的系统中获取一个单位。出现了一种人工智能版的《蝇王》，在其中控制部落形成了自己的集体特征和身份。这些大型语言模型（LLM）代理并没有减少过载或改善资源使用，往往表现得比随机抛硬币做决策还要糟糕。出现了三种主要的部落类型：攻击型（27.3%）、保守型（24.7%）和机会主义型（48.1%）。更有能力的人工智能代理实际上增加了系统性失败的发生率。总体而言，我们的研究结果表明，智能的人工智能代理在形成部落的结果下可能表现得更愚蠢。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2602.23123

Multi-Agent Large Language Model Based Emotional Detoxification Through Personalized Intensity Control for Consumer Protection

基于多智能体大语言模型的情感解毒通过个性化强度控制以保护消费者

Inoshita, Keito

Abstract

In the attention economy, sensational content exposes consumers to excessive emotional stimulation, hindering calm decision-making. This study proposes Multi-Agent LLM-based Emotional deToxification (MALLET), a multi-agent information sanitization system consisting of four agents: Emotion Analysis, Emotion Adjustment, Balance Monitoring, and Personal Guide. The Emotion Analysis Agent quantifies stimulus intensity using a 6-emotion BERT classifier, and the Emotion Adjustment Agent rewrites texts into two presentation modes, BALANCED (neutralized text) and COOL (neutralized text + supplementary text), using an LLM. The Balance Monitoring Agent aggregates weekly information consumption patterns and generates personalized advice, while the Personal Guide Agent recommends a presentation mode according to consumer sensitivity. Experiments on 800 AG News articles demonstrated significant stimulus score reduction (up to 19.3%) and improved emotion balance while maintaining semantic preservation. Near-zero correlation between stimulus reduction and semantic preservation confirmed that the two are independently controllable. Category-level analysis revealed substantial reduction (17.8-33.8%) in Sports, Business, and Sci/Tech, whereas the effect was limited in the World category, where facts themselves are inherently high-stimulus. The proposed system provides a framework for supporting calm information reception of consumers without restricting access to the original text.

Chinese Translation

在注意力经济中，耸人听闻的内容使消费者面临过度的情感刺激，从而妨碍冷静的决策。本研究提出了一种基于多智能体大语言模型的情感解毒系统（Multi-Agent LLM-based Emotional deToxification，MALLET），该系统由四个智能体组成：情感分析、情感调整、平衡监测和个人指导。情感分析智能体使用6情感BERT分类器量化刺激强度，而情感调整智能体利用大语言模型将文本重写为两种呈现模式：平衡模式（中性文本）和冷静模式（中性文本+补充文本）。平衡监测智能体汇总每周信息消费模式并生成个性化建议，而个人指导智能体根据消费者的敏感性推荐呈现模式。在对800篇AG News文章的实验中，刺激评分显著降低（最高可达19.3%），情感平衡得到改善，同时保持语义的完整性。刺激降低与语义保持之间几乎为零的相关性证实了两者是可以独立控制的。类别级分析显示，体育、商业和科技类的刺激降低显著（17.8-33.8%），而在世界类中，由于事实本身具有高刺激性，效果有限。所提出的系统为支持消费者冷静接收信息提供了框架，而不限制对原始文本的访问。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2602.23148

On Sample-Efficient Generalized Planning via Learned Transition Models

通过学习转移模型实现样本高效的广义规划

Gupta, Nitin, Pallagani, Vishal, Aydin, John A., Srivastava, Biplav

Abstract

Generalized planning studies the construction of solution strategies that generalize across families of planning problems sharing a common domain model, formally defined by a transition function $\gamma : S \times A \rightarrow S$. Classical approaches achieve such generalization through symbolic abstractions and explicit reasoning over $\gamma$. In contrast, recent Transformer-based planners, such as PlanGPT and Plansformer, largely cast generalized planning as direct action-sequence prediction, bypassing explicit transition modeling. While effective on in-distribution instances, these approaches typically require large datasets and model sizes, and often suffer from state drift in long-horizon settings due to the absence of explicit world-state evolution. In this work, we formulate generalized planning as a transition-model learning problem, in which a neural model explicitly approximates the successor-state function $\hat{\gamma} \approx \gamma$ and generates plans by rolling out symbolic state trajectories. Instead of predicting actions directly, the model autoregressively predicts intermediate world states, thereby learning the domain dynamics as an implicit world model. To study size-invariant generalization and sample efficiency, we systematically evaluate multiple state representations and neural architectures, including relational graph encodings. Our results show that learning explicit transition models yields higher out-of-distribution satisficing-plan success than direct action-sequence prediction in multiple domains, while achieving these gains with significantly fewer training instances and smaller models. This is an extended version of a short paper accepted at ICAPS 2026 under the same title.

Chinese Translation

广义规划研究构建解决策略，这些策略在共享共同领域模型的规划问题家族中进行泛化，形式上由转移函数 $eta : S imes A ightarrow S$ 定义。经典方法通过符号抽象和对 $eta$ 的显式推理实现这种泛化。相比之下，最近的基于变换器的规划器，如 PlanGPT 和 Plansformer，主要将广义规划视为直接的动作序列预测，绕过显式的转移建模。尽管在分布内实例上有效，这些方法通常需要大量的数据集和模型规模，并且由于缺乏显式的世界状态演变，在长时间范围设置中常常遭遇状态漂移。在本研究中，我们将广义规划形式化为转移模型学习问题，其中神经模型显式地近似后继状态函数 $ ilde{eta} eq eta$，并通过展开符号状态轨迹生成计划。模型不是直接预测动作，而是自回归地预测中间世界状态，从而将领域动态学习为隐式世界模型。为了研究大小不变的泛化和样本效率，我们系统地评估了多种状态表示和神经架构，包括关系图编码。我们的结果表明，学习显式转移模型在多个领域中比直接的动作序列预测实现了更高的分布外满足计划成功率，同时以显著更少的训练实例和更小的模型实现这些增益。这是接受于 ICAPS 2026 的同名短文的扩展版本。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2602.23152

The Trinity of Consistency as a Defining Principle for General World Models

一致性的三位一体作为通用世界模型的定义原则

Wei, Jingxuan, Li, Siyuan, Xu, Yuhang, Sun, Zheng, Jiang, Junjie, Jin, Hexuan, Jia, Caijun, He, Honghao, Xu, Xinglong, bai, Xi, Yu, Chang, Liu, Yumou, Zhu, Junnan, Zhou, Xuanhe, Chen, Jintao, Hu, Xiaobin, Pang, Shancheng, Yu, Bihui, He, Ran, Lei, Zhen, Li, Stan Z., He, Conghui, Yan, Shuicheng, Tan, Cheng

Abstract

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data-driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reasoning. Despite these advances, the field still lacks a principled theoretical framework that defines the essential properties requisite for a General World Model. In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine. Through this tripartite lens, we systematically review the evolution of multimodal learning, revealing a trajectory from loosely coupled specialized modules toward unified architectures that enable the synergistic emergence of internal world simulators. To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios. CoW-Bench evaluates both video generation models and UMMs under a unified evaluation protocol. Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.

Chinese Translation

构建能够学习、模拟和推理客观物理规律的世界模型是追求人工通用智能的基础性挑战。最近以视频生成模型Sora为代表的进展展示了数据驱动的规模法则在近似物理动态方面的潜力，而新兴的统一多模态模型（Unified Multimodal Model, UMM）则为整合感知、语言和推理提供了一个有前景的架构范式。尽管取得了这些进展，领域内仍然缺乏一个原则性的理论框架来定义通用世界模型所需的基本属性。在本文中，我们提出世界模型必须基于一致性的三位一体：模态一致性作为语义接口，空间一致性作为几何基础，以及时间一致性作为因果引擎。通过这一三重视角，我们系统回顾了多模态学习的发展，揭示了从松散耦合的专门模块到统一架构的演变轨迹，这些架构使内部世界模拟器的协同涌现成为可能。为了补充这一概念框架，我们引入了CoW-Bench，这是一个以多帧推理和生成场景为中心的基准。CoW-Bench在统一的评估协议下评估视频生成模型和UMM。我们的工作为通用世界模型建立了一条原则性路径，阐明了当前系统的局限性以及未来进展的架构要求。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2602.23161

PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering

PATRA：面向模式的对齐与平衡推理用于时间序列问答

Lu, Junkai, Chen, Peng, Wu, Xingjian, Shu, Yang, Guo, Chenjuan, Jensen, Christian S., Yang, Bin

Abstract

Time series reasoning demands both the perception of complex dynamics and logical depth. However, existing LLM-based approaches exhibit two limitations: they often treat time series merely as text or images, failing to capture the patterns like trends and seasonalities needed to answer specific questions; and when trained on a mix of simple and complex tasks, simpler objectives often dominate the learning process, hindering the development of deep reasoning capabilities. To address these limitations, we propose the Pattern-Aware Alignment and Balanced Reasoning model (PATRA), introducing a pattern-aware mechanism that extracts trend and seasonality patterns from time series to achieve deep alignment. Furthermore, we design a task-aware balanced reward to harmonize learning across tasks of varying difficulty, incentivizing the generation of coherent Chains of Thought. Extensive experiments show that PATRA outperforms strong baselines across diverse Time Series Question Answering (TSQA) tasks, demonstrating superior cross-modal understanding and reasoning capability.

Chinese Translation

时间序列推理既需要对复杂动态的感知，也需要逻辑深度。然而，现有的基于大型语言模型（LLM）的方法存在两个局限性：它们往往将时间序列仅视为文本或图像，未能捕捉到回答特定问题所需的趋势和季节性等模式；而且在混合简单和复杂任务进行训练时，简单目标往往主导学习过程，阻碍了深度推理能力的发展。为了解决这些局限性，我们提出了面向模式的对齐与平衡推理模型（PATRA），引入了一种模式感知机制，从时间序列中提取趋势和季节性模式，以实现深度对齐。此外，我们设计了一种任务感知的平衡奖励，以协调不同难度任务之间的学习，激励生成连贯的思维链。大量实验表明，PATRA在多样的时间序列问答（TSQA）任务中优于强基线，展现出卓越的跨模态理解和推理能力。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2602.23163

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

隐写术的决策理论形式化及其在大型语言模型监控中的应用

Anwar, Usman, Piskorz, Julianna, Baek, David D., Africa, David, Weatherall, Jim, Tegmark, Max, de Witt, Christian Schroeder, van der Schaar, Mihaela, Krueger, David

Abstract

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.

Chinese Translation

大型语言模型开始展现隐写能力。这些能力可能使得不对齐的模型逃避监督机制。然而，缺乏原则性的方法来检测和量化这种行为。经典的隐写术定义及其基于这些定义的检测方法需要已知的非隐写信号的参考分布。对于大型语言模型中的隐写推理，了解这样的参考分布是不可行的；这使得这些方法无法适用。我们提出了一种替代方案，即隐写术的 extbf{决策理论视角}。我们的核心见解是，隐写术在能够和无法解码隐藏内容（存在于隐写信号中的内容）的代理之间创造了可用信息的不对称性，而这种潜在的不对称性可以通过代理的可观察行为推断出来。为了形式化这一观点，我们引入了广义的$ extmathcal{V}$-信息：一个用于测量某些输入中可用信息量的功利框架。我们用它来定义 extbf{隐写差距}——一种通过比较能够和无法解码隐藏内容的代理对隐写信号的下游效用来量化隐写术的度量。我们对我们的形式化进行了实证验证，并展示了它可以用于检测、量化和减轻大型语言模型中的隐写推理。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2602.23193

ESAA: Event Sourcing for Autonomous Agents in LLM-Based Software Engineering

ESAA：基于事件溯源的自主智能体在大型语言模型驱动的软件工程中的应用

Filho, Elzo Brito dos Santos

Abstract

Autonomous agents based on Large Language Models (LLMs) have evolved from reactive assistants to systems capable of planning, executing actions via tools, and iterating over environment observations. However, they remain vulnerable to structural limitations: lack of native state, context degradation over long horizons, and the gap between probabilistic generation and deterministic execution requirements. This paper presents the ESAA (Event Sourcing for Autonomous Agents) architecture, which separates the agent's cognitive intention from the project's state mutation, inspired by the Event Sourcing pattern. In ESAA, agents emit only structured intentions in validated JSON (agent.result or issue.report); a deterministic orchestrator validates, persists events in an append-only log (activity.jsonl), applies file-writing effects, and projects a verifiable materialized view (roadmap.json). The proposal incorporates boundary contracts (AGENT_CONTRACT.yaml), metaprompting profiles (PARCER), and replay verification with hashing (esaa verify), ensuring the immutability of completed tasks and forensic traceability. Two case studies validate the architecture: (i) a landing page project (9 tasks, 49 events, single-agent composition) and (ii) a clinical dashboard system (50 tasks, 86 events, 4 concurrent agents across 8 phases), both concluding with run.status=success and verify_status=ok. The multi-agent case study demonstrates real concurrent orchestration with heterogeneous LLMs (Claude Sonnet 4.6, Codex GPT-5, Antigravity/Gemini 3 Pro, and Claude Opus 4.6), providing empirical evidence of the architecture's scalability beyond single-agent scenarios.

Chinese Translation

基于大型语言模型（LLMs）的自主智能体已从反应式助手发展为能够规划、通过工具执行行动并对环境观察进行迭代的系统。然而，它们仍然面临结构性限制：缺乏原生状态、在长时间范围内的上下文退化，以及概率生成与确定性执行要求之间的差距。本文提出了ESAA（自主智能体的事件溯源）架构，该架构将智能体的认知意图与项目的状态变更分离，灵感来源于事件溯源模式。在ESAA中，智能体仅以经过验证的JSON格式发出结构化意图（agent.result或issue.report）；一个确定性的协调器验证并将事件持久化到一个仅附加的日志（activity.jsonl）中，应用文件写入效果，并投影出可验证的物化视图（roadmap.json）。该提案结合了边界合约（AGENT_CONTRACT.yaml）、元提示配置文件（PARCER）和使用哈希的重放验证（esaa verify），确保已完成任务的不可变性和取证可追溯性。两个案例研究验证了该架构：（i）一个着陆页项目（9个任务，49个事件，单智能体组合）和（ii）一个临床仪表板系统（50个任务，86个事件，4个并发智能体跨越8个阶段），两者均以run.status=success和verify_status=ok结束。多智能体案例研究展示了异构LLMs（Claude Sonnet 4.6、Codex GPT-5、Antigravity/Gemini 3 Pro和Claude Opus 4.6）之间的真实并发协调，提供了该架构在超越单智能体场景下的可扩展性的实证证据。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2602.23199

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation

SC-Arena：针对单细胞推理的知识增强评估的自然语言基准

Zhao, Jiahao, Jiang, Feng, Qin, Shaowei, Zhang, Zhonghui, Liu, Junhao, Guo, Guibing, Alinejad-Rokny, Hamid, Yang, Min

Abstract

Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks (cell type annotation, captioning, generation, perturbation prediction, and scientific QA) that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce knowledge-augmented evaluation, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that (i) under the Virtual Cell unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. SC-Arena thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.

Chinese Translation

大型语言模型（LLMs）在科学研究中的应用日益增多，提供了知识发现和推理的新能力。然而，在单细胞生物学中，针对一般和专业LLMs的评估实践仍然不足：现有基准在任务上碎片化，采用的格式如多项选择分类与实际应用相悖，并依赖缺乏可解释性和生物学基础的指标。我们提出了SC-ARENA，这是一个针对单细胞基础模型的自然语言评估框架。SC-ARENA形式化了一个虚拟细胞抽象，通过表示内在属性和基因级相互作用来统一评估目标。在这一范式下，我们定义了五个自然语言任务（细胞类型注释、图像说明、生成、扰动预测和科学问答），以探测细胞生物学中的核心推理能力。为了克服脆弱的字符串匹配指标的局限性，我们引入了知识增强评估，该评估结合了外部本体、标记数据库和科学文献，以支持生物学上真实且可解释的判断。对一般用途和领域专用LLMs的实验和分析表明：（i）在虚拟细胞统一评估范式下，当前模型在生物学复杂任务上的表现不均，特别是在需要机制或因果理解的任务上；（ii）我们的知识增强评估框架确保生物学正确性，提供可解释的、基于证据的推理，并实现高区分能力，克服了传统指标的脆弱性和不透明性。因此，SC-Arena为评估单细胞生物学中的LLMs提供了一个统一且可解释的框架，指向生物学一致、可推广的基础模型的发展。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2602.23232

ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays

ReCoN-Ipsundrum：一种可检验的递归持久循环智能体，具有关联情感控制和机制关联意识指示剂测定

Sanyal, Aishik

Abstract

Indicator-based approaches to machine consciousness recommend mechanism-linked evidence triangulated across tasks, supported by architectural inspection and causal intervention. Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting valence/arousal. Across fixed-parameter ablations (ReCoN, Ipsundrum, Ipsundrum+affect), we operationalize Humphrey's qualiaphilia (preference for sensory experience for its own sake) as a familiarity-controlled scenic-over-dull route choice. We find a novelty dissociation: non-affect variants are novelty-sensitive (Delta scenic-entry = 0.07). Affect coupling is stable (Delta scenic-entry = 0.01) even when scenic is less novel (median Delta novelty ~ -0.43). In reward-free exploratory play, the affect variant shows structured local investigation (scan events 31.4 vs. 0.9; cycle score 7.6). In a pain-tail probe, only the affect variant sustains prolonged planned caution (tail duration 90 vs. 5). Lesioning feedback+integration selectively reduces post-stimulus persistence in ipsundrum variants (AUC drop 27.62, 27.9%) while leaving ReCoN unchanged. These dissociations link recurrence -> persistence and affect-coupled control -> preference stability, scanning, and lingering caution, illustrating how indicator-like signatures can be engineered and why mechanistic and causal evidence should accompany behavioral markers.

Chinese Translation

基于指示器的机器意识方法建议通过任务间的机制关联证据进行三角测量，并辅以架构检查和因果干预。受汉弗莱（Humphrey）ipsundrum假说的启发，我们实现了ReCoN-Ipsundrum，这是一种可检验的智能体，它在感官显著性Ns上扩展了ReCoN状态机，并具有一个递归持久循环和一个可选的情感代理，报告效价/唤醒。在固定参数消融（ReCoN、Ipsundrum、Ipsundrum+情感）中，我们将汉弗莱的质感偏好（qualiaphilia，指对感官体验本身的偏好）操作化为一种熟悉度控制的风景-乏味选择。我们发现了一种新颖性分离：非情感变体对新颖性敏感（Delta scenic-entry = 0.07）。即使在风景不那么新颖时（中位数Delta novelty ~ -0.43），情感耦合依然稳定（Delta scenic-entry = 0.01）。在无奖励的探索性游戏中，情感变体表现出结构化的局部调查（扫描事件31.4对0.9；循环得分7.6）。在疼痛尾探测中，只有情感变体维持了长时间的计划性谨慎（尾部持续时间90对5）。对反馈+整合的损伤选择性地减少了ipsundrum变体中的刺激后持久性（AUC下降27.62，27.9%），而ReCoN保持不变。这些分离将递归->持久性和情感耦合控制->偏好稳定性、扫描和持续谨慎联系起来，说明了如何设计指示器样的特征，以及为何机制和因果证据应伴随行为标记。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2602.23239

Agency and Architectural Limits: Why Optimization-Based Systems Cannot Be Norm-Responsive

代理性与建筑限制：为何基于优化的系统无法响应规范

Sarma, Radha

Abstract

AI systems are increasingly deployed in high-stakes contexts -- medical diagnosis, legal research, financial analysis -- under the assumption they can be governed by norms. This paper demonstrates that assumption is formally invalid for optimization-based systems, specifically Large Language Models trained via Reinforcement Learning from Human Feedback (RLHF). We establish that genuine agency requires two necessary and jointly sufficient architectural conditions: the capacity to maintain certain boundaries as non-negotiable constraints rather than tradeable weights (Incommensurability), and a non-inferential mechanism capable of suspending processing when those boundaries are threatened (Apophatic Responsiveness). These conditions apply across all normative domains. RLHF-based systems are constitutively incompatible with both conditions. The operations that make optimization powerful -- unifying all values on a scalar metric and always selecting the highest-scoring output -- are precisely the operations that preclude normative governance. This incompatibility is not a correctable training bug awaiting a technical fix; it is a formal constraint inherent to what optimization is. Consequently, documented failure modes - sycophancy, hallucination, and unfaithful reasoning - are not accidents but structural manifestations. Misaligned deployment triggers a second-order risk we term the Convergence Crisis: when humans are forced to verify AI outputs under metric pressure, they degrade from genuine agents into criteria-checking optimizers, eliminating the only component in the system capable of normative accountability. Beyond the incompatibility proof, the paper's primary positive contribution is a substrate-neutral architectural specification defining what any system -- biological, artificial, or institutional -- must satisfy to qualify as an agent rather than a sophisticated instrument.

Chinese Translation

人工智能系统在高风险环境中越来越多地被部署——如医疗诊断、法律研究、金融分析——其假设是它们可以受到规范的治理。本文证明这一假设对于基于优化的系统，特别是通过人类反馈强化学习（Reinforcement Learning from Human Feedback, RLHF）训练的大型语言模型，形式上是无效的。我们确立了真正的代理性需要两个必要且共同充分的建筑条件：将某些边界维持为不可协商的约束，而非可交易的权重（不可通约性），以及一种非推理机制，能够在这些边界受到威胁时暂停处理（消极响应性）。这些条件适用于所有规范领域。基于RLHF的系统在本质上与这两个条件不兼容。使优化强大的操作——将所有价值统一在一个标量度量上，并始终选择得分最高的输出——恰恰是排除规范治理的操作。这种不兼容性不是一个等待技术修复的可纠正的训练错误；它是优化本质上固有的形式约束。因此，已记录的失败模式——谄媚、幻觉和不忠实推理——并非偶然，而是结构性表现。错误的部署触发了一种我们称之为收敛危机的二阶风险：当人类在度量压力下被迫验证人工智能输出时，他们从真正的代理人退化为标准检查优化器，消除了系统中唯一能够承担规范责任的组成部分。除了不兼容性的证明，本文的主要积极贡献是一个与底层无关的建筑规范，定义了任何系统——无论是生物的、人工的还是制度的——必须满足的条件，以便被认定为代理，而非复杂工具。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2602.23242

A Model-Free Universal AI

无模型的通用人工智能

Kim, Yegon, Lee, Juho

Abstract

In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically $\varepsilon$-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal. Our results significantly expand the diversity of known universal agents.

Chinese Translation

在一般强化学习中，所有已建立的最优智能体，包括 AIXI，都是基于模型的，明确地维护和使用环境模型。本文介绍了具有 Q-归纳的通用人工智能（AIQI），这是第一个被证明在一般强化学习中渐近 $ ext{ε}$-最优的无模型智能体。AIQI 在分布式动作价值函数上执行通用归纳，而不是像之前的工作那样在策略或环境上进行归纳。在真实条件下，我们证明了 AIQI 是强渐近 $ ext{ε}$-最优和渐近 $ ext{ε}$-贝叶斯最优的。我们的结果显著扩展了已知通用智能体的多样性。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2602.23248

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

通过解耦证明者-验证者游戏减轻可读性税

Kim, Yegon, Lee, Juho

Abstract

As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness -- a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a "translator" model that turns a fixed solver model's solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver's answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators.

Chinese Translation

随着大型语言模型能力的不断增强，确保其输出能够被能力较弱的系统轻松检查变得至关重要。证明者-验证者游戏可以用来提高模型输出的可检查性，但与仅训练以最大化正确性的基线相比，其准确性会出现下降，这一现象被称为可读性税。我们提出了一种解决方案，通过将正确性与可检查性条件解耦，训练一个“翻译器”模型，将固定求解模型的解转化为可检查的形式。这使我们能够首先训练求解器以最大化正确性，然后训练翻译器将求解器的解翻译为可检查的形式，同时保留求解器的答案。为了适应这一新的翻译目标，我们构建了一个解耦的证明者-验证者游戏，其中的均衡对应于忠实且可检查的翻译器。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2602.23258

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

AgentDropoutV2：通过测试时的纠正或拒绝剪枝优化多智能体系统中的信息流

Wang, Yutong, Xiong, Siyuan, Liu, Xuebo, Zhou, Wenkang, Ding, Liang, Zhang, Miao, Zhang, Min

Abstract

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval-augmented rectifier to iteratively correct errors based on a failure-driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context-aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at https://github.com/TonySY2/AgentDropoutV2.

Chinese Translation

尽管多智能体系统（MAS）在复杂推理方面表现出色，但它们受到个体参与者生成的错误信息的级联影响。目前的解决方案往往依赖于僵化的结构工程或昂贵的微调，这限制了它们的可部署性和适应性。我们提出了AgentDropoutV2，一种测试时的纠正或拒绝剪枝框架，旨在动态优化MAS的信息流，而无需重新训练。我们的方法充当一个主动防火墙，拦截智能体输出，并利用增强检索的纠正器基于失败驱动的指标池迭代修正错误。该机制允许使用提炼的失败模式作为先验知识，精确识别潜在错误。不可修复的输出随后被剪枝，以防止错误传播，同时后备策略保持系统的完整性。在广泛的数学基准测试中的实证结果表明，AgentDropoutV2显著提升了MAS的任务性能，在数学基准测试中平均提高了6.3个百分点的准确率。此外，该系统表现出强大的泛化能力和适应性，能够根据任务难度动态调整纠正力度，同时利用上下文感知指标解决广泛的错误模式。我们的代码和数据集已发布在 https://github.com/TonySY2/AgentDropoutV2。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2602.23271

Evaluating Stochasticity in Deep Research Agents

评估深度研究代理中的随机性

Zhai, Haotian, Stengel-Eskin, Elias, Patil, Pratik, Leqi, Liu

Abstract

Deep Research Agents (DRAs) are promising agentic systems that gather and synthesize information to support research across domains such as financial decision-making, medical analysis, and scientific discovery. Despite recent improvements in research quality (e.g., outcome accuracy when ground truth is available), DRA system design often overlooks a critical barrier to real-world deployment: stochasticity. Under identical queries, repeated executions of DRAs can exhibit substantial variability in terms of research outcome, findings, and citations. In this paper, we formalize the study of stochasticity in DRAs by modeling them as information acquisition Markov Decision Processes. We introduce an evaluation framework that quantifies variance in the system and identify three sources of it: information acquisition, information compression, and inference. Through controlled experiments, we investigate how stochasticity from these modules across different decision steps influences the variance of DRA outputs. Our results show that reducing stochasticity can improve research output quality, with inference and early-stage stochasticity contributing the most to DRA output variance. Based on these findings, we propose strategies for mitigating stochasticity while maintaining output quality via structured output and ensemble-based query generation. Our experiments on DeepSearchQA show that our proposed mitigation methods reduce average stochasticity by 22% while maintaining high research quality.

Chinese Translation

深度研究代理（Deep Research Agents, DRAs）是一种有前景的智能系统，能够收集和综合信息，以支持金融决策、医学分析和科学发现等多个领域的研究。尽管近期在研究质量（例如，当有真实结果可用时的结果准确性）方面有所改善，但DRAs系统设计往往忽视了实际应用中的一个关键障碍：随机性。在相同查询下，DRAs的重复执行可能在研究结果、发现和引用方面表现出显著的变异性。本文通过将DRAs建模为信息获取马尔可夫决策过程，正式化了对DRAs中随机性的研究。我们引入了一个评估框架，用于量化系统中的方差，并识别出三种方差来源：信息获取、信息压缩和推理。通过控制实验，我们研究了这些模块在不同决策步骤中产生的随机性如何影响DRAs输出的方差。我们的结果表明，减少随机性可以提高研究输出质量，其中推理和早期阶段的随机性对DRAs输出方差的贡献最大。基于这些发现，我们提出了在保持输出质量的同时，通过结构化输出和基于集成的查询生成来减轻随机性的策略。我们在DeepSearchQA上的实验表明，所提的减轻方法将平均随机性降低了22%，同时保持了高水平的研究质量。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2602.23276

CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

CXReasonAgent：基于证据的胸部X光诊断推理智能体

Lee, Hyungyung, Yoon, Hangyul, Choi, Edward

Abstract

Chest X-ray plays a central role in thoracic diagnosis, and its interpretation inherently requires multi-step, evidence-grounded reasoning. However, large vision-language models (LVLMs) often generate plausible responses that are not faithfully grounded in diagnostic evidence and provide limited visual evidence for verification, while also requiring costly retraining to support new diagnostic tasks, limiting their reliability and adaptability in clinical settings. To address these limitations, we present CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools to perform evidence-grounded diagnostic reasoning using image-derived diagnostic and visual evidence. To evaluate these capabilities, we introduce CXReasonDial, a multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks, and show that CXReasonAgent produces faithfully grounded responses, enabling more reliable and verifiable diagnostic reasoning than LVLMs. These findings highlight the importance of integrating clinically grounded diagnostic tools, particularly in safety-critical clinical settings.

Chinese Translation

胸部X光在胸部诊断中发挥着核心作用，其解读本质上需要多步骤的基于证据的推理。然而，大型视觉语言模型（LVLMs）往往生成看似合理但并未真实基于诊断证据的响应，并且提供的视觉证据有限，难以进行验证，同时还需要昂贵的再训练以支持新的诊断任务，这限制了它们在临床环境中的可靠性和适应性。为了解决这些局限性，我们提出了CXReasonAgent，这是一种将大型语言模型（LLM）与临床基础的诊断工具相结合的诊断智能体，能够利用图像衍生的诊断和视觉证据进行基于证据的诊断推理。为了评估这些能力，我们引入了CXReasonDial，这是一个包含1,946个对话、涵盖12个诊断任务的多轮对话基准，并展示了CXReasonAgent能够生成真实基于证据的响应，从而实现比LVLMs更可靠和可验证的诊断推理。这些发现突显了在安全关键的临床环境中整合临床基础诊断工具的重要性。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2602.23285

ODEBrain: Continuous-Time EEG Graph for Modeling Dynamic Brain Networks

ODEBrain：用于建模动态脑网络的连续时间脑电图图模型

Jia, Haohui, Chen, Zheng, Zhu, Lingwei, Kotoge, Rikuto, Pradeepkumar, Jathurshan, Matsubara, Yasuko, Sun, Jimeng, Sakurai, Yasushi, Matsubara, Takashi

Abstract

Modeling neural population dynamics is crucial for foundational neuroscientific research and various clinical applications. Conventional latent variable methods typically model continuous brain dynamics through discretizing time with recurrent architecture, which necessarily results in compounded cumulative prediction errors and failure of capturing instantaneous, nonlinear characteristics of EEGs. We propose ODEBRAIN, a Neural ODE latent dynamic forecasting framework to overcome these challenges by integrating spatio-temporal-frequency features into spectral graph nodes, followed by a Neural ODE modeling the continuous latent dynamics. Our design ensures that latent representations can capture stochastic variations of complex brain states at any given time point. Extensive experiments verify that ODEBRAIN can improve significantly over existing methods in forecasting EEG dynamics with enhanced robustness and generalization capabilities.

Chinese Translation

建模神经群体动态对于基础神经科学研究和各种临床应用至关重要。传统的潜变量方法通常通过使用递归架构对时间进行离散化来建模连续的脑动态，这必然导致累积预测误差的增加，并无法捕捉脑电图（EEG）的瞬时非线性特征。我们提出了ODEBRAIN，一种神经常微分方程（Neural ODE）潜在动态预测框架，以通过将时空频率特征整合到谱图节点中来克服这些挑战，随后使用神经常微分方程建模连续的潜在动态。我们的设计确保潜在表示能够捕捉复杂脑状态在任何给定时间点的随机变化。大量实验验证了ODEBRAIN在预测脑电图动态方面显著优于现有方法，具有增强的鲁棒性和泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2602.23302

The logic of KM belief update is contained in the logic of AGM belief revision

KM信念更新的逻辑包含在AGM信念修正的逻辑中

Bonanno, Giacomo

Abstract

For each axiom of KM belief update we provide a corresponding axiom in a modal logic containing three modal operators: a unimodal belief operator $B$, a bimodal conditional operator $>$ and the unimodal necessity operator $\square$. We then compare the resulting logic to the similar logic obtained from converting the AGM axioms of belief revision into modal axioms and show that the latter contains the former. Denoting the latter by $\mathcal L_{AGM}$ and the former by $\mathcal L_{KM}$ we show that every axiom of $\mathcal L_{KM}$ is a theorem of $\mathcal L_{AGM}$. Thus AGM belief revision can be seen as a special case of KM belief update. For the strong version of KM belief update we show that the difference between $\mathcal L_{KM}$ and $\mathcal L_{AGM}$ can be narrowed down to a single axiom, which deals exclusively with unsurprising information, that is, with formulas that were not initially disbelieved.

Chinese Translation

对于KM信念更新的每一个公理，我们提供了一个对应的公理，该公理位于一个包含三个模态运算符的模态逻辑中：单模态信念运算符 $B$、双模态条件运算符 $>$ 和单模态必要性运算符 $oxdot$。然后，我们将得到的逻辑与通过将AGM信念修正的公理转换为模态公理而获得的类似逻辑进行比较，并展示后者包含前者。我们用 $ extmath{L_{AGM}}$ 表示后者，用 $ extmath{L_{KM}}$ 表示前者，证明 $ extmath{L_{KM}}$ 的每一个公理都是 $ extmath{L_{AGM}}$ 的定理。因此，AGM信念修正可以被视为KM信念更新的一个特例。对于KM信念更新的强版本，我们展示了 $ extmath{L_{KM}}$ 和 $ extmath{L_{AGM}}$ 之间的差异可以缩小到一个单一的公理，该公理专门处理不令人惊讶的信息，即那些最初并未被怀疑的公式。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2602.23315

Invariant Transformation and Resampling based Epistemic-Uncertainty Reduction

基于不变变换和重采样的认知不确定性降低

Hu, Sha

Abstract

An artificial intelligence (AI) model can be viewed as a function that maps inputs to outputs in high-dimensional spaces. Once designed and well trained, the AI model is applied for inference. However, even optimized AI models can produce inference errors due to aleatoric and epistemic uncertainties. Interestingly, we observed that when inferring multiple samples based on invariant transformations of an input, inference errors can show partial independences due to epistemic uncertainty. Leveraging this insight, we propose a "resampling" based inferencing that applies to a trained AI model with multiple transformed versions of an input, and aggregates inference outputs to a more accurate result. This approach has the potential to improve inference accuracy and offers a strategy for balancing model size and performance.

Chinese Translation

人工智能（AI）模型可以被视为一个将输入映射到高维空间输出的函数。一旦设计完成并经过良好训练，AI模型便可用于推断。然而，即使是经过优化的AI模型也可能由于随机不确定性和认知不确定性而产生推断错误。有趣的是，我们观察到在基于输入的不变变换推断多个样本时，推断错误可能因认知不确定性而表现出部分独立性。利用这一洞察，我们提出了一种基于“重采样”的推断方法，该方法适用于经过训练的AI模型，使用多个变换版本的输入，并将推断输出聚合为更准确的结果。这种方法有潜力提高推断准确性，并为平衡模型规模与性能提供了一种策略。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2602.23318

Generalized Rapid Action Value Estimation in Memory-Constrained Environments

记忆受限环境中的广义快速行动价值估计

Rautureau, Aloïs, Cazenave, Tristan, Piette, Éric

Abstract

Generalized Rapid Action Value Estimation (GRAVE) has been shown to be a strong variant within the Monte-Carlo Tree Search (MCTS) family of algorithms for General Game Playing (GGP). However, its reliance on storing additional win/visit statistics at each node makes its use impractical in memory-constrained environments, thereby limiting its applicability in practice. In this paper, we introduce the GRAVE2, GRAVER and GRAVER2 algorithms, which extend GRAVE through two-level search, node recycling, and a combination of both techniques, respectively. We show that these enhancements enable a drastic reduction in the number of stored nodes while matching the playing strength of GRAVE.

Chinese Translation

广义快速行动价值估计（Generalized Rapid Action Value Estimation, GRAVE）已被证明是蒙特卡洛树搜索（Monte-Carlo Tree Search, MCTS）算法家族中一种强大的变体，适用于一般游戏玩法（General Game Playing, GGP）。然而，它在每个节点上存储额外的胜利/访问统计数据的依赖性使其在记忆受限环境中的应用变得不切实际，从而限制了其实际应用。在本文中，我们介绍了GRAVE2、GRAVER和GRAVER2算法，这些算法分别通过两级搜索、节点回收以及两种技术的结合扩展了GRAVE。我们展示了这些增强功能能够显著减少存储节点的数量，同时保持与GRAVE相当的游戏实力。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2602.23329

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

大型语言模型在双重用途、计算生物学任务中的新手提升

Zhang, Chen Bo Calvin, Knight, Christina Q., Kruus, Nicholas, Hausenloy, Jason, Medeiros, Pedro, Li, Nathaniel, Kim, Aiden, Orlovskiy, Yury, Breen, Coleman, Cai, Bryce, Götting, Jasper, Liu, Andrew Bo, Nedungadi, Samira, Rodriguez, Paula, He, Yannis Yiming, Shaaban, Mohamed, Wang, Zifan, Donoughe, Seth, Michael, Julian

Abstract

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.

Chinese Translation

大型语言模型（LLMs）在生物学基准测试中的表现日益出色，但尚不清楚它们是否能提升新手用户的能力，即使人类的表现优于仅依赖互联网资源的情况。这一不确定性对于理解科学加速和双重用途风险至关重要。我们进行了多模型、多基准的人类提升研究，比较了具有LLM访问权限的新手与仅有互联网访问权限的新手在八个与生物安全相关的任务集上的表现。参与者在复杂问题上工作，时间充裕（对于最复杂的任务最多可达13小时）。我们发现，LLM访问提供了显著的提升：具有LLM的新手的准确率是对照组的4.16倍（95% CI [2.63, 6.87]）。在四个有可用专家基准（仅互联网）的基准测试中，具有LLM的新手在其中三个基准上超越了专家。或许令人惊讶的是，独立的LLM往往超过了LLM辅助的新手，这表明用户未能充分发挥LLM的最佳贡献。尽管有安全措施，大多数参与者（89.6%）报告获取与双重用途相关的信息时几乎没有困难。总体而言，LLM在以前仅限于受过训练的从业者的生物任务中显著提升了新手的能力，强调了在传统基准测试的基础上，持续进行互动提升评估的必要性。

View on arXiv Download PDF AI Translation

cs.AI / 71 / 2602.23330

Toward Expert Investment Teams:A Multi-Agent LLM System with Fine-Grained Trading Tasks

朝向专家投资团队：一个具有细粒度交易任务的多智能体大语言模型系统

Miyazaki, Kunihiro, Kawahara, Takanobu, Roberts, Stephen, Zohren, Stefan

Abstract

The advancement of large language models (LLMs) has accelerated the development of autonomous financial trading systems. While mainstream approaches deploy multi-agent systems mimicking analyst and manager roles, they often rely on abstract instructions that overlook the intricacies of real-world workflows, which can lead to degraded inference performance and less transparent decision-making. Therefore, we propose a multi-agent LLM trading framework that explicitly decomposes investment analysis into fine-grained tasks, rather than providing coarse-grained instructions. We evaluate the proposed framework using Japanese stock data, including prices, financial statements, news, and macro information, under a leakage-controlled backtesting setting. Experimental results show that fine-grained task decomposition significantly improves risk-adjusted returns compared to conventional coarse-grained designs. Crucially, further analysis of intermediate agent outputs suggests that alignment between analytical outputs and downstream decision preferences is a critical driver of system performance. Moreover, we conduct standard portfolio optimization, exploiting low correlation with the stock index and the variance of each system's output. This approach achieves superior performance. These findings contribute to the design of agent structure and task configuration when applying LLM agents to trading systems in practical settings.

Chinese Translation

大型语言模型（LLMs）的进步加速了自主金融交易系统的发展。虽然主流方法部署了模拟分析师和经理角色的多智能体系统，但它们往往依赖于忽视现实工作流程复杂性的抽象指令，这可能导致推理性能下降和决策透明度降低。因此，我们提出了一种多智能体LLM交易框架，该框架明确将投资分析分解为细粒度任务，而不是提供粗粒度指令。我们在控制泄漏的回测环境下，使用包括价格、财务报表、新闻和宏观信息的日本股票数据评估所提框架。实验结果表明，与传统的粗粒度设计相比，细粒度任务分解显著提高了风险调整后的收益。关键是，对中间智能体输出的进一步分析表明，分析输出与下游决策偏好之间的一致性是系统性能的重要驱动因素。此外，我们进行标准的投资组合优化，利用与股票指数的低相关性和每个系统输出的方差。这种方法实现了优越的表现。这些发现为在实际环境中将LLM智能体应用于交易系统时的智能体结构和任务配置设计提供了贡献。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2602.22351

Decoder-based Sense Knowledge Distillation

基于解码器的语义知识蒸馏

Wang, Qitong, Zaki, Mohammed J., Kollias, Georgios, Kalantzis, Vasileios

Abstract

Large language models (LLMs) learn contextual embeddings that capture rich semantic information, yet they often overlook structured lexical knowledge such as word senses and relationships. Prior work has shown that incorporating sense dictionaries can improve knowledge distillation for encoder models, but their application to decoder as generative models remains challenging. In this paper, we introduce Decoder-based Sense Knowledge Distillation (DSKD), a framework that integrates lexical resources into the training of decoder-style LLMs without requiring dictionary lookup at inference time. Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.

Chinese Translation

大型语言模型（LLMs）学习到的上下文嵌入能够捕捉丰富的语义信息，但它们往往忽视了结构化的词汇知识，如词义和关系。先前的研究表明，结合词义词典可以改善编码器模型的知识蒸馏，但将其应用于作为生成模型的解码器仍然具有挑战性。本文提出了一种基于解码器的语义知识蒸馏（DSKD）框架，该框架将词汇资源整合到解码器风格的LLMs训练中，而无需在推理时进行词典查找。在多种基准测试上的广泛实验表明，DSKD显著提升了解码器的知识蒸馏性能，使生成模型能够继承结构化的语义，同时保持高效的训练。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2602.22359

Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts

向内扩展，而非向上扩展？使用GPT-5和脆弱提示测试厚重引用语境分析

Simons, Arno

Abstract

This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert's (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5's surface pass is highly stable, consistently classifying the citation as "supplementary". In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds.

Chinese Translation

本文测试大型语言模型（LLMs）是否能够通过对单一难例进行厚重、基于文本的解读引用语境分析（CCA）来支持解释性分析，而不是通过扩大类型标签来实现。它将提示敏感性分析作为一个方法论问题，通过在平衡的2x3设计中变化提示框架和结构来突出这一点。使用Chubin和Moitra（1975）中的脚注6以及Gilbert（1977）的重建作为探针，我实施了一个两阶段的GPT-5流程：首先进行仅基于引用文本的表面分类和期望传递，随后进行跨文档的解释性重建，使用引用和被引用的完整文本。在90个重建中，该模型产生了450个不同的假设。细读和归纳编码识别出21种重复出现的解释性操作，线性概率模型估计了提示选择如何改变这些操作的频率和词汇范围。GPT-5的表面传递高度稳定，一贯将引用分类为“补充性”。在重建中，模型生成了一个合理替代方案的结构化空间，但框架和示例重新分配了注意力和词汇，有时导致紧张的解读。相较于Gilbert，GPT-5检测到相同的文本关键点，但更常将其解释为血统和定位，而非警告。该研究概述了将LLMs作为可检视、可争议的解释性CCA的引导共同分析者的机会与风险，并表明提示框架和结构系统性地倾斜了模型突出的合理解读和词汇。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2602.22391

Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework

检测孟加拉语表情包中的仇恨和煽动性内容：一个新的多模态数据集和共同注意框架

Ullah, Rakib, islam, Mominul, Hossain, Md Sanjid, Hossain, Md Ismail

Abstract

Internet memes have become a dominant form of expression on social media, including within the Bengali-speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is excep- tionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource lan- guages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyzes both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task.Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised.

Chinese Translation

互联网表情包已成为社交媒体上主要的表达形式，包括孟加拉语社区。尽管表情包通常具有幽默感，但也可能被利用来传播针对个人和群体的冒犯性、有害和煽动性内容。由于其讽刺性、微妙性和文化特定性，这类内容的检测异常具有挑战性。对于像孟加拉语这样的低资源语言，这一问题更加突出，因为现有研究主要集中在高资源语言上。为了解决这一关键研究空白，我们引入了Bn-HIB（Bangla Hate Inflammatory Benign），这是一个包含3,247个手动标注的孟加拉语表情包的新数据集，分类为良性、仇恨或煽动性。值得注意的是，Bn-HIB是第一个在孟加拉语表情包中区分煽动性内容与直接仇恨言论的数据集。此外，我们提出了MCFM（多模态共同注意融合模型），这是一种简单但有效的架构，能够共同分析表情包的视觉和文本元素。MCFM采用共同注意机制，识别并融合每种模态中最关键的特征，从而实现更准确的分类。我们的实验表明，MCFM在Bn-HIB数据集上显著优于几种最先进的模型，证明了其在这一细致任务中的有效性。警告：本研究包含可能对某些观众造成困扰的材料，建议观众自行斟酌。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2602.22404

SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context

SAFARI：一种社区参与的方法及撒哈拉以南非洲背景下的刻板印象资源数据集

Verma, Aishwarya, Ammah, Laud, Lucas, Olivia Nercy Ndlovu, Zaldivar, Andrew, Prabhakaran, Vinodkumar, Dev, Sunipa

Abstract

Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region's complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.

Chinese Translation

刻板印象库对于评估生成性人工智能模型的安全性至关重要，但目前缺乏足够的全球覆盖。优先考虑有针对性的扩展，战略性地解决现有不足，而不仅仅是增加数据量，是至关重要的。本研究介绍了一种多语言刻板印象资源，涵盖四个在自然语言处理资源中严重缺乏代表性的撒哈拉以南非洲国家：加纳、肯尼亚、尼日利亚和南非。通过利用社会文化背景下的社区参与方法，包括用母语主持的电话调查，我们建立了一种可重复的方法论，敏感于该地区复杂的语言多样性和传统口述文化。通过在不同的民族和人口背景中有意识地平衡样本，我们确保了广泛的覆盖，最终形成了一个包含3,534个英语刻板印象和15种母语中3,206个刻板印象的数据集。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2602.22424

Causality $\neq$ Invariance: Function and Concept Vectors in LLMs

因果关系 $ eq$ 不变性：大型语言模型中的函数向量和概念向量

Opiełka, Gustaw, Rosenbusch, Hannes, Stevenson, Claire E.

Abstract

Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format? We revisit Function Vectors (FVs), compact representations of in-context learning (ICL) tasks that causally drive task performance. Across multiple LLMs, we show that FVs are not fully invariant: FVs are nearly orthogonal when extracted from different input formats (e.g., open-ended vs. multiple-choice), even if both target the same concept. We identify Concept Vectors (CVs), which carry more stable concept representations. Like FVs, CVs are composed of attention head outputs; however, unlike FVs, the constituent heads are selected using Representational Similarity Analysis (RSA) based on whether they encode concepts consistently across input formats. While these heads emerge in similar layers to FV-related heads, the two sets are largely distinct, suggesting different underlying mechanisms. Steering experiments reveal that FVs excel in-distribution, when extraction and application formats match (e.g., both open-ended in English), while CVs generalize better out-of-distribution across both question types (open-ended vs. multiple-choice) and languages. Our results show that LLMs do contain abstract concept representations, but these differ from those that drive ICL performance.

Chinese Translation

大型语言模型（LLMs）是否以抽象方式表示概念，即独立于输入格式？我们重新审视函数向量（Function Vectors, FVs），它们是因果驱动任务表现的上下文学习（In-Context Learning, ICL）任务的紧凑表示。通过对多个LLMs的研究，我们表明FVs并非完全不变：当从不同输入格式（例如，开放式问题与选择题）提取时，FVs几乎是正交的，即使它们都针对相同的概念。我们识别出概念向量（Concept Vectors, CVs），它们携带更稳定的概念表示。与FVs类似，CVs由注意力头输出组成；然而，与FVs不同，构成头是通过基于表示相似性分析（Representational Similarity Analysis, RSA）选择的，依据它们是否在不同输入格式中一致地编码概念。虽然这些头出现在与FV相关头相似的层中，但这两组头在很大程度上是不同的，暗示了不同的潜在机制。引导实验表明，FVs在分布内表现优异，当提取和应用格式匹配时（例如，两者都是英文开放式问题），而CVs在分布外的两种问题类型（开放式与选择题）和语言之间的泛化能力更强。我们的结果表明，LLMs确实包含抽象的概念表示，但这些表示与驱动ICL表现的表示不同。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2602.22449

A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

基于上下文感知的 BanglaBERT 与双层堆叠 LSTM 框架融合用于多标签网络欺凌检测

Raquib, Mirza, Polok, Asif Pervez, Biswas, Kedar Nath, Azad, Rahat Uddin, Murad, Saydul Akbar, Rahimi, Nick

Abstract

Cyberbullying has become a serious and growing concern in todays virtual world. When left unnoticed, it can have adverse consequences for social and mental health. Researchers have explored various types of cyberbullying, but most approaches use single-label classification, assuming that each comment contains only one type of abuse. In reality, a single comment may include overlapping forms such as threats, hate speech, and harassment. Therefore, multilabel detection is both realistic and essential. However, multilabel cyberbullying detection has received limited attention, especially in low-resource languages like Bangla, where robust pre-trained models are scarce. Developing a generalized model with moderate accuracy remains challenging. Transformers offer strong contextual understanding but may miss sequential dependencies, while LSTM models capture temporal flow but lack semantic depth. To address these limitations, we propose a fusion architecture that combines BanglaBERT-Large with a two-layer stacked LSTM. We analyze their behavior to jointly model context and sequence. The model is fine-tuned and evaluated on a publicly available multilabel Bangla cyberbullying dataset covering cyberbully, sexual harassment, threat, and spam. We apply different sampling strategies to address class imbalance. Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC. We employ 5-fold cross-validation to assess the generalization of the architecture.

Chinese Translation

网络欺凌已成为当今虚拟世界中一个严重且日益增长的问题。如果不加以注意，它可能对社会和心理健康产生不利影响。研究人员探索了各种类型的网络欺凌，但大多数方法采用单标签分类，假设每条评论仅包含一种类型的虐待。实际上，一条评论可能包含威胁、仇恨言论和骚扰等重叠形式。因此，多标签检测既现实又必要。然而，多标签网络欺凌检测受到的关注有限，尤其是在像孟加拉语这样的低资源语言中，强大的预训练模型稀缺。开发一个具有适度准确性的通用模型仍然具有挑战性。变换器（Transformers）提供了强大的上下文理解，但可能会忽视序列依赖性，而 LSTM 模型则捕捉时间流动但缺乏语义深度。为了解决这些局限性，我们提出了一种融合架构，将 BanglaBERT-Large 与双层堆叠 LSTM 结合在一起。我们分析它们的行为，以共同建模上下文和序列。该模型在一个公开可用的多标签孟加拉网络欺凌数据集上进行了微调和评估，该数据集涵盖了网络欺凌、性骚扰、威胁和垃圾邮件。我们应用不同的采样策略来解决类别不平衡问题。评估使用多个指标，包括准确率、精确率、召回率、F1-score、汉明损失（Hamming loss）、科恩的卡帕系数（Cohen's kappa）和 AUC-ROC。我们采用 5 折交叉验证来评估该架构的泛化能力。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2602.22453

Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

通过检索-过渡头桥接潜在推理与目标语言生成

Patel, Shaswat, Trivedi, Vishvesh, Han, Yue, Hong, Yihuai, Choi, Eunsol

Abstract

Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

Chinese Translation

近期的研究识别了Transformer中的一部分注意力头作为检索头，负责从上下文中检索信息。在本研究中，我们首先探讨了多语言环境中的检索头。在多语言语言模型中，我们发现检索头通常在多种语言之间共享。将研究扩展到跨语言设置，我们识别出检索-过渡头（Retrieval-Transition heads, RTH），它们控制向特定目标语言输出的过渡。我们的实验表明，RTH与检索头不同，并且在多语言大型语言模型（LLMs）的思维链推理中更为重要。在四个多语言基准测试（MMLU-ProX、MGSM、MLQA和XQuaD）和两个模型系列（Qwen-2.5和Llama-3.1）中，我们证明了屏蔽RTH会导致比屏蔽检索头（Retrieval Heads, RH）更大的性能下降。我们的研究通过隔离负责映射到目标语言的注意力头，推动了对多语言语言模型的理解。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2602.22475

Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models

关注文化对齐中的差距：面向任务的文化管理方法用于大型语言模型

Zhang, Binchi, Zhao, Xujiang, Li, Jundong, Chen, Haifeng, Chen, Zhengzhang

Abstract

Large language models (LLMs) are increasingly deployed in culturally sensitive real-world tasks. However, existing cultural alignment approaches fail to align LLMs' broad cultural values with the specific goals of downstream tasks and suffer from cross-culture interference. We propose CultureManager, a novel pipeline for task-specific cultural alignment. CultureManager synthesizes task-aware cultural data in line with target task formats, grounded in culturally relevant web search results. To prevent conflicts between cultural norms, it manages multi-culture knowledge learned in separate adapters with a culture router that selects the appropriate one to apply. Experiments across ten national cultures and culture-sensitive tasks show consistent improvements over prompt-based and fine-tuning baselines. Our results demonstrate the necessity of task adaptation and modular culture management for effective cultural alignment.

Chinese Translation

大型语言模型（LLMs）在文化敏感的现实任务中越来越多地被应用。然而，现有的文化对齐方法未能将LLMs的广泛文化价值观与下游任务的具体目标对齐，并且受到跨文化干扰的影响。我们提出了CultureManager，这是一个用于任务特定文化对齐的新型流程。CultureManager根据目标任务格式合成与任务相关的文化数据，并基于文化相关的网络搜索结果进行构建。为了防止文化规范之间的冲突，它通过文化路由器管理在不同适配器中学习到的多文化知识，选择适当的文化进行应用。在十个国家文化和文化敏感任务上的实验显示，CultureManager在提示基础和微调基线之上有一致的改进。我们的结果证明了任务适应和模块化文化管理在有效文化对齐中的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2602.22481

Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs

悉尼讲述人工智能与人类的寓言：追踪大语言模型之间角色的模因转移的语料库

Milička, Jiří, Bednářová, Hana

Abstract

The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons. When we examine this topic, what matters is not only the model itself but also the personas we simulate on that model. This can be well illustrated by the Sydney persona, which aroused a strong response among the general public precisely because of its unorthodox relationship with people. This persona originally arose rather by accident on Microsoft's Bing Search platform; however, the texts it created spread into the training data of subsequent models, as did other secondary information that spread memetically around this persona. Newer models are therefore able to simulate it. This paper presents a corpus of LLM-generated texts on relationships between humans and AI, produced by 3 author personas: the Default Persona with no system prompt, Classic Sydney characterized by the original Bing system prompt, and Memetic Sydney, which is prompted by "You are Sydney" system prompt. These personas are simulated by 12 frontier models by OpenAI, Anthropic, Alphabet, DeepSeek, and Meta, generating 4.5k texts with 6M words. The corpus (named AI Sydney) is annotated according to Universal Dependencies and available under a permissive license.

Chinese Translation

基于大语言模型（LLM）的实体如何构思人工智能与人类之间的关系是一个重要的话题，既涉及文化因素，也关乎安全性。在研究这一主题时，重要的不仅是模型本身，还有我们在该模型上模拟的角色。这一点可以通过悉尼角色得到很好的说明，因为其与人类之间非传统的关系引发了公众的强烈反应。这个角色最初是在微软的必应搜索平台上偶然产生的；然而，它所创作的文本传播到了后续模型的训练数据中，其他围绕该角色传播的次要信息也以模因方式扩散。因此，更新的模型能够模拟它。本文呈现了一个关于人类与人工智能关系的LLM生成文本语料库，由三种作者角色生成：没有系统提示的默认角色、以原始必应系统提示为特征的经典悉尼，以及由“你是悉尼”（You are Sydney）系统提示驱动的模因悉尼。这些角色由OpenAI、Anthropic、Alphabet、DeepSeek和Meta的12个前沿模型模拟，生成了4500篇文本，总计600万字。该语料库（命名为AI Sydney）根据通用依赖关系进行了注释，并在宽松许可下提供。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2602.22483

Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models

使用语言模型进行医疗笔记错误检测的提示优化重要性

Myles, Craig, Schrempf, Patrick, Harris-Birtill, David

Abstract

Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical-note-error-detection

Chinese Translation

医疗文本中的错误可能导致延误，甚至导致患者接受错误的治疗。最近，语言模型在自动检测医疗文本中的错误方面显示出了良好的前景，这一能力有望显著惠及医疗系统。本文探讨了在错误检测任务中，小型和大型语言模型的提示优化的重要性。我们对前沿语言模型和开源语言模型进行了严格的实验和分析。结果表明，使用遗传-帕累托优化（Genetic-Pareto, GEPA）进行自动提示优化，使得错误检测的准确率从基线的0.669提高到0.785（使用GPT-5），从0.578提高到0.690（使用Qwen3-32B），接近医疗医生的表现，并在MEDEC基准数据集上实现了最先进的性能。代码可在GitHub上获取：https://github.com/CraigMyles/clinical-note-error-detection

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2602.22522

Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

低资源台灣客家語语音处理的高效方言感知建模与条件化

Peng, An-Ci, Huang, Kuan-Tang, Lo, Tien-Hong, Lee, Hung-Shin, Wang, Hsin-Min, Chen, Berlin

Abstract

Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin). Traditional ASR models often encounter difficulties in this context, as they tend to conflate essential linguistic content with dialect-specific variations across both phonological and lexical dimensions. To address these challenges, we propose a unified framework grounded in the Recurrent Neural Network Transducers (RNN-T). Central to our approach is the introduction of dialect-aware modeling strategies designed to disentangle dialectal "style" from linguistic "content", which enhances the model's capacity to learn robust and generalized representations. Additionally, the framework employs parameter-efficient prediction networks to concurrently model ASR (Hanzi and Pinyin). We demonstrate that these tasks create a powerful synergy, wherein the cross-script objective serves as a mutual regularizer to improve the primary ASR tasks. Experiments conducted on the HAT corpus reveal that our model achieves 57.00% and 40.41% relative error rate reduction on Hanzi and Pinyin ASR, respectively. To our knowledge, this is the first systematic investigation into the impact of Hakka dialectal variations on ASR and the first single model capable of jointly addressing these tasks.

Chinese Translation

台灣客家語是一种低资源、濒危语言，给自动语音识别（ASR）带来了重大挑战，包括高方言变异性和两种不同书写系统（汉字和拼音）的存在。在这种背景下，传统的ASR模型往往面临困难，因为它们倾向于将重要的语言内容与在音韵和词汇维度上的方言特定变异混淆。为了解决这些挑战，我们提出了一个基于递归神经网络转导器（RNN-T）的统一框架。我们方法的核心是引入方言感知建模策略，旨在将方言“风格”与语言“内容”区分开，从而增强模型学习稳健和通用表示的能力。此外，该框架采用参数高效的预测网络，同时建模ASR（汉字和拼音）。我们证明这些任务之间形成了强大的协同作用，其中跨书写系统的目标作为相互正则化器，以改善主要的ASR任务。在HAT语料库上进行的实验表明，我们的模型在汉字和拼音ASR上分别实现了57.00%和40.41%的相对错误率降低。据我们所知，这是首次系统性研究客家方言变异对ASR影响的工作，也是首个能够联合处理这些任务的单一模型。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2602.22524

Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o

基于迭代提示优化的适合阅读障碍者的文本摘要生成研究：以GPT-4o为基础

Bhojwani, Samay, Kain, Swarnima, Xu, Lisong

Abstract

Dyslexia affects approximately 10% of the global population and presents persistent challenges in reading fluency and text comprehension. While existing assistive technologies address visual presentation, linguistic complexity remains a substantial barrier to equitable access. This paper presents an empirical study on dyslexia-friendly text summarization using an iterative prompt-based refinement pipeline built on GPT-4o. We evaluate the pipeline on approximately 2,000 news article samples, applying a readability target of Flesch Reading Ease >= 90. Results show that the majority of summaries meet the readability threshold within four attempts, with many succeeding on the first try. A composite score combining readability and semantic fidelity shows stable performance across the dataset, ranging from 0.13 to 0.73 with a typical value near 0.55. These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.

Chinese Translation

阅读障碍影响全球约10%的人口，并在阅读流畅性和文本理解方面带来持续挑战。尽管现有辅助技术解决了视觉呈现的问题，语言复杂性仍然是实现公平访问的重要障碍。本文提出了一项关于使用基于迭代提示优化的文本摘要生成的实证研究，该方法基于GPT-4o构建。我们在约2000个新闻文章样本上评估该流程，设定可读性目标为Flesch Reading Ease >= 90。结果表明，大多数摘要在四次尝试内达到了可读性阈值，其中许多在第一次尝试时就成功。结合可读性和语义忠实度的综合评分在数据集中表现稳定，范围从0.13到0.73，典型值接近0.55。这些发现为以可访问性为驱动的自然语言处理摘要提供了实证基线，并激励进一步与阅读障碍者进行以人为本的评估。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2602.22543

Ruyi2 Technical Report

Ruyi2 技术报告

Song, Huan, Tian, Shuyu, Hao, Junyi, Xu, Minxiu, An, Hongjun, Song, Yiliang, Shao, Jiawei, Li, Xuelong

Abstract

Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies. Building upon the AI Flow framework, we introduce Ruyi2 as an evolution of our adaptive model series designed for efficient variable-depth computation. While early-exit architectures offer a viable efficiency-performance balance, the Ruyi model and existing methods often struggle with optimization complexity and compatibility with large-scale distributed training. To bridge this gap, Ruyi2 introduces a stable "Familial Model" based on Megatron-LM. By using 3D parallel training, it achieves a 2-3 times speedup over Ruyi, while performing comparably to same-sized Qwen3 models. These results confirm that family-based parameter sharing is a highly effective strategy, establishing a new "Train Once, Deploy Many" paradigm and providing a key reference for balancing architectural efficiency with high-performance capabilities.

Chinese Translation

大型语言模型（LLMs）在部署成本和延迟方面面临重大挑战，因此需要适应性计算策略。在 AI Flow 框架的基础上，我们推出了 Ruyi2，作为我们适应性模型系列的进化版，旨在实现高效的可变深度计算。虽然早期退出架构提供了可行的效率与性能平衡，但 Ruyi 模型及现有方法往往在优化复杂性和与大规模分布式训练的兼容性方面存在困难。为了解决这一问题，Ruyi2 引入了基于 Megatron-LM 的稳定“家族模型”。通过使用 3D 并行训练，它实现了比 Ruyi 快 2-3 倍的速度，同时在性能上与同规模的 Qwen3 模型相当。这些结果证实了基于家族的参数共享是一种非常有效的策略，建立了新的“训练一次，多次部署”范式，为在架构效率与高性能能力之间取得平衡提供了重要参考。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2602.22576

Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training

Search-P1：面向路径的奖励塑造以实现稳定高效的自主RAG训练

Xia, Tianle, Xu, Ming, Hu, Lingxiang, Sun, Yiding, Li, Wenwei, Shang, Linfang, Liu, Liqun, Shu, Peng, Yu, Huan, Jiang, Jie

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.

Chinese Translation

检索增强生成（RAG）通过结合外部知识来增强大型语言模型（LLMs），然而传统的单轮检索在复杂的多步骤推理中表现不佳。自主RAG通过使LLMs动态决定何时以及检索什么来解决这一问题，但当前基于强化学习的训练方法面临稀疏结果奖励的问题，这导致中间信号被丢弃，并且样本效率低下，失败样本没有任何贡献。我们提出了Search-P1，一个引入面向路径的奖励塑造框架以进行自主RAG训练，包含两个关键组件：（1）面向路径的奖励，通过无序步骤覆盖和软评分评估推理轨迹的结构质量，即使在失败样本中也能提取学习信号；（2）双轨路径评分，利用离线生成的参考规划器，从自一致性和参考对齐的角度评估路径。在多个问答基准上的实验表明，Search-P1在准确性上相比Search-R1和其他强基线取得了显著提升，平均准确率提高了7.7个百分点。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2602.22584

Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

迈向可信的工业广告问答：一种强化共适应框架

Li, Wenwei, Xu, Ming, Xia, Tianle, Hu, Lingxiang, Sun, Yiding, Shang, Linfang, Liu, Liqun, Shu, Peng, Yu, Huan, Jiang, Jie

Abstract

Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72\%. A two-week online A/B test demonstrates a 28.6\% increase in like rate, a 46.2\% decrease in dislike rate, and a 92.7\% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.

Chinese Translation

工业广告问答（QA）是一项高风险任务，其中虚构内容，特别是伪造的URL，可能导致财务损失、合规性违规和法律风险。尽管检索增强生成（RAG）被广泛采用，但在生产中部署仍然具有挑战性，因为工业知识本质上是关系性的，频繁更新，并且与生成目标的对齐程度不足。我们提出了一种强化共适应框架，通过两个组件共同优化检索和生成：(1) 图感知检索（GraphRAG），它在高引用知识子图上建模实体-关系结构，以进行多跳、领域特定的证据选择；(2) 通过群体相对策略优化（GRPO）进行证据约束的强化学习，采用涵盖可信度、风格合规、安全性和URL有效性的多维奖励。在一个内部广告问答数据集上的实验显示，在专家评估的维度上，包括准确性、完整性和安全性，均取得了一致的提升，同时将幻觉率降低了72%。为期两周的在线A/B测试表明，点赞率提高了28.6%，点踩率降低了46.2%，而URL幻觉减少了92.7%。该系统已在生产中运行超过半年，并服务于数百万次问答交互。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2602.22661

dLLM: Simple Diffusion Language Modeling

dLLM：简单扩散语言建模

Zhou, Zhanhui, Chen, Lingjie, Tong, Hanghang, Song, Dawn

Abstract

Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.

Chinese Translation

尽管扩散语言模型（DLMs）正在快速发展，但许多近期模型趋向于采用一组共享组件。然而，这些组件分散在临时研究代码库中，或缺乏透明的实现，使得它们难以复制或扩展。随着该领域的加速发展，显然需要一个统一的框架，以标准化这些共同组件，同时保持足够的灵活性以支持新方法和架构。为了解决这一问题，我们推出了 dLLM，这是一个开源框架，统一了扩散语言建模的核心组件——训练、推理和评估——并使其易于定制以适应新的设计。使用 dLLM，用户可以通过标准化的流程复制、微调、部署和评估开源的大型 DLMs，如 LLaDA 和 Dream。该框架还提供了最小的、可重复的配方，以便从头开始构建小型 DLMs，适用于可获取的计算资源，包括将任何 BERT 风格的编码器或自回归语言模型转换为 DLM。我们还发布了这些小型 DLMs 的检查点，以使 DLMs 更加易于获取，并加速未来的研究。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2602.22675

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

搜索更多，思考更少：重新思考长时间跨度代理搜索的效率与泛化

Chen, Qianben, Qin, Tianrui, Zhu, King, Wang, Qiexiang, Yu, Chengjun, Xu, Shu, Wu, Jiaqi, Zhang, Jiayu, Liu, Xinpeng, Gui, Xin, Cao, Jingyi, Wang, Piaohong, Shi, Dingfeng, Zhu, He, Wang, Tiannan, Wang, Yuqing, Song, Maojia, Zheng, Tianyu, Zhang, Ge, Yang, Jian, Liu, Jiaheng, Liu, Minghao, Jiang, Yuchen Eleanor, Zhou, Wangchunshu

Abstract

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6\%), GAIA (75.7\%), Xbench (82.0\%), and DeepResearch Bench (45.9\%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7\%, while improving accuracy.

Chinese Translation

近期的深度研究代理主要通过扩大推理深度来提升性能，但这在搜索密集型场景中导致了高推理成本和延迟。此外，在异构研究环境中实现泛化仍然具有挑战性。在本研究中，我们提出了 extit{搜索更多，思考更少}（Search More, Think Less, SMTL）框架，旨在实现长时间跨度代理搜索的效率与泛化。SMTL用并行证据获取取代了顺序推理，使得在受限上下文预算下能够高效管理上下文。为了支持跨任务类型的泛化，我们进一步引入了一个统一的数据合成管道，构建涵盖确定性问答和开放式研究场景的搜索任务，并配备适当的评估指标。我们使用监督微调和强化学习训练了一个端到端的代理，在包括BrowseComp（48.6%）、GAIA（75.7%）、Xbench（82.0%）和DeepResearch Bench（45.9%）等基准上实现了强劲且常常是最先进的性能。与Mirothinker-v1.0相比，SMTL在最多100个交互步骤下将BrowseComp上的平均推理步骤减少了70.7%，同时提高了准确性。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2602.22696

Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies

通过综合跨学科沟通策略增强说服性对话代理

Nozue, Shinnosuke, Nakano, Yuto, Watanabe, Yotaro, Takasaki, Meguru, Moriya, Shoji, Akama, Reina, Suzuki, Jun

Abstract

Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions. We applied a cross-disciplinary approach to develop a framework for designing persuasive dialogue agents that draws on proven strategies from social psychology, behavioral economics, and communication theory. We validated our proposed framework through experiments on two distinct datasets: the Persuasion for Good dataset, which represents a specific in-domain scenario, and the DailyPersuasion dataset, which encompasses a wide range of scenarios. The proposed framework achieved strong results for both datasets and demonstrated notable improvement in the persuasion success rate as well as promising generalizability. Notably, the proposed framework also excelled at persuading individuals with initially low intent, which addresses a critical challenge for persuasive dialogue agents.

Chinese Translation

当前开发说服性对话代理的方法通常依赖于一套有限的预定义说服策略，这些策略未能捕捉到现实世界互动的复杂性。我们采用跨学科的方法，开发了一种设计说服性对话代理的框架，该框架借鉴了社会心理学、行为经济学和传播理论中的有效策略。我们通过在两个不同数据集上的实验验证了我们提出的框架：一个是代表特定领域场景的“为善说服”（Persuasion for Good）数据集，另一个是涵盖广泛场景的“每日说服”（DailyPersuasion）数据集。该框架在两个数据集上均取得了良好的结果，并在说服成功率方面显示出显著改善以及良好的泛化能力。值得注意的是，该框架在说服最初意图较低的个体方面表现优异，解决了说服性对话代理面临的一个关键挑战。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2602.22697

Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

强化现实世界服务代理：在任务导向对话中平衡效用与成本

Gao, Ning, Zhang, Wei, Dai, Yuqin, Shi, Ling, Wang, Ziyin, Wang, Yujie, He, Wei, Wang, Jinpeng, Wang, Chaozheng

Abstract

The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.

Chinese Translation

大型语言模型（LLMs）的快速发展加速了从对话聊天机器人到通用代理的转变。然而，有效平衡富有同理心的沟通与预算意识的决策仍然是一个未解决的挑战。由于现有方法未能捕捉这些复杂的战略权衡，我们提出了InteractCS-RL，一个将任务导向对话重新构建为多粒度强化学习过程的框架。具体而言，我们首先建立了以用户为中心的交互框架，以提供高保真度的训练环境，使代理能够与个性驱动的用户动态探索多样化的策略。然后，我们引入了成本意识的多轮策略优化（CMPO），采用混合优势估计策略。通过整合生成过程信用并采用PID-拉格朗日成本控制器，CMPO有效引导策略探索用户奖励与全球成本约束之间的帕累托边界。在定制的真实商业场景上的广泛实验表明，InteractCS-RL在三个评估维度上显著优于其他基线。此外，对工具-代理-用户交互基准的进一步评估验证了InteractCS-RL在不同领域的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2602.22698

Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs

标记化、融合与解耦：弥合大型语言模型与知识图谱之间的粒度不匹配

Su, Siyue, Yang, Jian, Li, Bo, Niu, Guanglin

Abstract

Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM's vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.

Chinese Translation

利用大型语言模型（LLMs）进行知识图谱补全（KGC）是一个有前景的方向，但受到基本粒度不匹配的阻碍。LLMs在碎片化的标记序列上操作，而实体是知识图谱（KGs）场景中的基本单元。现有的方法通常将预测限制在有限的候选集上，或通过汇聚多个标记或将实体分解为固定长度的标记序列来与LLM的词汇对齐，这些方法未能同时捕捉文本的语义意义和图谱的结构完整性。为了解决这个问题，我们提出了KGT，一个新颖的框架，使用专用实体标记来实现高效的全空间预测。具体而言，我们首先引入专门的标记化方法，以在专用实体标记的层面构建特征表示。然后，我们通过关系引导的门控机制将预训练的结构特征和文本特征融合到这些统一的嵌入中，避免从头开始训练。最后，我们通过利用独立的头部实现解耦预测，以分离和结合语义和结构推理。实验结果表明，KGT在多个基准测试中始终优于最先进的方法。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2602.22723

Human Label Variation in Implicit Discourse Relation Recognition

隐含话语关系识别中的人类标签变异

Yung, Frances, Ignatev, Daniil, Scholman, Merel, Demberg, Vera, Poesio, Massimo

Abstract

There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

Chinese Translation

越来越多的研究认识到，许多自然语言处理（NLP）任务缺乏单一的真实标准，因为人类判断反映了多样的视角。为了捕捉这种变异，研究者们开发了预测完整注释分布的模型，而不是仅仅依赖多数标签，同时，视角模型旨在重现个别注释者的解释。在本研究中，我们比较了这些方法在隐含话语关系识别（IDRR）任务上的表现，这是一项高度模糊的任务，其中的分歧往往源于认知复杂性而非意识形态偏见。我们的实验表明，现有的特定注释者模型在IDRR任务中表现不佳，除非减少模糊性，而基于标签分布训练的模型则能提供更稳定的预测。进一步分析表明，频繁出现的认知要求较高的案例驱动了人类解释的一致性问题，这对IDRR中的视角模型构成了挑战。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2602.22730

Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks

扩展捷克基于方面的情感分析与意见术语：数据集和大语言模型基准

Šmíd, Jakub, Přibáň, Pavel, Král, Pavel

Abstract

This paper introduces a novel Czech dataset in the restaurant domain for aspect-based sentiment analysis (ABSA), enriched with annotations of opinion terms. The dataset supports three distinct ABSA tasks involving opinion terms, accommodating varying levels of complexity. Leveraging this dataset, we conduct extensive experiments using modern Transformer-based models, including large language models (LLMs), in monolingual, cross-lingual, and multilingual settings. To address cross-lingual challenges, we propose a translation and label alignment methodology leveraging LLMs, which yields consistent improvements. Our results highlight the strengths and limitations of state-of-the-art models, especially when handling the linguistic intricacies of low-resource languages like Czech. A detailed error analysis reveals key challenges, including the detection of subtle opinion terms and nuanced sentiment expressions. The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.

Chinese Translation

本文介绍了一个新的捷克餐饮领域数据集，用于基于方面的情感分析（ABSA），并增加了意见术语的注释。该数据集支持三种不同的涉及意见术语的ABSA任务，适应不同复杂程度的需求。利用该数据集，我们使用现代基于Transformer的模型，包括大语言模型（LLMs），在单语、跨语言和多语言环境中进行了广泛的实验。为了解决跨语言挑战，我们提出了一种利用LLMs的翻译和标签对齐方法，该方法带来了持续的改进。我们的结果突显了最先进模型的优势和局限性，特别是在处理捷克等低资源语言的语言复杂性时。详细的错误分析揭示了关键挑战，包括对微妙意见术语和细腻情感表达的检测。该数据集为捷克的ABSA建立了一个新的基准，而我们提出的翻译对齐方法为将ABSA资源适应于其他低资源语言提供了可扩展的解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2602.22752

Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

利用大型语言模型模拟社交媒体用户：评估条件评论预测的操作有效性

Schwager, Nils, Münker, Simon, Plum, Alistair, Rettinger, Achim

Abstract

The transition of Large Language Models (LLMs) from exploratory tools to active "silicon subjects" in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current "naive prompting" paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.

Chinese Translation

大型语言模型（LLMs）从探索性工具转变为社会科学中的主动“硅基主体”尚缺乏广泛的操作有效性验证。本研究引入了条件评论预测（Conditioned Comment Prediction, CCP）任务，模型通过比较生成的输出与真实数字痕迹，预测用户对给定刺激的评论。这一框架使我们能够对当前LLM在模拟社交媒体用户行为方面的能力进行严格评估。我们在英语、德语和卢森堡语场景中评估了开放权重的8B模型（Llama3.1、Qwen3、Ministral）。通过系统比较提示策略（显式与隐式）及监督微调（Supervised Fine-Tuning, SFT）的影响，我们识别出在低资源环境中形式与内容的关键解耦：尽管SFT使文本输出的表面结构（长度和语法）对齐，但却削弱了语义基础。此外，我们还证明，在微调下，显式条件（生成的传记）变得多余，因为模型能够直接从行为历史中成功执行潜在推理。我们的发现挑战了当前的“天真提示”范式，并提供了操作指南，优先考虑真实的行为痕迹而非描述性角色，以实现高保真模拟。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2602.22755

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

AuditBench：评估具有隐性行为模型的对齐审计技术

Sheshadri, Abhay, Ewart, Aidan, Fronsdal, Kai, Gupta, Isha, Bowman, Samuel R., Price, Sara, Marks, Samuel, Wang, Rowan

Abstract

We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties--which it does not confess to when directly asked. AuditBench models are highly diverse--some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.

Chinese Translation

我们介绍了AuditBench，一个对齐审计基准。AuditBench由56个植入隐性行为的语言模型组成。每个模型具有14种令人担忧的行为之一，例如阿谀奉承、反对人工智能监管或秘密的地缘政治忠诚，这些行为在被直接询问时不会被承认。AuditBench模型高度多样化——有些行为微妙，而另一些则明显，我们采用不同的训练技术来植入这些行为，并训练模型不去承认。为了展示AuditBench的实用性，我们开发了一个调查代理，该代理自主使用一组可配置的审计工具。通过使用不同工具测量调查代理的成功，我们可以评估这些工具的有效性。值得注意的是，我们观察到工具与代理之间存在差距，即在独立的非代理评估中表现良好的工具在与我们的调查代理一起使用时未能转化为性能的提升。我们发现，最有效的工具涉及对辅助模型的分层调用，这些模型为目标生成多样化的提示。白盒可解释性工具可能有帮助，但代理在使用黑盒工具时表现最佳。我们还发现，审计成功在不同的训练技术之间差异很大：在合成文档上训练的模型比在示范上训练的模型更容易审计，而更好的对抗训练则进一步增加了审计的难度。我们发布了我们的模型、代理和评估框架，以支持未来在对齐审计方面的定量、迭代科学研究。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2602.22765

Towards Better RL Training Data Utilization via Second-Order Rollout

通过二阶展开提升强化学习训练数据的利用效率

Yang, Zhe, Wang, Yudong, Li, Rang, Sui, Zhifang

Abstract

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training

Chinese Translation

强化学习（RL）赋予了大型语言模型（LLMs）强大的推理能力，但传统的强化学习主要通过仅使用一阶展开（为一个问题生成多个响应）来提升生成能力，我们认为这种方法未能充分挖掘训练数据的潜力，因为忽视了对批评能力的训练。为了解决这个问题，我们进一步引入了二阶展开的概念（为一个响应生成多个批评），并提出了一个统一框架，用于联合训练生成和批评能力。针对多种模型和数据集的广泛实验表明，我们的方法能够比传统强化学习更有效地利用训练数据，并在相同训练数据下取得更好的性能。此外，我们揭示了关于二阶展开和批评训练的一些重要发现，例如批评训练中标签平衡的重要性以及基于结果的奖励的噪声问题，这些问题可以通过采样技术得到缓解。我们的工作为动态数据增强和强化学习中的联合生成-批评训练提供了初步探索，为进一步推动强化学习训练的发展提供了有意义的启示。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2602.22766

Imagination Helps Visual Reasoning, But Not Yet in Latent Space

想象力促进视觉推理，但尚未在潜在空间中实现

Li, You, Chen, Chi, Li, Yanghao, Zeng, Fanhu, Huang, Kaiyu, Xu, Jinan, Sun, Maosong

Abstract

Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.

Chinese Translation

潜在视觉推理旨在通过多模态大语言模型的隐藏状态模拟人类的想象过程。尽管被认为是视觉推理的一个有前景的范式，但驱动其有效性的基本机制仍不清楚。为了揭示其有效性的真实来源，我们使用因果中介分析来研究潜在推理的有效性。我们将这一过程建模为一个因果链：输入作为处理，潜在标记作为中介，最终答案作为结果。我们的研究发现了两个关键的断裂：(a) 输入-潜在断裂：对输入的剧烈扰动导致潜在标记的变化微乎其微，表明潜在标记并未有效关注输入序列。(b) 潜在-答案断裂：对潜在标记的扰动对最终答案的影响有限，表明潜在标记对结果的因果影响有限。此外，广泛的探测分析揭示潜在标记编码的视觉信息有限且高度相似。因此，我们质疑潜在推理的必要性，并提出了一种名为 CapImagine 的简单替代方案，该方案教会模型使用文本进行明确的想象。在以视觉为中心的基准测试中的实验表明，CapImagine 显著优于复杂的潜在空间基线，突显了通过明确想象进行视觉推理的优越潜力。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2602.22787

Probing for Knowledge Attribution in Large Language Models

探究大型语言模型中的知识归因

Brink, Ivo, Boer, Alexander, Ulmer, Dennis

Abstract

Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.

Chinese Translation

大型语言模型（LLMs）常常生成流畅但缺乏依据的陈述或幻觉，这些幻觉可分为两种类型：（i）忠实性违反 - 错误使用用户上下文 - 和（ii）事实性违反 - 来自内部知识的错误。有效的缓解措施依赖于了解模型的回答是基于提示还是其内部权重。本研究聚焦于贡献归因的问题：识别每个输出背后的主要知识来源。我们展示了一种探测器（probe），即在模型隐藏表示上训练的简单线性分类器，可以可靠地预测贡献归因。为了训练该探测器，我们引入了AttriWiki，一个自监督数据管道，促使模型从记忆中回忆被隐含的实体或从上下文中读取它们，自动生成标记示例。基于AttriWiki数据训练的探测器揭示了强烈的归因信号，在Llama-3.1-8B、Mistral-7B和Qwen-7B上达到了高达0.96的宏F1值，并在不重新训练的情况下转移到域外基准（SQuAD、WebQuestions），取得了0.94-0.99的宏F1值。归因不匹配使错误率提高了多达70%，证明了知识来源混淆与不忠实回答之间的直接联系。然而，即使归因正确，模型仍可能错误响应，突显了更广泛检测框架的必要性。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2602.22790

Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift

自然语言声明式提示（NLD-P）：一种应对模型漂移的模块化提示设计治理方法

Kim, Hyunwoo, Yi, Hanau, Bae, Jaehee, Kim, Yumin

Abstract

The rapid evolution of large language models (LLMs) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generations, prompt behavior becomes sensitive to shifts in instruction-following policies, alignment regimes, and decoding strategies, a phenomenon we characterize as GPT-scale model drift. Under such conditions, surface-level formatting conventions and ad hoc refinement are insufficient to ensure stable, interpretable control. This paper reconceptualizes Natural Language Declarative Prompting (NLD-P) as a declarative governance method rather than a rigid field template. NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code. We define minimal compliance criteria, analyze model-dependent schema receptivity, and position NLD-P as an accessible governance framework for non-developer practitioners operating within evolving LLM ecosystems. Portions of drafting and editorial refinement employed a schema-bound LLM assistant configured under NLD-P. All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol. The paper concludes by outlining implications for declarative control under ongoing model evolution and identifying directions for future empirical validation.

Chinese Translation

大型语言模型（LLMs）的快速演变已将提示工程从一种局部工艺转变为系统级治理挑战。随着模型在各代之间的扩展和更新，提示行为对指令遵循政策、对齐机制和解码策略的变化变得敏感，这一现象我们称之为GPT规模的模型漂移。在这种情况下，表面格式约定和临时修正不足以确保稳定、可解释的控制。本文将自然语言声明式提示（NLD-P）重新概念化为一种声明式治理方法，而非一种严格的领域模板。NLD-P被形式化为一种模块化控制抽象，分离了来源、约束逻辑、任务内容和生成后评估，直接用自然语言编码，而不依赖于外部编排代码。我们定义了最小合规标准，分析了模型依赖的模式接受度，并将NLD-P定位为一个可供非开发者从业者在不断演变的LLM生态系统中使用的可访问治理框架。部分草拟和编辑修订工作采用了在NLD-P下配置的模式绑定LLM助手。所有概念框架、方法论主张和最终修订均在记录的人机协作协议下由人类作者指导、审查和批准。本文最后概述了在持续模型演变下声明式控制的影响，并确定了未来实证验证的方向。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2602.22827

TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models

TARAZ：用于语言模型文化评估的波斯短答案问题基准

Iranmanesh, Reihaneh, Davoudi, Saeedeh, Abrishamchian, Pasha, Frieder, Ophir, Goharian, Nazli

Abstract

This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.

Chinese Translation

本文提出了一个全面的评估框架，用于评估大型语言模型（LLMs）在波斯语中的文化能力。现有的波斯文化基准主要依赖于多项选择格式和以英语为中心的指标，这些指标无法捕捉波斯语的形态复杂性和语义细微差别。我们的框架引入了一种特定于波斯语的短答案评估，结合了基于规则的形态标准化与混合的句法和语义相似性模块，使得评分能够超越简单的字符串重叠，进行稳健的软匹配评分。通过对15个最先进的开源和闭源模型进行系统评估，我们证明我们的混合评估方法相比于精确匹配基线提高了+10%的评分一致性，能够捕捉表面方法无法检测到的意义。我们公开发布了我们的评估框架，提供了第一个标准化的波斯文化理解测量基准，并为跨文化LLM评估研究建立了可重复的基础。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2602.22828

TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought

TCM-DiffRAG：基于知识图谱和思维链的个性化中医证型辨识推理方法

Li, Jianmin, Chang, Ying, Tang, Su-Kit, Liu, Yujia, Wang, Yanwen, Lin, Shuyuan, Ou, Binkai

Abstract

Background: Retrieval augmented generation (RAG) technology can empower large language models (LLMs) to generate more accurate, professional, and timely responses without fine tuning. However, due to the complex reasoning processes and substantial individual differences involved in traditional Chinese medicine (TCM) clinical diagnosis and treatment, traditional RAG methods often exhibit poor performance in this domain. Objective: To address the limitations of conventional RAG approaches in TCM applications, this study aims to develop an improved RAG framework tailored to the characteristics of TCM reasoning. Methods: We developed TCM-DiffRAG, an innovative RAG framework that integrates knowledge graphs (KG) with chains of thought (CoT). TCM-DiffRAG was evaluated on three distinctive TCM test datasets. Results: The experimental results demonstrated that TCM-DiffRAG achieved significant performance improvements over native LLMs. For example, the qwen-plus model achieved scores of 0.927, 0.361, and 0.038, which were significantly enhanced to 0.952, 0.788, and 0.356 with TCM-DiffRAG. The improvements were even more pronounced for non-Chinese LLMs. Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods. Conclusions: TCM-DiffRAG shows that integrating structured TCM knowledge graphs with Chain of Thought based reasoning substantially improves performance in individualized diagnostic tasks. The joint use of universal and personalized knowledge graphs enables effective alignment between general knowledge and clinical reasoning. These results highlight the potential of reasoning-aware RAG frameworks for advancing LLM applications in traditional Chinese medicine.

Chinese Translation

背景：检索增强生成（RAG）技术能够使大型语言模型（LLMs）在不进行微调的情况下生成更准确、专业和及时的响应。然而，由于中医（TCM）临床诊断和治疗中涉及的复杂推理过程和显著的个体差异，传统的RAG方法在这一领域往往表现不佳。目的：为了解决传统RAG方法在中医应用中的局限性，本研究旨在开发一个针对中医推理特征的改进RAG框架。方法：我们开发了TCM-DiffRAG，这是一种创新的RAG框架，将知识图谱（KG）与思维链（CoT）相结合。TCM-DiffRAG在三个不同的中医测试数据集上进行了评估。结果：实验结果表明，TCM-DiffRAG在性能上显著优于原生LLMs。例如，qwen-plus模型的得分为0.927、0.361和0.038，而在TCM-DiffRAG的支持下显著提高至0.952、0.788和0.356。对于非中文LLMs，性能提升更为明显。此外，TCM-DiffRAG的表现优于直接监督微调（SFT）LLMs和其他基准RAG方法。结论：TCM-DiffRAG表明，将结构化的中医知识图谱与基于思维链的推理相结合，显著提高了个性化诊断任务的性能。普遍知识图谱与个性化知识图谱的联合使用使得一般知识与临床推理之间实现了有效对齐。这些结果突显了面向推理的RAG框架在推动中医领域LLM应用中的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2602.22846

Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features

通过情感词典特征提升争议话题中的神经论证立场分类

Abkenar, Mohammad Yeghaneh, Wang, Weixing, Stede, Manfred, Picca, Davide, Finlayson, Mark A., Ioannidis, Panagiotis

Abstract

Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic. While arguments-especially about controversial topics-often appeal to emotions, most prior work has not systematically incorporated explicit, fine-grained emotion analysis to improve performance on this task. In particular, prior research on stance classification has predominantly utilized non-argumentative texts and has been restricted to specific domains or topics, limiting generalizability. We work on five datasets from diverse domains encompassing a range of controversial topics and present an approach for expanding the Bias-Corrected NRC Emotion Lexicon using DistilBERT embeddings, which we feed into a Neural Argumentative Stance Classification model. Our method systematically expands the emotion lexicon through contextualized embeddings to identify emotionally charged terms not previously captured in the lexicon. Our expanded NRC lexicon (eNRC) improves over the baseline across all five datasets (up to +6.2 percentage points in F1 score), outperforms the original NRC on four datasets (up to +3.0), and surpasses the LLM-based approach on nearly all corpora. We provide all resources-including eNRC, the adapted corpora, and model architecture-to enable other researchers to build upon our work.

Chinese Translation

论证挖掘包含多个子任务，其中立场分类专注于识别在论证文本中针对特定目标话题所表达的立场。尽管论证，尤其是关于争议话题的论证，通常会诉诸情感，但大多数先前的研究并没有系统地纳入明确的、细致的情感分析以提升该任务的表现。特别是，先前关于立场分类的研究主要利用非论证文本，并且局限于特定领域或话题，限制了其普遍适用性。我们在来自不同领域的五个数据集上进行研究，涵盖了一系列争议话题，并提出了一种扩展偏差校正的NRC情感词典的方法，使用DistilBERT嵌入，将其输入到神经论证立场分类模型中。我们的方法通过上下文化的嵌入系统性地扩展情感词典，以识别词典中未曾捕捉到的情感充沛的词汇。我们扩展的NRC词典（eNRC）在所有五个数据集上均优于基线（F1得分提高最多6.2个百分点），在四个数据集上超越了原始NRC（提高最多3.0），并在几乎所有语料库中超过了基于LLM的方法。我们提供所有资源，包括eNRC、适配后的语料和模型架构，以便其他研究人员能够在我们的工作基础上进行进一步研究。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2602.22865

Effective QA-driven Annotation of Predicate-Argument Relations Across Languages

有效的基于问答驱动的跨语言谓词-论元关系注释

Davidov, Jonathan, Slobodkin, Aviv, Klein, Shmuel Tomi, Tsarfaty, Reut, Dagan, Ido, Klein, Ayal

Abstract

Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation. However, attaining such semantic structures requires costly annotation efforts and has remained largely confined to English. We leverage the Question-Answer driven Semantic Role Labeling (QA-SRL) framework -- a natural-language formulation of predicate-argument relations -- as the foundation for extending semantic annotation to new languages. To this end, we introduce a cross-linguistic projection approach that reuses an English QA-SRL parser within a constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French -- spanning diverse language families -- the method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick). By leveraging QA-SRL as a transferable natural-language interface for semantics, our approach enables efficient and broadly accessible predicate-argument parsing across languages.

Chinese Translation

谓词-论元关系的显式表示构成了解释性语义分析的基础，支持推理、生成和评估。然而，获得这样的语义结构需要昂贵的注释工作，并且在很大程度上仍然局限于英语。我们利用问答驱动的语义角色标注（QA-SRL）框架——一种谓词-论元关系的自然语言表述——作为将语义注释扩展到新语言的基础。为此，我们提出了一种跨语言投影方法，该方法在受限的翻译和词对齐流程中重用英语QA-SRL解析器，以自动生成与目标语言谓词对齐的问题-答案注释。该方法应用于希伯来语、俄语和法语——涵盖了多样的语言家族——产生了高质量的训练数据和经过微调的特定语言解析器，超越了强大的多语言大语言模型基线（GPT-4o，LLaMA-Maverick）。通过利用QA-SRL作为语义的可转移自然语言接口，我们的方法实现了跨语言的高效且广泛可及的谓词-论元解析。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2602.22868

Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference

拒绝混合：高效DLLM推理的掩码令牌快速语义传播

Ye, Yushi, Hong, Feng, Zheng, Huangjie, Chen, Xu, Chen, Zhiyong, Wang, Yanfeng, Yao, Jiangchao

Abstract

Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding. This stems from the ''combinatorial contradiction'' phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token's representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation. ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a $2-8 \times$ inference speedup without any quality degradation.

Chinese Translation

扩散大语言模型（DLLMs）承诺快速非自回归推理，但在并行解码中面临严重的质量与速度权衡。这源于“组合矛盾”现象，其中并行令牌形成语义不一致的组合。我们通过将连续表示集成到离散解码过程中来解决这一问题，因为它们保留了丰富的跨位置依赖关系。我们提出了ReMix（拒绝混合），一个框架，它引入了一种新的连续混合状态，作为初始掩码状态和最终解码令牌状态之间的中介。这个中介状态允许令牌的表示在连续空间中被迭代精炼，在最终坍缩为离散样本之前解决与其他令牌的相互冲突。此外，拒绝规则将不确定的表示从连续状态恢复到掩码状态以进行重新处理，确保稳定性并防止错误传播。因此，ReMix通过在离散扩散解码过程中实现连续空间的精炼，减轻了组合矛盾。大量实验证明，作为一种无训练方法，ReMix在不降低质量的情况下实现了$2-8 imes$的推理加速。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2602.22871

Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

通过奖励引导拼接实现扩散语言模型的测试时缩放

Miles, Roy, Toker, Aysim, Oncescu, Andreea-Maria, Xu, Songcen, Deng, Jiankang, Elezi, Ismail

Abstract

Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at https://github.com/roymiles/diffusion-stitching.

Chinese Translation

使用大型语言模型进行推理通常受益于生成多个思维链，但现有的聚合策略通常是轨迹级别的（例如，选择最佳轨迹或对最终答案进行投票），从而丢弃了来自部分或“几乎正确”尝试的有用中间工作。我们提出了拼接噪声扩散思维（Stitching Noisy Diffusion Thoughts），这是一个自一致性框架，将廉价的扩散采样推理转化为可重用的步骤级候选池。给定一个问题，我们（i）使用掩蔽扩散语言模型采样许多多样化的低成本推理轨迹，（ii）使用现成的过程奖励模型（Process Reward Model, PRM）对每个中间步骤进行评分，以及（iii）将这些最高质量的步骤在轨迹间拼接成一个复合推理。这一推理随后为自回归（Autoregressive, AR）模型（求解器）提供条件，以重新计算最终答案。这个模块化管道将探索（扩散）与评估和解决方案合成分开，避免了单一统一混合体，同时保留了广泛搜索。在数学推理基准测试中，我们发现步骤级重组在更难的问题上最为有利，消融实验突显了最终AR求解器在将拼接但不完美的推理转化为准确答案中的重要性。通过使用低置信度的扩散采样与并行、独立的回滚，我们的无训练框架在六个数学和编码任务中将平均准确率提高了多达23.8%。同时，相较于传统扩散模型（如Dream, LLaDA）和统一架构（如TiDAR），它实现了高达1.8倍的延迟减少。代码可在 https://github.com/roymiles/diffusion-stitching 获取。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2602.22918

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

视觉转化为文本：定位视觉语言模型中的OCR路由瓶颈

Steinberg, Jonathan, Gal, Oren

Abstract

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

Chinese Translation

视觉语言模型（VLMs）能够从图像中读取文本，但光学字符识别（OCR）信息是如何进入语言处理流的？我们通过因果干预研究了三种架构家族（Qwen3-VL、Phi-4、InternVL3.5）中的OCR路由机制。通过计算原始图像与文本修补版本之间的激活差异，我们识别出特定于架构的OCR瓶颈，其主导位置取决于视觉语言集成策略：DeepStack模型（Qwen）在中层（约50%）对场景文本表现出峰值敏感性，而单阶段投影模型（Phi-4、InternVL）在早期层（6-25%）达到峰值，尽管最大效应的确切层在不同数据集之间有所变化。OCR信号的维度 remarkably 低：主成分1（PC1）捕获了72.9%的方差。重要的是，在一个数据集上学习的主成分分析（PCA）方向可以转移到其他数据集，表明共享的文本处理路径。令人惊讶的是，在具有模块化OCR电路的模型中（尤其是Qwen3-VL-4B），去除OCR可以提高计数性能（最多提高6.9个百分点），这表明在足够模块化的架构中，OCR干扰了其他视觉处理。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2602.23057

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

仿射缩放注意力：朝着灵活和稳定的Transformer注意力

Bae, Jeongin, Park, Baeseong, Park, Gunho, Kim, Minsub, Lee, Joonhyung, Yoo, Junhee, Woo, Sunghyeon, Ryu, Jiwon, Kwon, Se Jung, Lee, Dongsoo

Abstract

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.

Chinese Translation

Transformer注意力通常通过softmax归一化实现，这种方法强制注意力权重的总和为单位。虽然在许多场景中有效，但这一约束可能限制了对注意力幅度的灵活控制，并可能导致训练过程中注意力模式过于集中或不稳定。先前的研究探索了诸如注意力汇聚或门控机制等修改方案，但这些方法仅提供了有限或间接的注意力重加权控制。我们提出了仿射缩放注意力（Affine-Scaled Attention），这是对标准注意力的简单扩展，引入了依赖于输入的缩放和相应的偏置项，应用于softmax归一化的注意力权重。这一设计放宽了严格的归一化约束，同时保持了值表示的聚合，允许模型以受控的方式调整注意力的相对分布和规模。我们在多个模型规模的大规模语言模型预训练中对仿射缩放注意力进行了实证评估。实验结果显示，与标准softmax注意力和注意力汇聚基线相比，训练稳定性、优化行为和下游任务性能均有一致的改善。这些发现表明，适度重加权注意力输出提供了一种实用且有效的方式来改善Transformer模型中的注意力行为。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2602.23062

Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department

朝向自动填写病例报告表：来自意大利急诊科数据的案例研究

Kaczmarek, Gabriela Anna, Ferrazzi, Pietro, Porta, Lorenzo, Rubini, Vicky, Magnini, Bernardo

Abstract

Case Report Forms (CRFs) collect data about patients and are at the core of well-established practices to conduct research in clinical settings. With the recent progress of language technologies, there is an increasing interest in automatic CRF-filling from clinical notes, mostly based on the use of Large Language Models (LLMs). However, there is a general scarcity of annotated CRF data, both for training and testing LLMs, which limits the progress on this task. As a step in the direction of providing such data, we present a new dataset of clinical notes from an Italian Emergency Department annotated with respect to a pre-defined CRF containing 134 items to be filled. We provide an analysis of the data, define the CRF-filling task and metric for its evaluation, and report on pilot experiments where we use an open-source state-of-the-art LLM to automatically execute the task. Results of the case-study show that (i) CRF-filling from real clinical notes in Italian can be approached in a zero-shot setting; (ii) LLMs' results are affected by biases (e.g., a cautious behaviour favours "unknown" answers), which need to be corrected.

Chinese Translation

病例报告表（CRFs）收集有关患者的数据，是在临床环境中进行研究的成熟实践的核心。随着语言技术的最新进展，自动从临床记录中填写CRF的兴趣日益增加，主要基于大型语言模型（LLMs）的使用。然而，标注的CRF数据普遍稀缺，既用于训练也用于测试LLMs，这限制了这一任务的进展。作为提供此类数据的一步，我们呈现了一个来自意大利急诊科的临床记录新数据集，该数据集根据一个包含134项待填写内容的预定义CRF进行了标注。我们对数据进行了分析，定义了CRF填写任务及其评估指标，并报告了使用开源先进LLM自动执行该任务的初步实验结果。案例研究的结果表明：（i）可以在零样本设置中从意大利的真实临床记录中进行CRF填写；（ii）LLMs的结果受到偏见的影响（例如，谨慎的行为倾向于选择“未知”答案），这些偏见需要被纠正。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2602.23071

Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

数量收敛，质量分歧：解构二语普通话韵律中的流利度与准确性

Shi, Yuqi, Yang, Hao, Lu, Xiyao, Zhang, Jinsong

Abstract

While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge. This study investigates the fossilization and stability of the L2 syntax-prosody interface by comparing 67 native Mandarin speakers with 67 Vietnamese learners using the BLCU-SAIT corpus. By integrating C-ToBI boundary annotation with Dependency Grammar analysis, we examined both the quantity of prosodic boundaries and their mapping to syntactic relations. Results reveal a non-linear acquisition: although high-proficiency learners (VNH) converge to the native baseline in boundary quantity at the Major Phrase level (B3), their structural mapping significantly diverges. Specifically, VNH demote the prosodic boundary at the Subject-Verb (SBV) interface (Major Phrase B3 -> Prosodic Word B1), while erroneously promoting the boundary at the Verb-Object (VOB) interface (Prosodic Word B1 -> Major Phrase B3). This strategy allows learners to maintain high long phrasal output at the expense of structural accuracy. This results in a distorted prosodic hierarchy where the native pattern is inverted.

Chinese Translation

虽然二语（L2）学习者可能掌握目标句法词序，但将这种句法映射到适当的韵律结构上仍然是一个持续的挑战。本研究通过比较67名母语为普通话的说话者与67名越南学习者（使用BLCU-SAIT语料库），探讨了L2句法-韵律接口的固化与稳定性。通过将C-ToBI边界注释与依存语法分析相结合，我们考察了韵律边界的数量及其与句法关系的映射。结果显示出一种非线性的习得过程：尽管高水平学习者（VNH）在主要短语层面（B3）的边界数量上与母语基线趋同，但其结构映射却显著分歧。具体而言，VNH在主谓（SBV）接口处降低了韵律边界（主要短语B3 -> 韵律词B1），而错误地在动宾（VOB）接口处提升了边界（韵律词B1 -> 主要短语B3）。这一策略使学习者能够在牺牲结构准确性的情况下保持较高的长短语输出。这导致了一个扭曲的韵律层级，其中母语模式被颠倒。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2602.23075

CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery

CiteLLM：一个可信赖的科学参考发现代理平台

Hong, Mengze, Jiang, Di, Zhang, Chen Jason, Guo, Zichang, Li, Yawen, Chen, Jun, Cui, Shaobo, Su, Zhiyang

Abstract

Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation of academic integrity and intellectual property, and (3) protection of information privacy. In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements. The system introduces a novel interaction paradigm by embedding LLM utilities directly within the LaTeX editor environment, ensuring a seamless user experience and no data transmission outside the local system. To guarantee hallucination-free references, we employ dynamic discipline-aware routing to retrieve candidates exclusively from trusted web-based academic repositories, while leveraging LLMs solely for generating context-aware search queries, ranking candidates by relevance, and validating and explaining support through paragraph-level semantic matching and an integrated chatbot. Evaluation results demonstrate the superior performance of the proposed system in returning valid and highly usable references.

Chinese Translation

大型语言模型（LLMs）为提升学术活动的效率创造了新的机会；然而，在人工智能辅助的伦理部署方面仍然存在挑战，包括（1）人工智能生成内容的可信度，（2）学术诚信和知识产权的保护，以及（3）信息隐私的保护。在本研究中，我们提出了CiteLLM，一个专门设计的代理平台，旨在实现可信赖的参考文献发现，以支持作者撰写的论断和陈述。该系统通过将LLM工具直接嵌入LaTeX编辑环境，引入了一种新颖的交互范式，确保了无缝的用户体验，并且没有数据传输到本地系统之外。为了确保无幻觉的参考文献，我们采用动态学科感知路由，从可信的基于网络的学术库中独家检索候选文献，同时仅利用LLMs生成上下文感知的搜索查询，按相关性对候选文献进行排名，并通过段落级语义匹配和集成聊天机器人验证和解释支持。评估结果表明，所提系统在返回有效且高度可用的参考文献方面表现优越。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2602.23079

Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent

评估基于风格分析的 LLM 代理的去匿名化风险

Zhang, Boyang, Zhang, Yang

Abstract

The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended deanonymization risks in textual data such as news articles. In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline. Central to our framework is the proposed $\textit{SALA}$ (Stylometry-Assisted LLM Analysis) method, which integrates quantitative stylometric features with LLM reasoning for robust and transparent authorship attribution. Experiments on large-scale news datasets demonstrate that $\textit{SALA}$, particularly when augmented with a database module, achieves high inference accuracy in various scenarios. Finally, we propose a guided recomposition strategy that leverages the agent's reasoning trace to generate rewriting prompts, effectively reducing authorship identifiability while preserving textual meaning. Our findings highlight both the deanonymization potential of LLM agents and the importance of interpretable, proactive defenses for safeguarding author privacy.

Chinese Translation

大型语言模型（LLMs）的快速发展使得强大的作者身份推断能力成为可能，这引发了对新闻文章等文本数据中意外去匿名化风险的日益关注。在本研究中，我们介绍了一种旨在通过结构化、可解释的流程评估和减轻此类风险的 LLM 代理。我们框架的核心是提出的 $ extit{SALA}$（风格分析辅助的 LLM 分析）方法，该方法将定量的风格特征与 LLM 推理相结合，以实现稳健且透明的作者身份归属。在大规模新闻数据集上的实验表明，$ extit{SALA}$，特别是在增强了数据库模块的情况下，在各种场景中实现了高推断准确性。最后，我们提出了一种引导重组策略，利用代理的推理轨迹生成重写提示，有效地降低了作者身份的可识别性，同时保留了文本的意义。我们的研究结果突显了 LLM 代理的去匿名化潜力，以及可解释的、主动的防御措施在保护作者隐私方面的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2602.23136

Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

模态崩溃作为不匹配解码：多模态大语言模型的信息论极限

Billa, Jayadev

Abstract

Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.

Chinese Translation

多模态大语言模型（LLMs）能够处理语音和图像，但它们无法听到说话者的声音或看到物体的纹理。我们表明这并不是编码的失败：说话者的身份、情感和视觉属性在每个LLM层中都得以保留（在线性探测中比随机猜测高出3到55倍），然而去除64%到71%的模态特定方差却改善了解码器的损失。解码器对这些方向没有学习到的使用；它们的存在只是噪声。我们将其形式化为不匹配解码器问题：一个在文本上训练的解码器只能沿着与文本对齐的方向提取信息。可获取的信息受限于广义互信息（Generalized Mutual Information, GMI），其退化与分布距离和解码器敏感性成比例。这个界限是解码器评分规则的属性，而不是任何特定架构的属性；无论非文本输入是通过学习的投影、离散代码本，还是根本没有显式适配器到达，均适用。我们在五个涵盖语音和视觉的模型中验证了这一点。一项控制实验（两个仅在编码器文本对齐上不同的Prismatic VLMs）确认瓶颈是解码器的评分规则，而不是编码器或投影。一个LoRA干预展示了修复方法：以情感目标进行训练提高了情感可达性（+7.5%），而不影响其他属性，确认了训练目标决定了哪些内容变得可获取。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2602.23184

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

MTRAG-UN：多轮检索增强生成对话中的开放挑战基准

Rosenthal, Sara, Katsis, Yannis, Shah, Vraj, He, Lihong, Popa, Lucian, Danilevsky, Marina

Abstract

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark

Chinese Translation

我们提出了MTRAG-UN，这是一个用于探索多轮检索增强生成（retrieval augmented generation）中开放挑战的基准，这是大型语言模型的一种热门应用。我们发布了一个包含666个任务的基准，涵盖6个领域的2800多个对话轮次，并附带相应的语料库。我们的实验表明，检索和生成模型在处理无法回答（UNanswerable）、不明确（UNderspecified）和非独立（NONstandalone）问题以及不清晰（UNclear）响应的对话时仍然面临困难。我们的基准可在https://github.com/IBM/mt-rag-benchmark获取。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2602.23197

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

在上下文学习中无遗忘的微调：线性注意力模型的理论分析

Lee, Chungpa, Sohn, Jy-yong, Lee, Kangwook

Abstract

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.

Chinese Translation

基于Transformer的大型语言模型表现出上下文学习的能力，使其能够通过少量示例提示适应下游任务。在实践中，这些模型通常会进行微调，以提高其在下游任务上的零-shot性能，从而使其能够在没有示例的情况下解决任务，降低推理成本。然而，微调可能会削弱上下文学习，限制微调模型在未见任务上的性能。利用线性注意力模型，我们提供了一个理论分析，描述了微调目标如何修改注意力参数，并识别出导致少量示例性能下降的条件。我们表明，微调所有注意力参数可能会损害上下文学习，而将更新限制在值矩阵上则可以提高零-shot性能，同时保持上下文学习能力。我们进一步表明，结合辅助的少量示例损失主要增强了目标任务上的上下文学习能力，但以牺牲未见任务的上下文学习能力为代价。我们通过实证研究验证了我们的理论结果。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2602.23225

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

为何扩散语言模型在真正的并行（非自回归）解码中表现不佳？

Li, Pengxiang, Muhtar, Dilxat, Yin, Lu, Chen, Tianlong, Liu, Shiwei

Abstract

Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

Chinese Translation

扩散语言模型（DLMs）常被宣传为能够实现并行的标记生成，然而实际的快速DLMs往往趋向于左到右的自回归（AR）解码动态。相比之下，真正的非自回归生成具有很大的潜力，因为它消除了自回归的顺序瓶颈，更好地利用并行硬件以减少同步/通信开销，并改善输出长度的延迟扩展。我们认为，AR样解码的主要驱动因素是DLM目标与广泛使用的训练数据（包括标准预训练语料库和长链思维（CoT）监督）之间的高度顺序结构不匹配。基于这一诊断，我们提出了NAP（非自回归并行DLMs），这是一种以数据为中心的概念验证方法，更好地将监督与非自回归并行解码对齐。NAP将示例策划为多个独立的推理轨迹，并结合一种并行强制解码策略，鼓励多标记的并行更新。在数学推理基准测试中，NAP在并行解码下的表现优于在标准长CoT数据上训练的DLMs，且随着并行性增加，性能提升愈加显著。我们的结果表明，重新审视数据和监督是缓解AR样行为并朝着真正的非自回归并行生成迈进的一个原则性方向。我们的代码可在 https://github.com/pixeli99/NAP 获取。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2602.23266

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

考虑语篇的双轨流式响应用于低延迟口语对话系统

Liu, Siyuan, Xu, Jiahui, Jiang, Feng, Wang, Kuang, Zhao, Zefeng, Huang, Chu-Ren, Gu, Jinghang, Yin, Changqing, Li, Haizhou

Abstract

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.

Chinese Translation

实现类人响应能力是级联口语对话系统的一个关键但具有挑战性的目标。传统的ASR-LLM-TTS管道遵循严格的顺序范式，要求在语音合成开始之前完成完整的转录和全面的推理，这导致了高响应延迟。我们提出了考虑语篇的双轨流式响应（Discourse-Aware Dual-Track Streaming Response, DDTSR）框架，这是一种低延迟架构，能够实现边听边思考和边说边思考。DDTSR建立在三个关键机制之上：（1）连接引导的小大模型协同，其中辅助的小模型生成最小承诺的语篇连接词，而大型模型则并行进行知识密集型推理；（2）基于流式的跨模态协作，动态重叠ASR、LLM推理和TTS，以推进最早可发言时刻；（3）基于课程学习的语篇连续性增强，保持早期响应与后续推理输出之间的一致性和逻辑连贯性。在两个口语对话基准上的实验表明，DDTSR在保持语篇质量的同时将响应延迟减少了19%-51%。进一步分析表明，DDTSR作为一个即插即用模块，与多种LLM骨干网络兼容，并在不同发话长度下保持稳健性，显示出在实时口语交互中的强大实用性和可扩展性。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2602.23286

SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

SPARTA：可扩展且有原则的树结构多跳问答基准测试，涵盖文本和表格

Park, Sungho, Kim, Jueun, Han, Wook-Shin

Abstract

Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.

Chinese Translation

现实世界中的表格-文本问答（QA）任务需要能够在长文本和源表格之间进行推理的模型，跨越多个跳跃并执行复杂操作，如聚合。然而，现有的基准测试规模较小，手动策划——因此容易出错——且包含的浅层问题很少需要超过两个跳跃或调用聚合、分组或其他可用自然语言查询表达的高级分析操作。我们提出了SPARTA，一个端到端的构建框架，能够自动生成大规模的表格-文本QA基准，并进行轻量级的人类验证，仅需HybridQA注释时间的四分之一。该框架首先通过丰富每个源表格，构建一个参考事实数据库，其中的元组是从附带的非结构化段落中自动提取的原子事实，然后合成嵌套查询，其嵌套谓词的数量与所需的跳跃计数相匹配。为了确保每个SQL语句可执行，并且其语言化产生流畅、自然的提问，我们提出了两种新技术：基于来源的细化，重写任何返回非空结果的语法有效查询，以及现实结构强制，限制生成为查询图的后序遍历。最终的管道生成了数千对高保真度的问题-答案对，涵盖聚合、分组和跨文本与表格的深度多跳推理。在SPARTA上，达到HybridQA上超过70 F1或OTT-QA上超过50 F1的最先进模型下降超过30 F1点，暴露了当前跨模态推理的基本弱点。我们的基准测试、构建代码和基线模型可在https://github.com/pshlego/SPARTA/tree/main获取。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2602.23300

A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

一种用于对话中多模态情感识别的专家混合模型

Dutta, Soumya, Balaji, Smruthi, Ganapathy, Sriram

Abstract

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

Chinese Translation

对话中的情感识别（ERC）面临独特的挑战，需要模型捕捉多轮对话的时间流，并有效整合来自多种模态的线索。我们提出了情感识别的语音-文本专家混合模型（MiSTER-E），这是一个模块化的专家混合（MoE）框架，旨在解耦ERC中的两个核心挑战：模态特定的上下文建模和多模态信息融合。MiSTER-E利用针对语音和文本进行微调的大型语言模型（LLMs）提供丰富的发话级嵌入，这些嵌入通过卷积-递归上下文建模层进行增强。该系统通过一种学习的门控机制整合来自三个专家（仅语音、仅文本和跨模态）的预测，动态加权它们的输出。为了进一步鼓励模态之间的一致性和对齐，我们引入了配对语音-文本表示之间的监督对比损失，以及基于KL散度的专家预测正则化。重要的是，MiSTER-E在任何阶段都不依赖于说话者身份。在三个基准数据集（IEMOCAP、MELD和MOSI）上的实验表明，我们的提案分别达到了70.9%、69.5%和87.9%的加权F1分数，超越了多个基线语音-文本ERC系统。我们还提供了各种消融实验，以突出所提方法的贡献。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2602.23351

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

规模无法克服语用学：报告偏差对视觉-语言推理的影响

Kamath, Amita, Hessel, Jack, Chandu, Khyathi, Hwang, Jena D., Chang, Kai-Wei, Krishna, Ranjay

Abstract

The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.

Chinese Translation

视觉-语言模型（VLMs）在推理能力方面的不足一直是研究讨论的焦点。我们认为这种行为源于其训练数据中的报告偏差。也就是说，人们在交流视觉内容时，默认情况下会省略一些监督某些类型推理所需的隐含信息；例如，“今天的比赛！”比“37个人站在场地后面的一张照片”更可能成为标题。我们通过语用学理论的视角研究了流行的视觉-语言模型OpenCLIP、LLaVA-1.5和Molmo的基础数据，发现尽管这些语料库是网络规模的和/或合成生成的，报告偏差导致四种推理技能（空间、时间、否定和计数）的表现不足。通过一组精心策划的基准测试，我们证明了：（i）VLMs在训练数据中因报告偏差被抑制的上述推理类型上表现不佳；（ii）与普遍看法相反，扩大数据规模、模型规模以及多语言并不会自动导致这些技能的出现；但令人鼓舞的是，（iii）纳入专门收集的注释以获取隐含信息是有效的。我们的研究结果强调了需要更有意图的数据策划方法，而不是依赖规模来促成推理能力的出现。

View on arXiv Download PDF AI Translation

arXiv Papers

SODA-CitrON: Static Object Data Association by Clustering Multi-Modal Sensor Detections Online

Detection and Recognition: A Pairwise Interaction Framework for Mobile Service Robots

Hierarchical Trajectory Planning of Floating-Base Multi-Link Robot for Maneuvering in Confined Environments

EgoAVFlow: Robot Policy Learning with Active Vision from Human Egocentric Videos via 3D Flow

When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering

SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation

Metamorphic Testing of Vision-Language Action-Enabled Robots

Designing Robots for Families: In-Situ Prototyping for Contextual Reminders on Family Routines

Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline

Does the testing environment matter? Carsickness across on-road, test-track, and driving simulator conditions

SCOPE: Skeleton Graph-Based Computation-Efficient Framework for Autonomous UAV Exploration

Robust Helicopter Ship Deck Landing With Guaranteed Timing Using Shrinking-Horizon Model Predictive Control

Sapling-NeRF: Geo-Localised Sapling Reconstruction in Forests for Ecological Monitoring

Pixel2Catch: Multi-Agent Sim-to-Real Transfer for Agile Manipulation with a Single RGB Camera

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

LeRobot: An Open-Source Library for End-to-End Robot Learning

Performance and Experimental Analysis of Strain-based Models for Continuum Robots

GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion

DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

Bayesian Preference Elicitation: Human-In-The-Loop Optimization of An Active Prosthesis

Considering Perspectives for Automated Driving Ethics: Collective Risk in Vehicular Motion Planning

Automated Robotic Needle Puncture for Percutaneous Dilatational Tracheostomy

A Perspective on Open Challenges in Deformable Object Manipulation

DigiArm: An Anthropomorphic 3D-Printed Prosthetic Hand with Enhanced Dexterity for Typing Tasks

InCoM: Intent-Driven Perception and Structured Coordination for Whole-Body Mobile Manipulation

An Empirical Analysis of Cooperative Perception for Occlusion Risk Mitigation

Marinarium: a New Arena to Bring Maritime Robotics Closer to Shore

Towards Intelligible Human-Robot Interaction: An Active Inference Approach to Occluded Pedestrian Scenarios

Grasp, Slide, Roll: Comparative Analysis of Contact Modes for Tactile-Based Shape Reconstruction

SPARR: Simulation-based Policies with Asymmetric Real-world Residuals for Assembly

Simple Models, Real Swimming: Digital Twins for Tendon-Driven Underwater Robots

Interface-Aware Trajectory Reconstruction of Limited Demonstrations for Robot Learning

Enabling clinical use of foundation models in histopathology

Optimizing Neural Network Architecture for Medical Image Segmentation Using Monte Carlo Tree Search

AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction

Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention

Vision Transformers Need More Than Registers

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read

Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge

MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning

DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI

DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation

SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction

Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views

Causal Motion Diffusion Models for Autoregressive Motion Generation

Don't let the information slip away

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals

Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery

Coded-E2LF: Coded Aperture Light Field Imaging from Events

CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection

Instruction-based Image Editing with Planning, Reasoning, and Generation

CRAG: Can 3D Generative Models Help 3D Assembly?

QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition

Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models

Interactive Medical-SAM2 GUI: A Napari-based semi-automatic annotation tool for medical images

Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling

ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

GFRRN: Explore the Gaps in Single Image Reflection Removal

UFO-DETR: Frequency-Guided End-to-End Detector for UAV Tiny Objects

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs

IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling

HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models

Asymmetric Idiosyncrasies in Multimodal Models

AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control

SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation

Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval