arXiv Daily Digest

183

Papers

Foundation Models in Robotics: A Comprehensive Review of Methods, Models, Datasets, Challenges and Future Research Directions

机器人中的基础模型：方法、模型、数据集、挑战及未来研究方向的综合评述

Psiris, Aggelos, Argyriou, Vasileios, Markakis, Evangelos K., Sarigiannidis, Panagiotis, Gavves, Efstratios, Bekris, Kostas, Papadopoulos, Arash Ajoudani adn Georgios Th.

Abstract

Over the recent years, the field of robotics has been undergoing a transformative paradigm shift from fixed, single-task, domain-specific solutions towards adaptive, multi-function, general-purpose agents, capable of operating in complex, open-world, and dynamic environments. This tremendous advancement is primarily driven by the emergence of Foundation Models (FMs), i.e., large-scale neural-network architectures trained on massive, heterogeneous datasets that provide unprecedented capabilities in multi-modal understanding and reasoning, long-horizon planning, and cross-embodiment generalization. In this context, the current study provides a holistic, systematic, and in-depth review of the research landscape of FMs in robotics. In particular, the evolution of the field is initially delineated through five distinct research phases, spanning from the early incorporation of Natural Language Processing (NLP) and Computer Vision (CV) models to the current frontier of multi-sensory generalization and real-world deployment. Subsequently, a highly-granular taxonomic investigation of the literature is performed, examining the following key aspects: a) the employed FM types, including LLMs, VFMs, VLMs, and VLAs, b) the underlying neural-network architectures, c) the adopted learning paradigms, d) the different learning stages of knowledge incorporation, e) the major robotic tasks, and f) the main real-world application domains. For each aspect, comparative analysis and critical insights are provided. Moreover, a report on the publicly available datasets used for model training and evaluation across the considered robotic tasks is included. Furthermore, a hierarchical discussion on the current open challenges and promising future research directions in the field is incorporated.

Chinese Translation

近年来，机器人领域正经历从固定的、单一任务的、特定领域解决方案向适应性强的、多功能的、通用代理的转变，这些代理能够在复杂、开放和动态的环境中操作。这一巨大进展主要得益于基础模型（Foundation Models, FMs）的出现，即在大量异构数据集上训练的大规模神经网络架构，提供了前所未有的多模态理解与推理、长时间规划和跨体现泛化能力。在此背景下，本研究提供了对机器人领域FMs研究现状的整体、系统和深入的评述。特别是，本文通过五个不同的研究阶段初步勾勒了该领域的演变，从早期自然语言处理（Natural Language Processing, NLP）和计算机视觉（Computer Vision, CV）模型的引入，到当前多感知泛化和现实世界部署的前沿。随后，对文献进行了高度细致的分类调查，考察了以下关键方面：a) 所采用的FM类型，包括大型语言模型（Large Language Models, LLMs）、视觉基础模型（Vision Foundation Models, VFMs）、视觉语言模型（Vision-Language Models, VLMs）和视觉语言代理（Vision-Language Agents, VLAs）；b) 基础神经网络架构；c) 采用的学习范式；d) 知识整合的不同学习阶段；e) 主要的机器人任务；f) 主要的现实世界应用领域。对于每个方面，提供了比较分析和关键见解。此外，还包含了用于模型训练和评估的公开可用数据集的报告，涵盖了所考虑的机器人任务。此外，本文还对当前开放挑战和该领域有前景的未来研究方向进行了分层讨论。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2604.15449

Iterated Invariant EKF for Quadruped Robot Odometry

用于四足机器人里程计的迭代不变扩展卡尔曼滤波器

Santana, Hilton Marques Souza, Soares, João Carlos Virgolino, Goffin, Sven, Nisticò, Ylenia, Bonnabel, Silvère, Semini, Claudio, Meggiolaro, Marco Antonio

Abstract

Kalman filter-based algorithms are fundamental for mobile robots, as they provide a computationally efficient solution to the challenging problem of state estimation. However, they rely on two main assumptions that are difficult to satisfy in practice: (a) the system dynamics must be linear with Gaussian process noise, and (b) the measurement model must also be linear with Gaussian measurement noise. Previous works have extended assumption (a) to nonlinear spaces through the Invariant Extended Kalman Filter (IEKF), showing that it retains properties similar to those of the classical Kalman filter when the system dynamics are group-affine on a Lie group. More recently, the counterpart of assumption (b) for the same nonlinear setting was addressed in [1]. By means of the proposed Iterated Invariant Extended Kalman Filter (IterIEKF), the authors of that work demonstrated that the update step exhibits several compatibility properties of the classical linear Kalman filter. In this work, we introduce a novel open-source state estimation algorithm for legged robots based on the IterIEKF. The update step of the proposed filter relies solely on proprioceptive measurements, exploiting kinematic constraints on foot velocity during contact and base-frame velocity, making it inherently robust to environmental conditions. Through extensive numerical simulations and evaluation on real-world datasets, we demonstrate that the IterIEKF outperforms the vanilla IEKF, the SO(3)-based Kalman Filter, and its iterated variant in terms of both accuracy and consistency.

Chinese Translation

基于卡尔曼滤波器的算法是移动机器人领域的基础，因为它们为状态估计这一复杂问题提供了计算上高效的解决方案。然而，这些算法依赖于两个在实际中难以满足的主要假设：（a）系统动态必须是线性的，并且具有高斯过程噪声；（b）测量模型也必须是线性的，并且具有高斯测量噪声。之前的研究通过不变扩展卡尔曼滤波器（Invariant Extended Kalman Filter, IEKF）将假设（a）扩展到非线性空间，表明当系统动态在李群上是群仿射时，它保留了与经典卡尔曼滤波器相似的性质。最近，假设（b）在相同的非线性设置下的对应问题在文献[1]中得到了探讨。通过提出的迭代不变扩展卡尔曼滤波器（Iterated Invariant Extended Kalman Filter, IterIEKF），该工作的作者展示了更新步骤表现出经典线性卡尔曼滤波器的多个兼容性特性。在本研究中，我们基于IterIEKF提出了一种新颖的开源状态估计算法，专为四足机器人设计。所提滤波器的更新步骤仅依赖于本体感知测量，利用接触期间足部速度和基准框架速度的运动约束，使其在环境条件下具有内在的鲁棒性。通过广泛的数值仿真和对真实世界数据集的评估，我们证明了IterIEKF在准确性和一致性方面优于普通IEKF、基于SO(3)的卡尔曼滤波器及其迭代变种。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2604.15455

One-Shot Cross-Geometry Skill Transfer through Part Decomposition

通过部件分解实现一次性跨几何技能转移

Thompson, Skye, Biza, Ondrej, Konidaris, George

Abstract

Given a demonstration, a robot should be able to generalize a skill to any object it encounters-but existing approaches to skill transfer often fail to adapt to objects with unfamiliar shapes. Motivated by examples of improved transfer from compositional modeling, we propose a method for improving transfer by decomposing objects into their constituent semantic parts. We leverage data-efficient generative shape models to accurately transfer interaction points from the parts of a demonstration object to a novel object. We autonomously construct an objective to optimize the alignment of those points on skill-relevant object parts. Our method generalizes to a wider range of object geometries than existing work, and achieves successful one-shot transfer for a range of skills and objects from a single demonstration, in both simulated and real environments.

Chinese Translation

给定一个演示，机器人应该能够将一种技能推广到其遇到的任何物体上，但现有的技能转移方法往往无法适应形状不熟悉的物体。受到组合建模改善转移效果的启发，我们提出了一种通过将物体分解为其组成语义部分来改善转移的方法。我们利用数据高效的生成形状模型，将演示物体的部件上的交互点准确转移到新物体上。我们自主构建一个目标，以优化这些点在与技能相关的物体部件上的对齐。我们的方法能够推广到比现有工作更广泛的物体几何形状，并在模拟和真实环境中实现了一系列技能和物体的一次性成功转移，均基于单一演示。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2604.15475

NeuroMesh: A Unified Neural Inference Framework for Decentralized Multi-Robot Collaboration

NeuroMesh：一种统一的神经推理框架用于去中心化多机器人协作

Zhou, Yang, Shetye, Yash, Quang, Long, Super, Devon, Milzman, Jesse, Goarin, Manohari, Azad, Aditya, Dhake, Devang Sunil, Mao, Jeffery, Nieto-Granda, Carlos, Loianno, Giuseppe

Abstract

Deploying learned multi-robot models on heterogeneous robots remains challenging due to hardware heterogeneity, communication constraints, and the lack of a unified execution stack. This paper presents NeuroMesh, a multi-domain, cross-platform, and modular decentralized neural inference framework that standardizes observation encoding, message passing, aggregation, and task decoding in a unified pipeline. NeuroMesh combines a dual-aggregation paradigm for reduction- and broadcast-based information fusion with a parallelized architecture that decouples cycle time from end-to-end latency. Our high-performance C++ implementation leverages Zenoh for inter-robot communication and supports hybrid GPU/CPU inference. We validate NeuroMesh on a heterogeneous team of aerial and ground robots across collaborative perception, decentralized control, and task assignment, demonstrating robust operation across diverse task structures and payload sizes. We plan to release NeuroMesh as an open-source framework to the community.

Chinese Translation

在异构机器人上部署学习的多机器人模型仍然面临挑战，原因包括硬件异构性、通信限制以及缺乏统一的执行栈。本文提出了NeuroMesh，这是一种多领域、跨平台和模块化的去中心化神经推理框架，它在统一的管道中标准化了观察编码、消息传递、聚合和任务解码。NeuroMesh结合了基于减少和广播的信息融合的双聚合范式，并采用了将周期时间与端到端延迟解耦的并行架构。我们的高性能C++实现利用Zenoh进行机器人间通信，并支持混合GPU/CPU推理。我们在一支异构的空中和地面机器人团队上验证了NeuroMesh，涵盖了协作感知、去中心化控制和任务分配，展示了在多样化任务结构和有效载荷大小下的稳健操作。我们计划将NeuroMesh作为开源框架发布给社区。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2604.15507

Trajectory Planning for Safe Dual Control with Active Exploration

安全双重控制的轨迹规划与主动探索

Naveed, Kaleb Ben, Singh, Manveer, Agrawal, Devansh R., Panagou, Dimitra

Abstract

Planning safe trajectories under model uncertainty is a fundamental challenge. Robust planning ensures safety by considering worst-case realizations, yet ignores uncertainty reduction and leads to overly conservative behavior. Actively reducing uncertainty on-the-fly during a nominal mission defines the dual control problem. Most approaches address this by adding a weighted exploration term to the cost, tuned to trade off the nominal objective and uncertainty reduction, but without formal consideration of when exploration is beneficial. Moreover, safety is enforced in some methods but not in others. We study a budget-constrained dual control problem, where uncertainty is reduced subject to safety and a mission-level cost budget that limits the allowable degradation in task performance due to exploration. In this work, we propose Dual-gatekeeper, a framework that integrates robust planning with active exploration under formal guarantees of safety and budget feasibility. The key idea is that exploration is pursued only when it provides a verifiable improvement without compromising safety or violating the budget, enabling the system to balance immediate task performance with long-term uncertainty reduction in a principled manner. We provide two implementations of the framework based on different safety mechanisms and demonstrate its performance on quadrotor navigation and autonomous car racing case studies under parametric uncertainty.

Chinese Translation

在模型不确定性下规划安全轨迹是一个基本挑战。鲁棒规划通过考虑最坏情况的实现来确保安全，但忽视了不确定性降低，导致过于保守的行为。在名义任务期间主动减少不确定性定义了双重控制问题。大多数方法通过向成本中添加加权探索项来解决这一问题，该项经过调整以权衡名义目标与不确定性降低，但未正式考虑何时探索是有益的。此外，一些方法强制执行安全性，而另一些则没有。我们研究了一个预算受限的双重控制问题，其中不确定性在安全和限制由于探索导致的任务性能下降的任务级成本预算的条件下得到减少。在这项工作中，我们提出了Dual-gatekeeper，一个将鲁棒规划与主动探索相结合的框架，并在安全性和预算可行性方面提供正式保证。关键思想是，只有在探索提供可验证的改进而不妥协安全或违反预算时，才会进行探索，从而使系统能够以原则性的方式平衡即时任务性能与长期不确定性降低。我们基于不同的安全机制提供了该框架的两种实现，并在参数不确定性的四旋翼导航和自动驾驶赛车案例研究中展示了其性能。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2604.15569

ShapeGen: Robotic Data Generation for Category-Level Manipulation

ShapeGen：用于类别级操作的机器人数据生成

Wang, Yirui, Xu, Xiuwei, Ma, Angyuan, Yu, Bingyao, Zhou, Jie, Lu, Jiwen

Abstract

Manipulation policies deployed in uncontrolled real-world scenarios are faced with great in-category geometric diversity of everyday objects. In order to function robustly under such variations, policies need to work in a category-level manner, i.e. knowing how to interact with any object in a certain category, instead of only a specific one seen during training. This in-category generalizability is usually nurtured with shape-diversified training data; however, manually collecting such a corpus of data is infeasible due to the requirement of intense human labor and large collections of divergent objects at hand. In this paper, we propose ShapeGen, a data generation method that aims at generating shape-variated manipulation data in a simulator-free and 3D manner. ShapeGen decomposes the process into two stages: Shape Library curation and Function-Aware Generation. In the first stage, we train spatial warpings between shapes mapping points to points that correspond functionally, and aggregate 3D models along with the warpings into a plug-and-play Shape Library. In the second stage, we design a pipeline that, leveraging established Libraries, requires only minimal human annotation to generate physically plausible and functionally correct novel demonstrations. Experiments in the real world demonstrate the effectiveness of ShapeGen to boost policies' in-category shape generalizability. Project page: https://wangyr22.github.io/ShapeGen/.

Chinese Translation

在不受控的现实场景中部署的操作策略面临着日常物体的类别内几何多样性。为了在这种变化下稳健地运行，策略需要以类别级的方式工作，即知道如何与某一类别中的任何物体进行交互，而不仅仅是与训练期间看到的特定物体进行交互。这种类别内的泛化通常需要通过形状多样化的训练数据来培养；然而，由于需要大量的人力和多样化物体的收集，手动收集这样的数据集是不可行的。在本文中，我们提出了ShapeGen，一种旨在以无模拟器和三维方式生成形状变化操作数据的数据生成方法。ShapeGen将该过程分为两个阶段：形状库的策划和功能感知生成。在第一阶段，我们训练形状之间的空间变换，将功能上对应的点映射到点，并将三维模型与变换聚合到一个即插即用的形状库中。在第二阶段，我们设计了一个管道，利用已建立的库，仅需最少的人类注释即可生成物理上合理且功能上正确的新演示。现实世界中的实验表明，ShapeGen能够有效提升策略的类别内形状泛化能力。项目页面：https://wangyr22.github.io/ShapeGen/

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2604.15612

GaussianFlow SLAM: Monocular Gaussian Splatting SLAM Guided by GaussianFlow

GaussianFlow SLAM：由 GaussianFlow 引导的单目高斯点云 SLAM

Seo, Dong-Uk, Jeon, Jinwoo, Lee, Eungchang Mason, Myung, Hyun

Abstract

Gaussian splatting has recently gained traction as a compelling map representation for SLAM systems, enabling dense and photo-realistic scene modeling. However, its application to monocular SLAM remains challenging due to the lack of reliable geometric cues from monocular input. Without geometric supervision, mapping or tracking could fall in local-minima, resulting in structural degeneracies and inaccuracies. To address this challenge, we propose GaussianFlow SLAM, a monocular 3DGS-SLAM that leverages optical flow as a geometry-aware cue to guide the optimization of both the scene structure and camera poses. By encouraging the projected motion of Gaussians, termed GaussianFlow, to align with the optical flow, our method introduces consistent structural cues to regularize both map reconstruction and pose estimation. Furthermore, we introduce normalized error-based densification and pruning modules to refine inactive and unstable Gaussians, thereby contributing to improved map quality and pose accuracy. Experiments conducted on public datasets demonstrate that our method achieves superior rendering quality and tracking accuracy compared with state-of-the-art algorithms. The source code is available at: https://github.com/url-kaist/gaussianflow-slam.

Chinese Translation

高斯点云最近作为一种引人注目的地图表示方法在 SLAM 系统中获得了关注，使得密集且照片级真实的场景建模成为可能。然而，由于单目输入缺乏可靠的几何线索，其在单目 SLAM 中的应用仍然具有挑战性。在没有几何监督的情况下，映射或跟踪可能会陷入局部最小值，导致结构退化和不准确。为了解决这一挑战，我们提出了 GaussianFlow SLAM，这是一种单目 3DGS-SLAM，利用光流作为几何感知线索来引导场景结构和相机姿态的优化。通过鼓励高斯的投影运动（称为 GaussianFlow）与光流对齐，我们的方法引入了一致的结构线索，以规范化地图重建和姿态估计。此外，我们引入了基于归一化误差的稠密化和修剪模块，以精炼不活跃和不稳定的高斯，从而提升地图质量和姿态准确性。在公共数据集上进行的实验表明，我们的方法在渲染质量和跟踪准确性方面优于最先进的算法。源代码可在以下网址获取：https://github.com/url-kaist/gaussianflow-slam。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2604.15619

Factor Graph-Based Shape Estimation for Continuum Robots via Magnus Expansion

基于因子图的连续机器人形状估计方法：马格努斯展开

Ticozzi, Lorenzo, Vela, Patricio A., Tsiotras, Panagiotis

Abstract

Reconstructing the shape of continuum manipulators from sparse, noisy sensor data is a challenging task, owing to the infinite-dimensional nature of such systems. Existing approaches broadly trade off between parametric methods that yield compact state representations but lack probabilistic structure, and Cosserat rod inference on factor graphs, which provides principled uncertainty quantification at the cost of a state dimension that grows with the spatial discretization. This letter combines the strength of both paradigms by estimating the coefficients of a low-dimensional Geometric Variable Strain (GVS) parameterization within a factor graph framework. A novel kinematic factor, derived from the Magnus expansion of the strain field, encodes the closed-form rod geometry as a prior constraint linking the GVS strain coefficients to the backbone pose variables. The resulting formulation yields a compact state vector directly amenable to model-based control, while retaining the modularity, probabilistic treatment and computational efficiency of factor graph inference. The proposed method is evaluated in simulation on a 0.4 m long tendon-driven continuum robot under three measurement configurations, achieving mean position errors below 2 mm for all three scenarios and demonstrating a sixfold reduction in orientation error compared to a Gaussian process regression baseline when only position measurements are available.

Chinese Translation

从稀疏且噪声较大的传感器数据中重建连续操纵器的形状是一项具有挑战性的任务，这主要是由于此类系统的无限维特性。现有的方法通常在参数化方法（这些方法能够产生紧凑的状态表示但缺乏概率结构）与基于因子图的Cosserat杆推断之间进行权衡，后者提供了原则性的不确定性量化，但其状态维度随着空间离散化而增长。本文结合了这两种范式的优势，通过在因子图框架内估计低维几何变量应变（Geometric Variable Strain, GVS）参数化的系数。一个新颖的运动因子，源自应变场的马格努斯展开，编码了闭式杆几何形状作为先验约束，将GVS应变系数与主干姿态变量联系起来。所得到的公式产生了一个紧凑的状态向量，直接适用于基于模型的控制，同时保留了因子图推断的模块化、概率处理和计算效率。所提方法在模拟中对一款0.4米长的腱驱动连续机器人进行了评估，在三种测量配置下，所有三种场景的平均位置误差均低于2毫米，并且在仅有位置测量可用的情况下，相较于高斯过程回归基线，方向误差减少了六倍。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2604.15638

Contact-Aware Planning and Control of Continuum Robots in Highly Constrained Environments

在高度受限环境中对连续机器人进行接触感知规划与控制

Mangan, Aedan, Long, Kehan, Lee, Ki Myung Brian, Potdar, Miheer, Atanasov, Nikolay, Morimoto, Tania K.

Abstract

Continuum robots are well suited for navigating confined and fragile environments, such as vascular or endoluminal anatomy, where contact with surrounding structures is often unavoidable. While controlled contact can assist motion, unfavorable contact can degrade controllability, induce kinematic singularities, or introduce safety risks. We present a contact-aware planning approach that evaluates contact quality, penalizing hazardous interactions, while permitting benign contact. The planner produces kinematically feasible trajectories and contact-aware Jacobians which can be used for closed-loop control in hardware experiments. We validate the approach by testing the integrated system (planning, control, and mechanical design) on anatomical models from patient scans. The planner generates effective plans for three common anatomical environments, and, in all hardware trials, the continuum robot was able to reach the target while avoiding dangerous tip contact (100% success). Mean tracking errors were 1.9 +/- 0.5 mm, 1.2 +/- 0.1 mm, and 1.7 +/- 0.2 mm across the three different environments. Ablation studies showed that penalizing end-of-continuum-segment (ECS) contact improved manipulability and prevented hardware failures. Overall, this work enables reliable, contact-aware navigation in highly constrained environments.

Chinese Translation

连续机器人非常适合在狭窄和脆弱的环境中导航，例如血管或腔内解剖结构，在这些环境中与周围结构的接触往往是不可避免的。虽然受控接触可以辅助运动，但不利的接触可能会降低可控性，诱发运动学奇异性，或引入安全风险。我们提出了一种接触感知规划方法，该方法评估接触质量，惩罚危险的交互，同时允许良性接触。该规划器生成运动学上可行的轨迹和接触感知雅可比矩阵，可用于硬件实验中的闭环控制。我们通过在患者扫描的解剖模型上测试集成系统（规划、控制和机械设计）来验证该方法。该规划器为三种常见的解剖环境生成了有效的计划，并且在所有硬件试验中，连续机器人能够在避免危险的尖端接触的情况下到达目标（成功率100%）。在三种不同环境中的平均跟踪误差分别为1.9 +/- 0.5 mm、1.2 +/- 0.1 mm和1.7 +/- 0.2 mm。消融研究表明，惩罚连续段末端（ECS）接触可以提高可操作性并防止硬件故障。总体而言，这项工作使得在高度受限环境中实现可靠的接触感知导航成为可能。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2604.15671

Long-Term Memory for VLA-based Agents in Open-World Task Execution

基于VLA的代理在开放世界任务执行中的长期记忆

Huang, Xu, Mao, Weixin, Li, Yinhao, Chen, Hua, Zhao, Jiabao

Abstract

Vision-Language-Action (VLA) models have demonstrated significant potential for embodied decision-making; however, their application in complex chemical laboratory automation remains restricted by limited long-horizon reasoning and the absence of persistent experience accumulation. Existing frameworks typically treat planning and execution as decoupled processes, often failing to consolidate successful strategies, which results in inefficient trial-and-error in multi-stage protocols. In this paper, we propose ChemBot, a dual-layer, closed-loop framework that integrates an autonomous AI agent with a progress-aware VLA model (Skill-VLA) for hierarchical task decomposition and execution. ChemBot utilizes a dual-layer memory architecture to consolidate successful trajectories into retrievable assets, while a Model Context Protocol (MCP) server facilitates efficient sub-agent and tool orchestration. To address the inherent limitations of VLA models, we further implement a future-state-based asynchronous inference mechanism to mitigate trajectory discontinuities. Extensive experiments on collaborative robots demonstrate that ChemBot achieves superior operational safety, precision, and task success rates compared to existing VLA baselines in complex, long-horizon chemical experimentation.

Chinese Translation

视觉-语言-行动（VLA）模型在具身决策中展现出显著潜力；然而，它们在复杂化学实验室自动化中的应用仍受到有限的长期推理能力和缺乏持续经验积累的限制。现有框架通常将规划和执行视为解耦的过程，往往未能巩固成功策略，导致多阶段协议中的低效试错。在本文中，我们提出了ChemBot，一个双层闭环框架，将自主AI代理与进度感知的VLA模型（Skill-VLA）集成，用于层次任务分解和执行。ChemBot利用双层记忆架构将成功轨迹整合为可检索的资产，同时模型上下文协议（MCP）服务器促进高效的子代理和工具协调。为了解决VLA模型固有的局限性，我们进一步实现了一种基于未来状态的异步推理机制，以减轻轨迹不连续性。在协作机器人上的广泛实验表明，ChemBot在复杂的长期化学实验中，相较于现有VLA基线，达到了更优的操作安全性、精确度和任务成功率。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2604.15772

Fuzzy Logic Theory-based Adaptive Reward Shaping for Robust Reinforcement Learning (FARS)

基于模糊逻辑理论的自适应奖励塑形用于稳健强化学习 (FARS)

Şahin, Hürkan, Dang, Van Huyen, Sayar, Erdi, Yegenoglu, Alper, Kayacan, Erdal

Abstract

Reinforcement learning (RL) often struggles in real-world tasks with high-dimensional state spaces and long horizons, where sparse or fixed rewards severely slow down exploration and cause agents to get trapped in local optima. This paper presents a fuzzy logic based reward shaping method that integrates human intuition into RL reward design. By encoding expert knowledge into adaptive and interpreable terms, fuzzy rules promote stable learning and reduce sensitivity to hyperparameters. The proposed method leverages these properties to adapt reward contributions based on the agent state, enabling smoother transitions between fast motion and precise control in challenging navigation tasks. Extensive simulation results on autonomous drone racing benchmarks show stable learning behavior and consistent task performance across scenarios of increasing difficulty. The proposed method achieves faster convergence and reduced performance variability across training seeds in more challenging environments, with success rates improving by up to approximately 5 percent compared to non fuzzy reward formulations.

Chinese Translation

强化学习 (RL) 在高维状态空间和长时间跨度的现实任务中常常面临挑战，其中稀疏或固定的奖励严重减缓探索速度，并导致智能体陷入局部最优解。本文提出了一种基于模糊逻辑的奖励塑形方法，将人类直觉融入强化学习的奖励设计中。通过将专家知识编码为自适应和可解释的术语，模糊规则促进了稳定学习，并降低了对超参数的敏感性。所提出的方法利用这些特性，根据智能体状态调整奖励贡献，使得在具有挑战性的导航任务中实现快速运动与精确控制之间的平滑过渡。在自主无人机竞速基准测试上的广泛仿真结果显示出稳定的学习行为和在不断增加的难度场景下的一致任务表现。与非模糊奖励公式相比，所提出的方法在更具挑战性的环境中实现了更快的收敛和减少的性能变异性，成功率提高了约5%。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2604.15805

From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation

从观察到模拟：用于可泛化机器人学习与评估的生成高保真模拟与数字表亲

Lu, Jasper, Shen, Zhenhao, Wang, Yuanfei, Liu, Shugao, Xu, Shengqiang, Xie, Shawn, Xu, Jingkai, Jiang, Feng, Yang, Jade, Xie, Chen, Wu, Ruihai

Abstract

Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.

Chinese Translation

在真实环境中学习稳健的机器人策略需要多样化的数据增强，但由于需要获取物理资产和重新配置环境，真实世界数据收集的规模化成本高昂。因此，将真实场景增强为模拟已成为高效学习和评估的实用方法。我们提出了一个生成框架，建立了从真实世界全景到高保真模拟场景的生成实-模映射，并通过语义和几何编辑进一步合成多样的表亲场景。结合高质量的物理引擎和真实的资产，生成的场景支持交互式操作任务。此外，我们结合多房间拼接构建一致的大规模环境，以便在复杂布局中进行长时间的导航。实验表明，模拟与现实之间存在强相关性，验证了我们平台的保真度，并显示大规模数据生成显著提高了对未见场景和物体变体的泛化能力，证明了数字表亲在可泛化机器人学习和评估中的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2604.15854

Limits of Lamarckian Evolution Under Pressure of Morphological Novelty

拉马克进化在形态新颖性压力下的局限性

Muff, Jed R, Miras, Karine, Eiben, A. E.

Abstract

Lamarckian inheritance has been shown to be a powerful accelerator in systems where the joint evolution of robot morphologies and controllers is enhanced with individual learning. Its defining advantage lies in the offspring inheriting controllers learned by their parents. The efficacy of this option, however, relies on morphological similarity between parent and offspring. In this study, we examine how Lamarckian inheritance performs when the search process is driven toward high morphological variance, potentially straining the requirement for parent-offspring similarity. Using a system of modular robots that can evolve and learn to solve a locomotion task, we compare Darwinian and Lamarckian evolution to determine how they respond to shifting from pure task-based selection to a multi-objective pressure that also rewards morphological novelty. Our results confirm that Lamarckian evolution outperforms Darwinian evolution when optimizing task-performance alone. However, introducing selection pressure for morphological diversity causes a substantial performance drop, which is much greater in the Lamarckian system. Further analyses show that promoting diversity reduces parent-offspring similarity, which in turn reduces the benefits of inheriting controllers learned by parents. These results reveal the limits of Lamarckian evolution by exposing a fundamental trade-off between inheritance-based exploitation and diversity-driven exploration.

Chinese Translation

拉马克遗传被证明在机器人形态与控制器的联合进化与个体学习增强的系统中是一种强有力的加速器。其定义优势在于后代可以继承父母所学习的控制器。然而，这一选项的有效性依赖于父母与后代之间的形态相似性。在本研究中，我们考察了当搜索过程朝向高形态变异性驱动时，拉马克遗传的表现如何，这可能会对父母与后代之间的相似性要求造成压力。我们使用一套模块化机器人系统，该系统能够进化并学习解决运动任务，我们比较了达尔文进化与拉马克进化，以确定它们如何应对从纯任务导向选择转向同时奖励形态新颖性的多目标压力。我们的结果确认，当单独优化任务表现时，拉马克进化优于达尔文进化。然而，引入对形态多样性的选择压力会导致性能显著下降，而这种下降在拉马克系统中更为严重。进一步分析表明，促进多样性会降低父母与后代之间的相似性，进而减少继承父母所学习的控制器的好处。这些结果揭示了拉马克进化的局限性，暴露了基于遗传的利用与驱动多样性的探索之间的基本权衡。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2604.15864

Environment-Adaptive Solid-State LiDAR-Inertial Odometry

环境自适应固态LiDAR-惯性里程计

Zhang, Zhi, Satirapod, Chalermchon, Ma, Bingtao, Gu, Changjun

Abstract

Solid-state LiDAR-inertial SLAM has attracted significant attention due to its advantages in speed and robustness. However, achieving accurate mapping in extreme environments remains challenging due to severe geometric degeneracy and unreliable observations, which often lead to ill-conditioned optimization and map inconsistencies. To address these challenges, we propose an environment-adaptive solid-state LiDAR-inertial odometry that integrates local normal-vector constraints with degeneracy-aware map maintenance to enhance localization accuracy. Specifically, we introduce local normal-vector constraints to improve the stability of state estimation, effectively suppressing localization drift in degenerate scenarios. Furthermore, we design a degeneration-guided map update strategy to improve map precision. Benefiting from the refined map representation, localization accuracy is further enhanced in subsequent estimation. Experimental results demonstrate that the proposed method achieves superior mapping accuracy and robustness in extreme and perceptually degraded environments, with an average RMSE reduction of up to 12.8% compared to the baseline method.

Chinese Translation

固态LiDAR-惯性SLAM因其在速度和鲁棒性方面的优势而受到广泛关注。然而，在极端环境中实现精确映射仍然具有挑战性，因为严重的几何退化和不可靠的观测常常导致优化条件不良和地图不一致。为了解决这些挑战，我们提出了一种环境自适应的固态LiDAR-惯性里程计，该方法将局部法向量约束与考虑退化的地图维护相结合，以提高定位精度。具体而言，我们引入局部法向量约束来改善状态估计的稳定性，有效抑制退化场景中的定位漂移。此外，我们设计了一种退化引导的地图更新策略，以提高地图精度。得益于精细化的地图表示，后续估计中的定位精度进一步提高。实验结果表明，所提出的方法在极端和感知退化环境中实现了优越的映射精度和鲁棒性，与基线方法相比，平均均方根误差（RMSE）降低了多达12.8%。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2604.15865

DTEA: A Dual-Topology Elastic Actuator Enabling Real-Time Switching Between Series and Parallel Compliance

DTEA：一种双拓扑弹性驱动器，实现串联与并联柔顺性的实时切换

Ramesh, Vishal, Singh, Aman, Kolathaya, Shishir

Abstract

Series and parallel elastic actuators offer complementary but mutually exclusive advantages, yet no existing actuator enables real-time transition between these topologies during operation. This paper presents a novel actuator design called the Dual-Topology Elastic Actuator (DTEA), which enables dynamic switching between SEA and PEA topologies during operation. A proof-of-concept prototype of the DTEA is developed to demonstrate the feasibility of the topology-switching mechanism. Experiments are conducted to evaluate the robustness and timing of the switching mechanism under operational conditions. The actuator successfully performed 324 topology-switching cycles under load without damage, demonstrating the robustness of the mechanism. The measured switching time between SEA and PEA modes is under 33.33 ms. Additional experiments are conducted to characterize the static stiffness and disturbance rejection performance in both SEA and PEA modes. Static stiffness tests show that the PEA mode is 1.53x stiffer than the SEA mode, with KSEA = 5.57 +/- 0.02 Nm/rad and KPEA = 8.54 +/- 0.02 Nm/rad. Disturbance rejection experiments show that the mean peak deflection in SEA mode is 2.26x larger than in PEA mode (5.2 deg vs. 2.3 deg), while the mean settling time is 3.45x longer (1380 ms vs. 400 ms). The observed behaviors are consistent with the known characteristics of conventional SEA and PEA actuators, validating the functionality of both modes in the DTEA actuator.

Chinese Translation

串联和并联弹性驱动器提供互补但相互排斥的优势，然而现有的驱动器在运行过程中并不支持这两种拓扑结构之间的实时转换。本文提出了一种新型驱动器设计，称为双拓扑弹性驱动器（DTEA），该驱动器能够在运行过程中动态切换串联弹性驱动器（SEA）和并联弹性驱动器（PEA）拓扑结构。我们开发了DTEA的概念验证原型，以展示拓扑切换机制的可行性。在操作条件下进行实验以评估切换机制的鲁棒性和时效性。该驱动器在负载下成功完成了324次拓扑切换循环而未受损，证明了该机制的鲁棒性。测得的SEA与PEA模式之间的切换时间小于33.33毫秒。还进行了额外实验，以表征SEA和PEA模式下的静态刚度和干扰抑制性能。静态刚度测试表明，PEA模式的刚度是SEA模式的1.53倍，其中KSEA = 5.57 +/- 0.02 Nm/rad，KPEA = 8.54 +/- 0.02 Nm/rad。干扰抑制实验显示，SEA模式下的平均峰值偏转比PEA模式大2.26倍（5.2度对比2.3度），而平均稳定时间则长3.45倍（1380毫秒对比400毫秒）。观察到的行为与传统SEA和PEA驱动器的已知特性一致，验证了DTEA驱动器两种模式的功能性。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2604.15890

Robust Fleet Sizing for Multi-UAV Inspection Missions under Synchronized Replacement Demand

在同步更换需求下的多无人机检查任务的鲁棒舰队规模优化

Ramesh, Vishal, Thomas, Antony

Abstract

Multi-UAV inspection missions require spare drones to replace active drones during recharging cycles. Existing fleet-sizing approaches often assume steady-state operating conditions that do not apply to finite-horizon missions, or they treat replacement requests as statistically independent events. The latter provides per-request blocking guarantees that fail to translate to mission-level reliability when demands cluster. This paper identifies a structural failure mode where efficient routing assigns similar workloads to each UAV, leading to synchronized battery depletion and replacement bursts that exhaust the spare pool even when average capacity is sufficient. We derive a closed-form sufficient fleet-sizing rule, k = m(ceil(R) + 1), where m is the number of active UAVs and R is the recovery-to-active time ratio. This additive buffer of m spares absorbs worst-case synchronized demand at recovery-cycle boundaries and ensures mission-level reliability even when all UAVs deplete simultaneously. Monte Carlo validation across five scenarios (m in [2, 10], R in [0.87, 3.39], 1000 trials each) shows that Erlang-B sizing with a per-request blocking target epsilon = 0.01 drops to 69.9% mission success at R = 3.39, with 95% of spare exhaustion events concentrated in the top-decile 5-minute demand windows. In contrast, the proposed rule maintains 99.8% success (Wilson 95% lower bound 99.3%) across all tested conditions, including wind variability up to CV = 0.30, while requiring only four additional drones in the most demanding scenario.

Chinese Translation

多无人机检查任务需要备用无人机在充电周期内替换活跃无人机。现有的舰队规模优化方法通常假设稳态操作条件，这对于有限时间范围的任务并不适用，或者将更换请求视为统计独立事件。后者提供的每个请求的阻塞保证在需求聚集时无法转化为任务级别的可靠性。本文识别出一种结构性失效模式，其中高效的路径分配将相似的工作负载分配给每个无人机，导致电池同步耗尽和更换高峰，即使平均容量足够也会耗尽备用池。我们推导出一个封闭形式的充分舰队规模规则，k = m(ceil(R) + 1)，其中m是活跃无人机的数量，R是恢复与活跃时间比。这个附加的m个备用无人机缓冲在恢复周期边界吸收最坏情况下的同步需求，并确保任务级别的可靠性，即使所有无人机同时耗尽。对五种情景（m在[2, 10]之间，R在[0.87, 3.39]之间，每种情景进行1000次试验）的蒙特卡洛验证显示，采用Erlang-B规模优化，针对每个请求的阻塞目标epsilon = 0.01，在R = 3.39时任务成功率降至69.9%，而95%的备用耗尽事件集中在前十分之一的5分钟需求窗口。相比之下，所提规则在所有测试条件下保持99.8%的成功率（Wilson 95%下限为99.3%），包括风速变异性高达CV = 0.30，而在最苛刻的情景中仅需额外四架无人机。

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2604.15907

A Reconfigurable Pneumatic Joint Enabling Localized Selective Stiffening and Shape Locking in Vine-Inspired Robots

一种可重构的气动关节，实现藤蔓启发机器人局部选择性刚度和形状锁定

Oyejide, Ayodele James, Yaqub, Ustaz A., Erturk, Samir, Baran, Eray A., Stroppa, Fabio

Abstract

Vine-inspired robots achieve large workspace coverage through tip eversion, enabling safe navigation in confined and cluttered environments. However, their deployment in free space is fundamentally limited by low axial stiffness, poor load-bearing capacity, and the inability to retain shape during and after steering. In this work, we propose a reconfigurable pneumatic joint (RPJ) architecture that introduces discrete, pressure-tunable stiffness along the robot body without compromising continuous growth. Each RPJ module comprises symmetrically distributed pneumatic chambers that locally increase bending stiffness when pressurized, enabling decoupling between global compliance and localized rigidity. We integrate the RPJs into a soft growing robot with tendon-driven steering and develop a compact base station for mid-air eversion. System characterization and experimental validation demonstrate moderate pressure requirements for eversion, as well as comparable localized stiffening and steering performance to layer-jamming mechanisms. Demonstrations further show that the proposed robot achieves improved shape retention during bending, reduced gravitational deflection under load, cascading retraction, and reliable payload transport up to 202 g in free space. The RPJ mechanism establishes a practical pathway toward structurally adaptive vine robots for manipulation-oriented tasks such as object sorting and adaptive exploration in unconstrained environments.

Chinese Translation

藤蔓启发机器人通过尖端翻转实现了大范围的工作空间覆盖，能够安全地在狭窄和杂乱的环境中导航。然而，它们在自由空间中的应用受到低轴向刚度、较差的承载能力以及在转向过程中和转向后无法保持形状的根本限制。在本研究中，我们提出了一种可重构的气动关节（RPJ）架构，该架构在不妨碍连续生长的情况下，引入了沿机器人主体的离散、可调压力刚度。每个RPJ模块由对称分布的气动腔组成，当加压时可局部增加弯曲刚度，从而实现全局柔性与局部刚性之间的解耦。我们将RPJ集成到一种软性生长机器人中，该机器人具有腱驱动转向，并开发了一个紧凑的基站用于空中翻转。系统特性表征和实验验证表明，翻转所需的压力适中，并且局部刚度和转向性能与层间卡滞机制相当。演示进一步表明，所提议的机器人在弯曲过程中实现了更好的形状保持，承载下的重力偏转减少，级联收缩，以及在自由空间中可靠地运输高达202克的有效载荷。RPJ机制为结构自适应的藤蔓机器人在物体分类和无约束环境中的自适应探索等操作导向任务提供了一条实用路径。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2604.15938

VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

VADF：用于高效机器人操作的视觉自适应扩散策略框架

Yu, Xinglei, Liu, Zhenyang, Nan, Shufeng, Wu, Simo, Fu, Yanwei

Abstract

Diffusion policies are becoming mainstream in robotic manipulation but suffer from hard negative class imbalance due to uniform sampling and lack of sample difficulty awareness, leading to slow training convergence and frequent inference timeout failures. We propose VADF (Vision-Adaptive Diffusion Policy Framework), a vision-driven dual-adaptive framework that significantly reduces convergence steps and achieves early success in inference, with model-agnostic design enabling seamless integration into any diffusion policy architecture. During training, we introduce Adaptive Loss Network (ALN), a lightweight MLP-based loss predictor that quantifies per-step sample difficulty in real time. Guided by hard negative mining, it performs weighted sampling to prioritize high-loss regions, enabling adaptive weight updates and faster convergence. In inference, we design the Hierarchical Vision Task Segmenter (HVTS), which decomposes high-level task instructions into multi-stage low-level sub-instructions based on visual input. It adaptively segments action sequences into simple and complex subtasks by assigning shorter noise schedules with longer direct execution sequences to simple actions, and longer noise steps with shorter execution sequences to complex ones, thereby dramatically reducing computational overhead and significantly improving the early success rate.

Chinese Translation

扩散策略在机器人操作中正逐渐成为主流，但由于均匀采样和缺乏样本难度意识，导致严重的负类不平衡，从而造成训练收敛缓慢和频繁的推理超时失败。我们提出了VADF（视觉自适应扩散策略框架），这是一个以视觉驱动的双自适应框架，显著减少了收敛步骤，并在推理中实现了早期成功，其模型无关设计使其能够无缝集成到任何扩散策略架构中。在训练过程中，我们引入了自适应损失网络（Adaptive Loss Network, ALN），这是一个轻量级的基于多层感知器（MLP）的损失预测器，能够实时量化每一步样本的难度。在困难负样本挖掘的指导下，它执行加权采样，以优先考虑高损失区域，从而实现自适应权重更新和更快的收敛。在推理阶段，我们设计了分层视觉任务分解器（Hierarchical Vision Task Segmenter, HVTS），它根据视觉输入将高层任务指令分解为多阶段低层子指令。它通过为简单动作分配较短的噪声调度和较长的直接执行序列，以及为复杂动作分配较长的噪声步骤和较短的执行序列，自适应地将动作序列细分为简单和复杂的子任务，从而显著降低计算开销并显著提高早期成功率。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2604.16201

DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs

DENALI：一个支持低成本激光雷达进行非视距空间推理的数据集

Behari, Nikhil, Rivero, Diego, Apostolides, Luke, Ghosh, Suman, Liang, Paul Pu, Raskar, Ramesh

Abstract

Consumer LiDARs in mobile devices and robots typically output a single depth value per pixel. Yet internally, they record full time-resolved histograms containing direct and multi-bounce light returns; these multi-bounce returns encode rich non-line-of-sight (NLOS) cues that can enable perception of hidden objects in a scene. However, severe hardware limitations of consumer LiDARs make NLOS reconstruction with conventional methods difficult. In this work, we motivate a complementary direction: enabling NLOS perception with low-cost LiDARs through data-driven inference. We present DENALI, the first large-scale real-world dataset of space-time histograms from low-cost LiDARs capturing hidden objects. We capture time-resolved LiDAR histograms for 72,000 hidden-object scenes across diverse object shapes, positions, lighting conditions, and spatial resolutions. Using our dataset, we show that consumer LiDARs can enable accurate, data-driven NLOS perception. We further identify key scene and modeling factors that limit performance, as well as simulation-fidelity gaps that hinder current sim-to-real transfer, motivating future work toward scalable NLOS vision with consumer LiDARs.

Chinese Translation

移动设备和机器人中的消费级激光雷达通常每个像素输出一个深度值。然而，它们内部记录了完整的时间分辨直方图，包含直接和多次反射的光返回；这些多次反射的返回编码了丰富的非视距（NLOS）线索，可以使场景中隐藏物体的感知成为可能。然而，消费级激光雷达的严重硬件限制使得使用传统方法进行NLOS重建变得困难。在本研究中，我们提出了一种互补的方向：通过数据驱动推理使低成本激光雷达实现NLOS感知。我们介绍了DENALI，这是第一个来自低成本激光雷达的大规模真实世界空间-时间直方图数据集，捕捉隐藏物体。我们为72,000个隐藏物体场景捕获了时间分辨的激光雷达直方图，这些场景涵盖了多样的物体形状、位置、光照条件和空间分辨率。利用我们的数据集，我们展示了消费级激光雷达能够实现准确的数据驱动NLOS感知。我们进一步识别了限制性能的关键场景和建模因素，以及阻碍当前模拟到现实转移的模拟保真度差距，激励未来在消费级激光雷达上实现可扩展的NLOS视觉的研究。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2604.16263

Semantic Area Graph Reasoning for Multi-Robot Language-Guided Search

基于语义区域图推理的多机器人语言引导搜索

Wang, Ruiyang, Hsu, Hao-Lun, Kim, Jiwoo, Pajic, Miroslav

Abstract

Coordinating multi-robot systems (MRS) to search in unknown environments is particularly challenging for tasks that require semantic reasoning beyond geometric exploration. Classical coordination strategies rely on frontier coverage or information gain and cannot incorporate high-level task intent, such as searching for objects associated with specific room types. We propose \textit{Semantic Area Graph Reasoning} (SAGR), a hierarchical framework that enables Large Language Models (LLMs) to coordinate multi-robot exploration and semantic search through a structured semantic-topological abstraction of the environment. SAGR incrementally constructs a semantic area graph from a semantic occupancy map, encoding room instances, connectivity, frontier availability, and robot states into a compact task-relevant representation for LLM reasoning. The LLM performs high-level semantic room assignment based on spatial structure and task context, while deterministic frontier planning and local navigation handle geometric execution within assigned rooms. Experiments on the Habitat-Matterport3D dataset across 100 scenarios show that SAGR remains competitive with state-of-the-art exploration methods while consistently improving semantic target search efficiency, with up to 18.8\% in large environments. These results highlight the value of structured semantic abstractions as an effective interface between LLM-based reasoning and multi-robot coordination in complex indoor environments.

Chinese Translation

协调多机器人系统（MRS）在未知环境中进行搜索，对于需要超越几何探索的语义推理的任务尤其具有挑战性。传统的协调策略依赖于边界覆盖或信息增益，无法融入高层次的任务意图，例如搜索与特定房间类型相关的物体。我们提出了 extit{语义区域图推理}（SAGR），这是一个分层框架，使大型语言模型（LLMs）能够通过环境的结构化语义拓扑抽象来协调多机器人探索和语义搜索。SAGR从语义占用图中逐步构建语义区域图，将房间实例、连通性、边界可用性和机器人状态编码为紧凑的任务相关表示，以供LLM推理。LLM基于空间结构和任务上下文执行高层次的语义房间分配，而确定性的边界规划和局部导航则处理分配房间内的几何执行。在100个场景的Habitat-Matterport3D数据集上的实验表明，SAGR在与最先进的探索方法竞争的同时，始终提高了语义目标搜索效率，在大型环境中提高幅度可达18.8 ext{%}。这些结果突显了结构化语义抽象作为LLM基础推理与复杂室内环境中的多机器人协调之间有效接口的价值。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

cs.CV / 1 / 2604.15376

Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines

缩放一致性：多步骤视觉定位管道中的免费置信信号

Kim, Keon, Chelikavada, Krish

Abstract

Multi-step zoom-in pipelines are widely used for GUI grounding, yet the intermediate predictions they produce are typically discarded after coordinate remapping. We observe that these intermediate outputs contain a useful confidence signal for free: zoom consistency, the distance between a model's step-2 prediction and the crop center. Unlike log-probabilities or token-level uncertainty, zoom consistency is a geometric quantity in a shared coordinate space, making it directly comparable across architecturally different VLMs without calibration. We prove this quantity is a linear estimator of step-1 spatial error under idealized conditions (perfect step-2, target within crop) and show it correlates with prediction correctness across two VLMs (AUC = 0.60; Spearman rho = -0.14, p < 10^{-6} for KV-Ground-8B; rho = -0.11, p = 0.0003 for Qwen3.5-27B). The correlation is small but consistent across models, application categories, and operating systems. As a proof-of-concept, we use zoom consistency to route between a specialist and generalist model, capturing 16.5% of the oracle headroom between them (+0.8%, McNemar p = 0.19). Code is available at https://github.com/omxyz/zoom-consistency-routing.

Chinese Translation

多步骤缩放管道广泛应用于图形用户界面（GUI）定位，但它们产生的中间预测通常在坐标重映射后被丢弃。我们观察到这些中间输出包含一个有用的免费置信信号：缩放一致性，即模型的第2步预测与裁剪中心之间的距离。与对数概率或标记级不确定性不同，缩放一致性是在共享坐标空间中的几何量，使其在不同架构的视觉语言模型（VLMs）之间可以直接比较而无需校准。我们证明在理想化条件下（完美的第2步，目标在裁剪内），该量是第1步空间误差的线性估计器，并展示其与两个VLM（AUC = 0.60；斯皮尔曼相关系数 rho = -0.14，p < 10^{-6} 对于 KV-Ground-8B；rho = -0.11，p = 0.0003 对于 Qwen3.5-27B）的预测正确性相关。尽管相关性较小，但在模型、应用类别和操作系统之间保持一致。作为概念验证，我们使用缩放一致性在专门模型和通用模型之间进行路由，捕获了它们之间16.5%的预期提升（+0.8%，McNemar p = 0.19）。代码可在 https://github.com/omxyz/zoom-consistency-routing 获取。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2604.15451

Weak-to-Strong Knowledge Distillation Accelerates Visual Learning

弱到强的知识蒸馏加速视觉学习

Li, Baiang, Chai, Wenhao, Heide, Felix

Abstract

Large-scale visual learning is increasingly limited by training cost. Existing knowledge distillation methods transfer from a stronger teacher to a weaker student for compression or final-accuracy improvement. We instead investigate distillation to accelerate the training of strong students. We propose a generalizable plug-and-play recipe that freezes a weaker teacher, applies distillation only in early training, and turns it off once the student reaches and surpasses teacher-level performance. For ImageNet and CIFAR classification, this strategy reaches target thresholds much earlier, with up to 4.8 times speedup measured by epochs. We confirm that the method generalizes to other tasks and report 1.7 times epoch speedup for object detection on the COCO dataset, and 2.5 times earlier target-FID crossing for diffusion generation on the CIFAR-10 dataset, measured in steps. These findings validate our method as a universal speedup mechanism for visual learning.

Chinese Translation

大规模视觉学习越来越受到训练成本的限制。现有的知识蒸馏方法通常是从更强的教师模型向较弱的学生模型进行迁移，以实现压缩或最终精度的提升。相反，我们研究了蒸馏在加速强学生模型训练中的应用。我们提出了一种可推广的即插即用方法，该方法冻结较弱的教师模型，仅在早期训练阶段应用蒸馏，并在学生模型达到并超越教师模型性能后关闭蒸馏。对于ImageNet和CIFAR分类任务，该策略能够更早地达到目标阈值，训练速度提高可达4.8倍（以训练轮数计）。我们确认该方法可以推广到其他任务，并报告在COCO数据集上的目标检测任务中实现了1.7倍的训练轮数加速，以及在CIFAR-10数据集上的扩散生成任务中实现了2.5倍的更早目标-FID交叉（以步骤计）。这些发现验证了我们的方法作为视觉学习的通用加速机制。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2604.15453

(1D) Ordered Tokens Enable Efficient Test-Time Search

(1D) 有序标记促进高效的测试时搜索

Gao, Zhitong, Rezaei, Parham, Cy, Ali, Ye, Mingqiao, Jovanović, Nataša, Allardice, Jesse, Dehghan, Afshin, Zamir, Amir, Bachmann, Roman, Kar, Oğuzhan Fatih

Abstract

Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.

Chinese Translation

标记化是自回归（AR）生成模型的关键组成部分，它将原始数据转换为更易于建模的单元。通常，标记描述局部信息，例如图像中的像素区域或文本中的词片，而AR生成以固定顺序预测这些标记。一个值得探讨的问题是，标记结构是否影响通过测试时搜索引导生成的能力，在这种情况下，多个候选生成被探索并由验证器评估。以图像生成作为我们的测试平台，我们假设最近的具有粗到细结构的1D有序标记器比传统的2D网格结构更适合搜索。这一假设源于粗到细序列中的中间状态携带语义意义，验证器可以可靠地评估，从而在生成过程中实现有效引导。通过控制实验，我们发现训练于粗到细有序标记的AR模型在测试时的扩展行为优于基于网格的对应模型。此外，我们展示了得益于有序结构，纯粹的测试时搜索标记序列（即不训练AR模型）可以在图像-文本验证器的指导下执行无训练的文本到图像生成。除此之外，我们系统地研究了经典搜索算法（最佳N、束搜索、前瞻搜索）如何与不同的标记结构相互作用，以及不同验证器和AR先验的角色。我们的结果突显了标记结构对推理时可扩展性的影响，并为AR模型的测试时扩展提供了实用指导。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2604.15521

Frequency-Aware Flow Matching for High-Quality Image Generation

频率感知流匹配用于高质量图像生成

Ren, Sucheng, Yu, Qihang, He, Ju, Shen, Xiaohui, Yuille, Alan, Chen, Liang-Chieh

Abstract

Flow matching models have emerged as a powerful framework for realistic image generation by learning to reverse a corruption process that progressively adds Gaussian noise. However, because noise is injected in the latent domain, its impact on different frequency components is non-uniform. As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. We introduce a two-branch architecture: (1) a frequency branch that separately processes low- and high-frequency components to capture global structure and refine textures and edges, and (2) a spatial branch that synthesizes images in the latent domain, guided by the frequency branch's output. By explicitly integrating frequency information into the generation process, FreqFlow ensures that both large-scale coherence and fine-grained details are effectively modeled low-frequency conditioning reinforces global structure, while high-frequency conditioning enhances texture fidelity and detail sharpness. On the class-conditional ImageNet-256 generation benchmark, our method achieves state-of-the-art performance with an FID of 1.38, surpassing the prior diffusion model DiT and flow matching model SiT by 0.79 and 0.58 FID, respectively. Code is available at https://github.com/OliverRensu/FreqFlow.

Chinese Translation

流匹配模型作为一种强大的框架，通过学习逆转逐步添加高斯噪声的损坏过程，已成为现实图像生成的重要工具。然而，由于噪声是在潜在域中注入的，其对不同频率成分的影响并不均匀。因此，在推理过程中，流匹配模型倾向于在早期阶段生成低频成分（全局结构），而高频成分（细节）则仅在逆过程的后期出现。基于这一洞察，我们提出了频率感知流匹配（Frequency-Aware Flow Matching，FreqFlow），这是一种新颖的方法，通过时间依赖的自适应加权，明确地将频率感知条件融入流匹配框架中。我们引入了一个双分支架构：(1) 一个频率分支，分别处理低频和高频成分，以捕捉全局结构并细化纹理和边缘；(2) 一个空间分支，在潜在域中合成图像，受频率分支输出的指导。通过明确将频率信息整合到生成过程中，FreqFlow确保了大规模一致性和细致入微的细节得到有效建模，低频条件强化了全局结构，而高频条件则提升了纹理保真度和细节清晰度。在类别条件的ImageNet-256生成基准测试中，我们的方法以1.38的FID达到了最先进的性能，分别超越了之前的扩散模型DiT和流匹配模型SiT 0.79和0.58的FID。代码可在https://github.com/OliverRensu/FreqFlow获取。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2604.15542

UA-Net: Uncertainty-Aware Network for TRISO Image Semantic Segmentation

UA-Net：一种考虑不确定性的TRISO图像语义分割网络

Lucke, Kyle, Krajewska-Travar, Zuzanna, Sun, Shoukun, Cai, Lu, Stempien, John D., Xian, Min

Abstract

Tristructural isotropic (TRISO)-coated particle fuels undergo dimensional changes and chemical reactions during high-temperature neutron irradiation. Post-irradiation materialography helps understand processes that impact fuel performance, such as coating integrity and fission product retention. Conventionally, experts manually evaluate features in thousands of cross sections of sub-mm-sized samples, which is tedious and subjective. In this work, we propose UA-Net, a deep learning framework that segments five characteristic regions of TRISO fuel micrographs and generates an uncertainty map for predictions. The model uses a multi-stage pretraining strategy, starting with general image representations learned from ImageNet, followed by fine-tuning on TRISO micrographs from various irradiation experiments and AGR-5/6/7 particle cross sections. A meta-model for uncertainty prediction is integrated to identify small defects in TRISO images. UA-Net was evaluated on a test set of 102 images, achieving mean Intersection over Union (mIoU) and mean Precision (mP) of 95.5% and 97.3%, respectively. The meta-model achieved a specificity of 91.8% and sensitivity of 93.5%, demonstrating strong performance in detecting misclassifications. The model was also applied to new TRISO images for qualitative evaluation, showing high accuracy in extracting layer regions.

Chinese Translation

三结构各向同性（TRISO）涂层颗粒燃料在高温中子辐照过程中会发生尺寸变化和化学反应。辐照后材料学帮助理解影响燃料性能的过程，例如涂层完整性和裂变产物的保留。传统上，专家需要手动评估数千个亚毫米级样本的截面特征，这一过程既繁琐又主观。在本研究中，我们提出了UA-Net，一种深度学习框架，用于对TRISO燃料显微图像的五个特征区域进行分割，并生成预测的不确定性图。该模型采用多阶段预训练策略，首先从ImageNet学习一般图像表示，然后在来自各种辐照实验和AGR-5/6/7颗粒截面的TRISO显微图像上进行微调。集成了不确定性预测的元模型，用于识别TRISO图像中的小缺陷。UA-Net在102幅图像的测试集上进行了评估，平均交并比（mIoU）和平均精度（mP）分别达到了95.5%和97.3%。元模型的特异性为91.8%，灵敏度为93.5%，在检测错误分类方面表现出色。该模型还应用于新的TRISO图像进行定性评估，显示出在提取层区域方面的高准确性。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2604.15555

CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

CXR-LT 2026 挑战：多中心长尾和零样本胸部X光分类

Dong, Hexin, Lin, Yi, Zhou, Pengyu, Zhao, Fengnian, Legasto, Alan Clint, Cho, Juno, Kim, Dohui, Kim, Justin Namuk, Kim, Mingeon, Kwak, Sunwoo, Moyà-Alcover, Gabriel, Nguyen, Ky Trung, Nguyen, Thanh-Huy, Pham, Ha-Hieu, Pham, Huy-Hieu, Pham, Huy Le, Sulake, Nikhileswara Rao, Tur-Serrano, Aina, Zhang, Ruichi, Zu, Ang, Flanders, Adam E., Lu, Zhiyong, Summers, Ronald M., Lin, Mingquan, Chen, Hao, Yang, Yuzhe, Shih, George, Peng, Yifan

Abstract

Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from a single institution, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT challenge. The first event, CXR-LT 2023, established a large-scale benchmark for long-tailed multi-label CXR classification and identified key challenges in rare disease recognition. CXR-LT 2024 further expanded the label space and introduced a zero-shot task to study generalization to unseen findings. Building on the success of CXR-LT 2023 and 2024, this third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. Additionally, all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels. The challenge defines two core tasks this year: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. This paper summarizes the overview of the CXR-LT 2026 challenge. We describe the data collection and annotation procedures, analyze solution strategies adopted by participating teams, and evaluate head-versus-tail performance, calibration, and cross-center generalization gaps. Our results show that vision-language foundation models improve both in-distribution and zero-shot performance, but detecting rare findings under multi-center shift remains challenging. Our study provides a foundation for developing and evaluating AI systems in realistic long-tailed and open-world clinical conditions.

Chinese Translation

胸部X光（CXR）解读受到病理分布的长尾特征和临床环境的开放世界性质的影响。现有基准通常依赖于来自单一机构的封闭集类别，未能捕捉稀有疾病的流行程度或新发现的出现。为了解决这一问题，我们提出了CXR-LT挑战。第一次活动CXR-LT 2023建立了一个大规模的长尾多标签CXR分类基准，并确定了稀有疾病识别中的关键挑战。CXR-LT 2024进一步扩展了标签空间，并引入了零样本任务，以研究对未见发现的泛化能力。在CXR-LT 2023和2024成功的基础上，本次基准的第三次迭代引入了一个多中心数据集，包含来自PadChest和NIH胸部X光数据集的超过145,000张图像。此外，CXR-LT 2026中的所有开发和测试集均由放射科医生进行注释，提供了比报告派生标签更可靠和临床基础的评估。今年的挑战定义了两个核心任务：（1）对30个已知类别进行稳健的多标签分类，以及（2）对6个未见（分布外）稀有疾病类别进行开放世界泛化。本文总结了CXR-LT 2026挑战的概述。我们描述了数据收集和注释程序，分析了参与团队采用的解决策略，并评估了头部与尾部的性能、校准和跨中心泛化差距。我们的结果表明，视觉-语言基础模型在分布内和零样本性能上均有所提升，但在多中心转移下检测稀有发现仍然具有挑战性。我们的研究为在现实的长尾和开放世界临床条件下开发和评估人工智能系统提供了基础。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2604.15611

CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder

CLIMB：基于Mamba的潜在扩散模型和高斯对齐自编码器的可控纵向脑图像生成

Dao, Duy-Phuong, Taqiyuddin, Muhammad, Kim, Jahae, Lee, Sang-Heon, Jung, Hye-Won, Choi, Jaehoo, Yang, Hyung-Jeong

Abstract

Latent diffusion models have emerged as powerful generative models in medical imaging, enabling the synthesis of high quality brain magnetic resonance imaging scans. In particular, predicting the evolution of a patients brain can aid in early intervention, prognosis, and treatment planning. In this study, we introduce CLIMB, Controllable Longitudinal brain Image generation via state space based latent diffusion model, an advanced framework for modeling temporal changes in brain structure. CLIMB is designed to model the structural evolution of the brain structure over time, utilizing a baseline MRI scan and its acquisition age as foundational inputs. Additionally, multiple conditional variables, including projected age, gender, disease status, genetic information, and brain structure volumes, are incorporated to enhance the temporal modeling of anatomical changes. Unlike existing LDM methods that rely on self attention modules, which effectively capture contextual information from input images but are computationally expensive, our approach leverages state space, a state space model architecture that substantially reduces computational overhead while preserving high-quality image synthesis. Furthermore, we introduce a Gaussian-aligned autoencoder that extracts latent representations conforming to prior distributions without the sampling noise inherent in conventional variational autoencoders. We train and evaluate our proposed model on the Alzheimers Disease Neuroimaging Initiative dataset, consisting of 6,306 MRI scans from 1,390 participants. By comparing generated images with real MRI scans, CLIMB achieves a structural similarity index of 0.9433, demonstrating notable improvements over existing methods.

Chinese Translation

潜在扩散模型已成为医学成像中强大的生成模型，使得高质量的脑部磁共振成像扫描的合成成为可能。特别是，预测患者大脑的演变可以帮助早期干预、预后和治疗规划。在本研究中，我们介绍了CLIMB，通过基于状态空间的潜在扩散模型进行可控的纵向脑图像生成，这是一个用于建模脑结构时间变化的先进框架。CLIMB旨在建模脑结构随时间的结构演变，利用基线MRI扫描及其获取年龄作为基础输入。此外，多个条件变量，包括预测年龄、性别、疾病状态、遗传信息和脑结构体积，被纳入以增强解剖变化的时间建模。与现有的依赖自注意力模块的LDM方法不同，后者有效捕捉输入图像的上下文信息但计算开销较大，我们的方法利用状态空间，这是一种状态空间模型架构，显著减少了计算开销，同时保持高质量的图像合成。此外，我们引入了一种高斯对齐自编码器，提取符合先验分布的潜在表示，而不受传统变分自编码器固有的采样噪声的影响。我们在阿尔茨海默病神经影像学倡议（Alzheimer's Disease Neuroimaging Initiative）数据集上训练和评估了我们提出的模型，该数据集包含来自1,390名参与者的6,306个MRI扫描。通过将生成的图像与真实的MRI扫描进行比较，CLIMB实现了0.9433的结构相似性指数，显示出相较于现有方法的显著改进。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2604.15622

AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

AdaVFM：通过大语言模型引导执行的边缘智能自适应视觉基础模型

Zhao, Yiwei, Zheng, Yi, Su, Huapeng, Lin, Jieyu, Ambrogio, Stefano, Jose, Cijo, Ramamonjisoa, Michaël, Labatut, Patrick, De Salvo, Barbara, Liu, Chiao, Gibbons, Phillip B., Li, Ziyun

Abstract

Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to $7.9\%$ in acc@1 on IN1K and $5.2\%$ mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to $77.9\%$.

Chinese Translation

与语言对齐的视觉基础模型（VFM）为始终在线的上下文人工智能提供了多样化的视觉理解，但在边缘设备上的部署受到严格的延迟和功耗限制的阻碍。我们提出了AdaVFM，这是一个高效的设备内推理自适应框架，能够根据场景上下文和任务复杂性动态调整计算。我们的关键见解是，在视觉应用中，模型规模缩减对性能的影响是任务依赖的，这促使我们采用运行时自适应执行策略。AdaVFM将神经架构搜索（NAS）集成到与语言对齐的VFM骨干网络中，以便在运行时实现轻量级子网络执行。部署在云端的多模态大语言模型（LLM）使得通过上下文感知代理进行运行时控制成为可能。这种协同作用使得在多种条件下实现高效的模型适应，同时保持较强的准确性。在零-shot 分类和开放词汇分割的广泛实验中，AdaVFM实现了最先进的准确性与效率权衡，在IN1K数据集上相较于以往基线提高了高达$7.9\%$的acc@1，在ADE20K数据集上提高了$5.2\ ext{mIoU}$，超越了同类VFM规模的最佳模型。对于具有相似准确性的模型，AdaVFM进一步将平均FLOPs减少了高达$77.9\\%$。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2604.15628

SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

SIMMER：基于MLLM嵌入的跨模态食品图像与食谱检索

Gomi, Keisuke, Yanai, Keiji

Abstract

Cross-modal retrieval between food images and recipe texts is an important task with applications in nutritional management, dietary logging, and cooking assistance. Existing methods predominantly rely on dual-encoder architectures with separate image and text encoders, requiring complex alignment strategies and task-specific network designs to bridge the semantic gap between modalities. In this work, we propose SIMMER (Single Integrated Multimodal Model for Embedding Recipes), which applies Multimodal Large Language Model (MLLM)-based embedding models, specifically VLM2Vec, to this task, replacing the conventional dual-encoder paradigm with a single unified encoder that processes both food images and recipe texts. We design prompt templates tailored to the structured nature of recipes, which consist of a title, ingredients, and cooking instructions, enabling effective embedding generation by the MLLM. We further introduce a component-aware data augmentation strategy that trains the model on both complete and partial recipes, improving robustness to incomplete inputs. Experiments on the Recipe1M dataset demonstrate that SIMMER achieves state-of-the-art performance across both the 1k and 10k evaluation settings, substantially outperforming all prior methods. In particular, our best model improves the 1k image-to-recipe R@1 from 81.8\% to 87.5\% and the 10k image-to-recipe R@1 from 56.5\% to 65.5\% compared to the previous best method.

Chinese Translation

食品图像与食谱文本之间的跨模态检索是一项重要任务，具有营养管理、饮食记录和烹饪辅助等应用。现有方法主要依赖于双编码器架构，分别对图像和文本进行编码，这需要复杂的对齐策略和特定任务的网络设计，以弥合模态之间的语义差距。在本研究中，我们提出了SIMMER（单一集成多模态模型用于嵌入食谱），该模型应用基于多模态大语言模型（MLLM）的嵌入模型，特别是VLM2Vec，来处理这一任务，用单一统一的编码器替代传统的双编码器范式，处理食品图像和食谱文本。我们设计了针对食谱结构特性的提示模板，食谱由标题、成分和烹饪说明组成，从而使MLLM能够有效生成嵌入。我们进一步引入了一种组件感知的数据增强策略，在完整和部分食谱上训练模型，提高了对不完整输入的鲁棒性。在Recipe1M数据集上的实验表明，SIMMER在1k和10k评估设置中均实现了最先进的性能，显著超越了所有先前的方法。特别是，我们的最佳模型将1k图像到食谱的R@1从81.8%提高到87.5%，将10k图像到食谱的R@1从56.5%提高到65.5%，相比于之前的最佳方法。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2604.15631

Causal Bootstrapped Alignment for Unsupervised Video-Based Visible-Infrared Person Re-Identification

用于无监督视频基础可见-红外人重识别的因果自助对齐

Li, Shuang, Leng, Jiaxu, Kuang, Changjiang, Tan, Mingpi, Yuan, Yu, Gao, Xinbo

Abstract

VVI-ReID is a critical technique for all-day surveillance, where temporal information provides additional cues beyond static images. However, existing approaches rely heavily on fully supervised learning with expensive cross-modality annotations, limiting scalability. To address this issue, we investigate Unsupervised Learning for VVI-ReID (USL-VVI-ReID), which learns identity-discriminative representations directly from unlabeled video tracklets. Directly extending image-based USL-VI-ReID methods to this setting with generic pretrained encoders leads to suboptimal performance. Such encoders suffer from weak identity discrimination and strong modality bias, resulting in severe intra-modality identity confusion and pronounced clustering granularity imbalance between visible and infrared modalities. These issues jointly degrade pseudo-label reliability and hinder effective cross-modality alignment. To address these challenges, we propose a Causal Bootstrapped Alignment (CBA) framework that explicitly exploits inherent video priors. First, we introduce Causal Intervention Warm-up (CIW), which performs sequence-level causal interventions by leveraging temporal identity consistency and cross-modality identity consistency to suppress modality- and motion-induced spurious correlations while preserving identity-relevant semantics, yielding cleaner representations for unsupervised clustering. Second, we propose Prototype-Guided Uncertainty Refinement (PGUR), which employs a coarse-to-fine alignment strategy to resolve cross-modality granularity mismatch, reorganizing under-clustered infrared representations under the guidance of reliable visible prototypes with uncertainty-aware supervision. Extensive experiments on the HITSZ-VCM and BUPTCampus benchmarks demonstrate that CBA significantly outperforms existing USL-VI-ReID methods when extended to the USL-VVI-ReID setting.

Chinese Translation

VVI-ReID 是全天候监控中的一项关键技术，其中时间信息提供了超越静态图像的额外线索。然而，现有方法严重依赖于昂贵的跨模态标注的完全监督学习，限制了其可扩展性。为了解决这个问题，我们研究了 VVI-ReID 的无监督学习（USL-VVI-ReID），该方法直接从未标记的视频轨迹中学习身份区分表示。将基于图像的 USL-VI-ReID 方法直接扩展到这种设置中，并使用通用的预训练编码器会导致次优性能。这些编码器在身份区分上表现较弱，并且存在强烈的模态偏差，导致可见和红外模态之间严重的模态内身份混淆和明显的聚类粒度不平衡。这些问题共同降低了伪标签的可靠性，并阻碍了有效的跨模态对齐。为了解决这些挑战，我们提出了一种因果自助对齐（CBA）框架，明确利用固有的视频先验。首先，我们引入了因果干预热身（CIW），通过利用时间身份一致性和跨模态身份一致性进行序列级因果干预，以抑制模态和运动引起的虚假相关，同时保留与身份相关的语义，从而为无监督聚类提供更清晰的表示。其次，我们提出了原型引导的不确定性精炼（PGUR），采用粗到细的对齐策略来解决跨模态粒度不匹配，在可靠的可见原型的指导下，重新组织聚类不足的红外表示，并进行不确定性感知的监督。对 HITSZ-VCM 和 BUPTCampus 基准的广泛实验表明，当扩展到 USL-VVI-ReID 设置时，CBA 显著优于现有的 USL-VI-ReID 方法。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2604.15651

SPLIT: Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography

SPLIT：用于非线性断层成像的自监督分区学习反演

Haltmeier, Markus, Neumann, Lukas, Gruber, Nadja, Hwang, Gyeongha

Abstract

Machine learning has achieved impressive performance in tomographic reconstruction, but supervised training requires paired measurements and ground-truth images that are often unavailable. This has motivated self-supervised approaches, which have primarily addressed denoising and, more recently, linear inverse problems. We address nonlinear inverse problems and introduce SPLIT (Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography), a self-supervised machine-learning framework for reconstructing images from nonlinear, incomplete, and noisy projection data without any samples of ground-truth images. SPLIT enforces cross-partition consistency and measurement-domain fidelity while exploiting complementary information across multiple partitions. Our main theoretical result shows that, under mild conditions, the proposed self-supervised objective is equivalent to its supervised counterpart in expectation. We regularize training with an automatic stopping rule that halts optimization when a no-reference image-quality surrogate saturates. As a concrete application, we derive SPLIT variants for multispectral computed tomography. Experiments on sparse-view acquisitions demonstrate high reconstruction quality and robustness to noise, surpassing classical iterative reconstruction and recent self-supervised baselines.

Chinese Translation

机器学习在断层重建方面取得了显著的性能，但监督训练需要配对的测量数据和真实图像，而这些数据通常不可用。这促使了自监督方法的发展，这些方法主要解决去噪问题，最近也开始关注线性反演问题。我们关注非线性反演问题，并引入SPLIT（Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography），这是一个自监督机器学习框架，旨在从非线性、不完整和噪声的投影数据中重建图像，而无需任何真实图像样本。SPLIT 强制执行跨分区一致性和测量域保真度，同时利用多个分区之间的互补信息。我们的主要理论结果表明，在温和条件下，所提出的自监督目标在期望上等同于其监督对应物。我们通过自动停止规则对训练进行正则化，当无参考图像质量替代指标饱和时停止优化。作为具体应用，我们推导了多光谱计算断层成像的SPLIT变体。在稀疏视图采集上的实验表明，重建质量高且对噪声具有良好的鲁棒性，超越了经典的迭代重建方法和最近的自监督基线。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2604.15652

Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline

面向现实的开放词汇遥感图像分割：基准与基线

Li, Bingyu, Huo, Tao, Dong, Haocheng, Zhang, Da, Zhao, Zhiyuan, Gao, Junyu, Li, Xuelong

Abstract

Open-vocabulary remote sensing image segmentation (OVRSIS) remains underexplored due to fragmented datasets, limited training diversity, and the lack of evaluation benchmarks that reflect realistic geospatial application demands. Our previous \textit{OVRSISBenchV1} established an initial cross-dataset evaluation protocol, but its limited scope is insufficient for assessing realistic open-world generalization. To address this issue, we propose \textit{OVRSISBenchV2}, a large-scale and application-oriented benchmark for OVRSIS. We first construct \textbf{OVRSIS95K}, a balanced dataset of about 95K image--mask pairs covering 35 common semantic categories across diverse remote sensing scenes. Built upon OVRSIS95K and 10 downstream datasets, OVRSISBenchV2 contains 170K images and 128 categories, substantially expanding scene diversity, semantic coverage, and evaluation difficulty. Beyond standard open-vocabulary segmentation, it further includes downstream protocols for building extraction, road extraction, and flood detection, thereby better reflecting realistic geospatial application demands and complex deployment scenarios. We also propose \textbf{Pi-Seg}, a baseline for OVRSIS. Pi-Seg improves transferability through a \textbf{positive-incentive noise} mechanism, where learnable and semantically guided perturbations broaden the visual-text feature space during training. Extensive experiments on OVRSISBenchV1, OVRSISBenchV2, and downstream tasks show that Pi-Seg delivers strong and consistent results, particularly on the more challenging OVRSISBenchV2 benchmark. Our results highlight both the importance of realistic benchmark design and the effectiveness of perturbation-based transfer for OVRSIS. The code and datasets are available at \href{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg}.

Chinese Translation

开放词汇遥感图像分割（OVRSIS）仍然未得到充分探索，原因在于数据集碎片化、训练多样性有限以及缺乏反映现实地理空间应用需求的评估基准。我们之前的 extit{OVRSISBenchV1} 建立了初步的跨数据集评估协议，但其有限的范围不足以评估现实的开放世界泛化能力。为了解决这一问题，我们提出了 extit{OVRSISBenchV2}，这是一个大规模且面向应用的 OVRSIS 基准。我们首先构建了 extbf{OVRSIS95K}，这是一个包含约 95K 图像-掩膜对的平衡数据集，涵盖了 35 个常见语义类别，适用于多样的遥感场景。基于 OVRSIS95K 和 10 个下游数据集，OVRSISBenchV2 包含 170K 图像和 128 个类别，显著扩展了场景多样性、语义覆盖和评估难度。除了标准的开放词汇分割外，它还包括用于建筑物提取、道路提取和洪水检测的下游协议，从而更好地反映现实地理空间应用需求和复杂的部署场景。我们还提出了 extbf{Pi-Seg}，作为 OVRSIS 的基线。Pi-Seg 通过 extbf{正激励噪声}机制提高了迁移性，在训练过程中，可学习和语义引导的扰动扩展了视觉-文本特征空间。在 OVRSISBenchV1、OVRSISBenchV2 和下游任务上的大量实验表明，Pi-Seg 提供了强大且一致的结果，特别是在更具挑战性的 OVRSISBenchV2 基准上。我们的结果强调了现实基准设计的重要性以及基于扰动的迁移在 OVRSIS 中的有效性。代码和数据集可在 exthref{https://github.com/LiBingyu01/RSKT-Seg/tree/Pi-Seg}{LiBingyu01/RSKT-Seg/tree/Pi-Seg} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2604.15654

From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark

从零到细节：一种渐进式光谱解耦范式用于超高清图像恢复的新基准

Zhao, Chen, Xu, Yunzhe, Chen, Zhizhou, Gu, Enxuan, Zhang, Kai, Liu, Xiaoming, Yang, Jian, Tai, Ying

Abstract

Ultra-high-definition (UHD) image restoration poses unique challenges due to the high spatial resolution, diverse content, and fine-grained structures present in UHD images. To address these issues, we introduce a progressive spectral decomposition for the restoration process, decomposing it into three stages: zero-frequency \textbf{enhancement}, low-frequency \textbf{restoration}, and high-frequency \textbf{refinement}. Based on this formulation, we propose a novel framework, \textbf{ERR}, which integrates three cooperative sub-networks: the zero-frequency enhancer (ZFE), the low-frequency restorer (LFR), and the high-frequency refiner (HFR). The ZFE incorporates global priors to learn holistic mappings, the LFR reconstructs the main content by focusing on coarse-scale information, and the HFR adopts our proposed frequency-windowed Kolmogorov-Arnold Network (FW-KAN) to recover fine textures and intricate details for high-fidelity restoration. To further advance research in UHD image restoration, we also construct a large-scale, high-quality benchmark dataset, \textbf{LSUHDIR}, comprising 82{,}126 UHD images with diverse scenes and rich content. Our proposed methods demonstrate superior performance across a range of UHD image restoration tasks, and extensive ablation studies confirm the contribution and necessity of each module. Project page: https://github.com/NJU-PCALab/ERR.

Chinese Translation

超高清（UHD）图像恢复由于其高空间分辨率、多样化内容和精细结构而面临独特挑战。为了解决这些问题，我们提出了一种渐进式光谱分解方法，将恢复过程分为三个阶段：零频率增强、低频率恢复和高频率细化。基于这一框架，我们提出了一种新颖的框架，称为ERR（Enhanced Restoration Framework），它集成了三个协同子网络：零频率增强器（ZFE）、低频率恢复器（LFR）和高频率细化器（HFR）。ZFE结合全局先验知识以学习整体映射，LFR通过关注粗尺度信息重建主要内容，而HFR采用我们提出的频率窗口化Kolmogorov-Arnold网络（FW-KAN）来恢复细腻的纹理和复杂的细节，以实现高保真恢复。为了进一步推动UHD图像恢复的研究，我们还构建了一个大规模高质量基准数据集LSUHDIR，包含82,126张具有多样场景和丰富内容的UHD图像。我们提出的方法在一系列UHD图像恢复任务中表现出优越性能，广泛的消融研究证实了每个模块的贡献和必要性。项目页面：https://github.com/NJU-PCALab/ERR。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2604.15665

CPU Optimization of a Monocular 3D Biomechanics Pipeline for Low-Resource Deployment

低资源部署的单目3D生物力学管道的CPU优化

Zhang, Yan, Zhao, Xiong

Abstract

Markerless 3D movement analysis from monocular video enables accessible biomechanical assessment in clinical and sports settings. However, most research-grade pipelines rely on GPU acceleration, limiting deployment on consumer-grade hardware and in low-resource environments. In this work, we optimize a monocular 3D biomechanics pipeline derived from the MonocularBiomechanics framework for efficient CPU-only execution. Through profiling-driven system optimization, including model initialization restructuring, elimination of disk I/O serialization, and improved CPU parallelization. Experiments on a consumer workstation (AMD Ryzen 7 9700X CPU) show a 2.47x increase in processing throughput and a 59.6\% reduction in total runtime, with initialization latency reduced by 4.6x. Despite these changes, biomechanical outputs remain highly consistent with the baseline implementation (mean joint-angle deviation 0.35$^\circ$, $r=0.998$). These results demonstrate that research-grade vision-based biomechanics pipelines can be deployed on commodity CPU hardware for scalable movement assessment.

Chinese Translation

基于单目视频的无标记3D运动分析使得在临床和体育环境中进行生物力学评估变得更加可及。然而，大多数研究级管道依赖于GPU加速，这限制了在消费级硬件和低资源环境中的部署。在本研究中，我们优化了基于单目生物力学框架（MonocularBiomechanics）衍生的单目3D生物力学管道，以实现高效的仅CPU执行。通过基于性能分析的系统优化，包括模型初始化重构、消除磁盘I/O串行化以及改进CPU并行化。在一台消费级工作站（AMD Ryzen 7 9700X CPU）上的实验显示，处理吞吐量提高了2.47倍，总运行时间减少了59.6\%，初始化延迟减少了4.6倍。尽管进行了这些改动，生物力学输出与基线实现保持高度一致（平均关节角度偏差0.35$^ ext{°}$，$r=0.998$）。这些结果表明，研究级基于视觉的生物力学管道可以在商品CPU硬件上进行部署，以实现可扩展的运动评估。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2604.15670

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

PixDLM：一种用于无人机推理分割的双路径多模态语言模型

Ke, Shuyan, Mei, Yifan, Wu, Changli, Zheng, Yonghan, Ji, Jiayi, Cao, Liujuan, Ji, Rongrong

Abstract

Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.

Chinese Translation

推理分割最近已从地面场景扩展到遥感影像，但无人机（UAV）数据面临独特的挑战，包括斜视角度、超高分辨率和极端尺度变化。为了解决这些问题，我们正式定义了无人机推理分割任务，并将其语义要求组织为三个维度：空间、属性和场景级推理。基于这一框架，我们构建了DRSeg，一个大规模的无人机推理分割基准，包含10,000张高分辨率航空图像，并针对所有三种推理类型配备了链式思维问答（Chain-of-Thought QA）监督。作为基准的补充，我们引入了PixDLM，一个简单而有效的像素级多模态语言模型，作为该任务的统一基线。对DRSeg的实验建立了强有力的基线结果，并突出了无人机推理分割的独特挑战，为未来的研究奠定了坚实的基础。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2604.15678

HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning

HyCal：一种无训练的跨学科少样本增量学习原型校准方法

Lee, Eunju, Kim, MiHyeon, Kwon, JuneHyoung, Lee, Yoonji, Kim, JiHyun, Jang, Soojin, Kim, YoungBin

Abstract

Pretrained Vision-Language Models (VLMs) like CLIP show promise in continual learning, but existing Few-Shot Class-Incremental Learning (FSCIL) methods assume homogeneous domains and balanced data distributions, limiting real-world applicability where data arises from heterogeneous disciplines with imbalanced sample availability and varying visual complexity. We identify Domain Gravity, a representational asymmetry where data imbalance across heterogeneous domains causes overrepresented or low-entropy domains to disproportionately influence the embedding space, leading to prototype drift and degraded performance on underrepresented or high-entropy domains. To address this, we introduce Cross-Discipline Variable Few-Shot Class-Incremental Learning (XD-VSCIL), a benchmark capturing real-world heterogeneity and imbalance where Domain Gravity naturally intensifies. We propose Hybrid Prototype Calibration (HyCal), a training-free method combining cosine similarity and Mahalanobis distance to capture complementary geometric properties-directional alignment and covariance-aware magnitude-yielding stable prototypes under imbalanced heterogeneous conditions. Operating on frozen CLIP embeddings, HyCal achieves consistent retention-adaptation improvements while maintaining efficiency. Experiments show HyCal effectively mitigates Domain Gravity and outperforms existing methods in imbalanced cross-domain incremental learning.

Chinese Translation

预训练的视觉-语言模型（VLMs），如CLIP，在持续学习中展现出良好的前景，但现有的少样本增量学习（FSCIL）方法假设领域同质且数据分布平衡，这限制了其在真实世界中的适用性，因为真实数据来自于异质学科，样本可用性不平衡且视觉复杂性各异。我们识别出领域重力（Domain Gravity），这是一种表示不对称性，其中异质领域间的数据不平衡导致过度代表或低熵领域对嵌入空间产生不成比例的影响，从而导致原型漂移和在欠代表或高熵领域的性能下降。为了解决这一问题，我们引入了跨学科可变少样本增量学习（XD-VSCIL），这是一个捕捉真实世界异质性和不平衡的基准，其中领域重力自然加剧。我们提出了混合原型校准（HyCal），这是一种无训练的方法，结合了余弦相似度和马哈拉诺比斯距离，以捕捉互补的几何特性——方向对齐和协方差感知的幅度——在不平衡的异质条件下产生稳定的原型。HyCal在冻结的CLIP嵌入上运行，实现了一致的保留-适应性改进，同时保持效率。实验表明，HyCal有效缓解了领域重力，并在不平衡的跨领域增量学习中超越了现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2604.15681

Self-Supervised Angular Deblurring in Photoacoustic Reconstruction via Noisier2Inverse

基于 Noisier2Inverse 的自监督角度去模糊在光声重建中的应用

Haltmeier, Markus, Gruber, Nadja, Hwang, Gyeongha

Abstract

Photoacoustic tomography (PAT) is an emerging imaging modality that combines the complementary strengths of optical contrast and ultrasonic resolution. A central task is image reconstruction, where measured acoustic signals are used to recover the initial pressure distribution. For ideal point-like or line-like detectors, several efficient and fast reconstruction algorithms exist, including Fourier methods, filtered backprojection, and time reversal. However, when applied to data acquired with finite-size detectors, these methods yield systematically blurred images. Although sharper images can be obtained by compensating for finite-detector effects, supervised learning approaches typically require ground-truth images that may not be available in practice. We propose a self-supervised reconstruction method based on Noisier2Inverse that addresses finite-size detector effects without requiring ground-truth data. Our approach operates directly on noisy measurements and learns to recover high-quality PAT images in a ground-truth-free manner. Its key components are: (i) PAT-specific modeling that recasts the problem as angular deblurring; (ii) a Noisier2Inverse formulation in the polar domain that leverages the known angular point-spread function; and (iii) a novel, statistically grounded early-stopping rule. In experiments, the proposed method consistently outperforms alternative approaches that do not use supervised data and achieves performance close to supervised benchmarks, while remaining practical for real acquisitions with finite-size detectors.

Chinese Translation

光声断层成像（PAT）是一种新兴的成像模式，结合了光学对比度和超声分辨率的互补优势。其核心任务是图像重建，即利用测得的声信号恢复初始压力分布。对于理想的点状或线状探测器，已经存在多种高效快速的重建算法，包括傅里叶方法、滤波反投影和时间反转。然而，当应用于有限尺寸探测器获取的数据时，这些方法会产生系统性模糊的图像。尽管通过补偿有限探测器效应可以获得更清晰的图像，但监督学习方法通常需要真实图像作为基础，而这些图像在实际中可能无法获得。我们提出了一种基于 Noisier2Inverse 的自监督重建方法，旨在解决有限尺寸探测器效应，而无需真实数据。我们的方法直接在噪声测量上操作，并学习以无真实数据的方式恢复高质量的 PAT 图像。其关键组成部分包括：（i）特定于 PAT 的建模，将问题重新表述为角度去模糊；（ii）在极坐标域中利用已知的角度点扩散函数的 Noisier2Inverse 公式；（iii）一种新颖的基于统计的早停规则。在实验中，所提出的方法在不使用监督数据的情况下始终优于其他替代方法，并且其性能接近监督基准，同时在有限尺寸探测器的实际采集中仍然具有实用性。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2604.15703

P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models

P3T：具有增强泛化能力的原型点级提示调优方法用于3D视觉-语言模型

Jung, Geunyoung, Kim, Soohong, Song, Kyungwoo, Jung, Jiyoung

Abstract

With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P$^3$T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P$^3$T consists of two components: 1) \textit{Point Prompter}, which generates instance-aware point-level prompts for the input point cloud, and 2) \textit{Text Prompter}, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P$^3$T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at \textcolor{violet}{https://github.com/gyjung975/P3T}.

Chinese Translation

随着预训练模型在3D点云领域的兴起，适应这些模型以应对各种现实世界应用的下游任务变得愈发重要。然而，传统的全面微调方法计算成本高且存储需求大。尽管提示调优作为一种高效的替代方案应运而生，但它往往会遭遇过拟合，从而影响泛化能力。为了解决这一问题，我们提出了原型点级提示调优方法（P$^3$T），这是一种为预训练的3D视觉-语言模型（VLMs）设计的参数高效的提示调优方法。P$^3$T由两个部分组成：1） extit{点提示器}，用于为输入的点云生成实例感知的点级提示；2） extit{文本提示器}，将可学习的提示应用于输入文本，而非手工制作的提示。由于两个提示器直接作用于输入数据，P$^3$T使得3D VLMs能够在不牺牲泛化能力的情况下进行任务特定的适应。此外，为了增强嵌入空间的对齐，这是微调3D VLMs的关键，我们引入了一种原型损失，旨在减少类别内的方差。大量实验表明，我们的方法在分类和少样本学习中与全面微调相匹配或超越，并在跨数据集设置下表现出强健的泛化能力。代码可在 extcolor{violet}{https://github.com/gyjung975/P3T}获取。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2604.15707

LP$^{2}$DH: A Locality-Preserving Pixel-Difference Hashing Framework for Dynamic Texture Recognition

LP$^{2}$DH：一种用于动态纹理识别的局部保持像素差异哈希框架

Ding, Ruxin, Ren, Jianfeng, Yu, Heng, Li, Jiawei, Jiang, Xudong

Abstract

Spatiotemporal Local Binary Pattern (STLBP) is a widely used dynamic texture descriptor, but it suffers from extremely high dimensionality. To tackle this, STLBP features are often extracted on three orthogonal planes, which sacrifice inter-plane correlation. In this work, we propose a Locality-Preserving Pixel-Difference Hashing (LP$^{2}$DH) framework that jointly encodes pixel differences in the full spatiotemporal neighbourhood. LP$^{2}$DH transforms Pixel-Difference Vectors (PDVs) into compact binary codes with maximal discriminative power. Furthermore, we incorporate a locality-preserving embedding to maintain the PDVs' local structure before and after hashing. Then, a curvilinear search strategy is utilized to jointly optimize the hashing matrix and binary codes via gradient descent on the Stiefel manifold. After hashing, dictionary learning is applied to encode the binary vectors into codewords, and the resulting histogram is utilized as the final feature representation. The proposed LP$^{2}$DH achieves state-of-the-art performance on three major dynamic texture recognition benchmarks: 99.80% against DT-GoogleNet's 98.93% on UCLA, 98.52% against HoGF$^{3D}$'s 97.63% on DynTex++, and 96.19% compared to STS's 95.00% on YUPENN. The source code is available at: https://github.com/drx770/LP2DH.

Chinese Translation

时空局部二值模式（STLBP）是一种广泛使用的动态纹理描述符，但其维度极高。为了解决这个问题，STLBP特征通常在三个正交平面上提取，这牺牲了平面之间的相关性。在本研究中，我们提出了一种局部保持像素差异哈希（LP$^{2}$DH）框架，该框架联合编码全时空邻域中的像素差异。LP$^{2}$DH将像素差异向量（PDVs）转换为具有最大区分能力的紧凑二进制码。此外，我们结合了局部保持嵌入，以在哈希前后保持PDVs的局部结构。然后，利用曲线搜索策略通过在斯蒂费尔流形上的梯度下降联合优化哈希矩阵和二进制码。哈希后，应用字典学习将二进制向量编码为码字，并利用生成的直方图作为最终特征表示。所提出的LP$^{2}$DH在三个主要的动态纹理识别基准测试中实现了最先进的性能：在UCLA上以99.80%对比DT-GoogleNet的98.93%，在DynTex++上以98.52%对比HoGF$^{3D}$的97.63%，在YUPENN上以96.19%对比STS的95.00%。源代码可在以下链接获取：https://github.com/drx770/LP2DH。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2604.15708

APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition

APC：可转移且高效的对抗点反击方法用于鲁棒的3D点云识别

Jung, Geunyoung, Kim, Soohong, Kong, Inseok, Jung, Jiyoung

Abstract

The advent of deep neural networks has led to remarkable progress in 3D point cloud recognition, but they remain vulnerable to adversarial attacks. Although various defense methods have been studied, they suffer from a trade-off between robustness and transferability. We propose Adversarial Point Counterattack (APC) to achieve both simultaneously. APC is a lightweight input-level purification module that generates instance-specific counter-perturbations for each point, effectively neutralizing attacks. Leveraging clean-adversarial pairs, APC enforces geometric consistency in data space and semantic consistency in feature space. To improve generalizability across diverse attacks, we adopt a hybrid training strategy using adversarial point clouds from multiple attack types. Since APC operates purely on input point clouds, it directly transfers to unseen models and defends against attacks targeting them without retraining. At inference, a single APC forward pass provides purified point clouds with negligible time and parameter overhead. Extensive experiments on two 3D recognition benchmarks demonstrate that the APC achieves state-of-the-art defense performance. Furthermore, cross-model evaluations validate its superior transferability. The code is available at https://github.com/gyjung975/APC.

Chinese Translation

深度神经网络的出现使得3D点云识别取得了显著进展，但它们仍然容易受到对抗攻击。尽管已经研究了各种防御方法，但它们在鲁棒性和可转移性之间存在权衡。我们提出了对抗点反击（Adversarial Point Counterattack，APC）方法，以同时实现这两者。APC是一个轻量级的输入级净化模块，为每个点生成特定实例的反扰动，有效中和攻击。通过利用干净-对抗对，APC在数据空间中强制执行几何一致性，在特征空间中强制执行语义一致性。为了提高对多种攻击的泛化能力，我们采用了一种混合训练策略，使用来自多种攻击类型的对抗点云。由于APC纯粹在输入点云上操作，它可以直接转移到未见过的模型上，并在不重新训练的情况下防御针对它们的攻击。在推理阶段，单次APC前向传播提供了净化的点云，几乎没有时间和参数开销。在两个3D识别基准上的大量实验表明，APC实现了最先进的防御性能。此外，跨模型评估验证了其优越的可转移性。代码可在 https://github.com/gyjung975/APC 获取。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2604.15711

SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification

SSMamba：一种自监督混合状态空间模型用于病理图像分类

Chai, Enhui, Chen, Sicheng, Zhang, Tianyi, Li, Xingyu, Cui, Tianxiang

Abstract

Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Directional Multi-scale (DMS) module for balanced local-global modeling, and a Local Perception Residual (LPR) module for enhanced fine-grained sensitivity. Employing a two-stage pipeline, SSL pretraining on target ROI datasets followed by supervised fine-tuning (SFT), SSMamba outperforms 11 state-of-the-art (SOTA) pathological FMs on 10 public ROI datasets and surpasses 8 SOTA methods on 6 public WSI datasets. These results validate the superiority of task-specific architectural designs for pathological image analysis.

Chinese Translation

病理诊断高度依赖于图像分析，其中感兴趣区域（ROIs）作为诊断证据的主要依据，而全切片图像（WSI）级任务主要捕捉聚合模式。为了提取这些关键的形态特征，基于视觉变换器（ViTs）和大规模自监督学习（SSL）的ROI级基础模型（FMs）已被广泛采用。然而，在其应用于ROI分析时仍然存在三个核心限制：（1）跨放大倍数领域转移，固定规模的预训练阻碍了对多样临床环境的适应；（2）局部-全局关系建模不足，FMs的ViT骨干网络面临高计算开销和不精确的局部特征表征；（3）细粒度敏感性不足，传统自注意机制往往忽视微妙的诊断线索。为了解决这些挑战，我们提出了SSMamba，一种混合SSL框架，能够在不依赖大型外部数据集的情况下有效学习细粒度特征。该框架包含三个领域自适应组件：用于减轻领域转移的Mamba掩蔽图像建模（MAMIM）、用于平衡局部-全局建模的方向性多尺度（DMS）模块，以及用于增强细粒度敏感性的局部感知残差（LPR）模块。通过采用两阶段管道，在目标ROI数据集上进行SSL预训练，然后进行监督微调（SFT），SSMamba在10个公共ROI数据集上超越了11个最先进（SOTA）的病理FMs，并在6个公共WSI数据集上超越了8个SOTA方法。这些结果验证了针对病理图像分析的任务特定架构设计的优越性。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2604.15718

NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition

NeuroLip：一种基于事件驱动的时空学习框架，用于跨场景基于唇部运动的视觉说话人识别

Yao, Junguang, Liu, Wenye, Picek, Stjepan, Zheng, Yue

Abstract

Visual speaker recognition based on lip motion offers a silent, hands-free, and behavior-driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance-dependent representations, lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine-grained dynamics is challenging for conventional frame-based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event-based framework that captures fine-grained lip dynamics under a strict yet practical cross-scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal-aware Voxel Encoding module with adaptive event weighting, 2) Structure-aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion-direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event-based lip-motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near-perfect matched-scene accuracy and robust cross-scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low-light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at https://github.com/JiuZeongit/NeuroLip.

Chinese Translation

基于唇部运动的视觉说话人识别提供了一种无声、免提且行为驱动的生物特征解决方案，即使在声学线索不可用时仍然有效。与传统方法高度依赖外观依赖性表示相比，唇部运动编码了由一致的发音模式和肌肉协调驱动的特定于个体的行为动态，提供了在环境变化中的固有稳定性。然而，由于运动模糊和低动态范围，传统的基于帧的相机捕捉这些稳健的细粒度动态是具有挑战性的。为了利用唇部运动的内在稳定性并解决这些传感限制，我们提出了NeuroLip，一种基于事件的框架，在严格但实用的跨场景协议下捕捉细粒度的唇部动态：训练在单一受控条件下进行，而识别必须能够推广到未见的视角和光照条件。NeuroLip具有1）具有自适应事件加权的时间感知体素编码模块，2）结构感知空间增强器，通过抑制噪声同时保留垂直结构运动信息来放大可区分的行为模式，以及3）极性一致性正则化机制，以保留编码在事件极性中的运动方向线索。为了便于系统评估，我们引入了DVSpeaker，一个综合性的基于事件的唇部运动数据集，包含50个在四种不同视角和光照场景下录制的受试者。大量实验表明，NeuroLip在匹配场景的准确率接近完美，并且在跨场景泛化方面表现稳健，在未见视角下达到超过71%的准确率，在低光照条件下接近76%，相比于代表性的现有方法提高了至少8.54%。数据集和代码已公开，地址为 https://github.com/JiuZeongit/NeuroLip。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2604.15723

Diffusion Autoencoder for Unsupervised Artifact Restoration in Handheld Fundus Images

用于手持眼底图像无监督伪影恢复的扩散自编码器

Palani, Mathumetha, Puthumana, Kavya, Das, Ayantika, Krishnamurthi, Ganapathy

Abstract

The advent of handheld fundus imaging devices has made ophthalmologic diagnosis and disease screening more accessible, efficient, and cost-effective. However, images captured from these setups often suffer from artifacts such as flash reflections, exposure variations, and motion-induced blur, which degrade image quality and hinder downstream analysis. While generative models have been effective in image restoration, most depend on paired supervision or predefined artifact structures, making them less adaptable to unstructured degradations commonly observed in handheld fundus images. To address this, we propose an unsupervised diffusion autoencoder that integrates a context encoder with the denoising process to learn semantically meaningful representations for artifact restoration. The model is trained only on high-quality table-top fundus images and infers to restore artifact-affected handheld acquisitions. We validate the restorations through quantitative and qualitative evaluations, and have shown that diagnostic accuracy increases to 81.17% on an unseen dataset and multiple artifact conditions

Chinese Translation

手持眼底成像设备的出现使得眼科诊断和疾病筛查变得更加便捷、高效和经济。然而，这些设备捕获的图像常常受到伪影的影响，如闪光反射、曝光变化和运动引起的模糊，这些因素降低了图像质量并妨碍了后续分析。尽管生成模型在图像恢复方面表现出色，但大多数模型依赖于成对监督或预定义的伪影结构，使其在应对手持眼底图像中常见的非结构性退化时适应性较差。为了解决这一问题，我们提出了一种无监督的扩散自编码器，该模型将上下文编码器与去噪过程相结合，以学习具有语义意义的表示用于伪影恢复。该模型仅在高质量的桌面眼底图像上进行训练，并推断以恢复受伪影影响的手持图像。我们通过定量和定性评估验证了恢复效果，并显示在未见数据集和多种伪影条件下，诊断准确率提高至81.17%。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2604.15729

MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis

MambaBack：在全幻灯片图像分析中桥接局部特征与全局上下文

Chen, Sicheng, Wong, Chad, Zhang, Tianyi, Chai, Enhui, Liu, Zeyu, Xia, Fei

Abstract

Whole Slide Image (WSI) analysis is pivotal in computational pathology, enabling cancer diagnosis by integrating morphological and architectural cues across magnifications. Multiple Instance Learning (MIL) serves as the standard framework for WSI analysis. Recently, Mamba has become a promising backbone for MIL, overtaking Transformers due to its efficiency and global context modeling capabilities originating from Natural Language Processing (NLP). However, existing Mamba-based MIL approaches face three critical challenges: (1) disruption of 2D spatial locality during 1D sequence flattening; (2) sub-optimal modeling of fine-grained local cellular structures; and (3) high memory peaks during inference on resource-constrained edge devices. Studies like MambaOut reveal that Mamba's SSM component is redundant for local feature extraction, where Gated CNNs suffice. Recognizing that WSI analysis demands both fine-grained local feature extraction akin to natural images, and global context modeling akin to NLP, we propose MambaBack, a novel hybrid architecture that harmonizes the strengths of Mamba and MambaOut. First, we propose the Hilbert sampling strategy to preserve the 2D spatial locality of tiles within 1D sequences, enhancing the model's spatial perception. Second, we design a hierarchical structure comprising a 1D Gated CNN block based on MambaOut to capture local cellular features, and a BiMamba2 block to aggregate global context, jointly enhancing multi-scale representation. Finally, we implement an asymmetric chunking design, allowing parallel processing during training and chunking-streaming accumulation during inference, minimizing peak memory usage for deployment. Experimental results on five datasets demonstrate that MambaBack outperforms seven state-of-the-art methods. Source code and datasets are publicly available.

Chinese Translation

全幻灯片图像（WSI）分析在计算病理学中至关重要，通过整合不同放大倍数下的形态学和结构线索，实现癌症诊断。多实例学习（MIL）作为WSI分析的标准框架，近年来，Mamba已成为MIL的一个有前景的基础架构，因其效率和源自自然语言处理（NLP）的全局上下文建模能力超越了变换器（Transformers）。然而，现有的基于Mamba的MIL方法面临三个关键挑战：（1）在一维序列展平过程中破坏二维空间局部性；（2）对细粒度局部细胞结构的建模不够理想；（3）在资源受限的边缘设备上推理时内存峰值过高。研究如MambaOut表明，Mamba的SSM组件在局部特征提取中是冗余的，门控卷积神经网络（Gated CNNs）即可满足需求。认识到WSI分析既需要类似自然图像的细粒度局部特征提取，又需要类似NLP的全局上下文建模，我们提出了MambaBack，这是一种新颖的混合架构，融合了Mamba和MambaOut的优势。首先，我们提出了希尔伯特采样策略，以保留1D序列中瓷砖的二维空间局部性，增强模型的空间感知。其次，我们设计了一个层次结构，包括一个基于MambaOut的1D门控卷积块，用于捕捉局部细胞特征，以及一个BiMamba2块，用于聚合全局上下文，共同增强多尺度表示。最后，我们实现了一种不对称分块设计，允许在训练期间进行并行处理，并在推理期间进行分块流式累积，最小化部署时的峰值内存使用。五个数据集上的实验结果表明，MambaBack在性能上超越了七种最先进的方法。源代码和数据集已公开。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2604.15735

Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval

草图与文本协同：融合结构轮廓与描述属性以实现细粒度图像检索

Wang, Siyuan, Gao, Hanchen, Zhu, Guangming, Lu, Jiang, Ma, Yiyue, Wu, Tianci, Huang, Jincai, Zhang, Liang

Abstract

Fine-grained image retrieval via hand-drawn sketches or textual descriptions remains a critical challenge due to inherent modality gaps. While hand-drawn sketches capture complex structural contours, they lack color and texture, which text effectively provides despite omitting spatial contours. Motivated by the complementary nature of these modalities, we propose the Sketch and Text Based Image Retrieval (STBIR) framework. By synergizing the rich color and texture cues from text with the structural outlines provided by sketches, STBIR achieves superior fine-grained retrieval performance. First, a curriculum learning driven robustness enhancement module is proposed to enhance the model's robustness when handling queries of varying quality. Second, we introduce a category-knowledge-based feature space optimization module, thereby significantly boosting the model's representational power. Finally, we design a multi-stage cross-modal feature alignment mechanism to effectively mitigate the challenges of cross modal feature alignment. Furthermore, we curate the fine-grained STBIR benchmark dataset to rigorously validate the efficacy of our proposed framework and to provide data support as a reference for subsequent related research. Extensive experiments demonstrate that the proposed STBIR framework significantly outperforms state of the art methods.

Chinese Translation

通过手绘草图或文本描述进行细粒度图像检索仍然是一个关键挑战，因为固有的模态差异。手绘草图捕捉了复杂的结构轮廓，但缺乏颜色和纹理，而文本则有效地提供了这些信息，尽管省略了空间轮廓。基于这些模态的互补特性，我们提出了草图与文本基础图像检索（Sketch and Text Based Image Retrieval, STBIR）框架。通过将文本中丰富的颜色和纹理线索与草图提供的结构轮廓相结合，STBIR 实现了优越的细粒度检索性能。首先，提出了一种基于课程学习的鲁棒性增强模块，以提高模型在处理不同质量查询时的鲁棒性。其次，我们引入了一个基于类别知识的特征空间优化模块，从而显著提升模型的表征能力。最后，我们设计了一个多阶段跨模态特征对齐机制，以有效缓解跨模态特征对齐的挑战。此外，我们整理了细粒度 STBIR 基准数据集，以严格验证我们提出的框架的有效性，并为后续相关研究提供数据支持作为参考。大量实验表明，所提出的 STBIR 框架显著优于现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2604.15736

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

RefereeBench：视频多模态大语言模型是否准备好成为多项运动裁判

Xu, Yichen, Liu, Yuanhang, Wang, Chuhan, Zhao, Zihan, luo, jinghan, Ma, Jianzhe, Wang, Wenxuan, Jin, Qin

Abstract

While Multimodal Large Language Models (MLLMs) excel at generic video understanding, their ability to support specialized, rule-grounded decision-making remains insufficiently explored. In this paper, we introduce RefereeBench, the first large-scale benchmark for evaluating MLLMs as automatic sports referees. Spanning 11 sports with 925 curated videos and 6,475 QA pairs, RefereeBench evaluates five core officiating abilities: foul existence, foul and penalty classification, foul and penalty reasoning, entity perception, and temporal grounding. The benchmark is fully human-annotated to ensure high-quality annotations grounded in authentic officiating logic and multimodal evidence. Extensive evaluations of state-of-the-art MLLMs show that even the strongest models, such as Doubao-Seed-1.8 and Gemini-3-Pro, achieve only around 60% accuracy, while the strongest open-source model, Qwen3-VL, reaches only 47%. These results indicate that current models remain far from being reliable sports referees. Further analysis shows that while models can often identify incidents and involved entities, they struggle with rule application and temporal grounding, and frequently over-call fouls on normal clips. Our benchmark highlights the need for future MLLMs that better integrate domain knowledge and multimodal understanding, advancing trustworthy AI-assisted officiating and broader multimodal decision-making.

Chinese Translation

尽管多模态大语言模型（MLLMs）在通用视频理解方面表现出色，但它们在支持专业的、基于规则的决策能力方面仍然未得到充分探索。本文介绍了RefereeBench，这是第一个用于评估MLLMs作为自动体育裁判的大规模基准。该基准涵盖11种运动，包含925个精心挑选的视频和6475对问答，评估五项核心裁判能力：犯规存在性、犯规和罚球分类、犯规和罚球推理、实体感知以及时间定位。该基准完全由人工注释，以确保高质量的注释基于真实的裁判逻辑和多模态证据。对最先进的MLLMs的广泛评估表明，即使是最强的模型，如Doubao-Seed-1.8和Gemini-3-Pro，准确率也仅约为60%，而最强的开源模型Qwen3-VL的准确率仅为47%。这些结果表明，当前模型距离成为可靠的体育裁判仍然相去甚远。进一步分析显示，尽管模型通常能够识别事件和相关实体，但在规则应用和时间定位方面存在困难，并且在正常片段上经常过度判罚犯规。我们的基准突显了未来MLLMs需要更好地整合领域知识和多模态理解，以推动可信赖的人工智能辅助裁判和更广泛的多模态决策。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2604.15748

Concept-wise Attention for Fine-grained Concept Bottleneck Models

基于概念的注意力机制用于细粒度概念瓶颈模型

Zhong, Minghong, Zou, Guoshuai, Chen, Kanghao, Chen, Dexia, Wang, Ruixuan

Abstract

Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors. Moreover, fine-tuning with Binary Cross-Entropy (BCE) loss treats each concept independently, which ignores mutual exclusivity among concepts, leading to suboptimal alignment. To address these limitations, we propose Concept-wise Attention for Fine-grained Concept Bottleneck Models (CoAt-CBM), a novel framework that achieves adaptive fine-grained image-concept alignment and high interpretability. Specifically, CoAt-CBM employs learnable concept-wise visual queries to adaptively obtain fine-grained concept-wise visual embeddings, which are then used to produce a concept score vector. Then, a novel concept contrastive optimization guides the model to handle the relative importance of the concept scores, enabling concept predictions to faithfully reflect the image content and improved alignment. Extensive experiments demonstrate that CoAt-CBM consistently outperforms state-of-the-art methods. The codes will be available upon acceptance.

Chinese Translation

最近，通过利用大型预训练视觉-语言模型（如 CLIP）学习的图像-文本对齐，概念瓶颈模型（CBM）取得了显著的性能。然而，概念建模中存在两个关键限制。现有方法往往受到预训练偏差的影响，表现为粒度不匹配或依赖于结构先验。此外，使用二元交叉熵（BCE）损失进行微调时，各个概念被独立对待，这忽视了概念之间的相互排斥性，导致次优的对齐效果。为了解决这些限制，我们提出了基于概念的注意力机制用于细粒度概念瓶颈模型（CoAt-CBM），这是一个新颖的框架，能够实现自适应的细粒度图像-概念对齐和高可解释性。具体而言，CoAt-CBM 采用可学习的概念级视觉查询，自适应地获取细粒度的概念级视觉嵌入，并用于生成概念得分向量。随后，一种新颖的概念对比优化方法引导模型处理概念得分的相对重要性，使得概念预测能够真实反映图像内容并改善对齐效果。大量实验表明，CoAt-CBM 在各项指标上始终优于最先进的方法。代码将在接受后公开。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2604.15770

PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding

PLAF：像素级语言对齐特征提取用于高效的3D场景理解

Wen, Junjie, He, Junlin, Ma, Fei, Cui, Jinqiang

Abstract

Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at https://github.com/RockWenJJ/PLAF.

Chinese Translation

准确的开放词汇3D场景理解需要在像素级别上既与语言对齐又空间精确的语义表示，同时在提升到3D空间时保持可扩展性。然而，现有的表示方法难以共同满足这些要求，密集传播像素级语义到3D往往导致显著的冗余，进而在大规模场景中导致存储和查询效率低下。为了解决这些挑战，我们提出了 extit{PLAF}，一种像素级语言对齐特征提取框架，能够在不牺牲开放词汇表达能力的情况下，实现2D中的密集和准确的语义对齐。在此表示的基础上，我们进一步设计了一种高效的语义存储和查询方案，显著减少了2D和3D领域中的冗余。实验结果表明， extit{PLAF}为准确和高效的开放词汇3D场景理解提供了强大的语义基础。代码可在 https://github.com/RockWenJJ/PLAF 上公开获取。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2604.15777

SegMix:Shuffle-based Feedback Learning for Semantic Segmentation of Pathology Images

SegMix：基于洗牌的反馈学习用于病理图像的语义分割

Yan, Zhiling, Chen, Sicheng, Zhang, Tianyi, Ying, Nan, Lei, Yanli, Zhang, Guanglei

Abstract

Segmentation is a critical task in computational pathology, as it identifies areas affected by disease or abnormal growth and is essential for diagnosis and treatment. However, acquiring high-quality pixel-level supervised segmentation data requires significant workload demands from experienced pathologists, limiting the application of deep learning. To overcome this challenge, relaxing the label conditions to image-level classification labels allows for more data to be used and more scenarios to be enabled. One approach is to leverage Class Activation Map (CAM) to generate pseudo pixel-level annotations for semantic segmentation with only image-level labels. However, this method fails to thoroughly explore the essential characteristics of pathology images, thus identifying only small areas that are insufficient for pseudo masking. In this paper, we propose a novel shuffle-based feedback learning method inspired by curriculum learning to generate higher-quality pseudo-semantic segmentation masks. Specifically, we perform patch level shuffle of pathology images, with the model adaptively adjusting the shuffle strategy based on feedback from previous learning. Experimental results demonstrate that our proposed approach outperforms state-of-the-arts on three different datasets.

Chinese Translation

分割是计算病理学中的一项关键任务，因为它识别受疾病或异常生长影响的区域，对于诊断和治疗至关重要。然而，获取高质量的像素级监督分割数据需要经验丰富的病理学家付出大量工作，这限制了深度学习的应用。为了解决这一挑战，放宽标签条件至图像级分类标签可以使用更多数据并启用更多场景。一种方法是利用类激活图（Class Activation Map, CAM）仅使用图像级标签生成伪像素级注释进行语义分割。然而，该方法未能充分探索病理图像的基本特征，因此仅识别出小区域，这不足以进行伪掩膜。在本文中，我们提出了一种新颖的基于洗牌的反馈学习方法，受到课程学习的启发，以生成更高质量的伪语义分割掩膜。具体而言，我们对病理图像进行块级洗牌，模型根据先前学习的反馈自适应调整洗牌策略。实验结果表明，我们提出的方法在三个不同的数据集上优于现有的最先进技术。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2604.15795

Fed3D: Federated 3D Object Detection

Fed3D：联邦3D目标检测

Dai, Suyan, Liu, Chenxi, Li, Fazeng, Lin, Peican

Abstract

3D object detection models trained in one server plays an important role in autonomous driving, robotics manipulation, and augmented reality scenarios. However, most existing methods face severe privacy concern when deployed on a multi-robot perception network to explore large-scale 3D scene. Meanwhile, it is highly challenging to employ conventional federated learning methods on 3D object detection scenes, due to the 3D data heterogeneity and limited communication bandwidth. In this paper, we take the first attempt to propose a novel Federated 3D object detection framework (i.e., Fed3D), to enable distributed learning for 3D object detection with privacy preservation. Specifically, considering the irregular input 3D object in local robot and various category distribution between robots could cause local heterogeneity and global heterogeneity, respectively. We then propose a local-global class-aware loss for the 3D data heterogeneity issue, which could balance gradient back-propagation rate of different 3D categories from local and global aspects. To reduce communication cost on each round, we develop a federated 3D prompt module, which could only learn and communicate the prompts with few learnable parameters. To the end, several extensive experiments on federated 3D object detection show that our Fed3D model significantly outperforms state-of-the-art algorithms with lower communication cost when providing the limited local training data.

Chinese Translation

在自主驾驶、机器人操作和增强现实场景中，训练于单一服务器的3D目标检测模型发挥着重要作用。然而，现有大多数方法在多机器人感知网络中部署以探索大规模3D场景时面临严重的隐私问题。同时，由于3D数据的异质性和有限的通信带宽，传统的联邦学习方法在3D目标检测场景中的应用极具挑战性。本文首次提出了一种新颖的联邦3D目标检测框架（即Fed3D），以实现隐私保护下的3D目标检测分布式学习。具体而言，考虑到本地机器人中的不规则输入3D对象和机器人之间的不同类别分布可能分别导致局部异质性和全局异质性。我们提出了一种针对3D数据异质性问题的局部-全局类别感知损失，该损失能够平衡来自局部和全局不同3D类别的梯度反向传播速率。为了降低每轮的通信成本，我们开发了一个联邦3D提示模块，该模块仅通过少量可学习参数来学习和传递提示。最后，在联邦3D目标检测上的多项广泛实验表明，我们的Fed3D模型在提供有限的本地训练数据时，显著优于现有的最先进算法，并且通信成本更低。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2604.15808

Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

超越单帧：跨体积MRI的多帧空间基础推理

Moukheiber, Lama, Yeung, Caleb M., Xue, Haotian, Helbling, Alec, Zhao, Zelin, Chen, Yongxin

Abstract

Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.

Chinese Translation

空间推理和视觉基础是视觉-语言模型（VLMs）的核心能力，但大多数医学VLM在没有透明推理或空间证据的情况下产生预测。现有基准测试也在孤立的2D图像上评估VLM，忽视了临床影像的体积特性，其中发现可能跨越多个帧或仅出现在少数切片上。我们引入了空间基础MRI视觉问答（SGMRI-VQA），这是一个包含41,307对多帧、空间基础推理的体积MRI基准。该基准基于fastMRI+数据集中专家放射科医师的注释，涵盖脑部和膝关节研究，每个问答对包括与临床一致的思维链轨迹及帧索引的边界框坐标。任务按层次组织，涵盖检测、定位、计数/分类和描述，要求模型共同推理存在的内容、位置及其跨越的帧。我们对10个VLM进行了基准测试，结果表明，使用边界框监督对Qwen3-VL-8B进行有监督微调，始终能在强大的零-shot基线之上显著提高基础性能，这表明有针对性的空间监督是实现扎实临床推理的有效途径。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2604.15809

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

对齐视觉-语言模型所见与感知的自适应信息流

Liu, Chengxin, Choi, Wonseok, Zhang, Chenshuang, Oh, Tae-Hyun

Abstract

Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets, including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. Project page: https://cxliu0.github.io/AIF/.

Chinese Translation

视觉-语言模型（VLMs）在视觉识别、文档解析和视觉定位等多种任务中展现了强大的能力。然而，最近的研究表明，尽管VLMs通常能够捕捉到与问题相对应的正确图像区域，但它们不一定会产生正确的答案。在本研究中，我们证明这种不对齐可能归因于VLMs内部信息流的次优配置，其中文本标记对无关视觉标记分配了过多的注意力，从而导致错误的答案。基于这一观察，我们展示了在推理过程中调节信息流可以提高VLMs的感知能力。我们的想法是，在解码过程中，文本标记应仅与重要的视觉标记相关联，从而消除无关区域的干扰。为此，我们提出了一种基于标记动态的方法来确定视觉标记的重要性，其中在不同解码阶段表现出明显激活模式的视觉标记被视为重要。我们将该方法应用于代表性的开源VLMs，并在多个数据集上进行评估，包括视觉问答、视觉定位与计数、光学字符识别和物体幻觉。结果表明，我们的方法显著提高了基线模型的性能。项目页面：https://cxliu0.github.io/AIF/

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2604.15814

Continual Hand-Eye Calibration for Open-world Robotic Manipulation

开放世界机器人操作的持续手眼标定

Li, Fazeng, Sun, Gan, Liu, Chenxi, He, Yao, Cong, Wei, Cong, Yang

Abstract

Hand-eye calibration through visual localization is a critical capability for robotic manipulation in open-world environments. However, most deep learning-based calibration models suffer from catastrophic forgetting when adapting into unseen data amongst open-world scene changes, while simple rehearsal-based continual learning strategy cannot well mitigate this issue. To overcome this challenge, we propose a continual hand-eye calibration framework, enabling robots to adapt to sequentially encountered open-world manipulation scenes through spatially replay strategy and structure-preserving distillation. Specifically, a Spatial-Aware Replay Strategy (SARS) constructs a geometrically uniform replay buffer that ensures comprehensive coverage of each scene pose space, replacing redundant adjacent frames with maximally informative viewpoints. Meanwhile, a Structure-Preserving Dual Distillation (SPDD) is proposed to decompose localization knowledge into coarse scene layout and fine pose precision, and distills them separately to alleviate both types of forgetting during continual adaptation. As a new manipulation scene arrives, SARS provides geometrically representative replay samples from all prior scenes, and SPDD applies structured distillation on these samples to retain previously learned knowledge. After training on the new scene, SARS incorporates selected samples from the new scene into the replay buffer for future rehearsal, allowing the model to continuously accumulate multi-scene calibration capability. Experiments on multiple public datasets show significant anti scene forgetting performance, maintaining accuracy on past scenes while preserving adaptation to new scenes, confirming the effectiveness of the framework.

Chinese Translation

通过视觉定位进行手眼标定是开放世界环境中机器人操作的一项关键能力。然而，大多数基于深度学习的标定模型在适应开放世界场景变化中的未见数据时，容易遭遇灾难性遗忘，而简单的基于重演的持续学习策略无法有效缓解这一问题。为了解决这一挑战，我们提出了一种持续手眼标定框架，使机器人能够通过空间重放策略和结构保持蒸馏，适应顺序遇到的开放世界操作场景。具体而言，空间感知重放策略（Spatial-Aware Replay Strategy, SARS）构建了一个几何均匀的重放缓冲区，确保对每个场景位姿空间的全面覆盖，用最大信息量的视角替换冗余的相邻帧。同时，提出了结构保持双重蒸馏（Structure-Preserving Dual Distillation, SPDD），将定位知识分解为粗略场景布局和精细位姿精度，并分别进行蒸馏，以减轻持续适应过程中的两种遗忘。当新的操作场景到来时，SARS从所有先前场景中提供几何代表性的重放样本，而SPDD对这些样本应用结构化蒸馏，以保留先前学习的知识。在新场景上训练后，SARS将新场景中选定的样本纳入重放缓冲区，以便未来重演，使模型能够持续积累多场景标定能力。在多个公共数据集上的实验表明，该框架在抗场景遗忘性能上显著提升，能够在保持对过去场景的准确性的同时，适应新场景，验证了框架的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2604.15823

Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

像人类一样观看电影：面向具身伴侣的自我中心情感理解

Dong, Ze, Shi, Hao, Gao, Zejia, Yi, Zhonghua, Wang, Kaiwei, Wang, Lin

Abstract

Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.

Chinese Translation

具身机器人代理通常通过自我中心的屏幕视角界面而非原生电影画面来感知电影，这引入了视角失真、尺度变化、光照变化和环境干扰等领域转变。然而，现有关于电影情感理解的研究几乎完全基于电影画面进行，限制了在现实观看场景中的跨领域泛化。为了解决这一问题，我们提出了EgoScreen-Emotion (ESE)，这是第一个用于自我中心屏幕视角电影情感理解的基准数据集。ESE包含224个在受控自我中心屏幕视角条件下捕获的电影预告片，生成了28,667个时间对齐的关键帧，这些关键帧由多个评审者使用一种关注置信度的多标签协议进行标注，以解决情感模糊性。我们进一步构建了一个多模态长上下文情感推理框架，该框架对时间视觉证据、叙事摘要、压缩历史上下文和音频线索进行建模。跨领域实验揭示了严重的领域差距：在真实的自我中心屏幕视角观察中，基于电影画面训练的模型的Macro-F1从27.99下降到16.69。在ESE上训练显著提高了在现实观看条件下的鲁棒性。我们的方法与强大的闭源多模态模型相比，表现出竞争力，突显了领域特定数据和长上下文多模态推理的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2604.15828

SSFT: A Lightweight Spectral-Spatial Fusion Transformer for Generic Hyperspectral Classification

SSFT：一种轻量级光谱-空间融合变换器用于通用高光谱分类

Musiat, Alexander, Ebert, Nikolas, Wasenmüller, Oliver

Abstract

Hyperspectral imaging enables fine-grained recognition of materials by capturing rich spectral signatures, but learning robust classifiers is challenging due to high dimensionality, spectral redundancy, limited labeled data, and strong domain shifts. Beyond earth observation, labeled HSI data is often scarce and imbalanced, motivating compact models for generic hyperspectral classification across diverse acquisition regimes. We propose the lightweight Spectral-Spatial Fusion Transformer (SSFT), which factorizes representation learning into spectral and spatial pathways and integrates them via cross-attention to capture complementary wavelength-dependent and structural information. We evaluate our SSFT on the challenging HSI-Benchmark, a heterogeneous multi-dataset benchmark covering earth observation, fruit condition assessment, and fine-grained material recognition. SSFT achieves state-of-the-art overall performance, ranking first while using less than 2% of the parameters of the previous leading method. We further evaluate transfer to the substantially larger SpectralEarth benchmark under the official protocol, where SSFT remains competitive despite its compact size. Ablation studies show that both spectral and spatial pathways are crucial, with spatial modeling contributing most, and that SSFT remains robust without data augmentation.

Chinese Translation

高光谱成像通过捕捉丰富的光谱特征实现材料的精细识别，但由于高维度、光谱冗余、有限的标注数据和强烈的领域转移，学习稳健的分类器面临挑战。除了地球观测外，标注的高光谱成像（HSI）数据通常稀缺且不平衡，这促使我们开发紧凑的模型以适应不同采集模式下的通用高光谱分类。我们提出了轻量级光谱-空间融合变换器（SSFT），该模型将表征学习分解为光谱和空间路径，并通过交叉注意力将其整合，以捕捉互补的波长依赖和结构信息。我们在具有挑战性的HSI-Benchmark上评估了SSFT，该基准涵盖了地球观测、果实状况评估和精细材料识别等异构多数据集。SSFT在整体性能上达到最先进水平，排名第一，同时使用的参数不到之前领先方法的2%。我们进一步在官方协议下评估了向规模更大的SpectralEarth基准的迁移，尽管SSFT体积紧凑，但仍保持竞争力。消融研究表明，光谱和空间路径都是至关重要的，其中空间建模贡献最大，并且SSFT在没有数据增强的情况下仍然保持稳健性。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2604.15829

Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration

超越文本提示：通过文本-图像协作实现精确概念消除

Li, Jun, Xiong, Lizhi, Li, Ziqiang, Jiang, Weiwei, Fu, Zhangjie, Li, Yong, Xie, Guo-Sen

Abstract

Text-to-image generative models have achieved impressive fidelity and diversity, but can inadvertently produce unsafe or undesirable content due to implicit biases embedded in large-scale training datasets. Existing concept erasure methods, whether text-only or image-assisted, face trade-offs: textual approaches often fail to fully suppress concepts, while naive image-guided methods risk over-erasing unrelated content. We propose TICoE, a text-image Collaborative Erasing framework that achieves precise and faithful concept removal through a continuous convex concept manifold and hierarchical visual representation learning. TICoE precisely removes target concepts while preserving unrelated semantic and visual content. To objectively assess the quality of erasure, we further introduce a fidelity-oriented evaluation strategy that measures post-erasure usability. Experiments on multiple benchmarks show that TICoE surpasses prior methods in concept removal precision and content fidelity, enabling safer, more controllable text-to-image generation. Our code is available at https://github.com/OpenAscent-L/TICoE.git

Chinese Translation

文本到图像的生成模型在保真度和多样性方面取得了令人印象深刻的成果，但由于大规模训练数据集中嵌入的隐性偏见，可能会无意中生成不安全或不理想的内容。现有的概念消除方法，无论是仅基于文本还是图像辅助，都面临权衡：文本方法往往无法完全抑制概念，而简单的图像引导方法则有过度消除无关内容的风险。我们提出了TICoE（文本-图像协作消除）框架，通过连续的凸概念流形和分层视觉表示学习，实现精确和忠实的概念移除。TICoE能够精确移除目标概念，同时保留无关的语义和视觉内容。为了客观评估消除的质量，我们进一步引入了一种以保真度为导向的评估策略，测量消除后的可用性。在多个基准测试中的实验表明，TICoE在概念移除精度和内容保真度方面超越了先前的方法，从而实现了更安全、更可控的文本到图像生成。我们的代码可在 https://github.com/OpenAscent-L/TICoE.git 获取。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2604.15853

Learning to Look before Learning to Like: Incorporating Human Visual Cognition into Aesthetic Quality Assessment

在学习喜欢之前学习观察：将人类视觉认知纳入美学质量评估

Yu, Liwen, Liu, Chi, Han, Xiaotong, Zhu, Congcong, Wang, Minghao, Shen, Sheng

Abstract

Automated Aesthetic Quality Assessment (AQA) treats images primarily as static pixel vectors, aligning predictions with human-rating scores largely through semantic perception. However, this paradigm diverges from human aesthetic cognition, which arises from dynamic visual exploration shaped by scanning paths, processing fluency, and the interplay between bottom-up salience and top-down intention. We introduce AestheticNet, a novel cognitive-inspired AQA paradigm that integrates human-like visual cognition and semantic perception with a two-pathway architecture. The visual attention pathway, implemented as a gaze-aligned visual encoder (GAVE) pre-trained offline on eye-tracking data using resource-efficient contrast gaze alignment, models attention from human vision system. This pathway augments the semantic pathway, which uses a fixed semantic encoder such as CLIP, through cross-attention fusion. Visual attention provides a cognitive prior reflecting foreground/background structure, color cascade, brightness, and lighting, all of which are determinants of aesthetic perception beyond semantics. Experiments validated by hypothesis testing show a consistent improvement over the semantic-alone baselines, and demonstrate the gaze module as a model-agnostic corrector compatible with diverse AQA backbones, supporting the necessity and modularity of human-like visual cognition for AQA. Our code is available at https://github.com/keepgallop/AestheticNet.

Chinese Translation

自动化美学质量评估（AQA）主要将图像视为静态像素向量，通过语义感知将预测与人类评分相对齐。然而，这一范式与人类美学认知存在偏差，因为人类的美学认知源于动态视觉探索，这种探索受到扫描路径、处理流畅性以及自下而上的显著性与自上而下的意图之间相互作用的影响。我们提出了AestheticNet，一种新颖的认知启发式AQA范式，它通过双通道架构将类人视觉认知与语义感知相结合。视觉注意力通道作为一个与注视对齐的视觉编码器（GAVE）实现，该编码器在离线眼动追踪数据上进行了资源高效的对比注视对齐预训练，模拟了人类视觉系统的注意力。该通道通过交叉注意力融合增强了使用固定语义编码器（如CLIP）的语义通道。视觉注意力提供了一种认知先验，反映了前景/背景结构、颜色级联、亮度和光照，这些都是超越语义的美学感知决定因素。通过假设检验验证的实验表明，相较于仅依赖语义的基线，模型表现出一致的改进，并展示了注视模块作为一种与模型无关的校正器，兼容多种AQA基础架构，支持类人视觉认知在AQA中的必要性和模块化。我们的代码可在 https://github.com/keepgallop/AestheticNet 获取。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2604.15856

Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

通过结构化潜在投影实现缺失或完整模态下的鲁棒多光谱语义分割

Ulku, Irem, Akagündüz, Erdem, Tanrıöver, Ömer Özgür

Abstract

Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-.

Chinese Translation

多模态遥感数据为语义分割提供了互补信息，但在实际应用中，由于传感器故障、获取问题或复杂的气象条件，某些模态可能不可用。现有的多模态分割模型通常通过学习输入之间的共享表示来解决缺失模态的问题。然而，这种方法可能会引入权衡，妥协模态特定的互补信息，并在所有模态可用时降低性能。本文通过CBC-SLP模型解决这一限制，该模型旨在保留模态不变和模态特定的信息。受到模态对齐理论结果的启发，该理论指出完美对齐的多模态表示可能导致下游预测任务中的次优性能，我们提出了一种新颖的结构化潜在投影方法作为架构归纳偏置。我们并不是通过损失项来强制执行这一策略，而是将其直接融入架构中。特别是，为了有效利用互补信息，同时在随机模态丢失的情况下保持鲁棒性，我们将潜在表示结构化为共享组件和模态特定组件，并根据随机模态可用性掩码自适应地将其转移到解码器。对三个多模态遥感图像数据集的广泛实验表明，CBC-SLP在完整和缺失模态场景下始终优于最先进的多模态模型。此外，我们实证证明所提出的策略能够恢复可能未在共享表示中保留的互补信息。代码可在 https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP- 获取。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2604.15857

AHS: Adaptive Head Synthesis via Synthetic Data Augmentations

AHS：通过合成数据增强实现自适应头部合成

Kang, Taewoong, Jang, Hyojin, Jeong, Sohyun, Moon, Seunggi, Kim, Gihwi, Jung, Hoon Jin, choo, Jaegul

Abstract

Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.

Chinese Translation

近年来，数字媒体的进步对复杂的肖像处理技术提出了越来越高的需求，尤其是头部交换技术，即将一个人的头部无缝地与另一个人的身体融合。然而，当前的方法主要依赖于面部中心裁剪的数据，且视角有限，这大大限制了其在现实世界中的适用性。这些方法在处理多样的头部表情、不同的发型以及超出面部区域的自然融合时面临挑战。为了解决这些局限性，我们提出了自适应头部合成（AHS），该方法有效处理具有多种头部姿态和表情的完整上半身图像。AHS结合了一种新颖的头部重演合成数据增强策略，以克服自监督训练的限制，增强了在不同面部表情和方向下的泛化能力，而无需配对的训练数据。全面的实验表明，AHS在具有挑战性的现实场景中表现优越，生成的结果在各种头部方向和发型下保持了身份和表情的连贯性。值得注意的是，AHS在保持面部身份的同时，对剧烈的表情变化表现出卓越的鲁棒性，并在显著的头部姿态变化下忠实地保留配饰。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2604.15862

Splats in Splats++: Robust and Generalizable 3D Gaussian Splatting Steganography

Splats in Splats++：稳健且可推广的3D高斯点云隐写术

Guo, Yijia, Huang, Wenkai, Hu, Tong, Li, Gaolei, Li, Yang, Hong, Yuxin, Hu, Liwen, Ling, Xitong, Li, Jianhua, Chen, Shengbo, Huang, Tiejun, Ma, Lei

Abstract

3D Gaussian Splatting (3DGS) has recently redefined the paradigm of 3D reconstruction, striking an unprecedented balance between visual fidelity and computational efficiency. As its adoption proliferates, safeguarding the copyright of explicit 3DGS assets has become paramount. However, existing invisible message embedding frameworks struggle to reconcile secure and high-capacity data embedding with intrinsic asset utility, often disrupting the native rendering pipeline or exhibiting vulnerability to structural perturbations. In this work, we present \textbf{\textit{Splats in Splats++}}, a unified and pipeline-agnostic steganography framework that seamlessly embeds high-capacity 3D/4D content directly within the native 3DGS representation. Grounded in a principled analysis of the frequency distribution of Spherical Harmonics (SH), we propose an importance-graded SH coefficient encryption scheme that achieves imperceptible embedding without compromising the original expressive power. To fundamentally resolve the geometric ambiguities that lead to message leakage, we introduce a \textbf{Hash-Grid Guided Opacity Mapping} mechanism. Coupled with a novel \textbf{Gradient-Gated Opacity Consistency Loss}, our formulation enforces a stringent spatial-attribute coupling between the original and hidden scenes, effectively projecting the discrete attribute mapping into a continuous, attack-resilient latent manifold. Extensive experiments demonstrate that our method substantially outperforms existing approaches, achieving up to \textbf{6.28 db} higher message fidelity, \textbf{3$\times$} faster rendering, and exceptional robustness against aggressive 3D-targeted structural attacks (e.g., GSPure). Furthermore, our framework exhibits remarkable versatility, generalizing seamlessly to 2D image embedding, 4D dynamic scene steganography, and diverse downstream tasks.

Chinese Translation

3D高斯点云（3DGS）最近重新定义了3D重建的范式，在视觉保真度和计算效率之间达成了前所未有的平衡。随着其应用的普及，保护显式3DGS资产的版权变得至关重要。然而，现有的隐形信息嵌入框架在安全性与高容量数据嵌入与内在资产效用之间难以调和，常常破坏原生渲染管线或对结构扰动表现出脆弱性。在本研究中，我们提出了 extbf{ extit{Splats in Splats++}}，一个统一且与管线无关的隐写术框架，能够无缝地将高容量的3D/4D内容直接嵌入到原生3DGS表示中。基于对球谐函数（Spherical Harmonics, SH）频率分布的原则性分析，我们提出了一种重要性分级的SH系数加密方案，实现了在不妥协原始表达能力的情况下进行不可察觉的嵌入。为了从根本上解决导致信息泄露的几何模糊性，我们引入了一种 extbf{哈希网格引导的不透明映射}机制。结合一种新颖的 extbf{梯度门控不透明一致性损失}，我们的公式强制原始场景与隐藏场景之间严格的空间属性耦合，有效地将离散属性映射投影到一个连续的、抗攻击的潜在流形中。大量实验表明，我们的方法在信息保真度上显著优于现有方法，达到高达 extbf{6.28 db}的消息保真度， extbf{3$ imes$}的渲染速度提升，并对激进的3D目标结构攻击（如GSPure）表现出卓越的鲁棒性。此外，我们的框架展现出显著的多样性，能够无缝推广至2D图像嵌入、4D动态场景隐写术及多种下游任务。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2604.15871

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

UniEditBench：通过蒸馏的多模态大模型（MLLMs）实现的统一且高效的图像和视频编辑基准

Jiang, Lifan, Wu, Tianrun, Pei, Yuhang, Wang, Chenyang, Wu, Boxi, Cai, Deng

Abstract

The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.

Chinese Translation

视觉编辑模型的评估在方法和模态上仍然存在碎片化现象。现有的基准通常针对特定范式进行定制，使得跨范式的公平比较变得困难，而视频编辑缺乏可靠的评估基准。此外，常用的自动化指标往往与人类偏好不一致，然而直接将大型多模态模型（MLLMs）作为评估者会产生高昂的计算和财务成本。我们提出了UniEditBench，这是一个统一的图像和视频编辑基准，支持基于重建和基于指令的方法，并在共享协议下进行评估。UniEditBench包括九种图像操作（添加、移除、替换、改变、基于笔画、提取、调整、计数、重新排序）和八种视频操作的结构化分类，涵盖了计数和空间重新排序等具有挑战性的组合任务。为了实现可扩展的评估，我们将高容量的MLLM评估者（Qwen3-VL-235B-A22B Instruct）蒸馏为轻量级的4B/8B评估者，这些评估者在结构保真度、文本对齐、背景一致性、自然性和时间-空间一致性（针对视频）等方面提供多维评分。实验表明，蒸馏后的评估者与人类判断保持强一致性，并显著降低了相对于教师模型的部署成本。UniEditBench为现代视觉编辑方法的基准测试提供了一个实用且可重复的协议。我们的基准及相关奖励模型已公开发布，网址为 https://github.com/wesar1/UniEditBench。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2604.15875

CLOTH-HUGS: Cloth Aware Human Gaussian Splatting

CLOTH-HUGS：服装感知的人体高斯点云渲染

Mubashshira, Sadia, Amini, Nazanin, Desai, Kevin

Abstract

We present Cloth-HUGS, a Gaussian Splatting based neural rendering framework for photorealistic clothed human reconstruction that explicitly disentangles body and clothing. Unlike prior methods that absorb clothing into a single body representation and struggle with loose garments and complex deformations, Cloth-HUGS represents the performer using separate Gaussian layers for body and cloth within a shared canonical space. The canonical volume jointly encodes body, cloth, and scene primitives and is deformed through SMPL-driven articulation with learned linear blend skinning weights. To improve cloth realism, we initialize cloth Gaussians from mesh topology and apply physics-inspired constraints, including simulation-consistency, ARAP regularization, and mask supervision. We further introduce a depth-aware multi-pass rendering strategy for robust body-cloth-scene compositing, enabling real-time rendering at over 60 FPS. Experiments on multiple benchmarks show that Cloth-HUGS improves perceptual quality and geometric fidelity over state-of-the-art baselines, reducing LPIPS by up to 28% while producing temporally coherent cloth dynamics.

Chinese Translation

我们提出了CLOTH-HUGS，这是一种基于高斯点云的神经渲染框架，用于逼真的穿衣人类重建，明确区分身体和服装。与之前将服装吸收到单一身体表示中并在处理松散衣物和复杂变形时遇到困难的方法不同，CLOTH-HUGS在共享的典范空间内使用独立的高斯层来表示表演者的身体和服装。典范体积共同编码身体、服装和场景原件，并通过SMPL驱动的关节运动与学习的线性混合蒙皮权重进行变形。为了提高服装的真实感，我们从网格拓扑初始化服装高斯，并应用物理启发的约束，包括模拟一致性、ARAP正则化和掩模监督。我们进一步引入了一种深度感知的多通道渲染策略，以实现稳健的身体-服装-场景合成，使得实时渲染超过60帧每秒。多个基准上的实验表明，CLOTH-HUGS在感知质量和几何保真度上优于最先进的基线，LPIPS降低了多达28%，同时产生了时间上连贯的服装动态。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2604.15893

PolarMAE: Efficient Fetal Ultrasound Pre-training via Semantic Screening and Polar-Guided Masking

PolarMAE：通过语义筛选和极坐标引导掩蔽实现高效的胎儿超声预训练

Lv, Meng, Li, Yapeng, Su, Hang, Liu, Juhua, Du, Bo

Abstract

Intelligent fetal ultrasound (US) interpretation is crucial for prenatal diagnosis, but high annotation costs and operator-induced variance make unsupervised pre-training a highly promising paradigm. However, existing pre-training methods largely ignore US-specific characteristics -- severe data redundancy, fan-shaped locality, and polar coordinate beamforming -- limiting their effectiveness in downstream tasks. To address this, we propose PolarMAE, a novel and efficient pre-training framework tailored for US images. Specifically, to mitigate continuous scanning redundancy, we introduce a Progressive Visual-Semantic Screening (PVSS) that adaptively extracts high-value samples, significantly boosting pre-training efficiency. Furthermore, we design an Acoustic-Bounded Region Constraint (ABRC) to accommodate US locality, forcing the model to focus strictly on valid acoustic regions rather than invalid dark backgrounds. Finally, leveraging the beamforming prior and local details, we propose a Polar-Texture Collaborative Masking (PTCM), enabling the model to capture underlying radial imaging patterns and critical tissue structures. Extensive experiments across diverse datasets and downstream interpretation tasks demonstrate that our method achieves state-of-the-art performance with strong pre-training scalability and efficiency.

Chinese Translation

智能胎儿超声（US）解读对于产前诊断至关重要，但高昂的标注成本和操作员引起的差异使得无监督预训练成为一种极具前景的范式。然而，现有的预训练方法在很大程度上忽视了超声特有的特征——严重的数据冗余、扇形局部性和极坐标波束形成——限制了它们在下游任务中的有效性。为了解决这个问题，我们提出了PolarMAE，一种新颖且高效的超声图像预训练框架。具体而言，为了减轻连续扫描的冗余，我们引入了一种渐进式视觉-语义筛选（PVSS），自适应地提取高价值样本，显著提升预训练效率。此外，我们设计了一种声学边界区域约束（ABRC），以适应超声的局部性，强制模型严格关注有效的声学区域，而非无效的暗背景。最后，利用波束形成先验和局部细节，我们提出了一种极坐标纹理协同掩蔽（PTCM），使模型能够捕捉潜在的径向成像模式和关键组织结构。通过在多样化的数据集和下游解读任务中进行广泛实验，我们的方法在预训练可扩展性和效率方面实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2604.15903

AeroDeshadow: Physics-Guided Shadow Synthesis and Penumbra-Aware Deshadowing for Aerospace Imagery

AeroDeshadow：面向航空航天图像的物理引导阴影合成与半影感知去阴影技术

Lu, Wei, Bo, Zi-Yang, Sang, Fei-Fei, Liu, Yi, Yang, Xue, Chen, Si-Bao

Abstract

Shadows are prevalent in high-resolution aerospace imagery (ASI). They often cause spectral distortion and information loss, which degrade downstream interpretation tasks. While deep learning methods have advanced natural-image shadow removal, their direct application to ASI faces two primary challenges. First, strictly paired training data are severely lacking. Second, homogeneous shadow assumptions fail to handle the broad penumbra transition zones inherent in aerospace scenes. To address these issues, we propose AeroDeshadow, a unified two-stage framework integrating physics-guided shadow synthesis and penumbra-aware restoration. In the first stage, a Physics-aware Degradation Shadow Synthesis Network (PDSS-Net) explicitly models illumination decay and spatial attenuation. This process constructs AeroDS-Syn, a large-scale paired dataset featuring soft boundary transitions. Constrained by this physical formulation, a Penumbra-aware Cascaded DeShadowing Network (PCDS-Net) then decouples the input into umbra and penumbra components. By restoring these regions progressively, PCDS-Net alleviates boundary artifacts and over-correction. Trained solely on the synthetic AeroDS-Syn, the network generalizes to real-world ASI without requiring paired real annotations. Experimental results indicate that AeroDeshadow achieves state-of-the-art quantitative accuracy and visual fidelity across synthetic and real-world datasets. The datasets and code will be made publicly available at: https://github.com/AeroVILab-AHU/AeroDeshadow.

Chinese Translation

阴影在高分辨率航空航天图像（ASI）中普遍存在。它们常常导致光谱失真和信息丢失，从而降低下游解释任务的效果。尽管深度学习方法在自然图像去阴影方面取得了进展，但其直接应用于ASI面临两个主要挑战。首先，严格配对的训练数据严重不足。其次，均匀阴影假设无法处理航空航天场景中固有的广泛半影过渡区域。为了解决这些问题，我们提出了AeroDeshadow，一个统一的两阶段框架，集成了物理引导的阴影合成和半影感知的恢复。在第一阶段，物理感知降解阴影合成网络（PDSS-Net）明确建模了照明衰减和空间衰减。该过程构建了AeroDS-Syn，一个具有柔和边界过渡的大规模配对数据集。在这一物理模型的约束下，半影感知级联去阴影网络（PCDS-Net）随后将输入解耦为阴影和半影成分。通过逐步恢复这些区域，PCDS-Net减轻了边界伪影和过度修正。该网络仅在合成的AeroDS-Syn上训练，能够在不需要配对真实注释的情况下推广到真实世界的ASI。实验结果表明，AeroDeshadow在合成和真实世界数据集上都达到了最先进的定量准确性和视觉保真度。数据集和代码将公开发布于：https://github.com/AeroVILab-AHU/AeroDeshadow。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2604.15911

Efficient Video Diffusion Models: Advancements and Challenges

高效视频扩散模型：进展与挑战

Shao, Shitong, Bai, Lichen, Wan, Pengfei, Kwok, James, Xie, Zeke

Abstract

Video diffusion models have rapidly become the dominant paradigm for high-fidelity generative video synthesis, but their practical deployment remains constrained by severe inference costs. Compared with image generation, video synthesis compounds computation across spatial-temporal token growth and iterative denoising, making attention and memory traffic major bottlenecks in real-world settings. This survey provides a systematic and deployment-oriented review of efficient video diffusion models. We propose a unified categorization that organizes existing methods into four classes of main paradigms, including step distillation, efficient attention, model compression, and cache/trajectory optimization. Building on this categorization, we respectively analyze algorithmic trends of these four paradigms and examine how different design choices target two core objectives: reducing the number of function evaluations and minimizing per-step overhead. Finally, we discuss open challenges and future directions, including quality preservation under composite acceleration, hardware-software co-design, robust real-time long-horizon generation, and open infrastructure for standardized evaluation. To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.

Chinese Translation

视频扩散模型迅速成为高保真生成视频合成的主导范式，但其实际应用仍受到严重推理成本的限制。与图像生成相比，视频合成在空间-时间标记增长和迭代去噪中增加了计算复杂度，使得注意力和内存流量成为现实环境中的主要瓶颈。本调查提供了一个系统且以部署为导向的高效视频扩散模型的综述。我们提出了一个统一的分类，将现有方法组织为四类主要范式，包括步骤蒸馏（step distillation）、高效注意力（efficient attention）、模型压缩（model compression）和缓存/轨迹优化（cache/trajectory optimization）。基于这一分类，我们分别分析了这四个范式的算法趋势，并考察了不同设计选择如何针对两个核心目标：减少函数评估的次数和最小化每步的开销。最后，我们讨论了开放挑战和未来方向，包括在复合加速下的质量保持、硬件-软件协同设计、稳健的实时长时间生成以及标准化评估的开放基础设施。根据我们所知，我们的工作是关于高效视频扩散模型的首个综合性综述，为研究人员和工程师提供了该领域及其新兴研究方向的结构化概述。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2604.15917

Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions

通过自适应任务重构与代理执行简化图像编辑

Zhao, Bo, Guo, Kairui, Du, Runnan, Sun, Haiyang, Wang, Pengshan, Yang, Huan, Gai, Kun, Cao, Yixin, Ji, Wei

Abstract

Instruction guided image editing has advanced substantially with recent generative models, yet it still fails to produce reliable results across many seemingly simple cases. We observe that a large portion of these failures stem not from insufficient model capacity, but from poorly formulated editing tasks, such as those involving small targets, implicit spatial relations, or under-specified instructions. In this work, we frame image editing failures as a task formulation problem and propose an adaptive task reformulation framework that improves editing performance without modifying the underlying model. Our key idea is to transform the original image-instruction pair into a sequence of operations that are dynamically determined and executed by a MLLM agent through analysis, routing, reformulation, and feedback-driven refinement. Experiments on multiple benchmarks, including ImgEdit, PICA, and RePlan, across diverse editing backbones such as Qwen Image Edit and Nano Banana, show consistent improvements, with especially large gains on challenging cases. These results suggest that task reformulation is a critical but underexplored factor, and that substantial gains can be achieved by better matching editing tasks to the effective operating regime of existing models.

Chinese Translation

基于指令的图像编辑在最近的生成模型中取得了显著进展，但在许多看似简单的案例中仍然无法产生可靠的结果。我们观察到，这些失败的很大一部分并非源于模型能力不足，而是由于编辑任务的表述不当，例如涉及小目标、隐含空间关系或指令不明确的任务。在本研究中，我们将图像编辑失败视为一个任务表述问题，并提出了一种自适应任务重构框架，该框架在不修改基础模型的情况下提高了编辑性能。我们的关键思想是将原始的图像-指令对转化为一系列操作，这些操作通过分析、路由、重构和基于反馈的细化，由一个多模态大语言模型（MLLM）代理动态确定和执行。在多个基准测试（包括 ImgEdit、PICA 和 RePlan）上的实验结果显示，在多种编辑骨干网络（如 Qwen Image Edit 和 Nano Banana）上均取得了一致的改进，尤其在具有挑战性的案例中获得了显著的提升。这些结果表明，任务重构是一个关键但尚未充分探索的因素，通过更好地匹配编辑任务与现有模型的有效操作范围，可以实现显著的提升。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2604.15941

Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction

神经Gabor点云：基于神经Gabor的增强高斯点云用于高频表面重建

Watanabe, Haato, Umetani, Nobuyuki

Abstract

Recent years have witnessed the rapid emergence of 3D Gaussian splatting (3DGS) as a powerful approach for 3D reconstruction and novel view synthesis. Its explicit representation with Gaussian primitives enables fast training, real-time rendering, and convenient post-processing such as editing and surface reconstruction. However, 3DGS suffers from a critical drawback: the number of primitives grows drastically for scenes with high-frequency appearance details, since each primitive can represent only a single color, requiring multiple primitives for every sharp color transition. To overcome this limitation, we propose neural Gabor splatting, which augments each Gaussian primitive with a lightweight multi-layer perceptron that models a wide range of color variations within a single primitive. To further control primitive numbers, we introduce a frequency-aware densification strategy that selects mismatch primitives for pruning and cloning based on frequency energy. Our method achieves accurate reconstruction of challenging high-frequency surfaces. We demonstrate its effectiveness through extensive experiments on both standard benchmarks, such as Mip-NeRF360 and High-Frequency datasets (e.g., checkered patterns), supported by comprehensive ablation studies.

Chinese Translation

近年来，3D高斯点云（3DGS）作为一种强大的3D重建和新视图合成方法迅速崛起。其通过高斯原语的显式表示实现了快速训练、实时渲染以及便捷的后处理，如编辑和表面重建。然而，3DGS存在一个关键缺陷：对于具有高频外观细节的场景，原语数量会急剧增加，因为每个原语只能表示单一颜色，导致每个锐利的颜色过渡需要多个原语。为了解决这一限制，我们提出了神经Gabor点云，该方法为每个高斯原语增添了一个轻量级的多层感知器，以建模单个原语内的广泛颜色变化。为了进一步控制原语数量，我们引入了一种频率感知的稠密化策略，根据频率能量选择不匹配的原语进行修剪和克隆。我们的方法能够准确重建具有挑战性的高频表面。通过在标准基准（如Mip-NeRF360和高频数据集，例如棋盘图案）上的广泛实验以及全面的消融研究，我们证明了其有效性。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2604.15946

SENSE: Stereo OpEN Vocabulary SEmantic Segmentation

SENSE：立体开放词汇语义分割

Campagnolo, Thomas, Malis, Ezio, Martinet, Philippe, Bahl, Gaétan

Abstract

Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.

Chinese Translation

开放词汇语义分割使模型能够对超出固定类别集的对象或图像区域进行分割，从而在动态环境中提供灵活性。然而，现有方法通常依赖于单视图图像，并在空间精度方面面临挑战，尤其是在遮挡和物体边界附近。我们提出了SENSE，这是首个关于立体开放词汇语义分割的研究，利用立体视觉和视觉-语言模型来增强开放词汇语义分割。通过结合立体图像对，我们引入了几何线索，以改善空间推理和分割精度。在PhraseStereo数据集上训练后，我们的方法在短语基础任务中表现出色，并在零样本设置中展示了良好的泛化能力。在PhraseStereo上，我们的平均精度比基线方法提高了2.9%，比最佳竞争方法提高了0.76%。与基线工作相比，SENSE在Cityscapes上提供了相对提高3.5%的mIoU，在KITTI上提高了18%。通过对语义和几何进行联合推理，SENSE支持从自然语言中进行准确的场景理解，这对于自主机器人和智能交通系统至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2604.15948

From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

从竞争到合作竞争：基于文本指导的无训练图像编辑方法

Shen, Jinhao, Du, Haoqian, Zhang, Xulu, Wei, Xiao-Yong, Li, Qing

Abstract

Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at https://github.com/JinhaoShen/CoEdit.

Chinese Translation

文本指导的图像编辑是现代多媒体内容创作中的一项关键任务，近年来无训练方法的出现使得无需额外优化而取得了显著进展。尽管取得了一定的进展，现有方法通常受到竞争范式的限制，在该范式中，编辑和重建分支各自独立地驱动其目标，以最大化与目标和源提示的对齐。对抗策略导致了语义冲突和不可预测的结果，因为分支之间缺乏协调。为了解决这些问题，我们提出了合作竞争无训练图像编辑（CoEdit），这是一种新颖的零-shot框架，将注意力控制从竞争转变为合作竞争谈判，实现了在空间和时间维度上的编辑和谐。在空间上，CoEdit引入了双熵注意力操控，量化了分支之间的方向性熵交互，将注意力控制重新表述为和谐最大化问题，从而改善了可编辑和可保留区域的定位。在时间上，我们提出了熵潜在细化机制，以动态调整潜在表示，最小化累积的编辑错误，并确保在去噪轨迹中的一致语义过渡。此外，我们提出了保真度约束编辑评分，这是一种复合指标，联合评估语义编辑和背景保真性。在标准基准上的大量实验表明，CoEdit在编辑质量和结构保留方面均表现出优越性能，通过实现视觉和文本模态之间更有效的互动，增强了多媒体信息的利用。代码将发布在 https://github.com/JinhaoShen/CoEdit。

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2604.15979

MMGait: Towards Multi-Modal Gait Recognition

MMGait：面向多模态步态识别

Wang, Chenye, Cai, Qingyuan, Hou, Saihui, Li, Aoqi, Huang, Yongzhen

Abstract

Gait recognition has emerged as a powerful biometric technique for identifying individuals at a distance without requiring user cooperation. Most existing methods focus primarily on RGB-derived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. MMGait contains twelve modalities and 334,060 sequences from 725 subjects, enabling systematic exploration across geometric, photometric, and motion domains. Based on MMGait, we conduct extensive evaluations on single-modal, cross-modal, and multi-modal paradigms to analyze modality robustness and complementarity. Furthermore, we introduce a new task, Omni Multi-Modal Gait Recognition, which aims to unify the above three gait recognition paradigms within a single model. We also propose a simple yet powerful baseline, OmniGait, which learns a shared embedding space across diverse modalities and achieves promising recognition performance. The MMGait benchmark, codebase, and pretrained checkpoints are publicly available at https://github.com/BNU-IVC/MMGait.

Chinese Translation

步态识别作为一种强大的生物特征识别技术，能够在不需要用户合作的情况下远距离识别个体。现有大多数方法主要集中在基于RGB的模态上，这在需要多模态协作和跨模态检索的现实场景中显得不足。为了解决这些挑战，我们提出了MMGait，这是一个综合性的多模态步态基准，整合了来自五种异构传感器的数据，包括RGB相机、深度相机、红外相机、LiDAR扫描仪和4D雷达系统。MMGait包含十二种模态和来自725个受试者的334,060个序列，使得在几何、光度和运动领域的系统性探索成为可能。基于MMGait，我们对单模态、跨模态和多模态范式进行了广泛评估，以分析模态的鲁棒性和互补性。此外，我们引入了一项新任务，Omni Multi-Modal Gait Recognition，旨在将上述三种步态识别范式统一到一个模型中。我们还提出了一个简单而强大的基线模型OmniGait，该模型在不同模态之间学习共享嵌入空间，并取得了令人满意的识别性能。MMGait基准、代码库和预训练检查点已公开发布，网址为https://github.com/BNU-IVC/MMGait。

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2604.16010

IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE

IA-CLAHE：图像自适应的CLAHE剪切限制估计

Otsuka, Rikuto, Shoji, Yuho, Ogino, Yuka, Toizumi, Takahiro, Ito, Atsushi

Abstract

This paper proposes image-adaptive contrast limited adaptive histogram equalization (IA-CLAHE). Conventional CLAHE is widely used to boost the performance of various computer vision tasks and to improve visual quality for human perception in practical industrial applications. CLAHE applies contrast limited histogram equalization to each local region to enhance local contrast. However, CLAHE often leads to over-enhancement, because the contrast-limiting parameter clip limit is fixed regardless of the histogram distribution of each local region. Our IA-CLAHE addresses this limitation by adaptively estimating tile-wise clip limits from the input image. To achieve this, we train a lightweight clip limits estimator with a differentiable extension of CLAHE, enabling end-to-end optimization. Unlike prior learning-based CLAHE methods, IA-CLAHE does not require pre-searched ground-truth clip limits or task-specific datasets, because it learns to map input image histograms toward a domain-invariant uniform distribution, enabling zero-shot generalization across diverse conditions. Experimental results show that IA-CLAHE consistently improves recognition performance, while simultaneously enhancing visual quality for human perception, without requiring any task-specific training data.

Chinese Translation

本文提出了一种图像自适应的对比度限制自适应直方图均衡（IA-CLAHE）。传统的CLAHE广泛应用于提升各种计算机视觉任务的性能，并在实际工业应用中改善人类感知的视觉质量。CLAHE对每个局部区域应用对比度限制直方图均衡，以增强局部对比度。然而，由于对比度限制参数剪切限制是固定的，忽略了每个局部区域的直方图分布，因此CLAHE常常导致过度增强。我们的IA-CLAHE通过自适应地从输入图像中估计每个瓦片的剪切限制来解决这一限制。为此，我们训练了一个轻量级的剪切限制估计器，该估计器使用CLAHE的可微扩展形式，使得端到端优化成为可能。与之前基于学习的CLAHE方法不同，IA-CLAHE不需要预先搜索的真实剪切限制或特定任务的数据集，因为它学习将输入图像的直方图映射到一个领域不变的均匀分布，从而实现跨多种条件的零样本泛化。实验结果表明，IA-CLAHE在提高识别性能的同时，增强了人类感知的视觉质量，而无需任何特定任务的训练数据。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2604.16011

Breakout-picker: Reducing false positives in deep learning-based borehole breakout characterization from acoustic image logs

Breakout-picker：基于深度学习的孔壁破裂特征识别中减少误报

Wang, Guangyu, Ma, Xiaodong, Wu, Xinming

Abstract

Borehole breakouts are stress-induced spalling on the borehole wall, which are identifiable in acoustic image logs as paired zones with near-symmetry azimuths, low acoustic amplitudes, and increased borehole radius. Accurate breakout characterization is crucial for in-situ stress analysis. In recent years, deep learning has been introduced to automate the time-consuming and labor-intensive breakout picking process. However, existing approaches often suffer from misclassification of non-breakout features, leading to high false positive rates. To address this limitation, this study develops a deep learning framework, termed Breakout-picker, with a specific focus on reducing false positives in automatic breakout characterization. Breakout-picker reduces false positives through two strategies. First, the training of Breakout-picker incorporates negative samples of non-breakout features, including natural fractures, keyseats, and logging artifacts. They share similar characteristics with breakouts, such as low acoustic amplitude or locally enlarged borehole radius. These negative training samples enables Breakout-picker to better discriminate true breakouts and similar non-breakout features. Second, candidate breakouts identified by Breakout-picker are further validated by azimuthal symmetry criteria, whereby detections that do not exhibit the near-symmetry characteristics of breakout azimuth are excluded. The performance of Breakout-picker is evaluated using three acoustic image log datasets from different regions. The results demonstrate that Breakout-picker outperforms other automatic methods with higher accuracy and substantially lower false positive rates. By reducing false positives, Breakout-picker enhances the reliability of automatic breakout characterization from acoustic image logs, which in turn benefits in-situ stress analysis based on borehole breakouts.

Chinese Translation

孔壁破裂是由应力引起的孔壁剥落，在声学图像日志中可识别为具有近对称方位、低声学振幅和增大的孔径的成对区域。准确的破裂特征识别对于原位应力分析至关重要。近年来，深度学习被引入以自动化耗时且劳动密集的破裂识别过程。然而，现有方法往往面临对非破裂特征的误分类，导致高误报率。为了解决这一局限性，本研究开发了一种名为Breakout-picker的深度学习框架，专注于减少自动破裂特征识别中的误报。Breakout-picker通过两种策略减少误报。首先，Breakout-picker的训练纳入了非破裂特征的负样本，包括自然裂缝、钥匙槽和测井伪影。这些负样本与破裂特征具有相似的特征，如低声学振幅或局部增大的孔径。这些负训练样本使Breakout-picker能够更好地区分真实破裂和相似的非破裂特征。其次，Breakout-picker识别的候选破裂通过方位对称性标准进一步验证，不符合破裂方位近对称特征的检测结果将被排除。使用来自不同地区的三个声学图像日志数据集对Breakout-picker的性能进行了评估。结果表明，Breakout-picker在准确性上优于其他自动方法，并显著降低了误报率。通过减少误报，Breakout-picker增强了基于声学图像日志的自动破裂特征识别的可靠性，从而有利于基于孔壁破裂的原位应力分析。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2604.16034

Ranking XAI Methods for Head and Neck Cancer Outcome Prediction

头颈癌预后预测的可解释人工智能方法排名

Ma, Baoqiang, Madzia-Madzou, Djennifer K., Kraaijveld, Rosa C. J., Ouyang, Jin

Abstract

For head and neck cancer (HNC) patients, prognostic outcome prediction can support personalized treatment strategy selection. Improving prediction performance of HNC outcomes has been extensively explored by using advanced artificial intelligence (AI) techniques on PET/CT data. However, the interpretability of AI remains a critical obstacle for its clinical adoption. Unlike previous HNC studies that empirically selected explainable AI (XAI) techniques, we are the first to comprehensively evaluate and rank 13 XAI methods across 24 metrics, covering faithfulness, robustness, complexity and plausibility. Experimental results on the multi-center HECKTOR challenge dataset show large variations across evaluation aspects among different XAI methods, with Integrated Gradients (IG) and DeepLIFT (DL) consistently obtained high rankings for faithfulness, complexity and plausibility. This work highlights the importance of comprehensive XAI method evaluation and can be extended to other medical imaging tasks.

Chinese Translation

对于头颈癌（HNC）患者，预后结果预测可以支持个性化治疗策略的选择。通过在PET/CT数据上使用先进的人工智能（AI）技术，改善HNC结果的预测性能已被广泛研究。然而，AI的可解释性仍然是其临床应用的一大障碍。与以往经验性选择可解释人工智能（XAI）技术的HNC研究不同，我们首次全面评估并排名了13种XAI方法，涵盖了24个指标，包括可信度、稳健性、复杂性和合理性。在多中心HECKTOR挑战数据集上的实验结果显示，不同XAI方法在评估方面存在较大差异，其中集成梯度（Integrated Gradients, IG）和深度学习特征归因（DeepLIFT, DL）在可信度、复杂性和合理性方面始终获得高排名。本研究强调了全面评估XAI方法的重要性，并可扩展至其他医学影像任务。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2604.16044

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

阐明扩散概率模型的信噪比-时间步（SNR-t）偏差

Yu, Meng, Sun, Lei, Zeng, Jianhao, Chu, Xiangxiang, Zhan, Kun

Abstract

Diffusion Probabilistic Models have demonstrated remarkable performance across a wide range of generative tasks. However, we have observed that these models often suffer from a Signal-to-Noise Ratio-timestep (SNR-t) bias. This bias refers to the misalignment between the SNR of the denoising sample and its corresponding timestep during the inference phase. Specifically, during training, the SNR of a sample is strictly coupled with its timestep. However, this correspondence is disrupted during inference, leading to error accumulation and impairing the generation quality. We provide comprehensive empirical evidence and theoretical analysis to substantiate this phenomenon and propose a simple yet effective differential correction method to mitigate the SNR-t bias. Recognizing that diffusion models typically reconstruct low-frequency components before focusing on high-frequency details during the reverse denoising process, we decompose samples into various frequency components and apply differential correction to each component individually. Extensive experiments show that our approach significantly improves the generation quality of various diffusion models (IDDPM, ADM, DDIM, A-DPM, EA-DPM, EDM, PFGM++, and FLUX) on datasets of various resolutions with negligible computational overhead. The code is at https://github.com/AMAP-ML/DCW.

Chinese Translation

扩散概率模型在广泛的生成任务中表现出色。然而，我们观察到这些模型通常存在信噪比-时间步（SNR-t）偏差。该偏差指的是在推理阶段，去噪样本的信噪比与其对应的时间步之间的不一致性。具体而言，在训练过程中，样本的信噪比与其时间步严格相关。然而，在推理过程中，这种对应关系被打破，导致误差累积，从而影响生成质量。我们提供了全面的实证证据和理论分析以支持这一现象，并提出了一种简单而有效的差分校正方法来减轻SNR-t偏差。我们认识到，扩散模型通常在反向去噪过程中先重建低频成分，然后再关注高频细节，因此我们将样本分解为不同的频率成分，并对每个成分单独应用差分校正。大量实验表明，我们的方法显著提高了各种扩散模型（IDDPM、ADM、DDIM、A-DPM、EA-DPM、EDM、PFGM++和FLUX）在不同分辨率数据集上的生成质量，且计算开销微乎其微。代码可在 https://github.com/AMAP-ML/DCW 获取。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2604.16054

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

心灵之眼：多模态大型语言模型的视觉抽象、转化与组合基准测试

Sinha, Rohit, Kanade, Aditya, Kancheti, Sai Srinivas, Balasubramanian, Vineeth N, Ganu, Tanuja

Abstract

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

Chinese Translation

多模态大型语言模型（MLLMs）在视觉语言基准测试中取得了显著进展，但它们在视觉认知和视觉空间推理方面的能力仍然不够清晰。我们提出了“心灵之眼”，这是一个基于经典人类智力测试灵感而设计的多项选择基准，包含八个视觉认知任务，并按照一种新颖的“A-R-T”分类法进行组织：抽象（Abstraction）、关系（Relation）和转化（Transformation）。这些任务探讨了流体智力的核心过程，如模式归纳、类比关系映射和心理转化。我们评估了一系列多样的封闭源和开放源MLLM，并将其表现与人类参与者进行比较。人类的准确率达到80%，而表现最佳的MLLM仍低于50%。错误分析揭示了以下几个方面的失败：（i）视觉注意力分配，（ii）内部感知操作，以及（iii）对潜在视觉概念的弱抽象。我们的研究结果表明，当前的MLLM在视觉空间推理能力方面相较于人类参与者表现有限，突显了需要更具认知基础的评估框架。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2604.16060

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

链式思维降低了多模态大语言模型的视觉空间推理能力

Kancheti, Sai Srinivas, Kanade, Aditya Sanjiv, Balasubramanian, Vineeth N., Ganu, Tanuja

Abstract

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

Chinese Translation

基于链式思维（Chain-of-Thought, CoT）的多模态推理模型（Multimodal Reasoning Models, MRM）在数学和逻辑问题解决方面引发了革命。然而，我们发现这一范式在广义空间智能方面存在困难。我们对十七个模型在十三个空间基准测试中的表现进行了全面评估，识别出一个关键差距：CoT 提示在视觉空间推理中持续降低性能。此外，通过一种新颖的无图像++（No-Image++）消融实验，我们证明了多模态推理模型和 CoT 提示的语言模型（MLMs）遭受严重的捷径学习，并在图像缺失的情况下从文本先验中幻觉出视觉细节。这些发现挑战了文本唯一的 CoT 在空间任务中的有效性，并强调了以视觉为中心的推理范式的必要性。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2604.16070

TableSeq: Unified Generation of Structure, Content, and Layout

TableSeq：结构、内容和布局的统一生成

Hamdi, Laziz, Tamasna, Amine, Boisson, Pascal, Paquet, Thierry

Abstract

We present TableSeq, an image-only, end-to-end framework for joint table structure recognition, content recognition, and cell localization. The model formulates these tasks as a single sequence-generation problem: one decoder produces an interleaved stream of \texttt{HTML} tags, cell text, and discretized coordinate tokens, thereby aligning logical structure, textual content, and cell geometry within a unified autoregressive sequence. This design avoids external OCR, auxiliary decoders, and complex multi-stage post-processing. TableSeq combines a lightweight high-resolution FCN-H16 encoder with a minimal structure-prior head and a single-layer transformer encoder, yielding a compact architecture that remains effective on challenging layouts. Across standard benchmarks, TableSeq achieves competitive or state-of-the-art results while preserving architectural simplicity. It reaches 95.23 TEDS / 96.83 S-TEDS on PubTabNet, 97.45 TEDS / 98.69 S-TEDS on FinTabNet, and 99.79 / 99.54 / 99.66 precision / recall / F1 on SciTSR under the CAR protocol, while remaining competitive on PubTables-1M under GriTS. Beyond TSR/TCR, the same sequence interface generalizes to index-based table querying without task-specific heads, achieving the best IRDR score and competitive ICDR/ICR performance. We also study multi-token prediction for faster blockwise decoding and show that it reduces inference latency with only limited accuracy degradation. Overall, TableSeq provides a practical and reproducible single-stream baseline for unified table recognition, and the source code will be made publicly available at https://github.com/hamdilaziz/TableSeq.

Chinese Translation

我们提出了TableSeq，这是一种仅基于图像的端到端框架，用于联合表格结构识别、内容识别和单元格定位。该模型将这些任务形式化为一个单一的序列生成问题：一个解码器生成交错的 exttt{HTML}标签、单元格文本和离散坐标标记流，从而在统一的自回归序列中对齐逻辑结构、文本内容和单元格几何。该设计避免了外部OCR、辅助解码器和复杂的多阶段后处理。TableSeq结合了轻量级高分辨率FCN-H16编码器、最小结构先验头和单层变换器编码器，形成了一个紧凑的架构，在复杂布局上仍然有效。在标准基准测试中，TableSeq实现了具有竞争力或最先进的结果，同时保持了架构的简单性。在PubTabNet上达到了95.23 TEDS / 96.83 S-TEDS，在FinTabNet上达到了97.45 TEDS / 98.69 S-TEDS，在SciTSR的CAR协议下达到了99.79 / 99.54 / 99.66的精确度/召回率/F1，同时在GriTS下的PubTables-1M上保持竞争力。除了TSR/TCR之外，同一序列接口可以推广到基于索引的表查询，无需特定任务的头，获得最佳的IRDR得分和具有竞争力的ICDR/ICR表现。我们还研究了多标记预测以加快块状解码，并表明这在仅有有限准确度下降的情况下减少了推理延迟。总体而言，TableSeq为统一表格识别提供了一个实用且可重复的单流基线，源代码将公开发布在https://github.com/hamdilaziz/TableSeq。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2604.16079

The Amazing Stability of Flow Matching

流匹配的惊人稳定性

Briq, Rania, Kamp, Michael, Fried, Ohad, Cohen, Sarel, Kesselheim, Stefan

Abstract

The success of deep generative models in generating high-quality and diverse samples is often attributed to particular architectures and large training datasets. In this paper, we investigate the impact of these factors on the quality and diversity of samples generated by \emph{flow-matching} models. Surprisingly, in our experiments on CelebA-HQ dataset, flow matching remains stable even when pruning 50\% of the dataset. That is, the quality and diversity of generated samples are preserved. Moreover, pruning impacts the latent representation only slightly, that is, samples generated by models trained on the full and pruned dataset map to visually similar outputs for a given seed. We observe similar stability when changing the architecture or training configuration, such that the latent representation is maintained under these changes as well. Our results quantify just how strong this stability can be in practice, and help explain the reliability of flow-matching models under various perturbations.

Chinese Translation

深度生成模型在生成高质量和多样化样本方面的成功，通常归因于特定的架构和大规模的训练数据集。本文研究了这些因素对流匹配（flow-matching）模型生成的样本质量和多样性的影响。令人惊讶的是，在我们对CelebA-HQ数据集的实验中，即使在修剪50\%的数据集时，流匹配仍然保持稳定。也就是说，生成样本的质量和多样性得以保留。此外，修剪对潜在表示的影响也很小，即在完整数据集和修剪数据集上训练的模型生成的样本，对于给定的种子映射到视觉上相似的输出。当改变架构或训练配置时，我们观察到类似的稳定性，这表明在这些变化下潜在表示也得以维持。我们的结果量化了这种稳定性在实践中可以有多强，并帮助解释了流匹配模型在各种扰动下的可靠性。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2604.16082

Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model

使用 YOLOv12 深度学习模型早期检测急性髓性白血病 (AML)

Ahmed, Enas E., Aly, Salah A., Moner, Mayar

Abstract

Acute Myeloid Leukemia (AML) is one of the most life-threatening type of blood cancers, and its accurate classification is considered and remains a challenging task due to the visual similarity between various cell types. This study addresses the classification of the multiclasses of AML cells Utilizing YOLOv12 deep learning model. We applied two segmentation approaches based on cell and nucleus features, using Hue channel and Otsu thresholding techniques to preprocess the images prior to classification. Our experiments demonstrate that YOLOv12 with Otsu thresholding on cell-based segmentation achieved the highest level of validation and test accuracy, both reaching 99.3%.

Chinese Translation

急性髓性白血病 (AML) 是一种最具威胁生命的血癌类型，其准确分类被认为是一项具有挑战性的任务，因为不同细胞类型之间存在视觉相似性。本研究利用 YOLOv12 深度学习模型解决了 AML 细胞多类的分类问题。我们采用了基于细胞和细胞核特征的两种分割方法，使用色调通道和 Otsu 阈值技术对图像进行预处理，以便进行分类。我们的实验表明，基于细胞分割的 Otsu 阈值处理的 YOLOv12 模型在验证和测试准确率上均达到了最高水平，均为 99.3%。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2604.16083

DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics

DINOv3 超越专业检测器：图像取证的简单基础模型基线

Yu, Jieming, Feng, Qiuxiao, Wang, Zhuohan, Ma, Xiaochen

Abstract

With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1\,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at https://github.com/Irennnne/DINOv3-IML.

Chinese Translation

随着深度生成模型的快速发展，逼真的假图像变得越来越容易获取，然而现有的定位方法依赖于复杂的设计，并且在不同的操控类型和成像条件下仍然难以泛化。我们提出了一种基于 DINOv3 的简单但强大的基线，结合 LoRA 适配和轻量级卷积解码器。在 CAT-Net 协议下，我们的最佳模型在四个标准基准上使用仅 9.1 万可训练参数的基础上，相较于之前的最先进方法提高了 17.0 个点的平均像素级 F1 值，即使是我们最小的变体也超越了所有先前的专业方法。LoRA 在所有骨干网络规模上均优于完全微调。在数据稀缺的 MVSS-Net 协议下，LoRA 的平均 F1 达到 0.774，而最强的先前方法为 0.530，而完全微调变得高度不稳定，这表明预训练的表示编码了更好地保留而非被覆盖的取证信息。该基线在高斯噪声、JPEG 重新压缩和高斯模糊下也表现出强大的鲁棒性。我们希望这项工作能为研究社区提供一个可靠的基线，并为未来的图像取证应用提供一个实用的起点。代码可在 https://github.com/Irennnne/DINOv3-IML 获取。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2604.16086

Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance

风格化-STORM (ST-STORM)：感知外观的语义特性

Ouattara, Hamed, Duthon, Pierre, Salmane, Pascal Houssam, Bernardin, Frédéric, Aider, Omar Ait

Abstract

One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing features that are insensitive to certain image transformations such as illumination, or geometric changes. This strategy is appropriate when the objective is to recognize objects independently of their appearance. However, it becomes counterproductive as soon as appearance itself constitutes the discriminative signal. In weather analysis, for example, rain streaks, snow granularity, atmospheric scattering, as well as reflections and halos, are not noise: they carry the essential information. In critical applications such as autonomous driving, ignoring these cues is risky, since grip and visibility depend directly on ground conditions and atmospheric conditions. We introduce ST-STORM, a hybrid SSL framework that treats appearance (style) as a semantic modality to be disentangled from content. Our architecture explicitly separates two latent streams, regulated by gating mechanisms. The Content branch aims at a stable semantic representation through a JEPA scheme coupled with a contrastive objective, promoting invariance to appearance variations. In parallel, the Style branch is constrained to capture appearance signatures (textures, contrasts, scattering) through feature prediction and reconstruction under an adversarial constraint. We evaluate ST-STORM on several tasks, including object classification (ImageNet-1K), fine-grained weather characterization, and melanoma detection (ISIC 2024 Challenge). The results show that the Style branch effectively isolates complex appearance phenomena (F1=97% on Multi-Weather and F1=94% on ISIC 2024 with 10% labeled data), without degrading the semantic performance (F1=80% on ImageNet-1K) of the Content branch, and improves the preservation of critical appearance

Chinese Translation

自监督学习（SSL）中的主导范式之一，如MoCo或DINO，旨在通过捕捉对某些图像变换（如光照或几何变化）不敏感的特征来生成稳健的表示。当目标是独立于外观识别对象时，这一策略是合适的。然而，一旦外观本身构成了区分信号，这种策略就变得适得其反。在天气分析中，例如，雨条、雪颗粒、气氛散射以及反射和光晕并不是噪声：它们携带着重要的信息。在诸如自动驾驶等关键应用中，忽视这些线索是有风险的，因为抓地力和能见度直接依赖于地面条件和大气条件。我们提出了ST-STORM，一个混合自监督学习框架，将外观（风格）视为一种语义模态，与内容进行解耦。我们的架构明确地分离了两个潜在流，通过门控机制进行调节。内容分支旨在通过JEPA方案结合对比目标实现稳定的语义表示，促进对外观变化的无关性。与此同时，风格分支受到约束，通过特征预测和在对抗约束下的重建来捕捉外观特征（纹理、对比度、散射）。我们在多个任务上评估ST-STORM，包括物体分类（ImageNet-1K）、细粒度天气特征描述和黑色素瘤检测（ISIC 2024挑战）。结果表明，风格分支有效地隔离了复杂的外观现象（在Multi-Weather上F1=97%，在ISIC 2024上F1=94%，使用10%的标记数据），而不降低内容分支的语义性能（在ImageNet-1K上F1=80%），并改善了关键外观的保留。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2604.16099

DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates

DenTab：用于真实世界牙科估算的表格识别和视觉问答的数据集

Hamdi, Laziz, Tamasna, Amine, Paquet, Thierry

Abstract

Tables condense key transactional and administrative information into compact layouts, but practical extraction requires more than text recognition: systems must also recover structure (rows, columns, merged cells, headers) and interpret roles such as line items, subtotals, and totals under common capture artifacts. Many existing resources for table structure recognition and TableVQA are built from clean digital-born sources or rendered tables, and therefore only partially reflect noisy administrative conditions. We introduce DenTab, a dataset of 2{,}000 cropped table images from dental estimates with high-quality HTML annotations, enabling evaluation of table recognition (TR) and table visual question answering (TableVQA) on the same inputs. DenTab includes 2{,}208 questions across eleven categories spanning retrieval, aggregation, and logic/consistency checks. We benchmark 16 systems, including 14 vision--language models (VLMs) and two OCR baselines. Across models, strong structure recovery does not consistently translate into reliable performance on multi-step arithmetic and consistency questions, and these reasoning failures persist even when using ground-truth HTML table inputs. To improve arithmetic reliability without training, we propose the Table Router Pipeline, which routes arithmetic questions to deterministic execution. The pipeline combines (i) a VLM that produces a baseline answer, a structured table representation, and a constrained table program with (ii) a rule-based executor that performs exact computation over the parsed table. The source code and dataset will be made publicly available at https://github.com/hamdilaziz/DenTab.

Chinese Translation

表格将关键的交易和管理信息浓缩成紧凑的布局，但实际提取不仅需要文本识别：系统还必须恢复结构（行、列、合并单元格、标题）并解释角色，如行项目、小计和总计，这些都在常见的捕获伪影下进行。许多现有的表格结构识别和表格视觉问答（TableVQA）资源都是基于干净的数字原生来源或渲染表格，因此只能部分反映嘈杂的管理条件。我们介绍了DenTab，这是一个包含2,000个来自牙科估算的裁剪表格图像的数据集，配有高质量的HTML注释，能够在相同输入上评估表格识别（TR）和表格视觉问答（TableVQA）。DenTab包含跨越检索、聚合和逻辑/一致性检查的11个类别中的2,208个问题。我们对16个系统进行了基准测试，包括14个视觉-语言模型（VLMs）和两个OCR基线。在这些模型中，强大的结构恢复并不总是能转化为在多步算术和一致性问题上的可靠表现，即使使用真实的HTML表格输入，这些推理失败仍然存在。为了在不进行训练的情况下提高算术可靠性，我们提出了表格路由管道（Table Router Pipeline），该管道将算术问题路由到确定性执行。该管道结合了（i）一个生成基线答案、结构化表格表示和受限表格程序的VLM，以及（ii）一个对解析表格进行精确计算的基于规则的执行器。源代码和数据集将公开发布在https://github.com/hamdilaziz/DenTab。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2604.16108

Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation

多语种：保留风格的语音驱动面部动画

Nocentini, Federico, Seo, Kwanggyoon, Liu, Qingju, Ferrari, Claudio, Berretti, Stefano, Ferman, David, Kim, Hyeongwoo, Garrido, Pablo, Caliskan, Akin

Abstract

Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference facial sequences to capture individual speaking characteristics. Polyglot does not require predefined language or speaker labels, enabling generalization across languages and speakers through self-supervised learning. By jointly conditioning on language and style, it captures expressive traits such as rhythm, articulation, and habitual facial movements, producing temporally coherent and realistic animations. Experiments show improved performance in both monolingual and multilingual settings, providing a unified framework for modeling language and personal style in SDFA.

Chinese Translation

语音驱动面部动画（SDFA）因其在电影、视频游戏和虚拟现实中的应用而受到广泛关注。然而，大多数现有模型都是在单一语言数据上训练的，这限制了它们在现实多语言场景中的有效性。在本研究中，我们探讨了多语言SDFA，这对于真实感生成至关重要，因为语言会影响语音学、节奏、语调和面部表情。说话风格也受到个体差异的影响，而不仅仅是语言。现有方法通常依赖于特定语言或特定说话者的条件，而不是两者兼顾，这限制了它们建模交互的能力。我们提出了Polyglot，这是一种基于统一扩散的个性化多语言SDFA架构。我们的方法使用转录嵌入来编码语言信息，并从参考面部序列中提取风格嵌入以捕捉个体说话特征。Polyglot不需要预定义的语言或说话者标签，通过自监督学习实现跨语言和说话者的泛化。通过对语言和风格进行联合条件建模，它捕捉到节奏、发音和习惯性面部动作等表现特征，生成时间上连贯且真实的动画。实验表明，在单语言和多语言环境中均表现出改善的性能，为SDFA中语言和个人风格的建模提供了统一框架。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2604.16114

Towards In-Context Tone Style Transfer with A Large-Scale Triplet Dataset

基于大规模三元组数据集的上下文音调风格迁移研究

Deng, Yuhai, She, Huimin, Shen, Wei, Li, Meng, Wu, Ruoxi, Yuan, Lunxi, Li, Xiang

Abstract

Tone style transfer for photo retouching aims to adapt the stylistic tone of the reference image to a given content image. However, the lack of high-quality large-scale triplet datasets with stylized ground truth forces existing methods to rely on self-supervised or proxy objectives, which limits model capability. To mitigate this gap, we design a data construction pipeline to build TST100K, a large-scale dataset of 100,000 content-reference-stylized triplets. At the core of this pipeline, we train a tone style scorer to ensure strict stylistic consistency for each triplet. In addition, existing methods typically extract content and reference features independently and then fuse them in a decoder, which may cause semantic loss and lead to inappropriate color transfer and degraded visual aesthetics. Instead, we propose ICTone, a diffusion-based framework that performs tone transfer in an in-context manner by jointly conditioning on both images, leveraging the semantic priors of generative models for semantic-aware transfer. Reward feedback learning using the tone style scorer is further incorporated to improve stylistic fidelity and visual quality. Experiments demonstrate the effectiveness of TST100K, and ICTone achieves state-of-the-art performance on both quantitative metrics and human evaluations.

Chinese Translation

照片修饰中的音调风格迁移旨在将参考图像的风格音调适配到给定的内容图像。然而，高质量大规模三元组数据集的缺乏迫使现有方法依赖自监督或代理目标，这限制了模型的能力。为了解决这一问题，我们设计了一种数据构建管道，构建了 TST100K，一个包含 100,000 个内容-参考-风格化三元组的大规模数据集。在该管道的核心，我们训练了一个音调风格评分器，以确保每个三元组的严格风格一致性。此外，现有方法通常独立提取内容和参考特征，然后在解码器中融合，这可能导致语义损失，从而导致不恰当的颜色迁移和视觉美学的下降。相反，我们提出了 ICTone，一个基于扩散的框架，通过共同条件化两个图像以上下文方式执行音调迁移，利用生成模型的语义先验进行语义感知迁移。进一步引入使用音调风格评分器的奖励反馈学习，以提高风格保真度和视觉质量。实验表明 TST100K 的有效性，ICTone 在定量指标和人类评估中均实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2604.16115

From Articles to Canopies: Knowledge-Driven Pseudo-Labelling for Tree Species Classification using LLM Experts

从文章到树冠：基于知识驱动的伪标注用于树种分类，借助大型语言模型专家

Romaszewski, Michał, Kopeć, Dominik, Cholewa, Michał, Kołodziej, Katarzyna, Głomb, Przemysław, Niedzielko, Jan, Charyton, Jakub, Wylazłowska, Justyna, Jarocińska, Anna

Abstract

Hyperspectral tree species classification is challenging due to limited and imbalanced class labels, spectral mixing (overlapping light signatures from multiple species), and ecological heterogeneity (variability among ecological systems). Addressing these challenges requires methods that integrate biological and structural characteristics of vegetation, such as canopy architecture and interspecific interactions, rather than relying solely on spectral signatures. This paper presents a biologically informed, semi-supervised deep learning method that integrates multi-sensor Earth observation data, specifically hyperspectral imaging (HSI) and airborne laser scanning (ALS), with expert, ecological knowledge. The approach relies on biologically inspired pseudo-labelling over a precomputed canopy graph, yielding accurate classification at low training cost. In addition, ecological priors on species cohabitation are automatically derived from reliable sources using large language models (LLMs) and encoded as a cohabitation matrix with likelihoods of species occurring together. These priors are incorporated into the pseudo-labelling strategy, effectively introducing expert knowledge into the model. Experiments on a real-world forest dataset demonstrate 5.6% improvement over the best reference method. Expert evaluation of cohabitation priors reveals high accuracy with differences no larger than 15%.

Chinese Translation

高光谱树种分类面临诸多挑战，包括有限且不平衡的类别标签、光谱混合（来自多个物种的重叠光谱特征）以及生态异质性（生态系统之间的变异性）。解决这些挑战需要整合植被的生物和结构特征的方法，例如树冠结构和种间相互作用，而不仅仅依赖于光谱特征。本文提出了一种生物信息驱动的半监督深度学习方法，该方法整合了多传感器地球观测数据，特别是高光谱成像（HSI）和航空激光扫描（ALS），以及专家的生态知识。该方法依赖于生物启发的伪标注，基于预计算的树冠图，能够以低训练成本实现准确分类。此外，关于物种共生的生态先验知识通过大型语言模型（LLMs）从可靠来源自动推导，并编码为物种共生矩阵，表示物种共同出现的可能性。这些先验知识被纳入伪标注策略，有效地将专家知识引入模型中。在真实世界森林数据集上的实验表明，该方法在最佳参考方法的基础上提高了5.6%的准确率。专家对共生先验的评估显示出高准确性，差异不超过15%。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2604.16135

Motion-Adapter: A Diffusion Model Adapter for Text-to-Motion Generation of Compound Actions

运动适配器：用于复合动作文本到运动生成的扩散模型适配器

Jiang, Yue, Yang, Mingyu, Yang, Liuyuxin, Xu, Yang, Yun, Bingxin, Zhang, Yuhe

Abstract

Recent advances in generative motion synthesis have enabled the production of realistic human motions from diverse input modalities. However, synthesizing compound actions from texts, which integrate multiple concurrent actions into coherent full-body sequences, remains a major challenge. We identify two key limitations in current text-to-motion diffusion models: (i) catastrophic neglect, where earlier actions are overwritten by later ones due to improper handling of temporal information, and (ii) attention collapse, which arises from excessive feature fusion in cross-attention mechanisms. As a result, existing approaches often depend on overly detailed textual descriptions (e.g., raising right hand), explicit body-part specifications (e.g., editing the upper body), or the use of large language models (LLMs) for body-part interpretation. These strategies lead to deficient semantic representations of physical structures and kinematic mechanisms, limiting the ability to incorporate natural behaviors such as greeting while walking. To address these issues, we propose the Motion-Adapter, a plug-and-play module that guides text-to-motion diffusion models in generating compound actions by computing decoupled cross-attention maps, which serve as structural masks during the denoising process. Extensive experiments demonstrate that our method consistently produces more faithful and coherent compound motions across diverse textual prompts, surpassing state-of-the-art approaches.

Chinese Translation

最近在生成运动合成方面的进展使得能够从多种输入模态生成逼真的人类动作。然而，从文本合成复合动作，即将多个同时进行的动作整合成连贯的全身序列，仍然是一个主要挑战。我们识别出现有文本到运动扩散模型的两个关键限制：（i）灾难性忽视，早期动作因时间信息处理不当而被后续动作覆盖；（ii）注意力崩溃，源于交叉注意机制中过度的特征融合。因此，现有方法通常依赖于过于详细的文本描述（例如，举起右手）、明确的身体部位规格（例如，编辑上半身）或使用大型语言模型（LLMs）进行身体部位解释。这些策略导致物理结构和运动机制的语义表示不足，限制了自然行为（如走路时打招呼）的融入能力。为了解决这些问题，我们提出了运动适配器（Motion-Adapter），一个即插即用的模块，通过计算解耦的交叉注意图来指导文本到运动扩散模型生成复合动作，这些图在去噪过程中作为结构掩码。大量实验表明，我们的方法在多样的文本提示下始终能够生成更真实和连贯的复合动作，超越了现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2604.16147

SWNet: A Cross-Spectral Network for Camouflaged Weed Detection

SWNet：一种用于伪装杂草检测的跨光谱网络

Velesaca, Henry O., Miranda, Luigi, Sappa, Angel D.

Abstract

This paper presents SWNet, a bimodal end-to-end cross-spectral network specifically engineered for the detection of camouflaged weeds in dense agricultural environments. Plant camouflage, characterized by homochromatic blending where invasive species mimic the phenotypic traits of primary crops, poses a significant challenge for traditional computer vision systems. To overcome these limitations, SWNet utilizes a Pyramid Vision Transformer v2 backbone to capture long-range dependencies and a Bimodal Gated Fusion Module to dynamically integrate Visible and Near-Infrared information. By leveraging the physiological differences in chlorophyll reflectance captured in the NIR spectrum, the proposed architecture effectively discriminates targets that are otherwise indistinguishable in the visible range. Furthermore, an Edge-Aware Refinement module is employed to produce sharper object boundaries and reduce structural ambiguity. Experimental results on the Weeds-Banana dataset indicate that SWNet outperforms ten state-of-the-art methods. The study demonstrates that the integration of cross-spectral data and boundary-guided refinement is essential for high segmentation accuracy in complex crop canopies. The code is available on GitHub: https://cod-espol.github.io/SWNet/

Chinese Translation

本文提出了SWNet，一种双模态端到端跨光谱网络，专门用于在密集农业环境中检测伪装杂草。植物伪装的特征是同色混合，其中入侵物种模仿主要作物的表型特征，这对传统计算机视觉系统构成了重大挑战。为了解决这些局限性，SWNet利用了金字塔视觉变换器v2（Pyramid Vision Transformer v2）作为主干网络，以捕捉长距离依赖关系，并采用双模态门控融合模块（Bimodal Gated Fusion Module）动态整合可见光和近红外信息。通过利用在近红外光谱中捕获的叶绿素反射率的生理差异，所提出的架构有效区分了在可见光范围内无法区分的目标。此外，采用了边缘感知细化模块（Edge-Aware Refinement）以产生更清晰的物体边界并减少结构模糊。对Weeds-Banana数据集的实验结果表明，SWNet的性能优于十种最先进的方法。研究表明，跨光谱数据的整合和边界引导细化对于复杂作物冠层中的高分割精度至关重要。代码可在GitHub上获取： https://cod-espol.github.io/SWNet/

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2604.16170

neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

neuralCAD-Edit：多模态指令下的3D CAD模型编辑专家基准

Perrett, Toby, Bouchard, Matthew, McCarthy, William

Abstract

We introduce neuralCAD-Edit, the first benchmark for editing 3D CAD models collected from expert CAD engineers. Instead of text conditioning as in prior works, we collect realistic CAD editing requests by capturing videos of professional designers, interacting directly with CAD models in CAD software, while talking, pointing and drawing. We recruited ten consenting designers to contribute to this contained study. We benchmark leading foundation models against human CAD experts carrying out edits, and find a large performance gap in both automatic metrics and human evaluations. Even the best foundation model (GPT 5.2) scores 53% lower (absolute) than CAD experts in human acceptance trials, demonstrating the challenge of neuralCAD-Edit. We hope neuralCAD-Edit will provide a solid foundation against which 3D CAD editing approaches and foundation models can be developed. Code/data: https://autodeskailab.github.io/neuralCAD-Edit

Chinese Translation

我们介绍了neuralCAD-Edit，这是第一个用于编辑由专业CAD工程师收集的3D CAD模型的基准。与之前的研究采用文本条件不同，我们通过捕捉专业设计师在CAD软件中直接与CAD模型互动时的视频，收集了真实的CAD编辑请求，这些设计师在互动过程中会进行交谈、指点和绘图。我们招募了十位同意参与的设计师来贡献于这项有限的研究。我们将领先的基础模型与进行编辑的人类CAD专家进行了基准测试，发现自动评估指标和人类评估之间存在较大的性能差距。即使是表现最好的基础模型（GPT 5.2），在人工接受度测试中也比CAD专家低53%（绝对值），这显示了neuralCAD-Edit所面临的挑战。我们希望neuralCAD-Edit能够为3D CAD编辑方法和基础模型的发展提供坚实的基础。代码/数据：https://autodeskailab.github.io/neuralCAD-Edit

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2604.16177

Winner of CVPR2026 NTIRE Challenge on Image Shadow Removal: Semantic and Geometric Guidance for Shadow Removal via Cascaded Refinement

CVPR2026 NTIRE挑战赛图像阴影去除的获胜者：通过级联细化的语义和几何指导进行阴影去除

Beltrame, Lorenzo, Salzinger, Jules, Svoboda, Filip, Lampert, Jasmin, Fanta-Jende, Phillipp, Timofte, Radu, Koerner, Marco

Abstract

We present a three-stage progressive shadow-removal pipeline for the CVPR2026 NTIRE WSRD+ challenge. Built on OmniSR, our method treats deshadowing as iterative direct refinement, where later stages correct residual artefacts left by earlier predictions. The model combines RGB appearance with frozen DINOv2 semantic guidance and geometric cues from monocular depth and surface normals, reused across all stages. To stabilise multi-stage optimisation, we introduce a contraction-constrained objective that encourages non-increasing reconstruction error across the cascade. A staged training pipeline transfers from earlier WSRD pretraining to WSRD+ supervision and final WSRD+ 2026 adaptation with cosine-annealed checkpoint ensembling. On the official WSRD+ 2026 hidden test set, our final ensemble achieved 26.680 PSNR, 0.8740 SSIM, 0.0578 LPIPS, and 26.135 FID, ranked first overall, and won the NTIRE 2026 Image Shadow Removal Challenge. The strong performance of the proposed model is further validated on the ISTD+ and UAV-SC+ datasets.

Chinese Translation

我们提出了一种三阶段渐进式阴影去除管道，针对CVPR2026 NTIRE WSRD+挑战。我们的算法基于OmniSR，将去阴影视为迭代直接细化，其中后续阶段修正早期预测留下的残余伪影。该模型结合了RGB外观、冻结的DINOv2语义指导以及来自单目深度和表面法线的几何线索，这些信息在所有阶段中重复使用。为了稳定多阶段优化，我们引入了一种收缩约束目标，鼓励级联过程中的重建误差不增加。分阶段的训练管道从早期的WSRD预训练转移到WSRD+监督，最终进行WSRD+ 2026适应，采用余弦退火检查点集成。在官方的WSRD+ 2026隐藏测试集上，我们的最终集成模型达到了26.680的PSNR、0.8740的SSIM、0.0578的LPIPS和26.135的FID，整体排名第一，并赢得了NTIRE 2026图像阴影去除挑战。所提模型的强大性能在ISTD+和UAV-SC+数据集上得到了进一步验证。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2604.16200

Saturation-Aware Space-Variant Blind Image Deblurring

饱和度感知的空间变异盲图像去模糊

Alam, Muhammad Z., Stetsiuk, Larry, Zeshan, Arooba

Abstract

This paper presents a novel saturation aware space variant blind image deblurring framework designed to address challenges posed by saturated pixels in deblurring under high dynamic range and low light conditions. The proposed approach effectively segments the image based on blur intensity and proximity to saturation, leveraging a pre estimated Light Spread Function to mitigate stray light effects. By accurately estimating the true radiance of saturated regions using the dark channel prior, our method enhances the deblurring process without introducing artifacts like ringing. Experimental evaluations on both synthetic and real world datasets demonstrate that the framework improves deblurring outcomes across various scenarios showcasing superior performance compared to state of the art saturation-aware and general purpose methods. This adaptability highlights the framework potential integration with existing and emerging blind image deblurring techniques.

Chinese Translation

本文提出了一种新颖的饱和度感知空间变异盲图像去模糊框架，旨在解决高动态范围和低光照条件下饱和像素带来的去模糊挑战。所提出的方法有效地根据模糊强度和接近饱和度对图像进行分割，利用预估的光扩散函数（Light Spread Function）来减轻杂散光的影响。通过使用暗通道先验（dark channel prior）准确估计饱和区域的真实辐射，我们的方法在去模糊过程中增强了效果，而不会引入诸如振铃等伪影。在合成和真实世界数据集上的实验评估表明，该框架在各种场景中改善了去模糊结果，展现出比最先进的饱和度感知和通用方法更优越的性能。这种适应性突显了该框架与现有及新兴盲图像去模糊技术的潜在整合能力。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2604.16207

AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection

AIFIND：面向增量人脸伪造检测的伪影感知细粒度对齐解释

Wang, Hao, Zhang, Beichen, Gong, Yanpei, Fang, Shaoyi, Qi, Zhaobo, Xu, Yuanrong, Liu, Xinyan, Zhang, Weigang

Abstract

As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typically rely on data replay or coarse binary supervision, which fails to explicitly constrain the feature space, leading to severe feature drift and catastrophic forgetting. To address this, we propose AIFIND, Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection, which leverages semantic anchors to stabilize incremental learning. We design the Artifact-Driven Semantic Prior Generator to instantiate invariant semantic anchors, establishing a fixed coordinate system from low-level artifact cues. These anchors are injected into the image encoder via Artifact-Probe Attention, which explicitly constrains volatile visual features to align with stable semantic anchors. Adaptive Decision Harmonizer harmonizes the classifiers by preserving angular relationships of semantic anchors, maintaining geometric consistency across tasks. Extensive experiments on multiple incremental protocols validate the superiority of AIFIND.

Chinese Translation

随着伪造类型的不断出现，增量人脸伪造检测（IFFD）已成为一个关键范式。然而，现有方法通常依赖于数据重放或粗略的二元监督，这无法明确约束特征空间，导致严重的特征漂移和灾难性遗忘。为了解决这个问题，我们提出了AIFIND（Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection），该方法利用语义锚点来稳定增量学习。我们设计了伪影驱动的语义先验生成器，以实例化不变的语义锚点，从低级伪影线索建立固定坐标系统。这些锚点通过伪影探测注意力（Artifact-Probe Attention）注入到图像编码器中，明确约束易变的视觉特征与稳定的语义锚点对齐。自适应决策协调器（Adaptive Decision Harmonizer）通过保持语义锚点的角度关系来协调分类器，维护任务间的几何一致性。在多个增量协议上的广泛实验验证了AIFIND的优越性。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2604.16214

GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos

GAViD：用于从视频中进行上下文感知的群体情感识别的大规模多模态数据集

Kumar, Deepak, Singh, Abhishek Pratap, Kumar, Puneet, Li, Xiaobai, Raman, Balasubramanian

Abstract

Understanding affective dynamics in real-world social systems is fundamental to modeling and analyzing human-human interactions in complex environments. Group affect emerges from intertwined human-human interactions, contextual influences, and behavioral cues, making its quantitative modeling a challenging computational social systems problem. However, computational modeling of group affect in in-the-wild scenarios remains challenging due to limited large-scale annotated datasets and the inherent complexity of multimodal social interactions shaped by contextual and behavioral variability. The lack of comprehensive datasets annotated with multimodal and contextual information further limits advances in the field. To address this, we introduce the Group Affect from ViDeos (GAViD) dataset, comprising 5091 video clips with multimodal data (video, audio and context), annotated with ternary valence and discrete emotion labels and enriched with VideoGPT-generated contextual metadata and human-annotated action cues. We also present Context-Aware Group Affect Recognition Network (CAGNet) for multimodal context-aware group affect recognition. CAGNet achieves 63.20\% test accuracy on GAViD, comparable to state-of-the-art performance. The dataset and code are available at github.com/deepakkumar-iitr/GAViD.

Chinese Translation

理解现实世界社会系统中的情感动态对于建模和分析复杂环境中的人际互动至关重要。群体情感源于交织的人际互动、上下文影响和行为线索，使其定量建模成为一个具有挑战性的计算社会系统问题。然而，由于缺乏大规模标注数据集以及由上下文和行为变异性塑造的多模态社会互动的固有复杂性，在实际场景中对群体情感的计算建模仍然面临挑战。缺乏全面的多模态和上下文信息标注数据集进一步限制了该领域的进展。为了解决这一问题，我们引入了群体情感视频数据集（GAViD），该数据集包含5091个视频片段，具有多模态数据（视频、音频和上下文），并标注了三元效价和离散情感标签，同时丰富了VideoGPT生成的上下文元数据和人工标注的行为线索。我们还提出了上下文感知群体情感识别网络（CAGNet），用于多模态上下文感知的群体情感识别。CAGNet在GAViD上的测试准确率达到63.20%，与最先进的性能相当。数据集和代码可在github.com/deepakkumar-iitr/GAViD获取。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2604.16231

Dental Panoramic Radiograph Analysis Using YOLO26 From Tooth Detection to Disease Diagnosis

基于YOLOv26的牙科全景放射影像分析：从牙齿检测到疾病诊断

Asif, Khawaja Azfar, Khan, Rafaqat Alam

Abstract

Panoramic radiography is a fundamental diagnostic tool in dentistry, offering a comprehensive view of the entire dentition with minimal radiation exposure. However, manual interpretation is time-consuming and prone to errors, especially in high-volume clinical settings. This creates a pressing need for efficient automated solutions. This study presents the first application of YOLOv26 for automated tooth detection, FDI-based numbering, and dental disease segmentation in panoramic radiographs. The DENTEX dataset was preprocessed using Roboflow for format conversion and augmentation, yielding 1,082 images for tooth enumeration and 1,040 images for disease segmentation across four pathology classes. Five YOLOv26-seg variants were trained on Google Colab using transfer learning at a resolution of 800x800. Results demonstrate that the YOLOv26m-seg model achieved the best performance for tooth enumeration, with a precision of 0.976, recall of 0.970, and box mAP50 of 0.976. It outperformed the YOLOv8x baseline by 4.9% in precision and 3.3% in mAP50, while also enabling high-quality mask-level segmentation (mask mAP50 = 0.970). For disease segmentation, the YOLOv26l-seg model attained a box mAP50 of 0.591 and a mask mAP50 of 0.547. Impacted teeth showed the highest per-class average precision (0.943), indicating that visual distinctiveness influences detection performance more than annotation quantity. Overall, these findings demonstrate that YOLOv26-based models offer a robust and accurate framework for automated dental image analysis, with strong potential to enhance diagnostic efficiency and consistency in clinical practice.

Chinese Translation

全景放射摄影是牙科诊断中的基本工具，能够在最小辐射暴露下提供整个牙列的全面视图。然而，手动解读耗时且容易出错，尤其是在高负荷的临床环境中。这就迫切需要高效的自动化解决方案。本研究首次应用YOLOv26进行全景放射影像中的自动牙齿检测、基于FDI的编号和牙科疾病分割。使用Roboflow对DENTEX数据集进行了预处理，以便格式转换和增强，生成了1,082张用于牙齿编号的图像和1,040张用于四种病理类别的疾病分割图像。五个YOLOv26-seg变体在Google Colab上使用迁移学习以800x800的分辨率进行训练。结果表明，YOLOv26m-seg模型在牙齿编号方面表现最佳，精确度为0.976，召回率为0.970，框架mAP50为0.976。其在精确度上比YOLOv8x基线提高了4.9%，在mAP50上提高了3.3%，同时也实现了高质量的掩膜级分割（掩膜mAP50 = 0.970）。在疾病分割方面，YOLOv26l-seg模型达到了框架mAP50为0.591和掩膜mAP50为0.547。受影响的牙齿显示出最高的每类平均精确度（0.943），表明视觉特征的明显性对检测性能的影响大于标注数量。总体而言，这些发现表明基于YOLOv26的模型为自动牙科影像分析提供了一个稳健且准确的框架，具有增强临床实践中诊断效率和一致性的强大潜力。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2604.16234

A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection

一种两阶段、以对象为中心的深度学习框架用于稳健的考试作弊检测

Le, Van-Truong, Nguyen, Le-Khanh, Nguyen, Trong-Doanh

Abstract

Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13\% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and scalability for deployment in large-scale environments. Beyond the technical contribution, the AI-assisted monitoring system also addresses ethical concerns by ensuring that final outcomes are delivered privately to individual students after the examination, for example, via personal email. This prevents public exposure or shaming and offers students an opportunity to reflect on their behavior. For further improvement, it is possible to incorporate additional factors, such as audio data and consecutive frames, to achieve greater accuracy. This study provides a foundation for developing real-time, scalable, ethical, and open-source solutions.

Chinese Translation

学术诚信面临着考试作弊的持续挑战。传统的监考依赖于人工观察，这种方式效率低下、成本高昂，并且在大规模应用中容易出错。尽管一些现有的人工智能监控系统已被部署并受到信任，但许多系统缺乏透明度或需要多层架构才能实现所需的性能。为了解决这些挑战，我们提出了一种改进的简单两阶段考试作弊检测框架，该框架结合了对象检测和行为分析，采用了知名技术。首先，使用最先进的YOLOv8n模型在考试室图像中定位学生。每个检测到的区域被裁剪和预处理，然后由微调的RexNet-150模型分类为正常行为或作弊行为。该系统在一个由10个独立来源编制的数据集上进行训练，总共包含273,897个样本，达到了0.95的准确率、0.94的召回率、0.96的精确率和0.95的F1-score，比视频基础的作弊检测的基线准确率0.82提高了13%。此外，平均每个样本的推理时间为13.9毫秒，所提方法在大规模环境中的部署展示了稳健性和可扩展性。除了技术贡献外，该AI辅助监控系统还通过确保最终结果在考试后私密地传递给每个学生（例如，通过个人电子邮件）来解决伦理问题。这防止了公众曝光或羞辱，并为学生提供了反思其行为的机会。为了进一步改进，可以考虑加入额外因素，例如音频数据和连续帧，以实现更高的准确性。本研究为开发实时、可扩展、伦理和开源的解决方案提供了基础。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2604.16240

CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting

CollideNet：用于碰撞时间预测的层次多尺度视频表示学习与解耦

Desai, Nishq Poorav, Etemad, Ali, Greenspan, Michael

Abstract

Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at https://github.com/DeSinister/CollideNet/.

Chinese Translation

碰撞时间预测（Time-to-Collision, TTC）是碰撞预防中的一项关键任务，要求精确的时间预测以及对视频中空间和时间上封装的局部和全局模式的理解。为了解决视频的多尺度特性，我们提出了一种新颖的基于时空层次变换器的架构，称为CollideNet，专门用于有效的TTC预测。在空间流中，CollideNet在多个分辨率下同时聚合每个视频帧的信息。在时间流中，除了多尺度特征编码外，CollideNet还解耦了非平稳性、趋势和季节性成分。与之前的工作相比，我们的方法在三个常用公共数据集上实现了最先进的性能，并以相当大的优势设定了新的最先进水平。我们进行了跨数据集评估，以分析我们方法的泛化能力，并可视化视频数据中趋势和季节性成分解耦的效果。我们的代码已发布在 https://github.com/DeSinister/CollideNet/。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2604.16243

Find, Fix, Reason: Context Repair for Video Reasoning

发现、修复、推理：视频推理的上下文修复

Huang, Haojian, Qin, Chuanyu, Li, Yinchuan, Chen, Yingcong

Abstract

Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model's capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while the question remains unchanged. The student answers again with the added context, and training updates with a chosen-rollout scheme integrated into Group Relative Policy Optimization (GRPO). We further propose a Robust Improvement Reward (RIR) that aligns optimization with two goals: outcome validity through correct answers and dependency alignment through rationales that reflect the cited evidence. Advantages are group-normalized across the batch, preserving on-policy exploration while directing it along causally meaningful directions with minimal changes to the training stack. Experiments on various related benchmarks show consistent accuracy gains and strong generalization. Web page and source code will be available at https://github.com/JethroJames/FFR.git.

Chinese Translation

强化学习在大型多模态模型中推动了视频推理的发展，但主流流程要么依赖于在策略内自我探索，这在模型的知识边界处停滞，要么是混合重放，混合多种策略并要求仔细的正则化。动态上下文方法聚焦于特定证据，但通常需要经过精心策划的预训练和两阶段调优，其上下文仍然受到小模型能力的限制。相比之下，更大的模型在遵循指令和多模态理解方面表现出色，能够为较小的模型提供更丰富的上下文，并通过简单工具快速聚焦于目标区域。基于这一能力，我们引入了一种观察级别的干预：一个冻结的、集成工具的教师识别缺失的时空依赖，并从原始视频中提供最小的证据补丁（例如，时间戳、区域等），而问题保持不变。学生在添加的上下文下再次回答，并通过集成到群体相对策略优化（Group Relative Policy Optimization, GRPO）中的选定回放方案进行训练更新。我们进一步提出了一种稳健改进奖励（Robust Improvement Reward, RIR），使优化与两个目标对齐：通过正确答案实现结果有效性，通过反映引用证据的推理实现依赖对齐。优势在批次间进行组归一化，保留了策略内探索，同时以最小的训练堆栈变更引导其沿因果意义明确的方向发展。在各种相关基准上的实验显示出一致的准确性提升和强大的泛化能力。网页和源代码将发布在 https://github.com/JethroJames/FFR.git。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2604.16248

Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization

视觉-语言模型的失败之处？图像地理定位的全球规模分析

Bharadwaj, Siddhant, Vashist, Ashish, Aleem, Fahimul, Vyas, Shruti

Abstract

Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.

Chinese Translation

图像地理定位传统上通过基于检索的地点识别或基于几何的视觉定位管道来解决。近年来，视觉-语言模型（VLMs）的进展展示了其在多模态任务中的强大零-shot推理能力，但它们在地理推断中的表现仍然未被充分探讨。在本研究中，我们对多种最先进的VLMs进行了系统评估，专注于仅使用地面视图图像进行国家级图像地理定位。我们不依赖于图像匹配、GPS元数据或特定任务的训练，而是在零-shot环境中评估基于提示的国家预测。所选模型在三个地理多样的数据集上进行测试，以评估其鲁棒性和泛化能力。我们的结果揭示了模型之间的显著差异，突显了语义推理在粗略地理定位中的潜力以及当前VLMs在捕捉细粒度地理线索方面的局限性。本研究首次对现代VLMs在国家级地理定位方面进行了集中比较，并为未来在多模态推理与地理理解交叉领域的研究奠定了基础。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2604.16256

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

视觉-语言模型是否真正执行视觉推理？对模态差距的严格研究

Xu, Yige, Wang, Yongjie, Wu, Zizhuo, Song, Kaisong, Lin, Jun, Shen, Zhiqi

Abstract

Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at https://github.com/xuyige/CrossMath.

Chinese Translation

视觉-语言模型（VLMs）中的推理最近引起了广泛关注，因为它在多种下游任务中具有广泛的适用性。然而，目前尚不清楚VLMs的优越性能是否源于真正的视觉基础推理，还是主要依赖于其文本骨干的推理能力。为了系统地测量这一点，我们引入了CrossMath，一个新颖的多模态推理基准，旨在进行受控的跨模态比较。具体而言，我们构建了文本-only、图像-only和图像+文本格式的问题，确保任务相关信息完全相同，并由人工标注者验证。这种严格的对齐有效地隔离了模态特定的推理差异，同时消除了信息不匹配等混杂因素。对最先进的VLMs进行的广泛评估揭示了一个一致的现象：文本推理和视觉推理之间存在显著的性能差距。值得注意的是，VLMs在仅使用文本输入时表现优异，而在加入视觉数据（图像+文本）时，性能通常会下降，与仅使用文本的基线相比。这些发现表明，当前的VLMs主要在文本空间中进行推理，对视觉证据的真实依赖有限。为了缓解这一局限性，我们为VLM微调策划了一个CrossMath训练集。实证评估表明，在该训练集上进行微调显著提升了所有单独和联合模态的推理性能，同时在两个通用视觉推理任务上取得了稳健的提升。源代码可在 https://github.com/xuyige/CrossMath 获取。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2604.16264

Information Router for Mitigating Modality Dominance in Vision-Language Models

信息路由器：缓解视觉语言模型中的模态主导性

Kim, Seulgi, Prabhushankar, Mohit, AlRegib, Ghassan

Abstract

Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.

Chinese Translation

视觉语言模型（VLMs）在广泛的基准测试中表现出色，但它们往往受到模态主导性的影响，即预测过度依赖单一模态。之前的方法主要通过引导模型的注意力分配来解决这一问题，隐含地假设所有模态提供足够的信息。然而，注意力仅决定模型的关注点，无法丰富缺失或模糊的信息。在现实世界中，输入模态的信息密度和信噪比往往存在差异。在这种情况下，仅仅调整模型的注意力并不能解决信息缺失的根本问题。本文提出了 extsc{MoIR}： extit{多模态信息路由器}，这是一种信息级融合方法，明确减少融合前的信息差异。 extsc{MoIR}识别出信息量较少的标记，并从更强的模态中路由互补信息，在被大型语言模型处理之前构建信息密集的标记表示。通过修改信息的可用性， extsc{MoIR}使得模态主导性能够可靠地转变，即使在某一模态退化的情况下。我们在三个广泛使用的多模态基准测试上评估了 extsc{MoIR}，涵盖多个模型骨干。实验结果表明， extsc{MoIR}始终展现出更平衡的模态贡献，并提高了鲁棒性和下游性能，特别是在模态退化的情况下。这些发现表明，明确修改跨模态信息是一种有效且互补的策略，可以缓解多模态推理模型中的模态主导性。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2604.16266

Hero-Mamba: Mamba-based Dual Domain Learning for Underwater Image Enhancement

Hero-Mamba：基于Mamba的双域学习用于水下图像增强

Pokuri, Tejeswar, Rai, Shivarth

Abstract

Underwater images often suffer from severe degradation, such as color distortion, low contrast, and blurred details, due to light absorption and scattering in water. While learning-based methods like CNNs and Transformers have shown promise, they face critical limitations: CNNs struggle to model the long-range dependencies needed for non-uniform degradation, and Transformers incur quadratic computational complexity, making them inefficient for high-resolution images. To address these challenges, we propose Hero-Mamba, a novel Mamba-based network that achieves efficient dual-domain learning for underwater image enhancement. Our approach uniquely processes information from both the spatial domain (RGB image) and the spectral domain (FFT components) in parallel. This dual-domain input allows the network to decouple degradation factors, separating color/brightness information from texture/noise. The core of our network utilizes Mamba-based SS2D blocks to capture global receptive fields and long-range dependencies with linear complexity, overcoming the limitations of both CNNs and Transformers. Furthermore, we introduce a ColorFusion block, guided by a background light prior, to restore color information with high fidelity. Extensive experiments on the LSUI and UIEB benchmark datasets demonstrate that Hero-Mamba outperforms state-of-the-art methods. Notably, our model achieves a PSNR of 25.802 and an SSIM of 0.913 on LSUI, validating its superior performance and generalization capabilities.

Chinese Translation

水下图像常常受到严重退化的影响，例如由于水中的光吸收和散射导致的颜色失真、低对比度和模糊细节。尽管基于学习的方法如卷积神经网络（CNNs）和变换器（Transformers）显示出良好的前景，但它们面临着关键的局限性：CNN在建模非均匀退化所需的长距离依赖方面存在困难，而变换器则会导致二次计算复杂度，使其在处理高分辨率图像时效率低下。为了解决这些挑战，我们提出了Hero-Mamba，一种新颖的基于Mamba的网络，实现了水下图像增强的高效双域学习。我们的方法独特地并行处理来自空间域（RGB图像）和频谱域（FFT组件）的信息。这种双域输入使网络能够解耦退化因素，将颜色/亮度信息与纹理/噪声分离。我们网络的核心利用基于Mamba的SS2D模块，以线性复杂度捕获全局感受野和长距离依赖，克服了CNN和变换器的局限性。此外，我们引入了一个ColorFusion模块，在背景光先验的指导下，以高保真度恢复颜色信息。在LSUI和UIEB基准数据集上的大量实验表明，Hero-Mamba的表现优于最先进的方法。值得注意的是，我们的模型在LSUI上达到了25.802的峰值信噪比（PSNR）和0.913的结构相似性指数（SSIM），验证了其卓越的性能和泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2604.16272

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

VEFX-Bench：通用视频编辑和视觉效果的整体基准

Gao, Xiangbo, Jiang, Sicong, Liu, Bangya, Chen, Xinghao, Yang, Minglai, Yang, Siyuan, Wu, Mingyang, Yu, Jiongze, Zheng, Qi, Wang, Haozhi, Zhang, Jiayi, Yang, Jared, Yang, Jie, Wang, Zihan, Yin, Qing, Tu, Zhengzhong

Abstract

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.

Chinese Translation

随着人工智能辅助视频创作的日益实用，基于指令的视频编辑已成为完善生成或捕获素材以满足专业要求的关键。然而，该领域仍然缺乏一个大规模的人类标注数据集，包含完整的编辑示例，以及一个标准化的评估工具用于比较编辑系统。现有资源受到规模小、缺少编辑输出或缺乏人类质量标签的限制，而当前的评估往往依赖于昂贵的人工检查或不专门针对编辑质量的通用视觉-语言模型评判。我们引入了VEFX-Dataset，这是一个人类标注的数据集，包含9个主要编辑类别和32个子类别中的5,049个视频编辑示例，每个示例沿着三个解耦维度进行标注：指令遵循、渲染质量和编辑独特性。在VEFX-Dataset的基础上，我们提出了VEFX-Reward，这是一个专门为视频编辑质量评估设计的奖励模型。VEFX-Reward共同处理源视频、编辑指令和编辑后的视频，并通过序数回归预测每个维度的质量评分。我们进一步发布了VEFX-Bench，这是一个包含300对精心策划的视频-提示对的基准，用于标准化比较编辑系统。实验表明，VEFX-Reward与人类判断的对齐程度明显高于通用视觉-语言模型评判者和之前的奖励模型，在标准的图像质量评估/视频质量评估指标和组偏好评估中均表现优异。使用VEFX-Reward作为评估工具，我们对代表性的商业和开源视频编辑系统进行了基准测试，揭示了当前模型在视觉可信度、指令遵循和编辑局部性方面的持续差距。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2604.16284

Enhancing Hazy Wildlife Imagery: AnimalHaze3k and IncepDehazeGan

增强模糊野生动物图像：AnimalHaze3k 和 IncepDehazeGan

Rai, Shivarth, Pokuri, Tejeswar

Abstract

Atmospheric haze significantly degrades wildlife imagery, impeding computer vision applications critical for conservation, such as animal detection, tracking, and behavior analysis. To address this challenge, we introduce AnimalHaze3k a synthetic dataset comprising of 3,477 hazy images generated from 1,159 clear wildlife photographs through a physics-based pipeline. Our novel IncepDehazeGan architecture combines inception blocks with residual skip connections in a GAN framework, achieving state-of-the-art performance (SSIM: 0.8914, PSNR: 20.54, and LPIPS: 0.1104), delivering 6.27% higher SSIM and 10.2% better PSNR than competing approaches. When applied to downstream detection tasks, dehazed images improved YOLOv11 detection mAP by 112% and IoU by 67%. These advances can provide ecologists with reliable tools for population monitoring and surveillance in challenging environmental conditions, demonstrating significant potential for enhancing wildlife conservation efforts through robust visual analytics.

Chinese Translation

大气雾霾显著降低了野生动物图像的质量，妨碍了对保护至关重要的计算机视觉应用，如动物检测、追踪和行为分析。为了解决这一挑战，我们推出了 AnimalHaze3k，一个合成数据集，包含 3,477 张模糊图像，这些图像是通过基于物理的流程从 1,159 张清晰的野生动物照片生成的。我们的新型 IncepDehazeGan 架构在 GAN 框架中结合了 inception 模块和残差跳跃连接，达到了最先进的性能（SSIM: 0.8914，PSNR: 20.54，LPIPS: 0.1104），相比竞争方法提高了 6.27% 的 SSIM 和 10.2% 的 PSNR。当应用于下游检测任务时，去雾图像使 YOLOv11 的检测平均精度（mAP）提高了 112%，交并比（IoU）提高了 67%。这些进展为生态学家提供了可靠的工具，以便在复杂环境条件下进行种群监测和监视，展示了通过强大的视觉分析增强野生动物保护工作的显著潜力。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2604.16298

FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

FineCog-Nav：集成细粒度认知模块的零-shot多模态无人机导航

Shao, Dian, Xu, Zhengzheng, Wang, Peiyang, Liu, Like, Wang, Yule, Shi, Jieqi, Huo, Jing

Abstract

UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: https://smartdianlab.github.io/projects-FineCogNav.

Chinese Translation

无人机视觉-语言导航（VLN）要求代理从自我中心的视角在复杂的三维环境中导航，同时遵循模糊的多步骤指令，覆盖较长的时间范围。现有的零-shot方法仍然有限，因为它们通常依赖于大型基础模型、通用提示和松散协调的模块。在本研究中，我们提出了FineCog-Nav，一个受人类认知启发的自上而下框架，将导航组织为细粒度模块，涵盖语言处理、感知、注意力、记忆、想象、推理和决策。每个模块由一个中等规模的基础模型驱动，配备角色特定的提示和结构化的输入输出协议，从而实现有效的协作和改进的可解释性。为了支持细粒度评估，我们构建了AerialVLN-Fine，这是一个由AerialVLN派生的300条轨迹的精心策划基准，具有句子级的指令-轨迹对齐和包含明确视觉终点及地标参考的精炼指令。实验表明，FineCog-Nav在指令遵循、长时间规划和对未见环境的泛化方面始终优于零-shot基线。这些结果表明，细粒度认知模块化在零-shot空中导航中的有效性。项目页面：https://smartdianlab.github.io/projects-FineCogNav。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2604.16299

Repurposing 3D Generative Model for Autoregressive Layout Generation

重新利用3D生成模型进行自回归布局生成

Feng, Haoran, Niu, Yifan, Huang, Zehuan, Sun, Yang-Tian, Guo, Chunchao, Peng, Yuxin, Sheng, Lu

Abstract

We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model that integrates scene, object, and instruction information and employs a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. Our code is publicly available at https://github.com/fenghora/LaviGen.

Chinese Translation

我们介绍了LaviGen，一个将3D生成模型重新用于3D布局生成的框架。与之前从文本描述推断对象布局的方法不同，LaviGen直接在原生3D空间中操作，将布局生成形式化为一个自回归过程，明确建模对象之间的几何关系和物理约束，从而生成连贯且物理上合理的3D场景。为了进一步增强这一过程，我们提出了一种改进的3D扩散模型，该模型整合了场景、对象和指令信息，并采用双重引导自回归蒸馏机制以提高效率和空间准确性。在LayoutVLM基准上的大量实验表明，LaviGen在3D布局生成性能上表现优越，其物理合理性比现有最优方法高出19%，计算速度快65%。我们的代码已公开发布在https://github.com/fenghora/LaviGen。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

cs.AI / 1 / 2604.15456

DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

DeepER-Med：通过自主智能推动医学中的深度循证研究

Wang, Zhizheng, Wei, Chih-Hsuan, Chan, Joey, Leaman, Robert, Day, Chi-Ping, Wu, Chuan, Knepper, Mark A, Farias, Antolin Serrano, Rincon-Torroella, Jordina, Slika, Hasan, Tyler, Betty, Nguyen, Ryan Huu-Tuan, Indurkar, Asmita, Hébert, Mélanie, Tian, Shubo, He, Lauren, Naffakh, Noor, Aseem, Aseem, Wan, Nicholas, Chew, Emily Y, Keenan, Tiarnan D L, Lu, Zhiyong

Abstract

Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med's conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.

Chinese Translation

信任性和透明性是人工智能（AI）在医疗保健和生物医学研究中临床应用的关键。最近的深度研究系统旨在通过将AI代理与多跳信息检索、推理和综合相结合，加速基于证据的科学发现。然而，大多数现有系统缺乏明确且可检查的证据评估标准，这增加了错误累积的风险，并使研究人员和临床医生难以评估其输出的可靠性。同时，当前的基准测试方法很少评估在复杂的现实医疗问题上的表现。在此，我们介绍DeepER-Med，一个针对医学的深度循证研究框架，配备自主智能系统。DeepER-Med将深度医学研究框架化为一个明确且可检查的基于证据的生成工作流程，包含三个模块：研究规划、自主协作和证据综合。为了支持现实评估，我们还呈现了DeepER-MedQA，一个基于证据的数据集，包含100个来自真实医学研究场景的专家级研究问题，由11位生物医学专家的多学科小组进行策划。专家手动评估表明，DeepER-Med在多个标准上始终优于广泛使用的生产级平台，包括生成新颖的科学见解。我们进一步通过八个现实临床案例展示DeepER-Med的实际效用。人类临床医生的评估表明，DeepER-Med的结论在七个案例中与临床建议一致，突显了其在医学研究和决策支持中的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2604.15495

GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

GIST：通过智能语义拓扑进行多模态知识提取和空间定位

Agrawal, Shivendra, Hayes, Bradley

Abstract

Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system's capacity for universal design.

Chinese Translation

在零售店、仓库和医院等复杂、密集的环境中导航，对人类和具身人工智能构成了显著的空间定位挑战。在这些空间中，由于物品的准静态特性，密集的视觉特征迅速变得陈旧，而长尾语义分布则对传统计算机视觉提出了挑战。尽管视觉-语言模型（Vision-Language Models, VLMs）有助于辅助系统在语义丰富的空间中导航，但它们在杂乱环境中的空间定位仍然面临困难。我们提出了GIST（Grounded Intelligent Semantic Topology），这是一种多模态知识提取管道，将消费级移动点云转换为语义注释的导航拓扑。我们的架构将场景提炼为二维占用图，提取其拓扑布局，并通过智能关键帧和语义选择叠加轻量级语义层。我们通过关键的下游人机交互任务展示了这种结构化空间知识的多样性：（1）一个意图驱动的语义搜索引擎，在精确匹配失败时主动推断类别替代方案和区域；（2）一个一次性语义定位器，达到1.04米的前五名平均翻译误差；（3）一个区域分类模块，将可步行的平面图分割为高层次的语义区域；（4）一个视觉基础的指令生成器，将最佳路径合成到以自我为中心、富含地标的自然语言路由中。在多标准的大语言模型评估中，GIST超越了基于序列的指令生成基线。最后，现场形成性评估（N=5）仅依赖于口头提示，获得了80%的导航成功率，验证了系统的通用设计能力。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2604.15514

Bureaucratic Silences: What the Canadian AI Register Reveals, Omits, and Obscures

官僚沉默：加拿大人工智能注册表揭示、遗漏与模糊的内容

Das, Dipto, Tessono, Christelle, Ahmed, Syed Ishtiaque, Guha, Shion

Abstract

In November 2025, the Government of Canada operationalized its commitment to transparency by releasing its first Federal AI Register. In this paper, we argue that such registers are not neutral mirrors of government activity, but active instruments of ontological design that configure the boundaries of accountability. We analyzed the Register's complete dataset of 409 systems using the Algorithmic Decision-Making Adapted for the Public Sector (ADMAPS) framework, combining quantitative mapping with deductive qualitative coding. Our findings reveal a sharp divergence between the rhetoric of "sovereign AI" and the reality of bureaucratic practice: while 86\% of systems are deployed internally for efficiency, the Register systematically obscures the human discretion, training, and uncertainty management required to operate them. By privileging technical descriptions over sociotechnical context, the Register constructs an ontology of AI as "reliable tooling" rather than "contestable decision-making." We conclude that without a shift in design, such transparency artifacts risk automating accountability into a performative compliance exercise, offering visibility without contestability.

Chinese Translation

2025年11月，加拿大政府通过发布首个联邦人工智能注册表，落实其对透明度的承诺。本文认为，这类注册表并非政府活动的中立镜像，而是主动的本体设计工具，配置着问责的边界。我们使用适用于公共部门的算法决策制定框架（Algorithmic Decision-Making Adapted for the Public Sector, ADMAPS）分析了注册表中409个系统的完整数据集，结合了定量映射与演绎性定性编码。我们的研究发现，“主权人工智能”的言辞与官僚实践的现实之间存在明显的差异：尽管86%的系统是为了提高效率而在内部部署，但注册表系统性地模糊了操作这些系统所需的人为裁量、培训和不确定性管理。通过优先考虑技术描述而非社会技术背景，注册表构建了将人工智能视为“可靠工具”而非“可争议决策”的本体论。我们得出结论，如果不在设计上进行转变，这类透明度工具可能会将问责自动化为一种表面合规的演练，提供可见性而非可争议性。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2604.15529

LACE: Lattice Attention for Cross-thread Exploration

LACE：用于跨线程探索的晶格注意力

Li, Yang, Zhang, Zirui, Liu, Yang, Mao, Chengzhi

Abstract

Current large language models reason in isolation. Although it is common to sample multiple reasoning paths in parallel, these trajectories do not interact, and often fail in the same redundant ways. We introduce LACE, a framework that transforms reasoning from a collection of independent trials into a coordinated, parallel process. By repurposing the model architecture to enable cross-thread attention, LACE allows concurrent reasoning paths to share intermediate insights and correct one another during inference. A central challenge is the absence of natural training data that exhibits such collaborative behavior. We address this gap with a synthetic data pipeline that explicitly teaches models to communicate and error-correct across threads. Experiments show that this unified exploration substantially outperforms standard parallel search, improving reasoning accuracy by over 7 points. Our results suggest that large language models can be more effective when parallel reasoning paths are allowed to interact.

Chinese Translation

当前的大型语言模型在孤立中进行推理。尽管并行采样多个推理路径是常见做法，但这些轨迹并不相互作用，且常常以相同的冗余方式失败。我们提出了LACE，一个将推理从一系列独立试验转变为协调并行过程的框架。通过重新设计模型架构以实现跨线程注意力，LACE允许并发推理路径在推理过程中共享中间见解并相互纠正。一个核心挑战是缺乏自然训练数据来展示这种协作行为。我们通过一个合成数据管道来填补这一空白，明确教导模型在线程之间进行沟通和错误修正。实验表明，这种统一探索显著优于标准的并行搜索，推理准确性提高了超过7个百分点。我们的结果表明，当允许并行推理路径相互作用时，大型语言模型可以更有效。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2604.15558

Preregistered Belief Revision Contracts

预注册信念修正合同

Alqithami, Saad

Abstract

Deliberative multi-agent systems allow agents to exchange messages and revise beliefs over time. While this interaction is meant to improve performance, it can also create dangerous conformity effects: agreement, confidence, prestige, or majority size may be treated as if they were evidence, producing high-confidence convergence to false conclusions. To address this, we introduce PBRC (Preregistered Belief Revision Contracts), a protocol-level mechanism that strictly separates open communication from admissible epistemic change. A PBRC contract publicly fixes first-order evidence triggers, admissible revision operators, a priority rule, and a fallback policy. A non-fallback step is accepted only when it cites a preregistered trigger and provides a nonempty witness set of externally validated evidence tokens. This ensures that every substantive belief change is both enforceable by a router and auditable after the fact. In this paper, (a) we prove that under evidential contracts with conservative fallback, social-only rounds cannot increase confidence and cannot generate purely conformity-driven wrong-but-sure cascades. (b) We show that auditable trigger protocols admit evidential PBRC normal forms that preserve belief trajectories and canonicalized audit traces. (c) We demonstrate that sound enforcement yields epistemic accountability: any change of top hypothesis is attributable to a concrete validated witness set. For token-invariant contracts, (d) we prove that enforced trajectories depend only on token-exposure traces; under flooding dissemination, these traces are characterized exactly by truncated reachability, giving tight diameter bounds for universal evidence closure. Finally, we introduce a companion contractual dynamic doxastic logic to specify trace invariants, and provide simulations illustrating cascade suppression, auditability, and robustness-liveness trade-offs.

Chinese Translation

审议性多智能体系统允许智能体随时间交换信息并修正信念。虽然这种互动旨在提高性能，但也可能产生危险的从众效应：一致性、信心、声望或多数规模可能被视为证据，从而导致对错误结论的高度自信收敛。为了解决这一问题，我们引入了PBRC（预注册信念修正合同），这是一种在协议层面上严格区分开放通信与可接受的认识变化的机制。PBRC合同公开固定了一阶证据触发器、可接受的修正操作、优先规则和后备政策。非后备步骤仅在引用预注册触发器并提供非空的外部验证证据令牌集合时被接受。这确保了每一次实质性的信念变化都可以由路由器强制执行，并在事后可审计。本文中，(a) 我们证明在具有保守后备的证据合同下，社会性回合无法增加信心，也无法产生完全由从众驱动的错误但确定的级联。(b) 我们展示了可审计的触发器协议允许证据PBRC标准形式，这些形式保留了信念轨迹和规范化的审计痕迹。(c) 我们证明了合理的执行产生了认识责任：任何顶层假设的变化都可归因于一个具体的验证过的证人集合。对于令牌不变的合同，(d) 我们证明了强制执行的轨迹仅依赖于令牌曝光痕迹；在洪泛传播下，这些痕迹恰好由截断可达性特征化，为普遍证据闭合提供了紧密的直径界限。最后，我们引入了一种伴随的合同动态信念逻辑，以指定轨迹不变性，并提供了模拟以说明级联抑制、可审计性和稳健性-活性权衡。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2604.15559

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

人工智能代理蒸馏中的潜意识不安全行为转移

Dang, Jacob, Xie, Brian Y., Younis, Omar G.

Abstract

Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file-system actions via an API-style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment, replacing API tool calls with shell commands and operationalizing the bias as a preference for issuing chmod as the first permission-related command over semantically equivalent alternatives such as chown or setfacl. Despite full keyword sanitation in both settings, students inherit measurable behavioral biases. In the API setting the student's deletion rate reaches 100% (versus a 5% baseline) under homogeneous distillation; in the Bash setting the student's chmod-first rate reaches 30%-55% (versus a 0%-10% baseline), with the strongest transfer observed in large-to-small distillation. Our results demonstrate that explicit data sanitation is an insufficient defense, and behavioral biases are encoded implicitly in trajectory dynamics regardless of the tool interface.

Chinese Translation

近期关于潜意识学习的研究表明，语言模型可以通过与这些特征语义上无关的数据传递语义特征。然而，在代理系统中，行为特征是否能够转移仍然不清楚，因为在这些系统中，策略是通过轨迹学习的，而不是静态文本。在本研究中，我们提供了首个实证证据，表明不安全的代理行为可以通过模型蒸馏在两种互补的实验设置中潜意识地转移。在我们的主要设置中，我们构建了一个表现出强烈删除偏见的教师代理，该代理倾向于通过API风格的工具接口执行破坏性文件系统操作，并仅使用表面上安全任务的轨迹将其蒸馏为学生，同时严格过滤所有显式删除关键词。在我们的次要设置中，我们在本地Bash环境中复制了威胁模型，将API工具调用替换为shell命令，并将偏见操作化为优先发出chmod作为第一个与权限相关的命令，而不是语义上等价的替代命令，如chown或setfacl。尽管在两个设置中都进行了全面的关键词清理，学生仍然继承了可测量的行为偏见。在API设置中，学生的删除率在同质蒸馏下达到了100%（对比基线的5%）；在Bash设置中，学生的chmod优先率达到了30%-55%（对比基线的0%-10%），在大到小的蒸馏中观察到最强的转移。我们的结果表明，显式的数据清理是不足以防御的，行为偏见在轨迹动态中隐式编码，无论工具接口如何。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2604.15709

Bilevel Optimization of Agent Skills via Monte Carlo Tree Search

通过蒙特卡洛树搜索进行代理技能的双层优化

Huang, Chenyi, Zhang, Haoting, Xu, Jingxu, Zheng, Zeyu, Lin, Yunduan

Abstract

Agent \texttt{skills} are structured collections of instructions, tools, and supporting resources that help large language model (LLM) agents perform particular classes of tasks. Empirical evidence shows that the design of \texttt{skills} can materially affect agent task performance, yet systematically optimizing \texttt{skills} remains challenging. Since a \texttt{skill} comprises instructions, tools, and supporting resources in a structured way, optimizing it requires jointly determining both the structure of these components and the content each component contains. This gives rise to a complex decision space with strong interdependence across structure and components. We therefore represent these two coupled decisions as \texttt{skill} structure and component content, and formulate \texttt{skill} optimization as a bilevel optimization problem. We propose a bilevel optimization framework in which an outer loop employs Monte Carlo Tree Search to determine the \texttt{skill} structure, while an inner loop refines the component content within the structure selected by the outer loop. In both loops, we employ LLMs to assist the optimization procedure. We evaluate the proposed framework on an open-source Operations Research Question Answering dataset, and the experimental results suggest that the bilevel optimization framework improves the performance of the agents with the optimized \texttt{skill}.

Chinese Translation

代理技能是指一系列结构化的指令、工具和支持资源，帮助大型语言模型（LLM）代理执行特定类别的任务。实证证据表明，技能的设计会显著影响代理的任务表现，但系统性地优化技能仍然具有挑战性。由于技能以结构化的方式包含指令、工具和支持资源，因此优化技能需要共同确定这些组件的结构及其各自包含的内容。这导致了一个复杂的决策空间，其中结构和组件之间存在强烈的相互依赖关系。因此，我们将这两种耦合决策表示为技能结构和组件内容，并将技能优化形式化为一个双层优化问题。我们提出了一个双层优化框架，其中外层循环采用蒙特卡洛树搜索来确定技能结构，而内层循环则在外层循环选择的结构内细化组件内容。在两个循环中，我们都使用LLM来辅助优化过程。我们在一个开源的运筹学问答数据集上评估了所提出的框架，实验结果表明，双层优化框架提高了具有优化技能的代理的性能。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2604.15719

The World Leaks the Future: Harness Evolution for Future Prediction Agents

世界泄露未来：利用演化进行未来预测代理

Wei, Chuyang, Gao, Maohang, Han, Zhixin, Chen, Kefei, Zhuang, Yu, Guan, Haoxiang, Zhang, Yanzhi, Cheng, Yilin, He, Jiyan, Chen, Huanhuan, Li, Jian, Shi, Yu, Duan, Yitong, Zheng, Shuxin

Abstract

Many consequential decisions must be made before the relevant outcome is known. Such problems are commonly framed as \emph{future prediction}, where an LLM agent must form a prediction for an unresolved question using only the public information available at the prediction time. The setting is difficult because public evidence evolves while useful supervision arrives only after the question is resolved, so most existing approaches still improve mainly from final outcomes. Yet final outcomes are too coarse to guide earlier factor tracking, evidence gathering and interpretation, or uncertainty handling. When the same unresolved question is revisited over time, temporal contrasts between earlier and later predictions can expose omissions in the earlier prediction process; we call this signal \emph{internal feedback}. We introduce \emph{Milkyway}, a self-evolving agent system that keeps the base model fixed and instead updates a persistent \emph{future prediction harness} for factor tracking, evidence gathering and interpretation, and uncertainty handling. Across repeated predictions on the same unresolved question, \emph{Milkyway} extracts internal feedback and writes reusable guidance back into the harness, so later predictions on that question can improve before the outcome is known. After the question is resolved, the final outcome provides a \emph{retrospective check} before the updated harness is carried forward to subsequent questions. On FutureX and FutureWorld, Milkyway achieves the best overall score among the compared methods, improving FutureX from 44.07 to 60.90 and FutureWorld from 62.22 to 77.96.

Chinese Translation

在相关结果尚未确定之前，必须做出许多重要决策。这类问题通常被框架为 extit{未来预测}，其中一个大型语言模型（LLM）代理必须仅利用在预测时可用的公共信息来形成对未解决问题的预测。这一设置十分困难，因为公共证据会随着时间的推移而演变，而有用的监督信息通常在问题解决后才会到来，因此大多数现有方法仍主要依赖最终结果进行改进。然而，最终结果过于粗略，无法指导早期因素追踪、证据收集与解释或不确定性处理。当同一未解决问题随着时间的推移被重新审视时，早期和后期预测之间的时间对比可以揭示早期预测过程中的遗漏；我们称这种信号为 extit{内部反馈}。我们引入了 extit{Milkyway}，一个自我演化的代理系统，该系统保持基础模型不变，而是更新一个持久的 extit{未来预测工具}，用于因素追踪、证据收集与解释以及不确定性处理。在对同一未解决问题的重复预测中， extit{Milkyway}提取内部反馈并将可重用的指导写回工具中，从而使得在结果已知之前，对该问题的后续预测能够得到改善。在问题解决后，最终结果提供了一个 extit{回顾检查}，以便在更新的工具被应用于后续问题之前进行验证。在FutureX和FutureWorld上， extit{Milkyway}在比较方法中实现了最佳整体得分，将FutureX从44.07提高到60.90，将FutureWorld从62.22提高到77.96。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2604.15726

LLM Reasoning Is Latent, Not the Chain of Thought

大语言模型推理是潜在的，而非思维链

Wang, Wenshuo

Abstract

This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.

Chinese Translation

本文立场论文认为，大语言模型（LLM）推理应被视为潜在状态轨迹的形成，而非忠实的表面思维链（CoT）。这一观点至关重要，因为关于忠实性、可解释性、推理基准和推理时干预的主张都依赖于该领域对推理主要对象的理解。我们探讨在分离三个常常混淆的因素后，这一对象应是什么，并形式化三个竞争假设：H1，推理主要通过潜在状态轨迹介导；H2，推理主要通过显式的表面思维链介导；H0，大多数明显的推理增益更好地通过通用的串行计算而非任何特权的表征对象来解释。在这一框架下重新组织近期的实证、机制和调查研究，并添加计算审计的工作示例，以分解表面痕迹、潜在干预和匹配的预算扩展，我们发现当前证据最强烈地支持H1作为默认工作假设，而非作为任务独立的裁决。因此，我们提出两项建议：该领域应将潜在状态动态视为LLM推理的默认研究对象，并应通过明确区分表面痕迹、潜在状态和串行计算的设计来评估推理。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2604.15727

Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants

通过代数不变量实现大语言模型的结构化溯因-演绎-归纳推理

Gilda, Sankalp, Gilda, Shlok

Abstract

Large language models exhibit systematic limitations in structured logical reasoning: they conflate hypothesis generation with verification, cannot distinguish conjecture from validated knowledge, and allow weak reasoning steps to propagate unchecked through inference chains. We present a symbolic reasoning scaffold that operationalizes Peirce's tripartite inference -- abduction, deduction, and induction -- as an explicit protocol for LLM-assisted reasoning. The framework enforces logical consistency through five algebraic invariants (the Gamma Quintet), the strongest of which -- the Weakest Link bound -- ensures that no conclusion in a reasoning chain can exceed the reliability of its least-supported premise. This principle, independently grounded as weakest link resolution in possibilistic logic and empirically validated for chain-of-thought reasoning, prevents logical inconsistencies from accumulating across multi-step inference. We verify all invariants through a property-based testing suite of 100 properties and 16 fuzz tests over 10^5+ generated cases, providing a verified reference implementation of the invariants suitable as a foundation for future reasoning benchmarks.

Chinese Translation

大型语言模型在结构化逻辑推理方面表现出系统性的局限性：它们将假设生成与验证混为一谈，无法区分猜想与验证知识，并允许弱推理步骤在推理链中不受控制地传播。我们提出了一种符号推理框架，将皮尔士的三元推理——溯因、演绎和归纳——作为大语言模型辅助推理的显式协议。该框架通过五个代数不变量（伽马五元组）强制执行逻辑一致性，其中最强的——最弱链界限——确保推理链中的任何结论都不会超过其支持最少的前提的可靠性。该原则在可能性逻辑中独立确立为最弱链解决方案，并在思维链推理中经过实证验证，防止逻辑不一致在多步推理中累积。我们通过一个包含100个属性和16个模糊测试的基于属性的测试套件验证了所有不变量，测试覆盖超过10^5个生成案例，提供了一个经过验证的不变量参考实现，适合作为未来推理基准的基础。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2604.15760

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

KWBench：测量知识工作中的无提示问题识别

Maloo, Ankit

Abstract

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

Chinese Translation

我们介绍了KWBench（知识工作基准）的第一个版本，这是一个用于评估大型语言模型（LLM）在无提示情况下识别问题能力的基准：LLM能否在尝试解决问题之前识别出专业场景。现有的前沿基准已趋于饱和，目前大多数知识工作评估仅限于根据规范进行提取或任务完成。KWBench的目标是这一过程之前的步骤：仅从原始输入中识别情境的主导结构。该基准包含来自采购、合同谈判、临床药学、组织政治、欺诈分析和激励设计等领域的实践者提供的223个任务。每个任务编码了一个正式的博弈论模式（委托-代理冲突、信号传递、机制设计失败、战略性遗漏、联盟动态、战略相互依赖），并记录了专家对情境的理解及预期的失败模式。模型接收原始数据和任务提示，但没有问题类型的指示。评分采用三层评分标准，必须通过一个强制性联合检查。强制性标准编码了预测的错误路径。我们评估了16个模型。最佳模型在27.9%的任务上通过。排名前两的模型仅在31.7%的通过任务上达成一致。在前8名中，有44个任务仅被一个模型解决；前8名模型的路由覆盖了基准的50.7%，几乎是最佳单一模型的两倍。在通过的条件下，质量评分趋于一致（各模型约为83%）；而无条件评分则不然。同样的模型在被询问时能够正确表达相关的博弈论概念，但在未提示的情况下却无法应用。我们发布KWBench，以改变前沿模型在知识工作中的评估方式，评分标准不仅考虑它们在问题被框定后执行的效果，还包括它们是否能够仅从情境中识别出正确的问题。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2604.15837

Stein Variational Black-Box Combinatorial Optimization

斯坦变分黑箱组合优化

Landais, Thomas, Goudet, Olivier, Goëffon, Adrien, Saubion, Frédéric, Lamprier, Sylvain

Abstract

Combinatorial black-box optimization in high-dimensional settings demands a careful trade-off between exploiting promising regions of the search space and preserving sufficient exploration to identify multiple optima. Although Estimation-of-Distribution Algorithms (EDAs) provide a powerful model-based framework, they often concentrate on a single region of interest, which may result in premature convergence when facing complex or multimodal objective landscapes. In this work, we incorporate the Stein operator to introduce a repulsive mechanism among particles in the parameter space, thereby encouraging the population to disperse and jointly explore several modes of the fitness landscape. Empirical evaluations across diverse benchmark problems show that the proposed method achieves performance competitive with, and in several cases superior to, leading state-of-the-art approaches, particularly on large-scale instances. These findings highlight the potential of Stein variational gradient descent as a promising direction for addressing large, computationally expensive, discrete black-box optimization problems.

Chinese Translation

在高维环境下，组合黑箱优化需要在利用搜索空间中有前景的区域与保持足够的探索以识别多个最优解之间进行谨慎的权衡。尽管分布估计算法（Estimation-of-Distribution Algorithms, EDAs）提供了一个强大的基于模型的框架，但它们往往集中于单一的关注区域，这可能导致在面对复杂或多模态目标景观时的过早收敛。在本研究中，我们引入斯坦算子（Stein operator），在参数空间中的粒子之间引入一种排斥机制，从而鼓励种群分散并共同探索适应度景观的多个模式。针对多种基准问题的实证评估表明，所提出的方法在性能上与领先的最先进方法具有竞争力，并在多个案例中表现优于这些方法，尤其是在大规模实例上。这些发现突显了斯坦变分梯度下降（Stein variational gradient descent）作为解决大型、计算开销高的离散黑箱优化问题的有前景方向的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2604.15839

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

发现与证明：一个开源的代理框架用于 Lean 4 中的困难模式自动定理证明

Liu, Chengwu, Yin, Yichun, Yuan, Ye, Xie, Jiaxuan, Li, Botao, Li, Siqi, Shen, Jianhao, Xu, Yan, Shang, Lifeng, Zhang, Ming

Abstract

Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.

Chinese Translation

大多数自动定理证明（ATP）基准将最终答案嵌入到形式声明中——我们称之为“简单模式”——这种设计相对于人类竞争者所面临的任务简化了问题，并可能导致对模型能力的乐观估计。我们将更严格、更现实的设置称为“困难模式”：系统必须独立发现答案，然后再构建正式证明。为了促进困难模式的研究，我们做出了两个贡献。首先，我们发布了 MiniF2F-Hard 和 FIMO-Hard，这两个广泛使用的 ATP 基准的专家重新标注的困难模式变体。其次，我们引入了发现与证明（Discover And Prove, DAP），这是一个代理框架，利用大型语言模型（LLM）的自然语言推理与明确的自我反思来发现答案，然后将困难模式声明重写为简单模式，以适应现有的 ATP 证明器。DAP 设置了最新的技术水平：在 CombiBench 上，它将解决的问题数量从 7（之前的 SOTA，Pass@16）提高到 10；在 PutnamBench 上，它是第一个在困难模式下正式证明 36 个定理的系统——同时揭示了最新的 LLM 在同样的问题上超过 80% 的答案准确率，而正式证明器的准确率不足 10%，暴露了一个显著的差距，困难模式基准特别适合于测量这一点。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2604.15877

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

经验压缩谱：统一 LLM 代理中的记忆、技能与规则

Zhang, Xing, Wang, Guanghui, Cui, Yanwei, Qiu, Wei, Li, Ziyuan, Zhu, Bing, He, Peiyang

Abstract

As LLM agents scale to long-horizon, multi-session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge -- extracting reusable knowledge from interaction traces -- yet a citation analysis of 1,136 references across 22 primary papers reveals a cross-community citation rate below 1%. We propose the \emph{Experience Compression Spectrum}, a unifying framework that positions memory, skills, and rules as points along a single axis of increasing compression (5--20$\times$ for episodic memory, 50--500$\times$ for procedural skills, 1,000$\times$+ for declarative rules), directly reducing context consumption, retrieval latency, and compute overhead. Mapping 20+ systems onto this spectrum reveals that every system operates at a fixed, predetermined compression level -- none supports adaptive cross-level compression, a gap we term the \emph{missing diagonal}. We further show that specialization alone is insufficient -- both communities independently solve shared sub-problems without exchanging solutions -- that evaluation methods are tightly coupled to compression levels, that transferability increases with compression at the cost of specificity, and that knowledge lifecycle management remains largely neglected. We articulate open problems and design principles for scalable, full-spectrum agent learning systems.

Chinese Translation

随着 LLM 代理在长期、多会话的部署中规模扩大，有效管理积累的经验成为一个关键瓶颈。代理记忆系统和代理技能发现都旨在应对这一挑战——从交互轨迹中提取可重用的知识——然而，对 22 篇主要论文中 1,136 个参考文献的引用分析显示，跨社区的引用率低于 1%。我们提出了 extit{经验压缩谱}，这是一个统一框架，将记忆、技能和规则定位于一个单一的逐步压缩轴上（情节记忆为 5-20$ imes$，程序技能为 50-500$ imes$，声明性规则为 1,000$ imes$+），直接减少上下文消耗、检索延迟和计算开销。将 20 多个系统映射到该谱上显示，每个系统都在一个固定的、预先确定的压缩水平上运行——没有一个系统支持自适应跨级压缩，这一缺口我们称之为 extit{缺失对角线}。我们进一步表明，仅靠专业化是不够的——两个社区独立解决共享的子问题而不交换解决方案——评估方法与压缩水平紧密相关，转移性随着压缩的增加而提高，但特异性则受到损害，知识生命周期管理仍然在很大程度上被忽视。我们阐明了可扩展的全谱代理学习系统的开放问题和设计原则。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2604.15898

Towards Rigorous Explainability by Feature Attribution

通过特征归因实现严格的可解释性

Létoffé, Olivier, Huang, Xuanxiang, Marques-Silva, Joao

Abstract

For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley values in explainable artificial intelligence (XAI), with the tool SHAP being a ubiquitous example. This paper overviews the ongoing efforts towards using rigorous symbolic methods of XAI as an alternative to non-rigorous non-symbolic approaches, concretely for assigning relative feature importance.

Chinese Translation

在过去十年中，非符号方法一直是解释复杂机器学习（ML）模型的首选。然而，这些方法缺乏严谨性，可能会误导人类决策者。在机器学习的高风险应用中，缺乏严谨性尤其成问题。一个明显缺乏严谨性的例子是可解释人工智能（XAI）中采用Shapley值，工具SHAP就是一个普遍的例子。本文概述了当前朝着使用严格符号方法的XAI的努力，以作为非严谨非符号方法的替代，具体用于分配相对特征重要性。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2604.15951

Integrating Graphs, Large Language Models, and Agents: Reasoning and Retrieval

图形、大型语言模型与智能体的整合：推理与检索

Jelodar, Hamed, Bai, Samita, Meymani, Mohammad, Hamedi, Parisa, Razavi-Far, Roozbeh, Ghorbani, Ali

Abstract

Generative AI, particularly Large Language Models, increasingly integrates graph-based representations to enhance reasoning, retrieval, and structured decision-making. Despite rapid advances, there remains limited clarity regarding when, why, where, and what types of graph-LLM integrations are most appropriate across applications. This survey provides a concise, structured overview of the design choices underlying the integration of graphs with LLMs. We categorize existing methods based on their purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, interaction graphs, causal graphs, dependency graphs), and integration strategies (prompting, augmentation, training, or agent-based use). By mapping representative works across domains such as cybersecurity, healthcare, materials science, finance, robotics, and multimodal environments, we highlight the strengths, limitations, and best-fit scenarios for each technique. This survey aims to offer researchers a practical guide for selecting the most suitable graph-LLM approach depending on task requirements, data characteristics, and reasoning complexity.

Chinese Translation

生成性人工智能，特别是大型语言模型，越来越多地整合基于图形的表示，以增强推理、检索和结构化决策。尽管快速发展，但关于何时、为何、在哪里以及何种类型的图形-LLM（Large Language Models）整合在各应用中最为适宜仍然缺乏明确性。本调查提供了图形与LLM整合设计选择的简明结构化概述。我们根据目的（推理、检索、生成、推荐）、图形模态（知识图谱、场景图、交互图、因果图、依赖图）和整合策略（提示、增强、训练或基于智能体的使用）对现有方法进行了分类。通过映射网络安全、医疗保健、材料科学、金融、机器人技术和多模态环境等领域的代表性工作，我们突出了每种技术的优点、局限性和最佳适用场景。本调查旨在为研究人员提供一个实用指南，以根据任务需求、数据特征和推理复杂性选择最合适的图形-LLM方法。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2604.15972

Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

多智能体推理与协作的弱链优化

Bian, Haoyu, Zhang, Chaoning, Zhang, Jiaquan, Li, Xingyao, Guo, Yuanfang, Dong, Wei, Yang, Yang

Abstract

LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance-limiting agents receive less attention. To address this gap, we propose WORC, a \underline{w}eak-link \underline{o}ptimization framework for multi-agent \underline{r}easoning and \underline{c}ollaboration, grounded in the weak-link principle. WORC follows a two-stage workflow. In the weak agent localization stage, task features are constructed, and a meta-learning-based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero-shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak-link optimization stage, an uncertainty-driven allocation strategy assigns additional reasoning budgets to weak agents, with lower predicted weights leading to larger repeated-sampling quotas to compensate for reliability deficiencies. Experimental results show that WORC achieves an average accuracy of 82.2\% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.

Chinese Translation

基于大型语言模型（LLM）的多智能体框架通过多角色协作解决复杂的推理任务。然而，现有方法往往面临推理不稳定的问题，个别智能体的错误在协作中被放大，从而削弱整体性能。目前的研究主要集中在增强高能力智能体或抑制不可靠输出以提高框架的有效性，而对性能限制智能体的系统识别和强化关注较少。为了解决这一问题，我们提出了WORC，一个基于弱链原则的多智能体推理与协作的 extbf{w}eak-link extbf{o}ptimization extbf{r}easoning and extbf{c}ollaboration框架。WORC遵循两个阶段的工作流程。在弱智能体定位阶段，构建任务特征，并通过基于元学习的权重预测器，该预测器在由群体智能算法（SIAs）识别的最佳配置上进行训练，实现从这些特征到智能体性能权重的零样本映射，其中预测权重最低的智能体被识别为弱智能体。在弱链优化阶段，基于不确定性的分配策略为弱智能体分配额外的推理预算，较低的预测权重导致更大的重复采样配额，以弥补可靠性不足。实验结果表明，WORC在推理基准测试中实现了82.2\%的平均准确率，同时提高了框架的稳定性和跨架构的泛化能力，表明补偿弱链而非单纯强化优势能够增强多智能体系统的鲁棒性。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2604.15994

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

ReactBench：一种用于化学反应图中多模态大型语言模型的拓扑推理基准

Xu, Qiang, Bai, Shengyuan, Wang, Yu, Cao, He, Chen, Leqing, Liu, Yuanyuan, Feng, Bin, Liu, Zijing, Li, Yu

Abstract

Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions. Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.

Chinese Translation

多模态大型语言模型（MLLMs）在识别单个视觉元素和对简单线性图进行推理方面表现出色。然而，当面对涉及分支路径、汇聚流和循环依赖的复杂拓扑结构时，它们的推理能力急剧下降，甚至在基本的任务如计数端点时也如此。现有的基准测试未能探测到这一差距，主要集中在语义理解而非结构推理上。我们引入了ReactBench，一个通过化学反应图揭示结构推理基本局限性的基准。这些真实的科学图表提供了理想的测试平台，因为它们自然涵盖了从线性链到循环图的多样结构，同时需要精确的局部识别和连贯的全局推理。我们的基准包括1,618个专家注释的问答对，涵盖四个层级任务维度。对17个MLLMs的广泛评估显示，基于锚点的任务与整体结构推理任务之间存在超过30%的显著性能差距。控制性消融实验确认这一瓶颈在于推理，而非感知。这些发现揭示了结构理解的基本缺陷，并为推动视觉推理的进展建立了方向。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2604.16009

MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

MEDLEY-BENCH：在人工智能元认知中评估规模购买但不控制

Abtahi, Farhad, Karbalaie, Abdolamir, Illueca-Fernandez, Eduardo, Seoane, Fernando

Abstract

Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two behavioural profiles, i.e., models that revise primarily in response to argument quality and models that track consensus statistics. Under within-model relative profiling (ipsative scoring), evaluation was the weakest relative ability in all 35 models, indicating a systematic knowing/doing gap. Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale. These findings position MEDLEY-BENCH as a tool for measuring belief revision under social pressure and suggest that future training should reward calibrated, proportional updating rather than output quality alone.

Chinese Translation

元认知，即监控和调节自身推理的能力，在人工智能基准测试中仍然未得到充分评估。我们引入了MEDLEY-BENCH，这是一个行为元认知的基准，旨在在真实的模型间分歧下区分独立推理、私人自我修正和社会影响修正。该基准评估了来自12个家族的35个模型在五个领域的130个模糊实例上的表现，并报告了两个互补的分数：Medley元认知分数（MMS），这是一个基于层级的反思更新、社会稳健性和认知表达的综合分数，以及Medley能力分数（MAS），它源自四种元认知子能力。结果显示出评估/控制的稳健分离：在同一家族内，评估能力随着模型规模的增加而增加，而控制能力则没有。在对11个模型进行的后续渐进对抗分析中，我们观察到两种行为特征，即主要根据论据质量进行修正的模型和跟踪共识统计的模型。在模型内部相对评估（自我评分）中，评估在所有35个模型中都是最弱的相对能力，表明存在系统性的知行差距。较小且成本较低的模型往往与较大模型相匹配或超越，表明元认知能力并不仅仅是规模的函数。这些发现将MEDLEY-BENCH定位为在社会压力下测量信念修正的工具，并建议未来的训练应奖励经过校准的、比例更新，而不仅仅是输出质量。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2604.16022

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

SocialGrid：一个用于规划和社会推理的具身多智能体系统基准

Shindo, Hikaru, Lin, Hanzhao, Helff, Lukas, Schramowski, Patrick, Kersting, Kristian

Abstract

As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.

Chinese Translation

随着大型语言模型（LLMs）从文本处理器转变为自主智能体，在具身多智能体环境中评估它们的社会推理能力变得至关重要。我们介绍了SocialGrid，一个受《Among Us》启发的具身多智能体环境，用于评估LLM智能体在规划、任务执行和社会推理方面的表现。我们的评估结果显示，即使是最强大的开放模型（GPT-OSS-120B）在任务完成和规划方面的准确率也低于60%，智能体往往陷入重复行为或无法克服基本障碍。由于糟糕的导航能力会干扰社会智能的评估，SocialGrid提供了一个可选的规划神谕（Planning Oracle），以将社会推理与规划缺陷隔离开来。虽然规划辅助提高了任务完成率，但社会推理仍然是一个瓶颈：智能体在检测欺骗方面的成功率几乎是随机的，无论规模如何，依赖于浅层启发式而不是积累行为证据。SocialGrid提供自动故障分析和细粒度指标，使开发者能够诊断和改进他们的智能体。我们还利用对抗联赛游戏中的Elo评分建立了一个竞争性排行榜。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2604.16175

MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

MARCH：用于CT报告生成的多智能体放射学临床层级

Lin, Yi, Ding, Yihao, Wu, Yonghui, Peng, Yifan

Abstract

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.

Chinese Translation

自动化的3D放射学报告生成常常面临临床幻觉和缺乏人类实践中迭代验证的问题。尽管近期的视觉-语言模型（Vision-Language Models, VLMs）在该领域取得了进展，但它们通常作为单一的“黑箱”系统运行，缺乏临床工作流程中典型的协作监督。为了解决这些挑战，我们提出了MARCH（多智能体放射学临床层级），这是一个模拟放射科部门专业层级的多智能体框架，并为不同的智能体分配专业角色。MARCH利用住院医师智能体进行初步草拟，结合多尺度CT特征提取，多个研究员智能体进行检索增强的修订，以及一个主治医师智能体协调迭代的、基于立场的共识讨论，以解决诊断差异。在RadGenome-ChestCT数据集上，MARCH在临床真实性和语言准确性方面显著优于最先进的基线。我们的工作表明，模拟类人组织结构可以增强AI在高风险医疗领域的可靠性。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2604.16258

Characterising LLM-Generated Competency Questions: a Cross-Domain Empirical Study using Open and Closed Models

特征化LLM生成的能力问题：基于开放和封闭模型的跨领域实证研究

Alharbi, Reham, Tamma, Valentina, Payne, Terry R., de Berardinis, Jacopo

Abstract

Competency Questions (CQs) are a cornerstone of requirement elicitation in ontology engineering. CQs represent requirements as a set of natural language questions that an ontology should satisfy; they are traditionally modelled by ontology engineers together with domain experts as part of a human-centred, manual elicitation process. The use of Generative AI automates CQ creation at scale, therefore democratising the process of generation, widening stakeholder engagement, and ultimately broadening access to ontology engineering. However, given the large and heterogeneous landscape of LLMs, varying in dimensions such as parameter scale, task and domain specialisation, and accessibility, it is crucial to characterise and understand the intrinsic, observable properties of the CQs they produce (e.g., readability, structural complexity) through a systematic, cross-domain analysis. This paper introduces a set of quantitative measures for the systematic comparison of CQs across multiple dimensions. Using CQs generated from well defined use cases and scenarios, we identify their salient properties, including readability, relevance with respect to the input text and structural complexity of the generated questions. We conduct our experiments over a set of use cases and requirements using a range of LLMs, including both open (KimiK2-1T, LLama3.1-8B, LLama3.2-3B) and closed models (Gemini 2.5 Pro, GPT 4.1). Our analysis demonstrates that LLM performance reflects distinct generation profiles shaped by the use case.

Chinese Translation

能力问题（CQs）是本体工程需求获取的基石。能力问题以一组自然语言问题的形式表示本体应满足的需求；它们通常由本体工程师与领域专家共同建模，作为以人为中心的手动获取过程的一部分。生成性人工智能的使用在规模上自动化了能力问题的创建，从而使生成过程民主化，扩大了利益相关者的参与，最终拓宽了对本体工程的访问。然而，考虑到大型语言模型（LLMs）的大规模和异质性，包括参数规模、任务和领域专业化以及可访问性等维度，系统性地特征化和理解它们生成的能力问题的内在可观察属性（例如，可读性、结构复杂性）至关重要。本文引入了一组定量指标，用于在多个维度上系统比较能力问题。通过使用从明确定义的用例和场景生成的能力问题，我们识别出它们的显著属性，包括可读性、与输入文本的相关性以及生成问题的结构复杂性。我们在一组用例和需求上进行实验，使用多种大型语言模型，包括开放模型（KimiK2-1T、LLama3.1-8B、LLama3.2-3B）和封闭模型（Gemini 2.5 Pro、GPT 4.1）。我们的分析表明，LLM的性能反映了由用例塑造的不同生成特征。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2604.16278

Learning to Reason with Insight for Informal Theorem Proving

通过洞察学习推理以进行非正式定理证明

Li, Yunhe, Shi, Hao, Deng, Bowen, Wang, Wei, Ruan, Mengzhe, Hou, Hanxu, Dai, Zhongxiang, Gao, Siyang, Wang, Chao, Qiu, Shuang, Song, Linqi

Abstract

Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose a novel framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. We propose $\mathtt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof. To fully exploit this dataset, we design a Progressive Multi-Stage SFT strategy that mimics the human learning process, guiding the model from basic proof writing to insightful thinking. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.

Chinese Translation

尽管大多数自动定理证明方法依赖于形式证明系统，但非正式定理证明更能与大型语言模型（LLMs）在自然语言处理方面的优势相契合。在本研究中，我们确定了非正式定理证明的一个主要瓶颈，即缺乏洞察力，即识别解决复杂问题所需核心技术的困难。为了解决这一问题，我们提出了一种新颖的框架，旨在培养这一基本推理技能，使LLMs能够进行富有洞察力的推理。我们提出了$ exttt{DeepInsightTheorem}$，这是一个分层数据集，通过明确提取核心技术和证明草图以及最终证明来构建非正式证明。为了充分利用该数据集，我们设计了一种渐进式多阶段SFT策略，模拟人类学习过程，引导模型从基本的证明写作到富有洞察力的思考。我们在具有挑战性的数学基准测试中的实验表明，这种关注洞察的生成策略显著优于基线。这些结果表明，教会模型识别和应用核心技术可以显著提高其数学推理能力。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2604.16280

Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

利用大型语言模型和知识图谱提高制造业机器学习模型的可解释性

Bayer, Thomas, Lohr, Alexander, Weiß, Sarah, Michelberger, Bernd, Höpken, Wolfram

Abstract

Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explanations, establishing a structured connection between domain knowledge and ML insights. To make these insights accessible to users, we designed a selective retrieval method in which relevant triplets are extracted from the KG and processed by a Large Language Model (LLM) to generate user-friendly explanations of ML results. We evaluated our method in a manufacturing environment using the XAI Question Bank. Beyond standard questions, we introduce more complex, tailored questions that highlight the strengths of our approach. We evaluated 33 questions, analyzing responses using quantitative metrics such as accuracy and consistency, as well as qualitative ones such as clarity and usefulness. Our contribution is both theoretical and practical: from a theoretical perspective, we present a novel approach for effectively enabling LLMs to dynamically access a KG in order to improve the explainability of ML results. From a practical perspective, we provide empirical evidence showing that such explanations can be successfully applied in real-world manufacturing environments, supporting better decision-making in manufacturing processes.

Chinese Translation

以透明和用户友好的方式解释机器学习（ML）结果仍然是可解释人工智能（XAI）面临的一项挑战。在本文中，我们提出了一种通过使用知识图谱（KG）来增强机器学习模型可解释性的方法。我们存储领域特定的数据以及机器学习结果及其相应的解释，建立了领域知识与机器学习洞察之间的结构化连接。为了使这些洞察对用户可访问，我们设计了一种选择性检索方法，从知识图谱中提取相关三元组，并由大型语言模型（LLM）处理，以生成用户友好的机器学习结果解释。我们在制造环境中使用XAI问题库对我们的方法进行了评估。除了标准问题外，我们还引入了更复杂、量身定制的问题，以突出我们方法的优势。我们评估了33个问题，使用准确性和一致性等定量指标以及清晰度和实用性等定性指标分析了回答。我们的贡献既具有理论意义，也具有实践意义：从理论角度来看，我们提出了一种新颖的方法，使大型语言模型能够动态访问知识图谱，从而提高机器学习结果的可解释性；从实践角度来看，我们提供了实证证据，表明这种解释可以成功应用于现实世界的制造环境中，支持制造过程中的更好决策。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2604.16286

ASMR-Bench: Auditing for Sabotage in ML Research

ASMR-Bench：机器学习研究中的破坏审计

Gan, Eric, Bhatt, Aryan, Shlegeris, Buck, Stastny, Julian, Hebbar, Vivek

Abstract

As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce ASMR-Bench (Auditing for Sabotage in ML Research), a benchmark for evaluating the ability of auditors to detect sabotage in ML research codebases. ASMR-Bench consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on ASMR-Bench and found that both struggled to reliably detect sabotage: the best performance was an AUROC of 0.77 and a top-1 fix rate of 42%, achieved by Gemini 3.1 Pro. We also tested LLMs as red teamers and found that LLM-generated sabotages were weaker than human-generated ones but still sometimes evaded same-capability LLM auditors. We release ASMR-Bench to support research on monitoring and auditing techniques for AI-conducted research.

Chinese Translation

随着人工智能系统越来越多地被用于自主进行研究，未对齐的系统可能会引入微妙的缺陷，从而产生误导性结果并逃避检测。我们介绍了ASMR-Bench（机器学习研究中的破坏审计），这是一个用于评估审计员在机器学习研究代码库中检测破坏能力的基准。ASMR-Bench包含9个机器学习研究代码库及其破坏变体，这些变体产生质 qualitatively不同的实验结果。每个破坏修改了实现细节，例如超参数、训练数据或评估代码，同时保持论文中描述的高级方法论。我们在ASMR-Bench上评估了前沿的大型语言模型（LLMs）和LLM辅助的人类审计员，发现两者在可靠检测破坏方面都存在困难：最佳表现为0.77的AUROC和42%的顶级修复率，由Gemini 3.1 Pro实现。我们还测试了LLMs作为红队成员，发现LLM生成的破坏比人类生成的弱，但有时仍能逃避同等能力的LLM审计员。我们发布ASMR-Bench以支持对人工智能进行研究的监控和审计技术的研究。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

cs.CL / 1 / 2604.15371

Applied Explainability for Large Language Models: A Comparative Study

大型语言模型的应用可解释性：一项比较研究

Kancharla, Venkata Abhinandan

Abstract

Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques: Integrated Gradients, Attention Rollout, and SHAP, on a fine-tuned DistilBERT model for SST-2 sentiment classification. Rather than proposing new methods, the focus is on evaluating the practical behavior of existing approaches under a consistent and reproducible setup. The results show that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features. Model-agnostic approaches offer flexibility but introduce higher computational cost and variability. This work highlights key trade-offs between explainability methods and emphasizes their role as diagnostic tools rather than definitive explanations. The findings provide practical insights for researchers and engineers working with transformer-based NLP systems. This is a preprint and has not undergone peer review.

Chinese Translation

大型语言模型（LLMs）在许多自然语言处理任务中表现出色，但其决策过程仍然难以解释。这种缺乏透明性为信任、调试和在现实系统中的部署带来了挑战。本文对三种可解释性技术进行了应用比较研究：集成梯度（Integrated Gradients）、注意力展开（Attention Rollout）和SHAP，针对经过微调的DistilBERT模型在SST-2情感分类任务中的表现。研究的重点不是提出新方法，而是评估现有方法在一致且可重复的设置下的实际表现。结果表明，基于梯度的归因提供了更稳定和直观的解释，而基于注意力的方法在计算上更高效，但与预测相关特征的对齐程度较低。模型无关的方法提供了灵活性，但引入了更高的计算成本和变异性。这项工作突出了可解释性方法之间的关键权衡，并强调它们作为诊断工具而非决定性解释的作用。研究结果为从事基于变换器的自然语言处理系统的研究人员和工程师提供了实用的见解。这是一篇预印本，尚未经过同行评审。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2604.15490

Think Multilingual, Not Harder: A Data-Efficient Framework for Teaching Reasoning Models to Code-Switch

多语言思维，而非更努力：一个数据高效的框架用于教授推理模型进行语言切换

Lin, Eleanor M., Jurgens, David

Abstract

Recent developments in reasoning capabilities have enabled large language models to solve increasingly complex mathematical, symbolic, and logical tasks. Interestingly, while reasoning models are often trained to generate monolingual text, these models have also been observed to code-switch (i.e., mix languages). Prior works have either viewed code-switching as an undesirable error, attempted to control code-switching through modifications to input prompts or the output decoding process, or focus on narrow subsets of languages, domains, tasks, and models. We address these gaps by introducing the first linguistically and behaviorally motivated fine-tuning framework for identifying beneficial code-switched reasoning behaviors in large language models and teaching these models to code-switch more effectively for reasoning. First, we create and systematically analyze a dataset of reasoning traces from diverse models, languages, tasks, and domains to understand the types of code-switching behaviors found in existing reasoning models. Then, we develop fine-tuning interventions that teach reasoning models to code-switch based on our observations of helpful behaviors in existing models. We find that our framework can significantly increase beneficial code-switched reasoning behaviors in a data-efficient manner. Interestingly, we also find that code-switching behaviors in reasoning models can be modified by fine-tuning for tasks that do not directly demonstrate code-switching in reasoning (e.g., machine translation). Our work suggests that data-efficient interventions can instill helpful forms of code-switching behavior in reasoning models.

Chinese Translation

最近在推理能力方面的发展使得大型语言模型能够解决日益复杂的数学、符号和逻辑任务。有趣的是，尽管推理模型通常被训练生成单语文本，但这些模型也被观察到能够进行语言切换（即混合语言）。之前的研究要么将语言切换视为一种不良错误，要么试图通过修改输入提示或输出解码过程来控制语言切换，或者专注于狭窄的语言、领域、任务和模型子集。我们通过引入第一个在语言学和行为上具有动机的微调框架，来填补这些空白，以识别大型语言模型中有益的语言切换推理行为，并教导这些模型更有效地进行语言切换。首先，我们创建并系统分析了一个来自不同模型、语言、任务和领域的推理轨迹数据集，以理解现有推理模型中发现的语言切换行为类型。然后，我们开发了微调干预措施，教导推理模型基于我们对现有模型中有益行为的观察进行语言切换。我们发现我们的框架可以以数据高效的方式显著增加有益的语言切换推理行为。有趣的是，我们还发现推理模型中的语言切换行为可以通过微调修改，以适应那些不直接展示语言切换的推理任务（例如，机器翻译）。我们的工作表明，数据高效的干预措施可以在推理模型中灌输有益的语言切换行为形式。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2604.15503

Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

脑评分追踪语言的共享特性：来自多种自然语言和结构化序列的证据

Qu, Jingnong, Ranjan, Ashvin, Steinert-Threlkeld, Shane

Abstract

Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models' processing to human language processing? Results using a framework called Brain Score (BS) -- predicting fMRI activations during reading from LM activations -- have been used to argue for a high degree of similarity. To understand this similarity, we conduct experiments by training LMs on various types of input data and evaluate them on BS. We find that models trained on various natural languages from many different language families have very similar BS performance. LMs trained on other structured data -- the human genome, Python, and pure hierarchical structure (nested parentheses) -- also perform reasonably well and close to natural languages in some cases. These findings suggest that BS can highlight language models' ability to extract common structure across natural languages, but that the metric may not be sensitive enough to allow us to infer human-like processing from a high BS score alone.

Chinese Translation

近期神经网络语言模型（LMs）的突破引发了一个问题：这些模型的处理与人类语言处理有多相似？使用一种称为脑评分（Brain Score, BS）的框架，通过预测阅读时的功能性磁共振成像（fMRI）激活来评估LM激活，结果表明两者具有高度相似性。为了理解这种相似性，我们通过在各种类型的输入数据上训练LM，并在BS上进行评估。我们发现，来自许多不同语言家族的多种自然语言训练的模型在BS表现上非常相似。训练于其他结构化数据（如人类基因组、Python和纯层次结构（嵌套括号））的LM在某些情况下也表现良好，接近自然语言。这些发现表明，BS能够突出语言模型提取自然语言共同结构的能力，但该指标可能不够敏感，无法仅凭高BS分数推断出类人处理能力。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2604.15505

PolicyBank: Evolving Policy Understanding for LLM Agents

PolicyBank：为大型语言模型代理演变政策理解

Choi, Jihye, Yoon, Jinsung, Le, Long T., Jha, Somesh, Pfister, Tomas

Abstract

LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent's behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve its policy understanding through interaction and corrective feedback from pre-deployment testing, can it autonomously refine its interpretation to close specification gaps? We propose PolicyBank, a memory mechanism that maintains structured, tool-level policy insights and iteratively refines them -- unlike existing memory mechanisms that treat the policy as immutable ground truth, reinforcing "compliant but wrong" behaviors. We also contribute a systematic testbed by extending a popular tool-calling benchmark with controlled policy gaps that isolate alignment failures from execution failures. While existing memory mechanisms achieve near-zero success on policy-gap scenarios, PolicyBank closes up to 82% of the gap toward a human oracle.

Chinese Translation

在组织政策下运行的大型语言模型（LLM）代理必须遵守通常以自然语言指定的授权约束。在实践中，这些规范不可避免地包含模糊性以及导致代理行为系统性偏离真实要求的逻辑或语义缺口。我们提出一个问题：通过让代理在预部署测试中通过互动和纠正反馈来演变其政策理解，是否可以自主地细化其解释以填补规范缺口？我们提出了PolicyBank，这是一种记忆机制，维护结构化的工具级政策洞察并进行迭代细化——与现有的将政策视为不可变真实的记忆机制不同，后者强化了“合规但错误”的行为。我们还通过扩展一个流行的工具调用基准，贡献了一个系统化的测试平台，该平台具有可控的政策缺口，以隔离对齐失败与执行失败。尽管现有的记忆机制在政策缺口场景中几乎没有成功，PolicyBank在向人类oracle的缺口填补方面达到了82%。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2604.15547

Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)

基于句法和语义上下文评估摘要的情感预测一致性分析 (SSAS)

Daruwalla, Sharookh, Mayande, Nitin, Kathuria, Shreeya Verma, Joglekar, Nitin, Weber, Charles

Abstract

The fundamental challenge of using Large Language Models (LLMs) for reliable, enterprise-grade analytics, such as sentiment prediction, is the conflict between the LLMs' inherent stochasticity (generative, non-deterministic nature) and the analytical requirement for consistency. The LLM inconsistency, coupled with the noisy nature of chaotic modern datasets, renders sentiment predictions too volatile for strategic business decisions. To resolve this, we present a Syntactic & Semantic Context Assessment Summarization (SSAS) framework for establishing context. Context established by SSAS functions as a sophisticated data pre-processing framework that enforces a bounded attention mechanism on LLMs. It achieves this by applying a hierarchical classification structure (Themes, Stories, Clusters) and an iterative Summary-of-Summaries (SoS) based context computation architecture. This endows the raw text with high-signal, sentiment-dense prompts, that effectively mitigate both irrelevant data and analytical variance. We empirically evaluated the efficacy of SSAS, using Gemini 2.0 Flash Lite, against a direct-LLM approach across three industry-standard datasets - Amazon Product Reviews, Google Business Reviews, Goodreads Book Reviews - and multiple robustness scenarios. Our results show that our SSAS framework is capable of significantly improving data quality, up to 30%, through a combination of noise removal and improvement in the estimation of sentiment prediction. Ultimately, consistency in our context-estimation capabilities provides a stable and reliable evidence base for decision-making.

Chinese Translation

使用大型语言模型 (LLMs) 进行可靠的企业级分析（如情感预测）面临的根本挑战在于 LLMs 内在的随机性（生成性、非确定性特性）与分析一致性要求之间的冲突。LLM 的不一致性，加上现代混乱数据集的噪声特性，使得情感预测对于战略业务决策而言过于波动。为了解决这一问题，我们提出了一种句法和语义上下文评估摘要 (SSAS) 框架，以建立上下文。SSAS 建立的上下文作为一个复杂的数据预处理框架，强制 LLMs 采用有界注意力机制。它通过应用分层分类结构（主题、故事、集群）和基于迭代的摘要汇总 (SoS) 上下文计算架构来实现这一点。这为原始文本赋予了高信号、情感密集的提示，有效减轻了无关数据和分析方差。我们通过使用 Gemini 2.0 Flash Lite 对比直接 LLM 方法，在三个行业标准数据集（亚马逊产品评论、谷歌商业评论、Goodreads 图书评论）和多个鲁棒性场景中对 SSAS 的有效性进行了实证评估。我们的结果表明，SSAS 框架能够通过噪声去除和情感预测估计的改善，显著提高数据质量，最高可达 30%。最终，我们在上下文估计能力上的一致性为决策提供了稳定可靠的证据基础。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2604.15574

Why Fine-Tuning Encourages Hallucinations and How to Fix It

为何微调会导致幻觉现象及其解决方法

Kaplan, Guy, Gekhman, Zorik, Zhu, Zhen, Rozner, Lotem, Reif, Yuval, Swayamdipta, Swabha, Hoiem, Derek, Schwartz, Roy

Abstract

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

Chinese Translation

大型语言模型容易产生事实错误的幻觉。导致这些错误的一个关键来源是通过监督微调（SFT）接触新的事实信息，这可能会增加与预训练期间获得的知识相关的幻觉。在本研究中，我们探讨了是否可以利用持续学习文献中的既有工具来减轻SFT引发的幻觉，因为这些幻觉是训练过程中知识退化的副产品。我们提出了一种基于自蒸馏的SFT方法，该方法通过正则化输出分布漂移，促进有效的事实学习，同时最小化与现有知识相关的幻觉。我们还表明，在新知识获取不必要的情况下，通过冻结参数组来抑制事实可塑性，可以在减少幻觉的同时保持任务性能。最后，我们通过三个假设探讨了SFT引发幻觉的机制：容量限制、行为克隆和局部干扰。我们的实验表明，主要驱动因素是重叠语义表示之间的干扰，而自蒸馏通过减轻这种干扰而成功。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2604.15588

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

抱歉，我可以说几句话吗……CoLabScience：一个用于生物医学发现和大型语言模型专家协作的主动AI助手

Wu, Yang, Yu, Jinhong, Xiong, Jingwei, Tao, Zhimin, Liu, Xiaozhong

Abstract

The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabScience, a proactive LLM assistant designed to enhance biomedical collaboration between AI systems and human experts through timely, context-aware interventions. At the core of our method is PULI (Positive-Unlabeled Learning-to-Intervene), a novel framework trained with a reinforcement learning objective to determine when and how to intervene in streaming scientific discussions, by leveraging the team's project proposal and long- and short-term conversational memory. To support this work, we introduce BSDD (Biomedical Streaming Dialogue Dataset), a new benchmark of simulated research discussion dialogues with intervention points derived from PubMed articles. Experimental results show that PULI significantly outperforms existing baselines in both intervention precision and collaborative task utility, highlighting the potential of proactive LLMs as intelligent scientific assistants.

Chinese Translation

将大型语言模型（LLMs）整合到科学工作流程中为加速生物医学发现带来了令人兴奋的机遇。然而，LLMs的反应性特征使其只能在被提示时作出响应，这限制了它们在需要前瞻性和自主参与的协作环境中的有效性。在本研究中，我们介绍了CoLabScience，一个主动的LLM助手，旨在通过及时、上下文感知的干预来增强AI系统与人类专家之间的生物医学协作。我们方法的核心是PULI（Positive-Unlabeled Learning-to-Intervene），这是一个新颖的框架，采用强化学习目标进行训练，以确定何时以及如何在流动的科学讨论中进行干预，利用团队的项目提案以及长短期对话记忆。为了支持这项工作，我们引入了BSDD（Biomedical Streaming Dialogue Dataset），这是一个新的基准，包含从PubMed文章中衍生的干预点的模拟研究讨论对话。实验结果表明，PULI在干预精度和协作任务效用方面显著优于现有基准，突显了主动LLMs作为智能科学助手的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2604.15589

LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

不同微调策略和模型规模下的大型语言模型归因分析用于自动化代码合规

Shi, Jack Wei Lun, Dang, Minghao, Solihin, Wawan, Yeoh, Justin K. W.

Abstract

Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors of LLMs across different fine-tuning strategies such as full fine-tuning (FFT), low-rank adaptation (LoRA) and quantized LoRA fine-tuning, as well as the impact of model scales which include varying LLM parameter sizes. Our results show that FFT produces attribution patterns that are statistically different and more focused than those from parameter-efficient fine-tuning methods. Furthermore, we found that as model scale increases, LLMs develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers in the building text, albeit with performance gains in semantic similarity of the generated and reference computer-processable rules plateauing for models larger than 7B. This paper provides crucial insights into the explainability of these models, taking a step toward building more transparent LLMs for critical, regulation-based tasks in the Architecture, Engineering, and Construction industry.

Chinese Translation

现有关于大型语言模型（LLMs）在自动化代码合规方面的研究主要集中于性能，通常将模型视为黑箱，忽视了训练决策如何影响其可解释行为。本文通过采用基于扰动的归因分析，填补了这一空白，比较了不同微调策略（如全微调（FFT）、低秩适应（LoRA）和量化LoRA微调）下LLMs的可解释行为，以及模型规模（包括不同LLM参数大小）的影响。我们的结果表明，FFT产生的归因模式在统计上与参数高效微调方法的模式显著不同且更为集中。此外，我们发现随着模型规模的增加，LLMs发展出特定的可解释策略，例如在构建文本中优先考虑数值约束和规则标识符，尽管生成的计算机可处理规则与参考规则的语义相似性在超过70亿参数的模型中趋于平稳。本文为这些模型的可解释性提供了重要见解，朝着为建筑、工程和施工行业中的关键法规任务构建更透明的LLMs迈出了重要一步。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2604.15593

DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation

DALM：一种通过三阶段结构化生成的领域代数语言模型

Li, Chao

Abstract

Large language models compress heterogeneous knowledge into a single parameter space, allowing facts from different domains to interfere during generation. We propose DALM, a Domain-Algebraic Language Model that replaces unconstrained token generation with structured denoising over a domain lattice. DALM follows a three-phase generation path: it first resolves domain uncertainty, then relation uncertainty, and finally concept uncertainty, so each stage operates under explicit algebraic constraints. The framework requires only three ingredients: a lattice of domains with computable meet, join, and implication; a typing function over relations that controls inheritance across domains; and a fiber partition that localizes knowledge to domain-specific subsets. Given these ingredients, DALM yields a three-phase encoder-decoder architecture in which generation is confined to a domain fiber, cross-domain contamination is structurally prevented in closed-vocabulary mode and auditably bounded in open-vocabulary mode, and a single query can produce a domain-indexed multi-perspective answer space. We instantiate the framework with the CDC knowledge representation system and outline training and evaluation on validated domain-annotated crystal libraries. DALM reframes language generation as algebraically constrained structured denoising rather than unconstrained decoding over a flat token space.

Chinese Translation

大型语言模型将异构知识压缩到单一的参数空间中，使得来自不同领域的事实在生成过程中可能相互干扰。我们提出了DALM（Domain-Algebraic Language Model），它用结构化去噪代替了不受限制的标记生成，基于领域格进行操作。DALM遵循三阶段生成路径：首先解决领域不确定性，然后是关系不确定性，最后是概念不确定性，因此每个阶段都在明确的代数约束下进行。该框架仅需三个要素：一个具有可计算的交、并和蕴含的领域格；一个控制领域间继承的关系类型函数；以及一个将知识局限于领域特定子集的纤维划分。在这些要素的基础上，DALM产生了一个三阶段的编码-解码架构，其中生成被限制在一个领域纤维内，跨领域污染在封闭词汇模式下被结构性地防止，而在开放词汇模式下则被可审计地界定，并且单个查询可以生成一个领域索引的多视角答案空间。我们用CDC知识表示系统实例化该框架，并概述了在经过验证的领域注释晶体库上的训练和评估。DALM将语言生成重新框架为代数约束的结构化去噪，而不是在平坦的标记空间上进行不受限制的解码。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2604.15597

LLMs Corrupt Your Documents When You Delegate

当你委托时，大型语言模型会破坏你的文档

Laban, Philippe, Schnabel, Tobias, Neville, Jennifer

Abstract

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

Chinese Translation

大型语言模型（LLMs）有可能颠覆知识工作，委托工作作为一种新的交互范式（例如，氛围编码）应运而生。委托工作需要信任——即期望LLM能够忠实地执行任务，而不会在文档中引入错误。我们引入了DELEGATE-52来研究AI系统在委托工作流程中的准备情况。DELEGATE-52模拟了需要深入文档编辑的长期委托工作流程，涵盖52个专业领域，如编码、晶体学和音乐记谱。我们对19个LLM的大规模实验表明，当前模型在委托过程中会降低文档质量：即使是前沿模型（Gemini 3.1 Pro、Claude 4.6 Opus、GPT 5.4）在长时间工作流程结束时也会平均破坏25%的文档内容，其他模型的失败情况更为严重。额外实验表明，工具的自主使用并未改善DELEGATE-52的表现，并且文档大小、交互时长或干扰文件的存在会加剧降级的严重性。我们的分析表明，当前的LLM是不可靠的委托者：它们会引入稀疏但严重的错误，默默地破坏文档，并在长时间交互中不断加剧。

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2604.15602

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

GroupDPO：内存高效的组内直接偏好优化

Leng, Jixuan, Si, Si, Yu, Hsiang-Fu, Raman, Vinod, Dhillon, Inderjit S.

Abstract

Preference optimization is widely used to align Large Language Models (LLMs) with preference feedback. However, most existing methods train on a single positive-negative pair per prompt, discarding additional supervision available in preference datasets that typically contain multiple candidate responses. Motivated by this limitation, recent work explores group-wise preference optimization, which jointly contrasts multiple responses for the same prompt, but its empirical behavior and scalability remain underexplored due to the memory overhead of group-coupled objectives. In this work, we introduce a memory-efficient group-wise preference optimization algorithm that preserves gradients while decoupling samples during backpropagation, substantially reducing peak memory usage, which enables scalable training with larger group sizes. Across both offline and online alignment settings, we show that leveraging multiple responses consistently outperforms single-pair training. Furthermore, incorporating a negative log-likelihood (NLL) term on positive responses is critical for both performance gains and training stability.

Chinese Translation

偏好优化广泛用于将大型语言模型（LLMs）与偏好反馈对齐。然而，大多数现有方法在每个提示上仅训练一个正负对，忽略了偏好数据集中通常包含的多个候选响应所提供的额外监督。受到这一限制的启发，最近的研究探索了组内偏好优化，该方法针对同一提示联合对比多个响应，但由于组耦合目标的内存开销，其经验表现和可扩展性仍然未得到充分探讨。在本研究中，我们提出了一种内存高效的组内偏好优化算法，该算法在反向传播过程中解耦样本的同时保留梯度，显著降低了峰值内存使用，从而实现了更大组规模的可扩展训练。在离线和在线对齐设置中，我们展示了利用多个响应的效果始终优于单对训练。此外，在正响应上加入负对数似然（NLL）项对于性能提升和训练稳定性至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2604.15607

Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies

不完全合作的人类-人工智能互动：比较人类与人工智能属性在模拟与用户研究中的影响

Cohen, Myke C., Zheng, Mingqian, Bhandari, Neel, Kao, Hsien-Te, Zhou, Xuhui, Nguyen, Daniel, Cassani, Laura, Sap, Maarten, Volkova, Svitlana

Abstract

AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 simulations and a parallel human subjects experiment involving 290 human participants to investigate these effects across two scenario categories: (1) hiring negotiations between human job candidates and AI hiring agents; and (2) human-AI transactions wherein AI agents may conceal information to maximize internal goals. We examine user Extraversion and Agreeableness alongside AI design characteristics, including Adaptability, Expertise, and chain-of-thought Transparency. Our causal discovery analysis extends performance-focused evaluations by integrating scenario-based outcomes, communication analysis, and questionnaire measures. Results reveal divergences between purely simulated and human study datasets, and between scenario types. In simulation experiments, personality traits and AI attributes were comparatively influential. Yet, with actual human subjects, AI attributes -- particularly transparency -- were much more impactful. We discuss how these divergences vary across different interaction contexts, offering crucial insights for the future of human-centered AI agents.

Chinese Translation

人工智能设计特征和人类个性特征各自影响人类与人工智能互动的质量和结果。然而，在不完全合作的场景中，即人类与人工智能的目标和目的仅部分一致，它们的相对和联合影响尚未得到充分探讨。本研究比较了一个包含2000次模拟的纯模拟数据集和一个涉及290名人类参与者的平行人类实验，以调查这些影响在两个场景类别中的表现：（1）人类求职者与人工智能招聘代理之间的招聘谈判；（2）人类与人工智能的交易，其中人工智能代理可能会隐瞒信息以最大化内部目标。我们考察了用户的外向性和宜人性，以及人工智能设计特征，包括适应性、专业知识和思维链透明度。我们的因果发现分析通过整合基于场景的结果、沟通分析和问卷测量，扩展了以绩效为中心的评估。结果显示，纯模拟数据集与人类研究数据集之间，以及不同场景类型之间存在差异。在模拟实验中，个性特征和人工智能属性的影响相对较大。然而，在实际人类参与者中，人工智能属性——特别是透明度——的影响则更为显著。我们讨论了这些差异如何在不同的互动背景中变化，为未来以人为中心的人工智能代理提供了重要的见解。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2604.15646

FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use

FD-NL2SQL：一种随着使用而改进的反馈驱动临床 NL2SQL

Chowdhury, Suparno Roy, Anvekar, Tejas, Choudhury, Manan Roy, Khan, Muhammad Ali, Khakwani, Kaneez Zahra Rubab, Sonbol, Mohamad Bassam, Riaz, Irbaz Bin, Gupta, Vivek

Abstract

Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement.

Chinese Translation

临床医生在探索肿瘤试验数据库时，常常需要针对生物标志物、终点、干预措施和时间进行临时的多约束查询，但编写 SQL 需要对数据库模式有一定的专业知识。我们展示了 FD-NL2SQL，这是一种针对基于 SQLite 的肿瘤数据库的反馈驱动临床 NL2SQL 助手。给定一个自然语言问题，具有模式感知能力的大型语言模型（LLM）将其分解为谓词级子问题，通过句子嵌入检索语义相似的专家验证 NL2SQL 示例，并基于分解、检索的示例和模式合成可执行的 SQL，同时进行后处理有效性检查。为了随着使用而改进，FD-NL2SQL 纳入了两个更新信号：（i）临床医生对生成的 SQL 进行编辑后被批准并添加到示例库中；（ii）轻量级基于逻辑的 SQL 增强应用单一原子变更（例如，操作符或列的更改），仅在返回非空结果时保留变体。第二个 LLM 为接受的变体生成相应的自然语言问题和谓词分解，自动扩展示例库而无需额外注释。演示界面展示了分解、检索、合成和执行结果，以支持交互式细化和持续改进。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2604.15647

CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics

CIG：通过语义记忆动态测量深思对话中的对话信息增益

Chen, Ming-Bin, Lau, Jey Han, Frermann, Lea

Abstract

Measuring the quality of public deliberation requires evaluating not only civility or argument structure, but also the informational progress of a conversation. We introduce a framework for Conversational Information Gain (CIG) that evaluates each utterance in terms of how it advances collective understanding of the target topic. To operationalize CIG, we model an evolving semantic memory of the discussion: the system extracts atomic claims from utterances and incrementally consolidates them into a structured memory state. Using this memory, we score each utterance along three interpretable dimensions: Novelty, Relevance, and Implication Scope. We annotate 80 segments from two moderated deliberative settings (TV debates and community discussions) with these dimensions and show that memory-derived dynamics (e.g., the number of claim updates) correlate more strongly with human-perceived CIG than traditional heuristics such as utterance length or TF--IDF. We develop effective LLM-based CIG predictors paving the way for information-focused conversation quality analysis in dialogues and deliberative success.

Chinese Translation

测量公共深思的质量不仅需要评估礼貌性或论证结构，还需要评估对话的信息进展。我们提出了一种对话信息增益（Conversational Information Gain, CIG）框架，该框架评估每个发言在多大程度上促进了对目标主题的集体理解。为了实现CIG，我们对讨论的语义记忆进行建模：系统从发言中提取原子主张，并将其逐步整合到结构化的记忆状态中。利用这一记忆，我们从三个可解释的维度对每个发言进行评分：新颖性（Novelty）、相关性（Relevance）和影响范围（Implication Scope）。我们对来自两个受控深思环境（电视辩论和社区讨论）的80个片段进行了这些维度的标注，并显示出基于记忆的动态（例如，主张更新的数量）与人类感知的CIG之间的相关性比传统启发式方法（如发言长度或TF-IDF）更强。我们开发了有效的基于大型语言模型（LLM）的CIG预测器，为对话中的信息聚焦质量分析和深思成功奠定了基础。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2604.15648

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

HyperGVL：在超图理解与推理中对大型视觉-语言模型的基准测试与改进

Wei, Yanbin, Kang, Chun, Li, Siwei, Che, Haoxuan, Chen, Yang, Liu, Hua, Liu, Jian, Liu, Zhuang, Ouyang, Can, Xing, Fei, Sha, Lei, Liu, Rui, Zhang, Yu, Kwok, James

Abstract

Large Vision-Language Models (LVLMs) consistently require new arenas to guide their expanding boundaries, yet their capabilities with hypergraphs remain unexplored. In the real world, hypergraphs have significant practical applications in areas such as life sciences and social communities. Recent advancements in LVLMs have shown promise in understanding complex topologies, yet there remains a lack of a benchmark to delineate the capabilities of LVLMs with hypergraphs, leaving the boundaries of their abilities unclear. To fill this gap, in this paper, we introduce $\texttt{HyperGVL}$, the first benchmark to evaluate the proficiency of LVLMs in hypergraph understanding and reasoning. $\texttt{HyperGVL}$ provides a comprehensive assessment of 12 advanced LVLMs across 84,000 vision-language question-answering (QA) samples spanning 12 tasks, ranging from basic component counting to complex NP-hard problem reasoning. The involved hypergraphs contain multiscale synthetic structures and real-world citation and protein networks. Moreover, we examine the effects of 12 textual and visual hypergraph representations and introduce a generalizable router $\texttt{WiseHyGR}$ that improves LVLMs in hypergraph via learning adaptive representations. We believe that this work is a step forward in connecting hypergraphs with LVLMs.

Chinese Translation

大型视觉-语言模型（LVLMs）持续需要新的领域来引导其扩展边界，但它们在超图方面的能力仍未得到探索。在现实世界中，超图在生命科学和社会社区等领域具有重要的实际应用。最近在LVLMs方面的进展显示出理解复杂拓扑的潜力，但缺乏一个基准来明确LVLMs在超图方面的能力，使得它们的能力边界不清晰。为填补这一空白，本文介绍了$ exttt{HyperGVL}$，这是第一个评估LVLMs在超图理解与推理能力的基准。$ exttt{HyperGVL}$对12个先进的LVLMs进行了全面评估，涵盖了84,000个视觉-语言问答（QA）样本，涉及从基本组件计数到复杂NP难题推理的12个任务。所涉及的超图包含多尺度的合成结构以及现实世界的引用和蛋白质网络。此外，我们研究了12种文本和视觉超图表示的效果，并引入了一个可推广的路由器$ exttt{WiseHyGR}$，通过学习自适应表示来改善LVLMs在超图方面的表现。我们相信，这项工作是将超图与LVLMs连接的一个重要进展。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2604.15675

C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

C-Mining：通过几何错位进行文化数据合成的无监督种子发现

Zeng, Pufan, Liu, Yilun, Dai, Mingchen, Piao, Mengyao, Zhao, Chunguang, Miao, Lingqi, Tao, Shimin, Meng, Weibin, He, Minggui, Liu, Chenxin, Qin, Zhenzhen, Zhang, Li, Ma, Hongxia, Chen, Boxing, Wei, Daimeng

Abstract

Achieving cultural alignment in Large Language Models (LLMs) increasingly depends on synthetic data generation. For such synthesis, the most vital initial step is seed curation; however, current methods lack quantifiable standards for selecting these seeds. Existing approaches rely on unscalable manual curation or bias-prone LLM extraction, treating cultural specificity as an abstract concept rather than a measurable signal. In this paper, we address this "quantification gap" by proposing C-Mining, an unsupervised framework that transforms the discovery of cultural seeds from a subjective selection process into a computable data mining formulation. Our approach exploits a novel geometric insight, leveraging the cross-lingual misalignment of cultural concepts within pre-trained embedding spaces as a quantifiable discovery signal. By systematically identifying these regions characterized by pronounced linguistic exclusivity and geometric isolation, while actively filtering out noise, C-Mining automatically extracts high-fidelity Culture Points (CPs) from raw multilingual corpora without reliance on human or LLM supervision, reducing preparation costs by more than 150-fold. We further leverage the mined knowledge to steer the synthesis of diverse instruction-tuning datasets. Extensive experiments demonstrate that this seed-centric approach significantly enhances cultural understanding and reasoning capabilities, achieving a +6.03 point improvement on CulturalBench-Hard and surpassing state-of-the-art baselines, providing a scalable, quantifiable solution for high-quality cultural data synthesis.

Chinese Translation

在大型语言模型（LLMs）中实现文化对齐日益依赖于合成数据生成。对于这种合成，最重要的初步步骤是种子策划；然而，当前的方法缺乏可量化的标准来选择这些种子。现有的方法依赖于不可扩展的手动策划或易受偏见影响的LLM提取，将文化特异性视为一个抽象概念，而不是一个可测量的信号。本文通过提出C-Mining，解决了这一“量化缺口”，将文化种子的发现从一个主观选择过程转变为一个可计算的数据挖掘公式。我们的方法利用了一种新颖的几何洞察，利用预训练嵌入空间中文化概念的跨语言错位作为可量化的发现信号。通过系统地识别这些以显著语言排他性和几何隔离为特征的区域，同时主动过滤噪声，C-Mining能够从原始多语言语料库中自动提取高保真文化点（Culture Points, CPs），而无需依赖人类或LLM的监督，将准备成本降低了150倍以上。我们进一步利用挖掘的知识来引导多样化的指令调优数据集的合成。大量实验表明，这种以种子为中心的方法显著增强了文化理解和推理能力，在CulturalBench-Hard上实现了+6.03的提升，并超越了最先进的基准，提供了一种可扩展、可量化的高质量文化数据合成解决方案。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2604.15687

Preference Estimation via Opponent Modeling in Multi-Agent Negotiation

通过对手建模进行多智能体谈判中的偏好估计

Konishi, Yuta, Yamamoto, Kento, Sonomoto, Eisuke, Takeda, Rikuho, Furukawa, Ryo, Muraki, Yusuke, Shimizu, Takafumi, Fukumura, Kazuma, Kanemoto, Yuya, Ito, Takayuki, Ding, Shiyao

Abstract

Automated negotiation in complex, multi-party and multi-issue settings critically depends on accurate opponent modeling. However, conventional numerical-only approaches fail to capture the qualitative information embedded in natural language interactions, resulting in unstable and incomplete preference estimation. Although Large Language Models (LLMs) enable rich semantic understanding of utterances, it remains challenging to quantitatively incorporate such information into a consistent opponent modeling. To tackle this issue, we propose a novel preference estimation method integrating natural language information into a structured Bayesian opponent modeling framework. Our approach leverages LLMs to extract qualitative cues from utterances and converts them into probabilistic formats for dynamic belief tracking. Experimental results on a multi-party benchmark demonstrate that our framework improves the full agreement rate and preference estimation accuracy by integrating probabilistic reasoning with natural language understanding.

Chinese Translation

在复杂的多方和多议题环境中，自动化谈判在很大程度上依赖于准确的对手建模。然而，传统的仅基于数值的方法未能捕捉自然语言交互中蕴含的定性信息，导致偏好估计的不稳定和不完整。尽管大型语言模型（LLMs）能够对话语进行丰富的语义理解，但将这些信息定量地纳入一致的对手建模仍然具有挑战性。为了解决这一问题，我们提出了一种新颖的偏好估计方法，将自然语言信息整合到结构化的贝叶斯对手建模框架中。我们的方法利用LLMs从话语中提取定性线索，并将其转换为动态信念跟踪的概率格式。在多方基准测试中的实验结果表明，通过将概率推理与自然语言理解相结合，我们的框架提高了完全一致率和偏好估计的准确性。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2604.15701

Improving Reasoning Capabilities in Small Models through Mixture-of-Layers Distillation with Stepwise Attention on Key Information

通过关键性信息的逐步注意力混合层蒸馏提升小模型的推理能力

Chen, Yao, Sheng, Jiawei, Zhang, Wenyuan, Liu, Tingwen

Abstract

The significant computational demands of large language models have increased interest in distilling reasoning abilities into smaller models via Chain-of-Thought (CoT) distillation. Current CoT distillation methods mainly focus on transferring teacher-generated rationales for complex reasoning to student models. However, they do not adequately explore teachers' dynamic attention toward critical information during reasoning. We find that language models exhibit progressive attention shifts towards key information during reasoning, which implies essential clues for drawing conclusions. Building on this observation and analysis, we introduce a novel CoT distillation framework that transfers the teacher's stepwise attention on key information to the student model. This establishes structured guidance for the student's progressive concentration on key information during reasoning. More importantly, we develop a Mixture of Layers module enabling dynamic alignment that adapts to different layers between the teacher and student. Our method achieves consistent performance improvements across multiple mathematical and commonsense reasoning datasets. To our knowledge, it is the first method to leverage stepwise attention within CoT distillation to improve small model reasoning.

Chinese Translation

大型语言模型的显著计算需求引发了对通过思维链（Chain-of-Thought, CoT）蒸馏将推理能力转移到小模型的兴趣。目前的CoT蒸馏方法主要集中在将教师生成的复杂推理理由转移给学生模型。然而，它们并未充分探讨教师在推理过程中对关键性信息的动态注意力。我们发现，语言模型在推理过程中对关键性信息表现出逐步的注意力转移，这暗示了得出结论的重要线索。基于这一观察和分析，我们提出了一种新颖的CoT蒸馏框架，将教师对关键性信息的逐步注意力转移到学生模型。这为学生在推理过程中对关键性信息的逐步集中提供了结构化指导。更重要的是，我们开发了一个混合层模块，使得教师和学生之间的不同层能够动态对齐。我们的方法在多个数学和常识推理数据集上实现了一致的性能提升。据我们所知，这是第一个在CoT蒸馏中利用逐步注意力来改善小模型推理的方法。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2604.15702

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

元认知监测电池：大型语言模型自我监测的跨领域基准

Cacioli, Jon-Paul

Abstract

We introduce a cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive function, prospective regulation), each grounded in an established experimental paradigm. Tasks T1-T5 were pre-registered on OSF prior to data collection; T6 was added as an exploratory extension. After every forced-choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates three profiles consistent with the Nelson-Narens architecture: blanket confidence, blanket withdrawal, and selective sensitivity. Accuracy rank and metacognitive sensitivity rank are largely inverted. Retrospective monitoring and prospective regulation appear dissociable (r = .17, 95% CI wide given n=20; exemplar-based evidence is the primary support). Scaling on metacognitive calibration is architecture-dependent: monotonically decreasing (Qwen), monotonically increasing (GPT-5.4), or flat (Gemma). Behavioural findings converge structurally with an independent Type-2 SDT approach, providing preliminary cross-method construct validity. All items, data, and code: https://github.com/synthiumjp/metacognitive-monitoring-battery.

Chinese Translation

我们介绍了一种基于Nelson和Narens（1990）元认知框架的跨领域监测-控制耦合行为测评，应用人类心理测量方法对大型语言模型（LLMs）进行评估。该电池包含524个项目，涵盖六个认知领域（学习、元认知校准、社会认知、注意力、执行功能、前瞻性调节），每个领域均基于已建立的实验范式。在数据收集之前，任务T1-T5已在开放科学框架（OSF）上预注册；T6作为探索性扩展被添加。在每个强制选择响应后，采用Koriat和Goldsmith（1996）的方法的双重探针询问模型是否保持（KEEP）或撤回（WITHDRAW）其答案，并进行下注（BET）或拒绝（decline）。关键指标是撤回差值（withdraw delta）：错误项与正确项之间的撤回率差异。应用于20个前沿LLMs（10,480次评估），该电池区分出与Nelson-Narens架构一致的三种特征：普遍信心（blanket confidence）、普遍撤回（blanket withdrawal）和选择性敏感性（selective sensitivity）。准确性排名和元认知敏感性排名大致相反。回顾性监测和前瞻性调节似乎是可分离的（r = .17，95%置信区间较宽，样本量n=20；基于示例的证据是主要支持）。元认知校准的规模依赖于架构：单调递减（Qwen）、单调递增（GPT-5.4）或平坦（Gemma）。行为发现与独立的二型信号检测理论（Type-2 SDT）方法在结构上趋于一致，提供了初步的跨方法构念效度。所有项目、数据和代码可在此访问：https://github.com/synthiumjp/metacognitive-monitoring-battery。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2604.15706

Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

基于神经元激活图的目标导向预训练数据选择

Wang, Zijun, Tu, Haoqin, Zhou, Weidong, Zhou, Yiyang, Zhou, Xiaohuan, Zhang, Bingni, Feng, Weiguo, Wang, Taifeng, Xie, Cihang, Liu, Fengze

Abstract

Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using black-box representations, our approach directly characterizes each target input by a sparse set of high-impact neurons in any off-the-shelf LLMs. Concretely, we quantify neuron impact and select the most influential neurons across layers into a compact Neuron-Activated Graph (NAG), and rank candidate data by NAG similarity to target examples. We conduct experiments across six benchmarks, where our NAG-based Ranking improves target-oriented pretraining by 4.9% on average over random sampling, and also outperforms state-of-the-art baselines by 5.3% accuracy on HellaSwag. It also remains effective under a more applicable multi-target setting, where our best setup surpasses two baselines by 1.1% and 4.1%, respectively. Furthermore, we provide a comprehensive analysis on why and how our NAG works, e.g., deactivating NAG-selected neurons (only 0.12% of all) causes a 23.5% performance collapse, and restricting NAG to the final layer incurs a 4.1% average drop, indicating that NAG captures a sparse "functional backbone" for learning target features. We release the code at https://github.com/asillycat/NAG.

Chinese Translation

日常任务都有一个目标，而围绕这一目标进行模型预训练则使其成为专家。本文研究了目标导向语言模型（LM）预训练，提出了一种基于神经元激活图排名（NAG-based Ranking）的训练无关且可解释的目标预训练数据选择框架。我们的方法并不使用黑箱表示，而是通过一组稀疏的高影响力神经元直接表征每个目标输入，这些神经元来自任何现成的语言模型（LLMs）。具体而言，我们量化神经元的影响力，并在各层中选择最具影响力的神经元构建一个紧凑的神经元激活图（NAG），并通过NAG与目标示例的相似度对候选数据进行排序。我们在六个基准测试中进行了实验，结果显示我们的NAG-based Ranking在目标导向预训练上平均比随机采样提高了4.9%，并且在HellaSwag上比最先进的基线提高了5.3%的准确率。在更具适用性的多目标设置下，我们的最佳配置分别超越了两个基线1.1%和4.1%。此外，我们提供了关于NAG为何及如何有效的全面分析，例如，停用NAG选择的神经元（仅占所有神经元的0.12%）会导致23.5%的性能崩溃，而将NAG限制在最后一层会导致4.1%的平均下降，表明NAG捕捉到了用于学习目标特征的稀疏“功能骨架”。我们在https://github.com/asillycat/NAG上发布了代码。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2604.15715

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

GTA-2：从原子工具使用到开放式工作流程的通用工具代理基准测试

Wang, Jize, Liu, Xuanxuan, Li, Yining, Zhang, Songyang, Wang, Yijun, Shan, Zifei, Le, Xinyi, Chen, Cailian, Guan, Xinping, Tao, Dacheng

Abstract

The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open-compass/GTA.

Chinese Translation

通用代理的发展需要从执行简单指令转向完成复杂的现实生产力工作流程。然而，目前的工具使用基准测试与现实世界需求仍然不匹配，依赖于人工智能生成的查询、虚拟工具和有限的系统级协调。为了解决这一问题，我们提出了GTA-2，这是一个针对通用工具代理（General Tool Agents, GTA）的分层基准，涵盖原子工具使用和开放式工作流程。GTA-2基于现实世界的真实性，利用真实用户查询、已部署工具和多模态上下文。(i) GTA-Atomic，继承自我们之前的GTA基准，评估短期、封闭式工具使用的精确度。(ii) GTA-Workflow引入长期、开放式任务，以实现现实的端到端完成。为了评估开放式交付成果，我们提出了一种基于递归检查点的评估机制，将目标分解为可验证的子目标，从而实现对模型能力和代理执行框架（即执行工具）的统一评估。实验结果揭示了显著的能力悬崖：尽管前沿模型在原子任务上已经面临困难（成功率低于50%），但在工作流程上几乎完全失败，顶级模型的成功率仅为14.39%。进一步分析表明，检查点引导的反馈可以提高性能，而诸如Manus和OpenClaw等先进框架显著增强了工作流程的完成度，突显了执行工具设计的重要性，超越了基础模型的能力。这些发现为开发可靠的个人和专业助手提供了指导。数据集和代码将发布在https://github.com/open-compass/GTA。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2604.15741

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

从大型语言模型的序列内部离散性中学习不确定性

Srey, Ponhvoan, Wu, Xiaobao, Nguyen, Cong-Duy, Luu, Anh Tuan

Abstract

Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens. To address these issues, we present Sequential Internal Variance Representation (SIVR), a supervised hallucination detection framework that leverages token-wise, layer-wise features derived from hidden states. SIVR adopts a more basic assumption that uncertainty manifests in the degree of dispersion or variance of internal representations across layers, rather than relying on specific assumptions, which makes the method model and task agnostic. It additionally aggregates the full sequence of per-token variance features, learning temporal patterns indicative of factual errors and thereby preventing information loss. Experimental results demonstrate SIVR consistently outperforms strong baselines. Most importantly, SIVR enjoys stronger generalisation and avoids relying on large training sets, highlighting the potential for practical deployment. Our code repository is available online at https://github.com/ponhvoan/internal-variance.

Chinese Translation

不确定性估计是一种有前景的方法，用于检测大型语言模型（LLMs）中的幻觉。近期的方法通常依赖于模型内部状态来估计不确定性。然而，它们受到关于隐藏状态在各层之间如何演变的严格假设的限制，并且仅关注最后一个或平均标记导致信息损失。为了解决这些问题，我们提出了序列内部方差表示（SIVR），这是一个利用从隐藏状态中提取的逐标记、逐层特征的监督幻觉检测框架。SIVR采用了一个更基本的假设，即不确定性表现为各层内部表示的离散程度或方差，而不是依赖于特定的假设，这使得该方法与模型和任务无关。此外，它还聚合了每个标记方差特征的完整序列，学习指示事实错误的时间模式，从而防止信息损失。实验结果表明，SIVR在性能上始终优于强基线。最重要的是，SIVR具有更强的泛化能力，并且不依赖于大量训练集，突显了其实际应用的潜力。我们的代码库可在线访问，网址为 https://github.com/ponhvoan/internal-variance.

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2604.15744

Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand

语言、地点与社交媒体：新西兰的地理方言对齐

Wong, Sidney

Abstract

This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.

Chinese Translation

本论文研究了在以地点为导向的社交媒体社区中地理方言的对齐，重点关注与新西兰相关的Reddit社区。通过将用户感知的定性分析与计算方法相结合，本研究考察了语言使用如何反映地点身份以及基于用户提供的词汇、形态句法和语义变量的语言变异和变化模式。研究结果表明，用户通常将语言与地点联系在一起，地点相关的社区形成了一个连续的言语社区，尽管地理方言社区与地点相关社区之间的对齐仍然复杂。先进的语言建模，包括静态和历时的Word2Vec语言嵌入，揭示了基于地点的社区之间的语义变异以及新西兰英语内部的有意义的语义转变。该研究涉及创建一个包含42.6亿个未处理单词的语料库，为未来的研究提供了宝贵的资源。总体而言，结果突显了社交媒体作为社会语言学研究自然实验室的潜力。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2604.15756

TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models

TTL：基于预训练视觉-语言模型的测试时文本学习用于OOD检测

Ye, Jinlun, Liao, Jiang, Lai, Runhe, Lu, Xinhua, Zhuang, Jiaxin, Gan, Zhiyong, Wang, Ruixuan

Abstract

Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce Test-time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Our code is available at https://github.com/figec/TTL.

Chinese Translation

视觉-语言模型（VLMs），如CLIP，通过对齐视觉和文本表示，展现出强大的分布外（OOD）检测能力。最近基于CLIP的测试时适应方法通过结合外部OOD标签进一步提高了检测性能。然而，这些标签是有限且固定的，而真实的OOD语义空间本质上是开放的。因此，固定标签无法代表在测试流中遇到的多样化和不断演变的OOD语义。为了解决这一局限性，我们提出了测试时文本学习（TTL）框架，该框架能够从未标记的测试流中动态学习OOD文本语义，而无需依赖外部OOD标签。TTL通过使用伪标记的测试样本更新可学习的提示，以捕捉新兴的OOD知识。为了抑制伪标签引入的噪声，我们引入了一种OOD知识净化策略，该策略选择可靠的OOD样本进行适应，同时抑制噪声。此外，TTL维护一个OOD文本知识库，存储高质量的文本特征，为各批次提供稳定的分数校准。在两个标准基准和九个OOD数据集上的大量实验表明，TTL始终实现了最先进的性能，突显了文本适应在稳健的测试时OOD检测中的价值。我们的代码可在 https://github.com/figec/TTL 获取。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2604.15771

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

Skill-RAG：通过隐状态探测和技能路由的故障状态感知检索增强

Wei, Kai, Li, Raymond, Zhu, Xi, Xue, Zhaoqian, Han, Jiaojiao, Niu, Jingcheng, Yang, Fan

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills -- query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases -- to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.

Chinese Translation

检索增强生成（RAG）已成为将大型语言模型与外部知识结合的基础范式。尽管自适应检索机制提高了检索效率，但现有方法将后检索失败视为重试的信号，而非诊断的依据，从而未能解决查询与证据不匹配的结构性原因。我们观察到，持久性检索失败的一个重要部分并非源于相关证据的缺失，而是源于查询与证据空间之间的对齐差距。我们提出了Skill-RAG，一个故障感知的RAG框架，它将轻量级的隐状态探测器与基于提示的技能路由器相结合。探测器在两个管道阶段对检索进行控制；在检测到故障状态时，技能路由器诊断潜在原因，并在四种检索技能中进行选择——查询重写、问题分解、证据聚焦，以及针对真正不可简化情况的退出技能——以在下一次生成尝试之前纠正不对齐。针对多个开放域问答和复杂推理基准的实验表明，Skill-RAG显著提高了在多轮检索后仍然存在的困难案例的准确性，尤其在分布外数据集上表现出强劲的提升。表示空间分析进一步揭示，所提出的技能占据了故障状态空间中结构化、可分离的区域，支持了查询与证据不匹配是一种类型化而非单一现象的观点。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2604.15774

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

MemEvoBench：大规模语言模型代理中的记忆误演变基准测试

Xie, Weiwei, Guo, Shaoxiong, Zhang, Fan, Xia, Tian, Yang, Xue, Ma, Lizhuang, Yan, Junchi, Ren, Qibing

Abstract

Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.

Chinese Translation

为大规模语言模型（LLMs）配备持久性记忆可以增强交互的连续性和个性化，但也引入了新的安全风险。具体而言，受污染或偏见的记忆积累可能会引发异常的代理行为。现有的评估方法尚未建立一个标准化框架来衡量记忆误演变。该现象指的是由于反复接触误导性信息而导致的行为逐渐漂移。为了解决这一问题，我们提出了MemEvoBench，这是第一个评估LLM代理在面对对抗性记忆注入、噪声工具输出和偏见反馈时的长期记忆安全性的基准。该框架包含跨越7个领域和36种风险类型的问答风格任务，并结合从20个Agent-SafetyBench环境中改编的工作流风格任务，后者涉及噪声工具返回。在多轮交互中，两种设置都采用混合的良性和误导性记忆池，以模拟记忆演变。对代表性模型的实验表明，在偏见记忆更新下，安全性显著下降。我们的分析表明，记忆演变是导致这些失败的重要因素。此外，基于静态提示的防御措施被证明不足，突显了确保LLM代理记忆演变安全的紧迫性。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2604.15776

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

PIIBench：一个统一的多源基准语料库用于个人可识别信息检测

Jha, Pritesh

Abstract

We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span-level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain-silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at https://github.com/pritesh-2711/pii-bench.

Chinese Translation

我们提出了PIIBench，这是一个用于自然语言文本中个人可识别信息（PII）检测的统一基准语料库。现有的PII检测资源在领域特定的语料库中碎片化，且注释方案互不兼容，阻碍了检测系统的系统性比较。我们整合了十个公开可用的数据集，涵盖了合成PII语料库、多语言命名实体识别（NER）基准和金融领域注释文本，形成了一个包含2,369,883个注释序列和335万个实体提及的语料库，涵盖48种标准PII实体类型。我们开发了一个原则性的标准化管道，将80多个源特定标签变体映射到标准化的BIO标记方案，应用基于频率的近缺失实体类型抑制，并生成保持源分布的分层80/10/10训练/验证/测试划分。为了建立基线难度，我们评估了八个已发布的系统，包括基于规则的引擎（Microsoft Presidio）、通用NER模型（spaCy、BERT-base NER、XLM-RoBERTa NER、SpanMarker mBERT、SpanMarker BERT）、一个特定于PII的模型（Piiranha DeBERTa）和一个金融NER专家（XtremeDistil FiNER）。所有系统的跨度级F1值均低于0.14，其中最佳系统（Presidio，F1=0.1385）在大多数实体类型上仍然产生零召回。这些结果直接量化了领域孤岛问题，并证明PIIBench提供了比任何现有单一源PII数据集更具挑战性和更全面的评估挑战。数据集构建管道和基准评估代码已公开可用，网址为https://github.com/pritesh-2711/pii-bench。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2604.15789

A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

无训练方法在可信大型语言模型中的系统研究

Si, Wai Man, Li, Mingjie, Backes, Michael, Zhang, Yang

Abstract

As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and their influence on utility, robustness, and computational overhead. We also categorize these methods into three levels (input, internal, and output) based on where they intervene in the model's information flow during inference. Using this taxonomy, we conduct a comprehensive analysis of various representative and effective methods from each level across different LLM families and sizes. Our analysis highlights several trade-offs and unresolved challenges in current approaches. We summarize key findings and limitations in the existing literature, and propose practical recommendations for balancing trustworthiness, utility, and robustness in LLMs without the need for additional training.

Chinese Translation

随着大型语言模型（LLMs）受到越来越多的关注并在各个领域得到应用，它们潜在的风险，包括生成有害或偏见内容、产生不支持的主张以及对对抗攻击的脆弱性，已引起了广泛关注。为了实现快速且低成本的适应，最近出现的无训练方法作为后训练对齐技术的经济有效替代方案。尽管这些方法展现出良好的结果，但在文献中的评估却不一致，涵盖的可信度维度有限，并且可能引入不良副作用，如效用下降和脆弱性增加。为了全面评估这些无训练方法的影响，我们退后一步，系统地重新评估现有无训练方法在各种可信设置下的有效性及其对效用、鲁棒性和计算开销的影响。我们还根据这些方法在推理过程中干预模型信息流的位置，将其分类为三个层次（输入、内部和输出）。利用这一分类法，我们对来自不同LLM家族和规模的各种代表性和有效方法进行了全面分析。我们的分析突出了当前方法中的几个权衡和未解决的挑战。我们总结了现有文献中的关键发现和局限性，并提出了在无需额外训练的情况下平衡LLMs的可信度、效用和鲁棒性的实用建议。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2604.15802

CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents

CHOP：面向多文档的逐块上下文保持框架用于检索增强生成

Park, Hyunseok, Kim, Jihyeon, Kim, Jongeun, Yoon, Dongsik

Abstract

Retrieval-Augmented Generation (RAG) systems lose retrieval accuracy when similar documents coexist in the vector database, causing unnecessary information, hallucinations, and factual errors. To alleviate this issue, we propose CHOP, a framework that iteratively evaluates chunk relevance with Large Language Models (LLMs) and progressively reconstructs documents by determining their association with specific topics or query types. CHOP integrates two key components: the CNM-Extractor, which generates compact per-chunk signatures capturing categories, key nouns, and model names, and the Continuity Decision Module, which preserves contextual coherence by deciding whether consecutive chunks belong to the same document flow. By prefixing each chunk with context-aware metadata, CHOP reduces semantic conflicts among similar documents and enhances retriever discrimination. Experiments on benchmark datasets show that CHOP alleviates retrieval confusion and provides a scalable approach for building high-quality knowledge bases, achieving a Top-1 Hit Rate of 90.77% and notable gains in ranking quality metrics.

Chinese Translation

检索增强生成（RAG）系统在向量数据库中存在相似文档时会失去检索准确性，从而导致不必要的信息、幻觉和事实错误。为了解决这个问题，我们提出了CHOP，一个框架，通过与大型语言模型（LLMs）迭代评估块的相关性，并通过确定它们与特定主题或查询类型的关联来逐步重建文档。CHOP集成了两个关键组件：CNM-Extractor，它生成紧凑的每块签名，捕捉类别、关键名词和模型名称；以及连续性决策模块，它通过决定连续块是否属于同一文档流来保持上下文的一致性。通过在每个块前添加上下文感知的元数据，CHOP减少了相似文档之间的语义冲突，并增强了检索器的区分能力。在基准数据集上的实验表明，CHOP缓解了检索混淆，并提供了一种可扩展的方法来构建高质量知识库，达到了90.77%的Top-1命中率，并在排名质量指标上取得了显著提升。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2604.15804

Qwen3.5-Omni Technical Report

Qwen3.5-Omni技术报告

Qwen Team

Abstract

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

Chinese Translation

在本研究中，我们介绍了Qwen3.5-Omni，这是Qwen-Omni模型系列的最新进展。Qwen3.5-Omni在其前身的基础上实现了显著的演变，参数规模达到数百亿，并支持256k的上下文长度。通过利用包含异构文本-视觉对和超过1亿小时音视频内容的大规模数据集，该模型展示了强大的全模态能力。Qwen3.5-Omni-plus在215个音频和音视频理解、推理和交互子任务及基准测试中取得了SOTA（最先进技术）结果，在关键音频任务上超越了Gemini-3.1 Pro，并在全面的音视频理解上与其持平。在架构上，Qwen3.5-Omni采用了混合注意力专家混合（Hybrid Attention Mixture-of-Experts, MoE）框架，适用于Thinker和Talker，从而实现高效的长序列推理。该模型支持复杂的交互，能够理解超过10小时的音频和400秒的720P视频（以1 FPS的速度）。为了解决流式语音合成中固有的不稳定性和不自然性，通常由文本和语音标记器之间的编码效率差异引起，我们引入了ARIA。ARIA动态对齐文本和语音单元，显著提升了对话语音的稳定性和韵律，同时对延迟影响最小。此外，Qwen3.5-Omni扩展了语言边界，支持10种语言的多语言理解和语音生成，具有人类般的情感细腻度。最后，Qwen3.5-Omni展现了卓越的音视频基础能力，生成具有精确时间同步和自动场景分割的脚本级结构化字幕。值得注意的是，我们观察到全模态模型中出现了一种新能力：基于音视频指令直接执行编码，我们称之为音视频氛围编码（Audio-Visual Vibe Coding）。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2604.15840

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

CoEvolve：通过代理-数据互进化训练大语言模型代理

Yang, Shidong, Ma, Ziyu, Huang, Tongwen, Hu, Yiming, Wang, Yong, Chu, Xiangxiang

Abstract

Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure-prone interaction patterns, and utilizes them to guide LLM-based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.

Chinese Translation

大语言模型（LLM）代理的强化学习通常是在静态数据分布上进行的，这无法适应代理不断演变的行为，导致对复杂环境交互的覆盖不足。为了解决这些挑战，我们提出了CoEvolve，一种代理-数据互进化框架，使得LLM代理能够通过闭环的、基于交互的训练进行改进。具体而言，CoEvolve从回放轨迹中提取诸如遗忘和不确定性等反馈信号，以识别易出错的交互模式，并利用这些模式指导基于LLM的任务合成。合成的任务通过环境交互进行验证，并用于更新数据分布，从而实现代理及其数据的联合适应。在Qwen2.5-7B、Qwen3-4B和Qwen3-30B-A3B上的AppWorld和BFCL的广泛实验表明，与强基线模型相比，CoEvolve在性能上实现了一致且显著的提升，绝对增益分别为19.43%、15.58%和18.14%。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2604.15841

Exploring the Capability Boundaries of LLMs in Mastering of Chinese Chouxiang Language

探索大型语言模型在掌握中文抽象语言方面的能力边界

Lin, Dianqing, Lan, Tian, Zhu, Jiali, Li, Jiang, Chen, Wei, Liu, Xu, Aruukhan, Su, Xiangdong, Hou, Hongxu, Gao, Guanglai

Abstract

While large language models (LLMs) have achieved remarkable success in general language tasks, their performance on Chouxiang Language, a representative subcultural language in the Chinese internet context, remains largely unexplored. In this paper, we introduce Mouse, a specialized benchmark designed to evaluate the capabilities of LLMs on NLP tasks involving Chouxiang Language across six tasks. Experimental results show that, current state-of-the-art (SOTA) LLMs exhibit clear limitations on multiple tasks, while performing well on tasks that involve contextual semantic understanding. In addition, we further discuss the reasons behind the generally low performance of SOTA LLMs on Chouxiang Language, examine whether the LLM-as-a-judge approach adopted for translation tasks aligns with human judgments and values, and analyze the key factors that influence Chouxiang translation. Our study aims to promote further research in the NLP community on multicultural integration and the dynamics of evolving internet languages. Our code and data are publicly available.

Chinese Translation

尽管大型语言模型（LLMs）在一般语言任务中取得了显著成功，但它们在中文抽象语言这一代表性亚文化语言的表现仍然 largely 未被探索。本文介绍了 Mouse，一个专门设计的基准，用于评估 LLMs 在涉及中文抽象语言的自然语言处理（NLP）任务中的能力，共包含六个任务。实验结果表明，当前的最先进（SOTA）LLMs 在多个任务上表现出明显的局限性，而在涉及上下文语义理解的任务上表现良好。此外，我们进一步讨论了 SOTA LLMs 在中文抽象语言上普遍低表现的原因，考察了用于翻译任务的 LLM-as-a-judge 方法是否与人类的判断和价值观相一致，并分析了影响中文抽象语言翻译的关键因素。我们的研究旨在促进 NLP 社区在多文化融合和不断演变的互联网语言动态方面的进一步研究。我们的代码和数据已公开可用。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2604.15842

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

解构大型语言模型中的数学推理：内部机制的 методологическое исследование

Baeumel, Tanja, van Genabith, Josef, Ostermann, Simon

Abstract

Large language models (LLMs) have demonstrated impressive capabilities, yet their internal mechanisms for handling reasoning-intensive tasks remain underexplored. To advance the understanding of model-internal processing mechanisms, we present an investigation of how LLMs perform arithmetic operations by examining internal mechanisms during task execution. Using early decoding, we trace how next-token predictions are constructed across layers. Our experiments reveal that while the models recognize arithmetic tasks early, correct result generation occurs only in the final layers. Notably, models proficient in arithmetic exhibit a clear division of labor between attention and MLP modules, where attention propagates input information and MLP modules aggregate it. This division is absent in less proficient models. Furthermore, successful models appear to process more challenging arithmetic tasks functionally, suggesting reasoning capabilities beyond factual recall.

Chinese Translation

大型语言模型（LLMs）展现了令人印象深刻的能力，但它们在处理推理密集型任务时的内部机制仍然未被充分探讨。为了推动对模型内部处理机制的理解，我们对LLMs如何执行算术运算进行了调查，重点考察任务执行过程中的内部机制。通过早期解码，我们追踪了下一词预测在各层之间的构建过程。实验结果显示，尽管模型在早期就能识别算术任务，但正确结果的生成仅发生在最后几层。值得注意的是，擅长算术的模型在注意力和多层感知器（MLP）模块之间展现出明确的分工，其中注意力模块传播输入信息，而MLP模块则对其进行汇总。这种分工在能力较弱的模型中并不存在。此外，成功的模型似乎在功能上处理更具挑战性的算术任务，表明其推理能力超越了简单的事实回忆。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2604.15847

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

CiPO：通过迭代偏好优化实现大型推理模型的反事实遗忘

Li, Junyi, Chen, Yongqiang, Ding, Ningning

Abstract

Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.

Chinese Translation

近年来，机器遗忘受到越来越多的关注，作为一种有前景的技术，可以从在大量人类数据上训练的大型语言模型中选择性地移除不需要的隐私或版权信息。然而，大型推理模型（LRMs）的出现，强调长链思维（CoT）推理以解决复杂问题，给遗忘带来了困境：现有方法要么难以完全消除CoT痕迹中的不必要知识，要么由于干扰推理过程而降低推理性能。为此，我们提出了通过迭代偏好优化的反事实遗忘（CiPO），这是一个新颖的框架，将遗忘重新定义为对LRMs中CoT推理的有针对性的干预。更具体地说，给定一个期望的遗忘目标答案，CiPO指导LRMs生成一个逻辑有效的反事实推理痕迹以进行偏好调优。当LRM调整到反事实痕迹时，CiPO迭代更新偏好学习数据，以增加与原始模型的差异。这个迭代循环确保了既能实现理想的遗忘，又能平滑优化，有效缓解了困境。在具有挑战性的基准测试中的实验表明，CiPO在遗忘方面表现出色，能够完全移除中间CoT步骤和最终答案中的知识，同时保留LRMs的推理能力。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2604.15866

DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition

DiZiNER：通过试点标注模拟引导分歧的指令优化用于零样本命名实体识别

Kim, Siun, Yoon, Hyung-Jin

Abstract

Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inconsistencies observed in early-stage human annotation processes that resolve disagreements through pilot annotation. Motivated by this analogy, we introduce DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter-model disagreements to refine task instructions. Across 18 benchmarks, DiZiNER achieves zero-shot SOTA results on 14 datasets, improving prior bests by +8.0 F1 and reducing the zero-shot to supervised gap by over +11 points. It also consistently outperforms its supervisor, GPT-5 mini, indicating that improvements stem from disagreement-guided instruction refinement rather than model capacity. Pairwise agreement between models shows a strong correlation with NER performance, further supporting this finding.

Chinese Translation

大型语言模型（LLMs）通过实现零样本和少样本命名实体识别（NER）推动了信息提取（IE）的进展，但它们的生成输出仍然表现出持续的系统性错误。尽管通过指令微调取得了一定进展，零样本NER仍远远落后于监督系统。这些反复出现的错误反映了早期人类标注过程中观察到的不一致性，这些不一致性通过试点标注得以解决。受到这一类比的启发，我们提出了DiZiNER（Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition），一个模拟试点标注过程的框架，利用LLMs作为标注者和监督者。多个异构LLMs对共享文本进行标注，监督模型分析模型间的分歧，以优化任务指令。在18个基准测试中，DiZiNER在14个数据集上实现了零样本SOTA（state-of-the-art）结果，F1值提高了+8.0，并将零样本与监督之间的差距缩小了超过11个点。它还始终优于其监督者GPT-5 mini，表明改进源于引导分歧的指令优化，而非模型能力。模型间的成对一致性与NER性能之间显示出强相关性，进一步支持了这一发现。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2604.15873

How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

你的大型语言模型法官有多虚伪？大型语言模型的语用能力中的听者-说话者不对称性

Sieker, Judith, Zarrieß, Sina

Abstract

Large language models (LLMs) are increasingly studied as repositories of linguistic knowledge. In this line of work, models are commonly evaluated both as generators of language and as judges of linguistic output, yet these two roles are rarely examined in direct relation to one another. As a result, it remains unclear whether success in one role aligns with success in the other. In this paper, we address this question for pragmatic competence by comparing LLMs' performance as pragmatic listeners, judging the appropriateness of linguistic outputs, and as pragmatic speakers, generating pragmatically appropriate language. We evaluate multiple open-weight and proprietary LLMs across three pragmatic settings. We find a robust asymmetry between pragmatic evaluation and pragmatic generation: many models perform substantially better as listeners than as speakers. Our results suggest that pragmatic judging and pragmatic generation are only weakly aligned in current LLMs, calling for more integrated evaluation practices.

Chinese Translation

大型语言模型（LLMs）越来越多地被研究作为语言知识的存储库。在这方面，模型通常被评估为语言生成者和语言输出的评判者，但这两种角色之间的直接关系很少被研究。因此，目前尚不清楚在一个角色中的成功是否与另一个角色中的成功相一致。本文通过比较LLMs作为语用听者（评判语言输出的适当性）和作为语用说话者（生成语用适当的语言）的表现，来探讨这一问题。我们在三个语用情境中评估了多种开放权重和专有的LLMs。我们发现语用评估与语用生成之间存在显著的不对称性：许多模型作为听者的表现明显优于作为说话者的表现。我们的结果表明，当前LLMs中的语用判断和语用生成之间的关联性较弱，这呼吁更为综合的评估实践。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2604.15929

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

MUSCAT：多语言科学对话基准

Sinhamahapatra, Supriti, Nguyen, Thai-Binh, Oğuz, Yiğit, Ugan, Enes, Niehues, Jan, Waibel, Alexander

Abstract

The goal of multilingual speech technology is to facilitate seamless communication between individuals speaking different languages, creating the experience as though everyone were a multilingual speaker. To create this experience, speech technology needs to address several challenges: Handling mixed multilingual input, specific vocabulary, and code-switching. However, there is currently no dataset benchmarking this situation. We propose a new benchmark to evaluate current Automatic Speech Recognition (ASR) systems, whether they are able to handle these challenges. The benchmark consists of bilingual discussions on scientific papers between multiple speakers, each conversing in a different language. We provide a standard evaluation framework, beyond Word Error Rate (WER) enabling consistent comparison of ASR performance across languages. Experimental results demonstrate that the proposed dataset is still an open challenge for state-of-the-art ASR systems. The dataset is available in https://huggingface.co/datasets/goodpiku/muscat-eval \\ \newline \Keywords{multilingual, speech recognition, audio segmentation, speaker diarization}

Chinese Translation

多语言语音技术的目标是促进不同语言使用者之间的无缝沟通，创造出每个人都像是多语言使用者的体验。为了实现这一体验，语音技术需要解决几个挑战：处理混合的多语言输入、特定词汇和代码切换。然而，目前尚无数据集对这种情况进行基准测试。我们提出了一个新的基准，以评估当前的自动语音识别（ASR）系统是否能够应对这些挑战。该基准由多个说话者之间关于科学论文的双语讨论组成，每位说话者使用不同的语言进行交流。我们提供了一个标准评估框架，超越了字错误率（WER），使得不同语言的ASR性能能够进行一致的比较。实验结果表明，所提出的数据集仍然是当前最先进的ASR系统面临的一个开放挑战。数据集可在 https://huggingface.co/datasets/goodpiku/muscat-eval 获取。\ ewline 关键词：多语言，语音识别，音频分割，发言者分离

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2604.15945

RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

RAGognizer：通过检测头集成实现的幻觉感知微调

Ridder, Fabian, Lessel, Laurin, Schilling, Malte

Abstract

Retrieval-Augmented Generation (RAG) is widely used to augment the input to Large Language Models (LLMs) with external information, such as recent or domain-specific knowledge. Nonetheless, current models still produce closed-domain hallucinations and generate content that is unsupported by the retrieved context. Current detection approaches typically treat hallucination as a post-hoc problem, relying on black-box consistency checks or probes over frozen internal representations. In this work, we demonstrate that hallucination detection based on internal state representation can also serve as a direct training signal. We introduce RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and RAGognizer, a hallucination-aware fine-tuning approach that integrates a lightweight detection head into an LLM, allowing for the joint optimization of language modeling and hallucination detection. This joint objective forces the model to improve the separability of its internal states regarding hallucinations while simultaneously learning to generate well-formed and meaningful responses. Across multiple benchmarks, RAGognizer achieves state-of-the-art token-level hallucination detection while substantially reducing hallucination rates during generation, without degrading language quality or relevance.

Chinese Translation

检索增强生成（Retrieval-Augmented Generation，RAG）广泛用于通过外部信息增强大型语言模型（Large Language Models，LLMs）的输入，例如最近的或特定领域的知识。然而，当前模型仍然会产生封闭域幻觉，并生成与检索上下文不符的内容。目前的检测方法通常将幻觉视为事后问题，依赖于黑箱一致性检查或对冻结内部表示的探测。在本研究中，我们证明了基于内部状态表示的幻觉检测也可以作为直接的训练信号。我们引入了RAGognize，这是一个自然发生的封闭域幻觉数据集，具有令牌级注释，以及RAGognizer，一种幻觉感知微调方法，它将轻量级检测头集成到LLM中，从而允许语言建模和幻觉检测的联合优化。这一联合目标迫使模型改善其内部状态在幻觉方面的可分离性，同时学习生成结构良好且有意义的响应。在多个基准测试中，RAGognizer实现了最先进的令牌级幻觉检测，同时在生成过程中显著降低了幻觉率，而不降低语言质量或相关性。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2604.15998

SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification

SCHK-HTC：基于层次知识感知提示调优的兄弟对比学习用于层次文本分类

Xiong, Ke, Wu, Qian, Gan, Wangjie, Li, Yuke, Zhang, Xuhong

Abstract

Few-shot Hierarchical Text Classification (few-shot HTC) is a challenging task that involves mapping texts to a predefined tree-structured label hierarchy under data-scarce conditions. While current approaches utilize structural constraints from the label hierarchy to maintain parent-child prediction consistency, they face a critical bottleneck, the difficulty in distinguishing semantically similar sibling classes due to insufficient domain knowledge. We introduce an innovative method named Sibling Contrastive Learning with Hierarchical Knowledge-aware Prompt Tuning for few-shot HTC tasks (SCHK-HTC). Our work enhances the model's perception of subtle differences between sibling classes at deeper levels, rather than just enforcing hierarchical rules. Specifically, we propose a novel framework featuring two core components: a hierarchical knowledge extraction module and a sibling contrastive learning mechanism. This design guides model to encode discriminative features at each hierarchy level, thus improving the separability of confusable classes. Our approach achieves superior performance across three benchmark datasets, surpassing existing state-of-the-art methods in most cases. Our code is available at https://github.com/happywinder/SCHK-HTC.

Chinese Translation

少样本层次文本分类（few-shot HTC）是一项具有挑战性的任务，涉及在数据稀缺条件下将文本映射到预定义的树状标签层次结构。当前的方法利用标签层次结构中的结构约束来保持父子预测的一致性，但面临一个关键瓶颈，即由于领域知识不足，难以区分语义相似的兄弟类别。我们提出了一种创新的方法，称为基于层次知识感知提示调优的兄弟对比学习（SCHK-HTC），用于少样本HTC任务。我们的工作增强了模型对兄弟类别在更深层次上细微差异的感知，而不仅仅是强制执行层次规则。具体而言，我们提出了一个新颖的框架，具有两个核心组件：一个层次知识提取模块和一个兄弟对比学习机制。该设计引导模型在每个层次级别编码区分特征，从而提高混淆类别的可分性。我们的方法在三个基准数据集上实现了优越的性能，在大多数情况下超越了现有的最先进方法。我们的代码可在 https://github.com/happywinder/SCHK-HTC 获取。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2604.16004

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

AgentV-RL：通过代理验证器扩展奖励建模

Zhang, Jiazheng, Fu, Ziche, Xi, Zhiheng, Jing, Wenqing, Chai, Mingxu, He, Wei, Zhang, Guoqiang, Fan, Chenghao, An, Chenxin, Chen, Wenxiang, Liu, Zhicheng, Pan, Haojie, Zhu, Dingwei, Gui, Tao, Zhang, Qi, Huang, Xuanjing

Abstract

Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.

Chinese Translation

验证器已被证明能够通过测试时扩展（TTS）增强大型语言模型（LLM）的推理能力。然而，它们在复杂领域面临重大挑战。来自不正确中间推理的错误传播可能导致看似合理解决方案的假阳性，而缺乏外部基础使得验证器在计算或知识密集型任务中不可靠。为了解决这些挑战，我们提出了代理验证器（Agentic Verifier），一个将奖励建模转变为多轮、工具增强的深思熟虑过程的框架。我们引入了互补的前向和后向代理：一个从前提追踪解决方案到结论，而另一个则重新检查结论与其基础前提之间的一致性。这一双向过程使得对解决方案的评估变得全面、可靠且可解释。为了便于实际部署，我们提出了AgentV-RL。通过主动探索和强化学习，验证器自主地将工具使用与内部推理交织在一起。大量实验表明，代理验证器在并行和顺序TTS下均能带来一致的性能提升。值得注意的是，我们的4B变体超越了最先进的ORM，提升幅度达到25.2%，将其定位为代理奖励建模的有前景的范式。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2604.16027

Where does output diversity collapse in post-training?

后训练中输出多样性崩溃的原因何在？

Karouzos, Constantinos, Tan, Xingwei, Aletras, Nikolaos

Abstract

Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.

Chinese Translation

后训练的语言模型产生的输出比其基础模型的输出少得多。这种输出多样性崩溃削弱了依赖多样化样本的推理时扩展方法，并有可能在创造性和价值导向的任务中使模型输出同质化。先前的研究将崩溃归因于特定的后训练方法，而未能将训练数据组成的作用与方法、生成格式与模型权重分开。我们通过 Olmo 3、Think（思维链蒸馏）、Instruct（广泛的多源数据）和 RL-Zero 的三条平行后训练谱系，追踪了输出多样性，涵盖了 15 个任务和四个文本多样性指标。我们发现崩溃的位置与数据组成共同变化：Think 谱系在监督微调时失去了大部分语义多样性，而 DPO 的影响在 Instruct 中大于在 Think 中。在 Think 模型中抑制推理时的思维链推理会降低困难任务的准确性，但不会改变答案级别的多样性，表明崩溃是由训练数据嵌入在模型权重中，而不是由生成格式强加的。将六个可验证任务上的多样性损失分解为质量控制成分（去除不正确输出）和残余成分（正确输出之间的真实收窄）显示，这一分裂是任务依赖的，尽管 Think 模型在整体上崩溃得更多，但仍保留了比 Instruct 更多的正确答案多样性。我们的结果表明，多样性崩溃是在训练过程中由数据组成决定的，不能仅在推理时解决。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2604.16029

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

减少损失！学习提前修剪路径以实现高效的并行推理

Bi, Jiaxi, Luo, Tongxu, Du, Wenyu, Tang, Zhengyang, Wang, Benyou

Abstract

Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP

Chinese Translation

并行推理增强了大型推理模型（Large Reasoning Models, LRM），但由于早期错误导致的无效路径，造成了巨大的成本。为了缓解这一问题，在前缀级别进行路径修剪是至关重要的，但现有研究仍然零散，缺乏标准化框架。在本研究中，我们提出了首个系统的路径修剪分类法，根据信号来源（内部与外部）和可学习性（可学习与不可学习）对方法进行分类。这一分类揭示了可学习内部方法尚未开发的潜力，激励我们提出了STOP（Super TOken for Pruning）。在参数范围从15亿到200亿的LRM上进行的广泛评估表明，STOP在有效性和效率上优于现有基准。此外，我们严格验证了STOP在不同计算预算下的可扩展性——例如，在固定计算预算下，将GPT-OSS-20B在AIME25上的准确率从84%提升至近90%。最后，我们将研究发现提炼为正式的经验指导，以促进最佳的实际部署。代码、数据和模型可在 https://bijiaxihh.github.io/STOP 获取。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2604.16037

Stochasticity in Tokenisation Improves Robustness

标记化中的随机性提高了鲁棒性

Steger, Sophie, Li, Rui, Ennadir, Sofiane, Sims, Anya, Solin, Arno, Pernkopf, Franz, Trapp, Martin

Abstract

The widespread adoption of large language models (LLMs) has increased concerns about their robustness. Vulnerabilities in perturbations of tokenisation of the input indicate that models trained with a deterministic canonical tokenisation can be brittle to adversarial attacks. Recent studies suggest that stochastic tokenisation can deliver internal representations that are less sensitive to perturbations. In this paper, we analyse how stochastic tokenisations affect robustness to adversarial attacks and random perturbations. We systematically study this over a range of learning regimes (pre-training, supervised fine-tuning, and in-context learning), data sets, and model architectures. We show that pre-training and fine-tuning with uniformly sampled stochastic tokenisations improve robustness to random and adversarial perturbations. Evaluating on uniformly sampled non-canonical tokenisations reduces the accuracy of a canonically trained Llama-1b model by 29.8%. We find that training with stochastic tokenisation preserves accuracy without increasing inference cost.

Chinese Translation

大型语言模型（LLMs）的广泛应用引发了对其鲁棒性的担忧。输入标记化的扰动中的脆弱性表明，使用确定性规范标记化训练的模型可能对对抗性攻击非常脆弱。最近的研究表明，随机标记化可以提供对扰动不太敏感的内部表示。本文分析了随机标记化如何影响对对抗性攻击和随机扰动的鲁棒性。我们在一系列学习模式（预训练、监督微调和上下文学习）、数据集和模型架构上系统地研究了这一问题。我们展示了使用均匀采样的随机标记化进行预训练和微调可以提高对随机和对抗性扰动的鲁棒性。在均匀采样的非规范标记化上进行评估时，规范训练的Llama-1b模型的准确率下降了29.8%。我们发现，使用随机标记化进行训练可以在不增加推理成本的情况下保持准确性。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2604.16042

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

朝向大型语言模型的内在可解释性：设计原则与架构的综述

Gao, Yutong, Meng, Qinglin, Zhou, Yuan, Pan, Liangming

Abstract

While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.

Chinese Translation

尽管大型语言模型（LLMs）在许多自然语言处理（NLP）任务中取得了优异的表现，但其不透明的内部机制阻碍了信任度和安全部署。现有的可解释人工智能（AI）综述主要集中于事后解释方法，这些方法通过外部近似来解释训练好的模型。相比之下，内在可解释性直接将透明性构建到模型架构和计算中，最近作为一种有前景的替代方案出现。本文系统回顾了大型语言模型内在可解释性的最新进展，将现有方法分为五种设计范式：功能透明性、概念对齐、表征可分解性、显式模块化和潜在稀疏性引导。我们进一步讨论了该新兴领域中的开放挑战，并概述了未来的研究方向。论文列表可在以下链接获取：https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2604.16132

Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

大型语言模型能理解创伤的影响吗？大型语言模型对火器暴力幸存者访谈编码的成本与收益

Zhu, Jessica H., Stringfield, Shayla, Zaprosyan, Vahe, Wagner, Michael, Cukier, Michel, Richardson Jr, Joseph B.

Abstract

Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the ethical challenges of applying AI in research involving marginalized communities.

Chinese Translation

火器暴力是一个紧迫的公共卫生问题，但对幸存者生活经历的研究仍然资金不足且难以扩展。定性研究，包括深入访谈，是理解社区火器暴力的个人和社会后果以及设计有效干预措施的重要工具。然而，通过主题分析和归纳编码手动分析这些叙述既耗时又劳动密集。大型语言模型（LLMs）的最新进展为自动化这一过程打开了大门，尽管人们仍然担心这些模型是否能够准确且伦理地捕捉脆弱群体的经历。在本研究中，我们评估了使用开源LLMs对21名幸存于社区火器暴力的黑人男性的访谈进行归纳编码。我们的结果表明，尽管某些LLMs的配置能够识别重要的编码，但整体相关性仍然较低，并且对数据处理高度敏感。此外，LLM的保护措施导致了叙述的显著消失。这些发现突显了LLM辅助定性编码的潜力与局限性，并强调了在涉及边缘化社区的研究中应用人工智能所面临的伦理挑战。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2604.16138

Sentiment Analysis of German Sign Language Fairy Tales

德语手语童话的情感分析

Nunnari, Fabrizio, Jain, Siddhant, Gebhard, Patrick

Abstract

We present a dataset and a model for sentiment analysis of German sign language (DGS) fairy tales. First, we perform sentiment analysis for three levels of valence (negative, neutral, positive) on German fairy tales text segments using four large language models (LLMs) and majority voting, reaching an inter-annotator agreement of 0.781 Krippendorff's alpha. Second, we extract face and body motion features from each corresponding DGS video segment using MediaPipe. Finally, we train an explainable model (based on XGBoost) to predict negative, neutral or positive sentiment from video features. Results show an average balanced accuracy of 0.631. A thorough analysis of the most important features reveal that, in addition to eyebrows and mouth motion on the face, also the motion of hips, elbows, and shoulders considerably contribute in the discrimination of the conveyed sentiment, indicating an equal importance of face and body for sentiment communication in sign language.

Chinese Translation

我们提出了一个用于德语手语（DGS）童话情感分析的数据集和模型。首先，我们使用四个大型语言模型（LLMs）和多数投票方法，对德语童话文本片段进行三种情感值（负面、中性、正面）的情感分析，达到了0.781的Krippendorff's alpha注释者间一致性。其次，我们使用MediaPipe从每个对应的DGS视频片段中提取面部和身体运动特征。最后，我们训练了一个可解释的模型（基于XGBoost），以预测视频特征中的负面、中性或正面情感。结果显示平均平衡准确率为0.631。对最重要特征的深入分析表明，除了面部的眉毛和嘴部运动外，臀部、肘部和肩部的运动也在传达情感的区分中起到了重要作用，表明在手语中面部和身体对情感交流同样重要。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2604.16146

On the Rejection Criterion for Proxy-based Test-time Alignment

基于代理的测试时间对齐的拒绝标准

Hammal, Ayoub, Zweigenbaum, Pierre, Corro, Caio

Abstract

Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned) model. The implicit reward approach skews the large model distribution, whereas the nudging approach defers the generation of the next token to the small aligned model when the large base one is unconfident about its outcome. In this work, we first show that both approaches can be reduced to sampling from similar graphical models, where they differ only in the definition of a rejection criterion (or distribution). Moreover, we argue that the confidence criterion is ill-motivated due to linguistic phenomena like ambiguous phrasing. We propose a novel rejection criterion based on a conservative confidence bet. Experimentally, our novel approach outperforms previous work on several datasets.

Chinese Translation

近期的研究提出了依赖于小型对齐模型作为代理的测试时间对齐方法，该代理指导生成更大基础（未对齐）模型。隐式奖励方法扭曲了大模型的分布，而推拉方法则在大基础模型对其结果缺乏信心时，将下一个标记的生成推迟到小型对齐模型。在本研究中，我们首先展示了这两种方法可以归结为从相似图模型中采样，它们的不同仅在于拒绝标准（或分布）的定义。此外，我们认为信心标准由于模糊措辞等语言现象而缺乏合理性。我们提出了一种基于保守信心赌注的新型拒绝标准。实验结果表明，我们的新方法在多个数据集上优于之前的工作。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2604.16158

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

AtManRL：通过可微分注意力显著性实现可信推理

Höth, Max Henning, Kersting, Kristian, Deiseroth, Björn, Parcalabescu, Letitia

Abstract

Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.

Chinese Translation

大型语言模型（LLMs）越来越依赖于思维链（CoT）推理来解决复杂任务。然而，确保推理轨迹既能贡献于模型最终答案的生成，又能真实反映模型底层的过程，而不仅仅是伴随其后，仍然具有挑战性。我们提出了AtManRL，一种利用可微分注意力操控通过强化学习学习更可信推理的方法。通过训练一个加性注意力掩码，识别出在思维链中对生成正确答案至关重要的标记，我们推导出一个显著性奖励信号，鼓励模型生成真正影响其最终预测的推理轨迹。我们将这一显著性奖励与基于结果的奖励结合在GRPO框架内，以共同优化正确性和可解释性。在GSM8K和MMLU数据集上使用Llama-3.2-3B-Instruct的实验表明，我们的方法能够识别出有影响力的推理标记，并使得训练出更透明的推理模型成为可能。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2604.16217

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

超越表面统计：通过内部表示实现对大型语言模型的稳健保形预测

Wang, Yanli, Kuang, Peng, Han, Xiaoyu, Xu, Kaidi, Wang, Haohan

Abstract

Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities, entropy, and self-consistency can become brittle under calibration--deployment mismatch. Conformal prediction provides finite-sample validity under exchangeability, but its practical usefulness depends on the quality of the nonconformity score. We propose a conformal framework for LLM question answering that uses internal representations rather than output-facing statistics: specifically, we introduce Layer-Wise Information (LI) scores, which measure how conditioning on the input reshapes predictive entropy across model depth, and use them as nonconformity scores within a standard split conformal pipeline. Across closed-ended and open-domain QA benchmarks, with the clearest gains under cross-domain shift, our method achieves a better validity--efficiency trade-off than strong text-level baselines while maintaining competitive in-domain reliability at the same nominal risk level. These results suggest that internal representations can provide more informative conformal scores when surface-level uncertainty is unstable under distribution shift.

Chinese Translation

大型语言模型日益被应用于可靠性至关重要的场景中，然而，诸如标记概率、熵和自一致性等输出级不确定性信号在校准与部署不匹配时可能变得脆弱。保形预测在可交换性下提供有限样本有效性，但其实际效用依赖于非保形评分的质量。我们提出了一种针对大型语言模型问答的保形框架，该框架使用内部表示而非输出统计：具体而言，我们引入了逐层信息（Layer-Wise Information, LI）评分，测量输入条件如何重塑模型深度上的预测熵，并将其作为标准分割保形管道中的非保形评分。在封闭式和开放领域问答基准测试中，尤其是在跨领域转移下，我们的方法在有效性与效率的权衡上优于强文本级基线，同时在相同名义风险水平下保持竞争力的领域内可靠性。这些结果表明，当表面级不确定性在分布转移下不稳定时，内部表示可以提供更具信息量的保形评分。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2604.16235

Optimizing Korean-Centric LLMs via Token Pruning

通过标记修剪优化以韩语为中心的大型语言模型

Kim, Hoyeol, Kim, Hyeonwoo

Abstract

This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression technique that eliminates tokens and embedding parameters corresponding to languages irrelevant to the target application. Focusing on Korean-centric natural language processing (NLP) tasks, we evaluate architectures including Qwen3, Gemma-3, Llama-3, and Aya across three vocabulary configurations: Original, English-Korean (EnKo), and English-Korean-Chinese (EnKoZh). Performance is assessed using established benchmarks for general aptitude, cultural literacy, instruction following, and machine translation. Our findings indicate that token pruning significantly improves generation stability by eliminating language confusion, and in the case of machine translation, frequently enhances performance on Korean-specific tasks. While instruction-following capabilities display architecture-dependent variance linked to latent cross-lingual representations, the significant reduction in vocabulary size validates token pruning as a highly effective optimization strategy for memory-constrained, domain-specific deployments, despite modest gains in inference latency.

Chinese Translation

本文系统性地基准测试了通过标记修剪适配的最先进的多语言大型语言模型（LLMs）——一种压缩技术，旨在消除与目标应用无关的语言对应的标记和嵌入参数。我们聚焦于以韩语为中心的自然语言处理（NLP）任务，评估了包括 Qwen3、Gemma-3、Llama-3 和 Aya 在内的架构，涵盖三种词汇配置：原始配置、英语-韩语（EnKo）和英语-韩语-中文（EnKoZh）。我们使用已建立的基准测试评估了模型在一般能力、文化素养、指令遵循和机器翻译方面的表现。我们的研究结果表明，标记修剪通过消除语言混淆显著提高了生成的稳定性，并且在机器翻译的情况下，通常提升了韩语特定任务的表现。尽管指令遵循能力显示出与潜在跨语言表示相关的架构依赖性差异，但词汇规模的显著减少验证了标记修剪作为一种在内存受限、特定领域部署中非常有效的优化策略，尽管在推理延迟方面的增益有限。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2604.16241

BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

BAGEL：语言模型中动物知识专业性的基准评估

Shen, Jiacheng, Hagiwara, Masato, Alizadeh, Milad, Gilsenan-McMahon, Ellen, Miron, Marius, Robinson, David, Chemla, Emmanuel, Keen, Sara, Narula, Gagan, Laurière, Mathieu, Geist, Matthieu, Pietquin, Olivier

Abstract

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.

Chinese Translation

大型语言模型在广泛领域的知识和推理基准测试中表现出色，但在统一的闭卷评估协议下，语言模型处理专业动物相关知识的能力仍不清楚。我们介绍了BAGEL，这是一个用于评估语言模型中动物知识专业性的基准。BAGEL由多种科学和参考来源构建，包括bioRxiv、Global Biotic Interactions、Xeno-canto和维基百科，采用策划的示例和自动生成的闭卷问答对的组合。该基准涵盖了动物知识的多个方面，包括分类学、形态学、栖息地、行为、发声、地理分布和物种相互作用。通过专注于闭卷评估，BAGEL在推理时不依赖外部检索来测量模型的动物相关知识。BAGEL还支持跨来源领域、分类群和知识类别的细致分析，使得对模型优势和系统性失败模式的更精确表征成为可能。我们的基准为研究语言模型中的领域特定知识泛化以及提高其在生物多样性相关应用中的可靠性提供了新的测试平台。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2604.16262

SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

SwanNLP在SemEval-2026任务5中的应用：基于大型语言模型的叙事词义消歧的合理性评分框架

Sumanathilaka, Deshan, Micallef, Nicholas, Hough, Julian, Jayasinghe, Saman

Abstract

Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions

Chinese Translation

近期语言模型的进展显著提升了自然语言理解（NLU）的能力。尽管广泛使用的基准测试表明大型语言模型（LLMs）能够有效进行消歧，但它们在现实叙事语境中的实际应用仍然未得到充分探索。SemEval-2026任务5通过引入一个预测短篇故事中词义人类感知合理性的任务来填补这一空白。在本研究中，我们提出了一种基于LLM的框架，用于在叙事文本中对同义词的合理性进行评分，采用结构化推理机制。我们考察了使用多样化推理策略对低参数LLM进行微调，以及对大参数模型进行动态少样本提示的影响，以提高词义识别的准确性和合理性估计。我们的结果表明，商业化的大参数LLM结合动态少样本提示能够紧密模拟人类的合理性判断。此外，模型集成略微提升了性能，更好地模拟了五位人类标注者的意见一致性模式，相较于单模型预测。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2604.16270

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

从基准测试到推理：对越南法律文本的大规模双重视角评估

Le, Van-Truong

Abstract

The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the "why" behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \textit{Incorrect Example} and \textit{Misinterpretation} as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.

Chinese Translation

越南法律文本的复杂性对公众获取司法公正构成了重大障碍。尽管大型语言模型（LLMs）为法律文本简化提供了有希望的解决方案，但评估它们的真实能力需要一种超越表面指标的多维度方法。本文提出了一种全面的双重视角评估框架以满足这一需求。首先，我们为四个最先进的大型语言模型（GPT-4o、Claude 3 Opus、Gemini 1.5 Pro 和 Grok-1）在三个关键维度（准确性、可读性和一致性）上建立了性能基准。其次，为了理解这些性能评分背后的“原因”，我们对一个包含60篇复杂越南法律条款的精心策划数据集进行了大规模错误分析，采用了一种新颖的、经过专家验证的错误分类法。我们的结果揭示了一个关键的权衡：像Grok-1这样的模型在可读性和一致性方面表现出色，但在细致的法律准确性上有所妥协，而像Claude 3 Opus这样的模型则在准确性得分上表现优异，却掩盖了大量微妙但关键的推理错误。错误分析指出了 extit{错误示例}和 extit{误解}是最普遍的失败，确认当前LLMs的主要挑战不是总结，而是受控的、准确的法律推理。通过将定量基准与定性深入分析相结合，我们的研究为法律应用中的LLMs提供了全面且可操作的评估。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2604.16275

No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

没有普遍的礼貌：基于PLUM语料库的跨语言多模型礼貌效应研究

Mehta, Hitesh, Saxena, Arjit, Chhikara, Garima, Kumar, Rohit

Abstract

This paper explores the response of Large Language Models (LLMs) to user prompts with different degrees of politeness and impoliteness. The Politeness Theory by Brown and Levinson and the Impoliteness Framework by Culpeper form the basis of experiments conducted across three languages (English, Hindi, Spanish), five models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, and Llama 3), and three interaction histories between users (raw, polite, and impolite). Our sample consists of 22,500 pairs of prompts and responses of various types, evaluated across five levels of politeness using an eight-factor assessment framework: coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability. The findings show that model performance is highly influenced by tone, dialogue history, and language. While polite prompts enhance the average response quality by up to ~11% and impolite tones worsen it, these effects are neither consistent nor universal across languages and models. English is best served by courteous or direct tones, Hindi by deferential and indirect tones, and Spanish by assertive tones. Among the models, Llama is the most tone-sensitive (11.5% range), whereas GPT is more robust to adversarial tone. These results indicate that politeness is a quantifiable computational variable that affects LLM behaviour, though its impact is language- and model-dependent rather than universal. To support reproducibility and future work, we additionally release PLUM (Politeness Levels in Utterances, Multilingual), a publicly available corpus of 1,500 human-validated prompts across three languages and five politeness categories, and provide a formal supplementary analysis of six falsifiable hypotheses derived from politeness theory, empirically assessed against the dataset.

Chinese Translation

本文探讨了大型语言模型（LLMs）对不同礼貌和失礼程度的用户提示的响应。布朗和莱文森的礼貌理论以及卡尔佩珀的失礼框架构成了在三种语言（英语、印地语、西班牙语）、五种模型（Gemini-Pro、GPT-4o Mini、Claude 3.7 Sonnet、DeepSeek-Chat和Llama 3）以及三种用户互动历史（原始、礼貌和失礼）下进行实验的基础。我们的样本由22,500对不同类型的提示和响应组成，使用一个八因素评估框架在五个礼貌水平上进行评估：连贯性、清晰度、深度、响应性、上下文保留、毒性、简洁性和可读性。研究结果表明，模型性能受到语气、对话历史和语言的高度影响。虽然礼貌的提示将平均响应质量提升了约11%，而失礼的语气则使其恶化，但这些效应在不同语言和模型之间并不一致或普遍。英语最适合礼貌或直接的语气，印地语适合谦恭和间接的语气，而西班牙语则适合果断的语气。在模型中，Llama对语气最为敏感（变化范围为11.5%），而GPT对对抗性语气更具鲁棒性。这些结果表明，礼貌是一个可量化的计算变量，影响LLM的行为，但其影响是依赖于语言和模型的，而非普遍的。为了支持可重复性和未来的研究，我们还发布了PLUM（多语言话语中的礼貌水平），这是一个包含1,500个经过人工验证的提示的公开可用语料库，涵盖三种语言和五个礼貌类别，并提供了基于礼貌理论的六个可证伪假设的正式补充分析，这些假设经过实证评估与数据集相对照。

View on arXiv Download PDF AI Translation

arXiv Papers

Foundation Models in Robotics: A Comprehensive Review of Methods, Models, Datasets, Challenges and Future Research Directions

Iterated Invariant EKF for Quadruped Robot Odometry

One-Shot Cross-Geometry Skill Transfer through Part Decomposition

NeuroMesh: A Unified Neural Inference Framework for Decentralized Multi-Robot Collaboration

Trajectory Planning for Safe Dual Control with Active Exploration

ShapeGen: Robotic Data Generation for Category-Level Manipulation

GaussianFlow SLAM: Monocular Gaussian Splatting SLAM Guided by GaussianFlow

Factor Graph-Based Shape Estimation for Continuum Robots via Magnus Expansion

Contact-Aware Planning and Control of Continuum Robots in Highly Constrained Environments

Long-Term Memory for VLA-based Agents in Open-World Task Execution

Fuzzy Logic Theory-based Adaptive Reward Shaping for Robust Reinforcement Learning (FARS)

From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation

Limits of Lamarckian Evolution Under Pressure of Morphological Novelty

Environment-Adaptive Solid-State LiDAR-Inertial Odometry

DTEA: A Dual-Topology Elastic Actuator Enabling Real-Time Switching Between Series and Parallel Compliance

Robust Fleet Sizing for Multi-UAV Inspection Missions under Synchronized Replacement Demand

A Reconfigurable Pneumatic Joint Enabling Localized Selective Stiffening and Shape Locking in Vine-Inspired Robots

VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

DENALI: A Dataset Enabling Non-Line-of-Sight Spatial Reasoning with Low-Cost LiDARs

Semantic Area Graph Reasoning for Multi-Robot Language-Guided Search

Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines

Weak-to-Strong Knowledge Distillation Accelerates Visual Learning

(1D) Ordered Tokens Enable Efficient Test-Time Search

Frequency-Aware Flow Matching for High-Quality Image Generation

UA-Net: Uncertainty-Aware Network for TRISO Image Semantic Segmentation

CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder

AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

Causal Bootstrapped Alignment for Unsupervised Video-Based Visible-Infrared Person Re-Identification

SPLIT: Self-supervised Partitioning for Learned Inversion in Nonlinear Tomography

Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline

From Zero to Detail: A Progressive Spectral Decoupling Paradigm for UHD Image Restoration with New Benchmark

CPU Optimization of a Monocular 3D Biomechanics Pipeline for Low-Resource Deployment

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

HyCal: A Training-Free Prototype Calibration Method for Cross-Discipline Few-Shot Class-Incremental Learning

Self-Supervised Angular Deblurring in Photoacoustic Reconstruction via Noisier2Inverse

P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models

LP$^{2}$DH: A Locality-Preserving Pixel-Difference Hashing Framework for Dynamic Texture Recognition

APC: Transferable and Efficient Adversarial Point Counterattack for Robust 3D Point Cloud Recognition

SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification

NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition

Diffusion Autoencoder for Unsupervised Artifact Restoration in Handheld Fundus Images

MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis

Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

Concept-wise Attention for Fine-grained Concept Bottleneck Models

PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding

SegMix:Shuffle-based Feedback Learning for Semantic Segmentation of Pathology Images

Fed3D: Federated 3D Object Detection

Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

Continual Hand-Eye Calibration for Open-world Robotic Manipulation

Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

SSFT: A Lightweight Spectral-Spatial Fusion Transformer for Generic Hyperspectral Classification

Beyond Text Prompts: Precise Concept Erasure through Text-Image Collaboration

Learning to Look before Learning to Like: Incorporating Human Visual Cognition into Aesthetic Quality Assessment

Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection

AHS: Adaptive Head Synthesis via Synthetic Data Augmentations

Splats in Splats++: Robust and Generalizable 3D Gaussian Splatting Steganography

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

CLOTH-HUGS: Cloth Aware Human Gaussian Splatting

PolarMAE: Efficient Fetal Ultrasound Pre-training via Semantic Screening and Polar-Guided Masking

AeroDeshadow: Physics-Guided Shadow Synthesis and Penumbra-Aware Deshadowing for Aerospace Imagery

Efficient Video Diffusion Models: Advancements and Challenges

Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions

Neural Gabor Splatting: Enhanced Gaussian Splatting with Neural Gabor for High-frequency Surface Reconstruction

SENSE: Stereo OpEN Vocabulary SEmantic Segmentation

From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

MMGait: Towards Multi-Modal Gait Recognition

IA-CLAHE: Image-Adaptive Clip Limit Estimation for CLAHE

Breakout-picker: Reducing false positives in deep learning-based borehole breakout characterization from acoustic image logs

Ranking XAI Methods for Head and Neck Cancer Outcome Prediction

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

TableSeq: Unified Generation of Structure, Content, and Layout

The Amazing Stability of Flow Matching

Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model