arXiv Daily Digest

630

Papers

Haptic Rendering of Fractional-Order Viscoelasticity: Passivity and Rendering Fidelity

分数阶粘弹性的触觉渲染：无源性与渲染保真度

Gemalmaz, Gorkem, Tolasa, Harun, Patoglu, Volkan

Abstract

Haptic rendering of viscoelastic materials that exhibit creep and stress relaxation is crucial for many applications, such as medical training with realistic biological tissue models. Fractional-order viscoelastic models provide an effective means of describing intrinsically time-dependent dynamics with few parameters, as these models can naturally capture memory effects. In this study, we present analyses of passivity and rendering performance for fractional-order viscoelastic models under finite-memory discretization. We derive closed-form expressions to ensure the passivity of haptic rendering with a fractional-order (FO) standard linear solid (SLS) model based on Grunwald-Letnikov derivative under short-memory discretization. We also provide symbolic expressions for the effective stiffness and damping of such FO-SLS models. The resulting passivity conditions constitute a unified framework that generalizes previously reported results for integer-order Kelvin-Voigt, Maxwell, and SLS models, since these results are special cases of the newly derived condition. Furthermore, we provide experimental validations of the theoretical passivity bounds and human-subject evaluations of perceived realism of FO-SLS models. Overall, this study establishes a unified theoretical framework and experimental evaluations for FO viscoelastic rendering under short-memory discretization.

Chinese Translation

触觉渲染具有蠕变和应力松弛特性的粘弹性材料对于许多应用至关重要，例如使用真实生物组织模型进行的医学培训。分数阶粘弹性模型提供了一种有效的手段，以少量参数描述内在的时间依赖动态，因为这些模型能够自然捕捉记忆效应。在本研究中，我们对有限记忆离散化下的分数阶粘弹性模型进行了无源性和渲染性能的分析。我们推导出闭合形式的表达式，以确保基于Grunwald-Letnikov导数的分数阶（FO）标准线性固体（SLS）模型在短记忆离散化下的触觉渲染无源性。我们还提供了此类FO-SLS模型的有效刚度和阻尼的符号表达式。所得的无源性条件构成了一个统一框架，推广了先前报告的整数阶Kelvin-Voigt、Maxwell和SLS模型的结果，因为这些结果是新推导条件的特例。此外，我们提供了理论无源性界限的实验验证以及对FO-SLS模型感知真实感的人体实验评估。总体而言，本研究为短记忆离散化下的FO粘弹性渲染建立了统一的理论框架和实验评估。

View on arXiv Download PDF AI Translation

cs.RO / 2 / 2605.16395

OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

OrbiSim：作为具身智能的可微物理引擎的世界模型

Li, Jiajian, Huang, Jingyuan, Gong, Junru, Wang, Qi, Yang, Xiaokang, Wang, Yunbo

Abstract

We present OrbiSim, a novel robotic simulation paradigm that redefines world models as a fully differentiable physics engine for embodied intelligence. Unlike prior world models that focus on unconstrained imagination in latent or visual domains, OrbiSim establishes a unified, physically-grounded pathway that bridges structured scene assets, neural dynamics, and downstream reinforcement learning. By enabling end-to-end differentiability throughout the entire simulation loop -- spanning from explicit state transitions to visual observation generation -- OrbiSim supports tasks traditionally intractable for classical simulators, such as differentiable contact modeling, gradient-based policy optimization under sparse rewards, and intuitive physical inference. Empirical results demonstrate that OrbiSim significantly outperforms state-of-the-art world models in both predictive fidelity and control performance. Furthermore, its consistent responsiveness to asset configurations and physical parameters suggests its potential as a differentiable tool for enhancing robot simulation and policy training.

Chinese Translation

我们提出了OrbiSim，一种新颖的机器人仿真范式，它将世界模型重新定义为一个完全可微的物理引擎，以支持具身智能。与以往专注于潜在或视觉领域中不受限制想象的世界模型不同，OrbiSim建立了一个统一的、基于物理的路径，连接了结构化场景资产、神经动态和下游强化学习。通过在整个仿真循环中实现端到端的可微性——涵盖从显式状态转移到视觉观察生成——OrbiSim支持传统上对于经典仿真器难以处理的任务，例如可微接触建模、稀疏奖励下的基于梯度的策略优化，以及直观的物理推理。实证结果表明，OrbiSim在预测精度和控制性能上显著优于最先进的世界模型。此外，它对资产配置和物理参数的一致响应表明其作为可微工具在增强机器人仿真和策略训练方面的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 3 / 2605.16398

Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery

支持安全的变分混合过滤用于接触模式和稀疏法则恢复

Papamichalis, Marios, Ruane, Regina

Abstract

Contact-rich robot dynamics are hybrid: a single observation can match several latent states and contact regimes (free, impact, stick--slip). A standard amortized filter that places no probability on a feasible contact transition will permanently lose the branch the robot actually follows. We introduce VHYDRO, a variational hybrid dynamics learner that prevents this branch loss. At each step, VHYDRO mixes the learned proposal with a feasible transition law before sampling and importance weighting, ensuring that every transition retained by the model-feasible carrier remains covered. VHYDRO jointly infers a continuous latent state and a discrete contact mode, and fits a sparse port-Hamiltonian law to each recovered regime. On top of this, three guarantees connect: support coverage stabilizes filtering, the stabilized filter concentrates the discrete contact posterior on coherent regimes, and mode-pure segments admit sparse port-Hamiltonian recovery. The recovery error separates cleanly into filtering, derivative, mode-impurity, and physics-residual parts. Three empirical findings track the same mechanism. Under heavy occlusion the support-safe filter stays usable while a non-defensive proposal collapses. On ManiSkill demonstrations and on four Sawyer/BridgeData task families the discrete state forms temporally coherent contact-regime segments that the discrete state yields a stronger joint profile across ARI, change-point F1, and segment purity than post-hoc and mode-free baselines. On hybrid systems with known equations the mode-conditioned sparse fit recovers the active physical terms; purely predictive baselines do not.

Chinese Translation

接触丰富的机器人动力学是混合的：单一观察可以匹配多个潜在状态和接触模式（自由、冲击、粘滑）。标准的摊销滤波器对可行的接触转移不赋予概率，这将永久性地丢失机器人实际遵循的分支。我们引入了VHYDRO，一种变分混合动力学学习器，防止这种分支丢失。在每一步中，VHYDRO将学习到的提议与可行的转移法则混合，然后进行采样和重要性加权，确保模型可行载体保留的每个转移都得到覆盖。VHYDRO共同推断连续的潜在状态和离散的接触模式，并为每个恢复的模式拟合稀疏的端口哈密顿法则。此外，三个保证相互关联：支持覆盖稳定了滤波，稳定的滤波器将离散接触后验集中在一致的模式上，而模式纯净的片段允许稀疏的端口哈密顿恢复。恢复误差清晰地分为滤波、导数、模式不纯和物理残差部分。三个实证发现跟踪相同的机制。在严重遮挡下，支持安全滤波器仍然可用，而非防御性提议则崩溃。在ManiSkill演示和四个Sawyer/BridgeData任务系列中，离散状态形成时间上连贯的接触模式片段，离散状态在ARI、变更点F1和片段纯度上产生比事后和无模式基线更强的联合轮廓。在具有已知方程的混合系统中，模式条件的稀疏拟合恢复了活跃的物理项；纯预测基线则无法做到。

View on arXiv Download PDF AI Translation

cs.RO / 4 / 2605.16412

SCAR: Self-Supervised Continuous Action Representation Learning

SCAR：自监督连续动作表征学习

Liu, Hongjia, Feng, Fan, Fu, Minghao, Wang, Xinyue, Lu, Haofei, Huang, Biwei

Abstract

Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.

Chinese Translation

尽管动作在具身智能中扮演着核心角色，但从视觉转变中学习可转移的动作表征仍然是一个基本挑战，特别是在世界模型必须在有限数据下跨具身体进行泛化时。我们认为，动作不仅仅是一个辅助条件信号，而是一个独特的表征因素，它将可控变化与具身特定的执行解耦。在本研究中，我们提出了SCAR，一个联合逆向-前向动力学框架，用于从视觉转变中学习跨具身体的统一动作表征。SCAR建立在一个预训练的生成骨干网络上，使用逆向动力学模型（IDM）从潜在观察对中推断潜在动作，并使用前向动力学模型（FDM）基于这些潜在动作预测未来的动态。为了使潜在空间可转移，而不是一个通用的视觉瓶颈，我们对潜在动作后验进行正则化，使其朝向标准高斯先验，以限制任意视觉编码，并引入对抗不变性以抑制具身体和环境特定的干扰因素。在Procgen和Robotwin数据集上的实验表明，学习到的统一潜在动作表征作为世界建模的更强条件接口，优于具身特定的原始动作，从而实现了跨具身体的低数据适应和跨任务转移。综合来看，这些结果表明，动作可以作为跨具身体的可控变化的共享表征进行学习，为更可转移和更具泛化能力的世界模型提供了接口。

View on arXiv Download PDF AI Translation

cs.RO / 5 / 2605.16432

MR-SLAM: Immersive Spatial Supervision for Multi-Robot Mapping via Mixed Reality

MR-SLAM：通过混合现实实现多机器人映射的沉浸式空间监督

Aryan, Prakash, Erdogdu, Cem, Kumarchokkappan, Kavinaya, Kehrer, Timo, Panichella, Sebastiano

Abstract

Operating a multi-robot fleet for simultaneous localization and mapping (SLAM) in applications such as building inspection or warehouse-aisle monitoring requires the operator to maintain spatial awareness of each robot's position and mapping state, a task that scales poorly on conventional 2D interfaces. We present MR-SLAM, a mixed reality (MR) system in which an operator wearing a Meta Quest 3 headset teleoperates three simulated TurtleBot3 robots through a passthrough view with real-world occlusion, while spatially anchored dashboard panels report mapping progress in situ. Each robot runs an independent SLAM Toolbox instance whose occupancy grid is merged in real time on a Robot Operating System 2 (ROS 2) back end. Across five 9-minute evaluation sessions, the system delivered scans at 8.83 +/- 0.16 Hz, mapped 17.9 +/- 0.8 m^2 of merged occupancy, and reached 94.7 +/- 0.5% cross-instance occupancy consistency across robot pairs. An additional session recorded 6.3 ms median transform jitter and 26.7 m^2 coverage of a 41 m^2 grid. We position MR-SLAM as a reference implementation for combining passthrough mixed reality supervision with multi-robot SLAM on consumer hardware.

Chinese Translation

在建筑检查或仓库通道监控等应用中，操作多机器人队伍进行同时定位与地图构建（SLAM）要求操作员保持对每个机器人的位置和映射状态的空间意识，而这一任务在传统的二维界面上扩展性较差。我们提出了MR-SLAM，一个混合现实（MR）系统，操作员佩戴Meta Quest 3头显，通过具有现实世界遮挡的透视视图远程操控三台模拟的TurtleBot3机器人，同时空间锚定的仪表板面板实时报告映射进度。每个机器人运行一个独立的SLAM Toolbox实例，其占用网格在Robot Operating System 2（ROS 2）后端实时合并。在五个9分钟的评估会话中，该系统以8.83 +/- 0.16 Hz的频率提供扫描，映射了17.9 +/- 0.8 m^2的合并占用，并在机器人对之间达到了94.7 +/- 0.5%的跨实例占用一致性。另一个会话记录了6.3毫秒的中位变换抖动和26.7 m^2的41 m^2网格覆盖率。我们将MR-SLAM定位为在消费硬件上结合透视混合现实监督与多机器人SLAM的参考实现。

View on arXiv Download PDF AI Translation

cs.RO / 6 / 2605.16442

Hierarchical Two-Stage Framework for Environment-Aware Long-Horizon Vessel Trajectory Prediction

环境感知长时间航行器轨迹预测的分层两阶段框架

Gnanavel, Ganeshaaraj, Fernando, Tharindu, Sridharan, Sridha, Fookes, Clinton

Abstract

Long-horizon vessel trajectory forecasting under real ocean conditions is critical for collision avoidance, traffic management, and route planning. However, achieving accurate predictions is challenging due to long-range temporal dependencies and dynamic environmental factors such as currents, wind, and waves. To address these issues, we propose a hierarchical two-stage framework that combines a coarse long-term predictor with a grid-aware short-term predictor through a hierarchical fusion mechanism. The short-term branch leverages a Spatio-Temporal Graph Transformer on discretized maritime cells to capture localized dynamics, while the long-term branch encodes overarching navigational intent. An integrated environmental module incorporates oceanographic parameters, including surface currents, wind vectors, and significant wave height, using cross-modal attention and feature-wise modulation for adaptive response to varying sea conditions. Additionally, a learnable Savitzky-Golay smoothing layer enhances temporal coherence in fused trajectories. We evaluate our approach on Australian Craft Tracking System (CTS) data from the North West region, aligned with Copernicus Marine Service products, using a 3-hour input and a 10-hour prediction horizon. Experimental results show that our framework outperforms the state-of-the-art by 25% in Average Displacement Error (ADE) and 17% in Final Displacement Error (FDE). Ablation studies further validate the contribution of each component.

Chinese Translation

在真实海洋条件下进行长时间航行器轨迹预测对于避免碰撞、交通管理和航线规划至关重要。然而，由于长时间的时间依赖性和动态环境因素（如洋流、风和波浪），实现准确的预测面临挑战。为了解决这些问题，我们提出了一种分层两阶段框架，通过分层融合机制将粗略的长期预测器与网格感知的短期预测器结合起来。短期分支利用时空图变换器（Spatio-Temporal Graph Transformer）在离散化的海洋单元上捕捉局部动态，而长期分支则编码整体的导航意图。集成的环境模块结合了海洋学参数，包括表面洋流、风矢量和显著波高，使用跨模态注意力和特征调制以适应不同海况的响应。此外，一个可学习的Savitzky-Golay平滑层增强了融合轨迹的时间一致性。我们在与Copernicus Marine Service产品对齐的西北地区澳大利亚船舶追踪系统（CTS）数据上评估了我们的方法，使用3小时的输入和10小时的预测范围。实验结果表明，我们的框架在平均位移误差（ADE）上比最先进的方法提高了25%，在最终位移误差（FDE）上提高了17%。消融研究进一步验证了各个组件的贡献。

View on arXiv Download PDF AI Translation

cs.RO / 7 / 2605.16514

No Plan, Yet Human: A Reactive Robotics Model Predicts Human Planning Failures on a Clinical Task

没有计划，但仍然是人类：一个反应式机器人模型预测临床任务中的人类规划失败

Migacev, Michael, Mengers, Vito, Köngeter, Antonia, Brock, Oliver

Abstract

Understanding why some sequential planning problems are harder than others requires models that go beyond average performance. They should capture the specific pattern of which problems are hard, and ideally fail in the same way people do when planning capacity is reduced. We apply AICON, a reactive gradient-descent framework developed for robotic manipulation, to the Tower of London test, a cognitive test used to assess planning in Parkinson's disease, mild cognitive impairment, and stroke. Without any lookahead planning or knowledge of human cognition, AICON reproduces the fine-grained human difficulty ordering across 24 problems better than structural task parameters and generalizes to held-out problems in a leave-two-out evaluation. Crucially, AICON outperforms a planning baseline for groups with reduced planning capacity while the planning baseline better captures healthy controls. This dissociation was predicted by the original AICON paper, which noted that the model's failure modes resemble those of Parkinson's patients who struggle with goal hierarchies but not move counts. This suggests that as planning capacity is reduced, human behavior shifts toward the reactive mode AICON models. The finding extends a broader pattern: AICON, originally built for robotics, now captures aspects of biological behavior across perception, eye movements, and sequential planning, suggesting its core abstraction reflects something real about how biological systems are organized.

Chinese Translation

理解为什么某些顺序规划问题比其他问题更困难，需要超越平均表现的模型。这些模型应捕捉哪些问题是困难的特定模式，并且理想情况下在规划能力降低时以与人类相同的方式失败。我们将为机器人操作开发的反应式梯度下降框架 AICON 应用于伦敦塔测试，这是一种用于评估帕金森病、轻度认知障碍和中风患者规划能力的认知测试。在没有任何前瞻性规划或人类认知知识的情况下，AICON 在 24 个问题中再现了细致的人类困难排序，优于结构任务参数，并且在留二法评估中能够推广到未见过的问题。至关重要的是，AICON 在规划能力降低的群体中优于规划基线，而规划基线更好地捕捉健康对照组的表现。这种分离是原始 AICON 论文所预测的，该论文指出模型的失败模式类似于在目标层次结构上挣扎但不在移动计数上挣扎的帕金森患者。这表明，随着规划能力的降低，人类行为向 AICON 模型的反应模式转变。该发现扩展了一个更广泛的模式：最初为机器人构建的 AICON 现在捕捉到生物行为在感知、眼动和顺序规划等方面的特征，表明其核心抽象反映了生物系统组织的某种真实情况。

View on arXiv Download PDF AI Translation

cs.RO / 8 / 2605.16522

A Mechanistic Model for Collective Motion from Sensorimotor Regularities

基于感觉运动规律的集体运动机制模型

Mengers, Vito, Cao, Bao Duc, Brock, Oliver

Abstract

Collective behavior in animals has long been modeled through self-propelled particle models, which reproduce striking group-level phenomena through abstract interaction forces. Yet these models are fundamentally descriptive: they leave open the question of how collective behavior is actually produced. Recent empirical work makes this gap concrete: locusts do not align with neighbors, sensory and cognitive mechanisms mediate interaction instead. A mechanistic model must therefore operate at the sensorimotor level, grounded in what individual organisms can actually perceive, estimate, and physically execute. We present such a model based on a modeling framework from robotics, extended here to collective motion. Each agent perceives neighbors through bearing and apparent-size cues within a limited field of view, maintains uncertain internal state estimates, and selects actions through gradient descent on a desired social distance -- without any prescribed interaction forces. This simple model produces diverse collective behaviors including polarized motion, milling, ring formations, and subgroup fragmentation. A global sensitivity analysis shows that behavioral transitions are governed by sensorimotor parameters corresponding to measurable biological quantities: field of view geometry, sensory noise, turning agility, and memory. Collective behavior can therefore be understood as the emergent outcome of interacting sensorimotor regularities, and differences across species as the emergent outcome of differences in embodiment and environment.

Chinese Translation

动物的集体行为长期以来通过自驱动粒子模型进行建模，这些模型通过抽象的相互作用力再现了显著的群体现象。然而，这些模型本质上是描述性的：它们未能解答集体行为是如何产生的。最近的实证研究使这一空白变得具体：蝗虫并不与邻居对齐，而是通过感觉和认知机制进行互动。因此，机制模型必须在感觉运动层面上运作，基于个体生物体实际能够感知、估计和物理执行的内容。我们提出了这样一个模型，基于机器人学的建模框架，并在此基础上扩展到集体运动。每个代理通过有限视野内的方位和表观大小线索感知邻居，保持不确定的内部状态估计，并通过对期望社交距离的梯度下降选择行动——而不依赖任何规定的相互作用力。这个简单的模型产生了多样的集体行为，包括极化运动、旋转、环形形成和子群体分裂。全球敏感性分析表明，行为转变受传感运动参数的支配，这些参数对应于可测量的生物量：视野几何、感觉噪声、转向灵活性和记忆。因此，集体行为可以被理解为相互作用的感觉运动规律的涌现结果，而物种之间的差异则是体现和环境差异的涌现结果。

View on arXiv Download PDF AI Translation

cs.RO / 9 / 2605.16537

Nori Bot: A Sub-$1,000 Floor-to-Counter Mobile Manipulator

Nori Bot：一款低于$1,000的地面到柜台移动操纵器

Li, Antonio, Park, Sungjoon, Chew, Wen Ni

Abstract

Open-source mobile manipulators have reached $660 (XLeRobot) but every sub-$1,000 platform shares three limitations: a fixed-height workspace, reactive-only control, and no protection against the stall-induced burn-out that destroys cheap Feetech servos. We present Nori Bot, a 17-DoF dual-arm mobile manipulator at $947 (~3% the cost of comparable commercial platforms) that addresses all three: (1) a 600mm Z-axis lift on the existing servo bus for floor-to-counter reach; (2) a thin-client Raspberry Pi 4 paired with the OpenClaw proactive agent runtime so cron jobs and hooks trigger physical tasks autonomously; and (3) a software safety stack with sensorless grip-force feedback via motor current on a soft TPU finger. Code, CAD, and the skill manifest will be released.

Chinese Translation

开源移动操纵器的价格已降至$660（XLeRobot），但每个低于$1,000的平台都存在三个限制：固定高度的工作空间、仅反应式控制，以及缺乏防止因堵转引起的烧毁保护，这会损坏廉价的Feetech伺服电机。我们提出了Nori Bot，这是一款17自由度的双臂移动操纵器，售价为$947（约为同类商业平台成本的3%），解决了这三个问题：（1）在现有伺服总线上提供600mm的Z轴升降，以实现地面到柜台的可达性；（2）配备OpenClaw主动代理运行时的薄客户端Raspberry Pi 4，使得定时任务和钩子能够自主触发物理任务；（3）通过软TPU手指的电机电流实现无传感器的抓握力反馈的软件安全堆栈。代码、CAD和技能清单将会发布。

View on arXiv Download PDF AI Translation

cs.RO / 10 / 2605.16588

Policy Library CBF: Finite-Horizon Safety at Runtime via Parallel Rollouts

策略库控制屏障函数：通过并行回滚实现有限时域的实时安全

Kim, Taekyung, Okamoto, Hideki, Hoxha, Bardh, Fainekos, Georgios, Panagou, Dimitra

Abstract

Safety-critical autonomy in unstructured environments poses significant challenges for online safety certification under evolving constraints. We propose Policy Library Control Barrier Function~(PL-CBF), a runtime safety filter that evaluates a library of fallback policies via parallel finite-horizon rollouts, selects the least invasive safe mode, and enforces safety by solving a quadratic program that minimally modifies a nominal policy. We provide a theoretical analysis based on a finite-horizon language metric over closed-loop behaviors, characterizing policy-library coverage requirements for certifying finite-horizon safety. Simulations on a planar double-integrator (4 states), highway driving with abrupt friction changes using a realistic nonlinear vehicle model (8 states), and 3D quadrotor navigation in crowded dynamic environments (12 states) demonstrate improved safety coverage over single-policy safety filters while retaining millisecond-level runtime.

Chinese Translation

在非结构化环境中，安全关键的自主系统在不断变化的约束下进行在线安全认证面临重大挑战。我们提出了策略库控制屏障函数（Policy Library Control Barrier Function，PL-CBF），这是一种实时安全过滤器，通过并行有限时域回滚评估一组后备策略，选择最不具侵入性的安全模式，并通过求解一个最小修改名义策略的二次规划来强制执行安全性。我们基于闭环行为的有限时域语言度量提供了理论分析，表征了认证有限时域安全所需的策略库覆盖要求。在平面双积分器（4个状态）、使用现实非线性车辆模型的高速公路驾驶（急剧摩擦变化，8个状态）以及在拥挤动态环境中进行的3D四旋翼导航（12个状态）的仿真中，展示了相较于单一策略安全过滤器的改进安全覆盖，同时保持了毫秒级的运行时间。

View on arXiv Download PDF AI Translation

cs.RO / 11 / 2605.16673

Bayesian Networks for Path-Based Sensors: Gathering Information and Path Planning in Communication Denied Environments

基于路径的传感器的贝叶斯网络：在通信受限环境中收集信息和路径规划

Srivastava, Alkesh K., Kontoudis, George P., Sofge, Donald, Otte, Michael

Abstract

A "path-based sensor" produces a single observation along a continuous path. For example, a boolean path-based sensor returns a single "1" if an event of interest is detected at any point along the path and a "0" otherwise. Notably, a "1" provides no direct information about where along the path the event(s) may have occurred. Previous work has demonstrated that observations from multiple path-based sensors can be fused to create a Bayesian belief map over the spatial locations of the underlying event or phenomenon. Moreover, path planning can employ Shannon information theory to accelerate the rate of convergence of the belief map. In this paper, we present a new method to update the belief map based on a path-based sensor observation, and then plan paths to increase information gain. In contrast to prior work that approximates the posterior by averaging over the alternative event histories, we introduce a Bayesian Network (BN) formulation that models the probabilistic relationships between the latent variables and path-based sensor measurements, enabling a more principled Bayesian belief update. We consider static hazard detection in a communication-denied environment as a representative problem setting. The event of a robot returning from its path corresponds to a path-based hazard sensor reading of "0" (hazard not detected), while a robot failing to return corresponds to a reading of "1" (hazard detected). We consider false positives and false negatives. We find that the new method leads to quicker convergence of the belief map than prior work in both single- and multi-robot cases.

Chinese Translation

“基于路径的传感器”沿着连续路径产生单一观测。例如，当在路径上的任何一点检测到感兴趣的事件时，布尔型基于路径的传感器返回一个“1”，否则返回“0”。值得注意的是，“1”并未直接提供事件发生在路径上的具体位置。先前的研究表明，来自多个基于路径的传感器的观测可以融合，以创建一个关于潜在事件或现象空间位置的贝叶斯信念图。此外，路径规划可以利用香农信息理论来加速信念图的收敛速度。本文提出了一种新的方法，通过基于路径的传感器观测更新信念图，然后规划路径以增加信息增益。与之前通过对替代事件历史进行平均来近似后验的工作不同，我们引入了一种贝叶斯网络（Bayesian Network, BN）模型，描述潜在变量与基于路径的传感器测量之间的概率关系，从而实现更为系统的贝叶斯信念更新。我们将静态危险检测在通信受限环境中作为代表性问题设置。机器人从路径返回的事件对应于基于路径的危险传感器读数“0”（未检测到危险），而机器人未能返回则对应于读数“1”（检测到危险）。我们考虑了假阳性和假阴性。我们发现，该新方法在单机器人和多机器人情况下都能比之前的工作更快地收敛信念图。

View on arXiv Download PDF AI Translation

cs.RO / 12 / 2605.16737

DriveSafer: End-to-End Autonomous Driving with Safety Guidance

DriveSafer：具有安全指导的端到端自动驾驶

Sural, Shounak, Rajkumar, Raj

Abstract

End-to-End (E2E) autonomous driving models have shown growing capability in recent years, with performance improving on increasingly challenging benchmarks. However, modern generative E2E planners still suffer from a substantial number of catastrophic failures in safety-critical scenarios. We find that many such failures arise from violations of physical constraints and safety requirements, leading to unsafe behavior. Motivated by this finding, in this paper, we focus on improving safety outcomes in generative end-to-end driving with a targeted reduction of catastrophic planning failures, instead of enhancing average planning quality. Towards this end, we propose DriveSafer, a failure-aware safety framework for end-to-end planners. DriveSafer explicitly steers generative planners towards safe behaviors leveraging both training-time safety constraints and inference-time safety guidance. Compared to the state-of-the-art DiffusionDrive model, on the NAVSIM benchmark, DriveSafer reduces the number of catastrophic failures (PDMS=0) by 48%, with over 65% reduction in drivable-area compliance failures.

Chinese Translation

近年来，端到端（E2E）自动驾驶模型的能力不断增强，在日益具有挑战性的基准测试中表现出色。然而，现代生成式E2E规划器在安全关键场景中仍然面临大量灾难性失败。我们发现，许多此类失败源于对物理约束和安全要求的违反，导致不安全的行为。基于这一发现，本文重点改善生成式端到端驾驶中的安全结果，旨在有针对性地减少灾难性规划失败，而不是提升平均规划质量。为此，我们提出了DriveSafer，一个针对失败的安全框架，旨在为端到端规划器提供支持。DriveSafer通过利用训练时的安全约束和推理时的安全指导，明确引导生成式规划器朝向安全行为。与最先进的DiffusionDrive模型相比，在NAVSIM基准测试中，DriveSafer将灾难性失败数量（PDMS=0）减少了48%，可驾驶区域合规性失败减少超过65%。

View on arXiv Download PDF AI Translation

cs.RO / 13 / 2605.16743

LACE: Latent Visual Representation for Cross-Embodiment Learning

LACE：用于跨体态学习的潜在视觉表示

Jang, Yoo Sung, Ranasinghe, Kanchana, Mata, Cristina, Zhang, Yichi, Mendez-Mendez, Jorge, Ryoo, Michael S.

Abstract

Cross-embodiment learning from human demonstrations is hindered by the visual gap between human and robot embodiments. While self-supervised learning (SSL) backbones encode rich inter-class semantics of general objects, we show they fail to establish correspondence between human and robot hands. We propose LACE, a framework that aligns human and robot visual representations in the latent space of these backbones by leveraging correspondences between shared body parts across embodiments as sparse supervision. These annotations can be automatically obtained via forward kinematics, and single robot demonstration is sufficient to train the model. Our semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality. This alignment enables robot policies to leverage abundant human data when robot demonstrations are scarce: in zero-shot transfer, policies using LACE-DINO outperform those using DINO by a large margin (65\%), with consistent gains in low-data regimes and out-of-distribution environments.

Chinese Translation

从人类示范中进行跨体态学习受到人类与机器人体态之间视觉差距的阻碍。虽然自监督学习（SSL）骨干网络能够编码一般物体的丰富类间语义，但我们发现它们未能建立人类与机器人手部之间的对应关系。我们提出了LACE，一个通过利用跨体态共享身体部位之间的对应关系作为稀疏监督，在这些骨干网络的潜在空间中对齐人类和机器人视觉表示的框架。这些注释可以通过正向运动学自动获得，单个机器人示范足以训练模型。我们的语义对齐损失匹配由对应特征引起的分布，将补丁级监督提升到语义级对齐，而Gram损失则保持预训练特征的质量。这种对齐使得机器人策略在机器人示范稀缺时能够利用丰富的人类数据：在零样本迁移中，使用LACE-DINO的策略相比使用DINO的策略有显著提升（65%），并在低数据环境和分布外环境中持续获得收益。

View on arXiv Download PDF AI Translation

cs.RO / 14 / 2605.16816

"I'm Not Mad, Just Focused'': Understanding Human Emotions in Human-Robot Collaboration

“我不是生气，只是专注”：理解人机协作中的人类情感

Hong, Seung Chan, Kulić, Dana, Tian, Leimin

Abstract

Human-robot collaboration (HRC) can benefit from robots' abilities to interpret human emotional states. However, current emotion recognition (ER) models in HRC often fall short, particularly due to their reliance on acted datasets and single-modality inputs like facial expressions. We propose a novel vision language model (VLM)-based ER system that leverages contextual understanding to improve emotion interpretation in HRC. We first evaluate the VLM-ER system by assessing its semantic and sentiment similarity with human annotations on an existing HRC dataset. Then, in a user study with a service robot in a collaborative delivery task, we evaluate the effects of modulating the robot's behaviour based on the user's emotional state inferred by the VLM-ER system. The results show that the proposed VLM-ER system achieves higher semantic similarity and positive sentiment alignment with human annotations compared to a baseline convolutional neural network-based system. Further, participants in the user study preferred emotion-adaptive robot behaviour facilitated by the VLM-ER system.

Chinese Translation

人机协作（HRC）可以从机器人解读人类情感状态的能力中受益。然而，目前HRC中的情感识别（ER）模型往往存在不足，特别是由于它们依赖于表演数据集和单一模态输入（如面部表情）。我们提出了一种基于新型视觉语言模型（VLM）的情感识别系统，该系统利用上下文理解来改善HRC中的情感解读。我们首先通过评估VLM-ER系统与现有HRC数据集的人类注释之间的语义和情感相似性来进行评估。然后，在与服务机器人进行协作送货任务的用户研究中，我们评估了根据VLM-ER系统推断的用户情感状态调节机器人行为的效果。结果表明，所提出的VLM-ER系统在与人类注释的语义相似性和积极情感一致性方面，优于基线卷积神经网络系统。此外，用户研究中的参与者更倾向于选择由VLM-ER系统促进的情感适应型机器人行为。

View on arXiv Download PDF AI Translation

cs.RO / 15 / 2605.16858

Pedestrian-Aware LLM-Driven Behavioral Planning for Autonomous Vehicles

行人感知的基于大型语言模型的自主车辆行为规划

Baimbetova, Aidana, Yonekura, Haruki, Rizk, Hamada, Yamaguchi, Hirozumi

Abstract

Autonomous Vehicles (AVs) must make reliable decisions in dense urban environments where pedestrian behavior is variable, sometimes abnormal, and often unseen during training. Reinforcement learning (RL)-based AV control systems perform well in structured traffic but struggle to generalize to unpredictable pedestrian interactions and out-of-distribution scenarios. Their reliance on handcrafted rewards and opaque decisions further limits their suitability for safety-critical, pedestrian-rich environments. To address these limitations, we introduce a Large Language Model (LLM)-based decision-making framework for pedestrian-aware behavioral planning. The system converts structured scene observations into natural-language reasoning prompts, enabling the LLM to infer pedestrian intent, anticipate risk, and generate cautious tactical driving decisions. These decisions are executed by a motion planner that ensures smooth, kinematically feasible control. We evaluate the framework in SUMO across multiple pedestrian-interaction scenarios, including unexpected jaywalking, turn-back crossing, hesitation, and bidirectional crossing. In zero-shot evaluation, the LLM-based agent achieves a 68% collision-free success rate, substantially outperforming deep RL baselines (17.7%). With few-shot episodic memory in a single-pedestrian scenario, performance increases to 96.0%, exceeding a custom DQN controller (82.0%). Cross-behavior evaluation further shows that memory derived from turn-back interactions transfers to unseen hesitation and bidirectional crossing scenarios, achieving 82.0% and 90.0% success, respectively. The system consistently initiates earlier responses, maintains wider safety buffers, and produces interpretable, human-aligned decisions.

Chinese Translation

自主车辆（AVs）必须在行人行为多变、时常异常且在训练期间常常未被观察到的密集城市环境中做出可靠决策。基于强化学习（RL）的自主车辆控制系统在结构化交通中表现良好，但在不可预测的行人互动和分布外场景中难以泛化。它们对手工设计奖励和不透明决策的依赖进一步限制了其在安全关键、行人密集环境中的适用性。为了解决这些局限性，我们提出了一种基于大型语言模型（LLM）的决策框架，用于行人感知的行为规划。该系统将结构化场景观察转换为自然语言推理提示，使LLM能够推断行人意图、预判风险并生成谨慎的战术驾驶决策。这些决策由运动规划器执行，确保平滑且运动学上可行的控制。我们在SUMO中评估该框架，涵盖多种行人互动场景，包括意外的闯红灯、回头过马路、犹豫和双向过马路。在零样本评估中，基于LLM的代理实现了68%的无碰撞成功率，显著优于深度强化学习基线（17.7%）。在单行人场景中，采用少量样本的情节记忆，性能提升至96.0%，超过定制的DQN控制器（82.0%）。跨行为评估进一步表明，来自回头互动的记忆可以转移到未见的犹豫和双向过马路场景，分别实现82.0%和90.0%的成功率。该系统始终能够更早地启动响应，保持更宽的安全缓冲区，并产生可解释的、与人类对齐的决策。

View on arXiv Download PDF AI Translation

cs.RO / 16 / 2605.16863

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

先规划，后扩散：用于长时间规划的外部图引导

Hassidof, Yaniv, Morgan, Adir, Du, Yilun, Solovey, Kiril

Abstract

Compositional diffusion models offer a promising route to long-horizon planning by denoising multiple overlapping sub-trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute-heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long-horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search-guided Diffuser (XDiffuser) first computes a plan over a state-space graph -- serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion-based baselines on long-horizon tasks, with particularly large gains in the low-quality data regime and on unseen tasks beyond goal-reaching, including multi-agent coordination and TSP-style reasoning. Project website: https://yanivhass.github.io/XDiffuser-site/

Chinese Translation

组合扩散模型通过去噪多个重叠的子轨迹，为长时间规划提供了一条有前景的途径，同时确保它们共同构成一个全局解决方案。然而，在长链上强制局部行为往往不足以使一致的全局结构出现。最近的研究通过内在搜索来解决这一限制，该方法在去噪过程中探索多个路径。虽然内在搜索改善了全局一致性，但代价是对一个已经计算密集型的模型进行重复评估。在本研究中，我们认为外部搜索在去噪过程之外进行，为长时间规划提供了一种更有效的探索模式，同时自然地允许使用经典算法在测试时解决未见的组合任务。我们的外部搜索引导扩散器（XDiffuser）首先在状态空间图上计算一个计划——作为扩散模型的轻量级局部连通性神谕。然后，该计划用于引导单个轨迹的去噪，有效地减轻了探索的负担。XDiffuser在长时间任务上优于基于扩散的基线，尤其在低质量数据环境和未见任务（超出目标达成，包括多智能体协调和旅行商问题（TSP）风格推理）上获得了显著提升。项目网站：https://yanivhass.github.io/XDiffuser-site/

View on arXiv Download PDF AI Translation

cs.RO / 17 / 2605.16870

SSTL: Self-Sensing Tendon Loop for Hysteresis Modeling and Compensation in Tendon-Sheath Mechanisms

SSTL：用于腱鞘机制中滞后建模和补偿的自感知腱环

Park, Myeongbo, Park, Junhyun, Ullah, Ihsan, An, Chunggil, Hwang, Minho

Abstract

Flexible endoscopic robots enable minimally invasive access through natural orifices, but their control accuracy is limited by configuration-dependent hysteresis in the tendon-sheath mechanisms (TSMs). Tendon-sheath friction and tendon elasticity induce a systematic discrepancy between the proximal actuation input and distal output, and this discrepancy varies with the insertion tube configuration. To address this challenge, this paper proposes the Self-Sensing Tendon Loop (SSTL), a double-pass tendon loop routed through the insertion tube and wrapped around a distal pulley, and returned to the proximal end. The loop structure allows both the input and output tensions of the SSTL to be measured proximally, thereby providing an input-output tension profile without requiring distal force or fiber-optic sensors. Because the SSTL shares the same routing path as the actuation TSM, the two TSMs exhibit strongly correlated hysteresis behaviors. From the SSTL tension profile, a learning-based mapping estimates the configuration-dependent hysteresis parameters of the actuation TSM, which are then used by a feedforward controller to compensate for actuation hysteresis. We validate the proposed method by tracking actuation tendon tension under three different insertion tube configurations. Across sinusoidal and random trajectories, the proposed method reduces average RMSE by 88.1% compared with the uncompensated baseline, achieving 97.8% of the performance of direct identification, which requires direct measurement of the input and output tension profile of the actuation TSM.

Chinese Translation

柔性内窥镜机器人通过自然腔道实现微创访问，但其控制精度受到腱鞘机制（TSMs）中依赖配置的滞后影响。腱鞘摩擦和腱的弹性导致近端驱动输入与远端输出之间存在系统性差异，并且这种差异随着插入管配置的变化而变化。为了解决这一挑战，本文提出了自感知腱环（SSTL），一种通过插入管布置并绕过远端滑轮的双通道腱环，最终返回至近端。该环结构允许在近端测量SSTL的输入和输出张力，从而提供输入-输出张力曲线，而无需远端力或光纤传感器。由于SSTL与驱动TSM共享相同的布线路径，因此这两个TSM表现出强相关的滞后行为。通过SSTL张力曲线，基于学习的映射估计驱动TSM的依赖配置的滞后参数，然后由前馈控制器使用这些参数来补偿驱动滞后。我们通过在三种不同插入管配置下跟踪驱动腱张力来验证所提方法。在正弦和随机轨迹下，所提方法相比于未补偿基线将平均均方根误差（RMSE）降低了88.1%，达到了直接识别的97.8%性能，后者需要直接测量驱动TSM的输入和输出张力曲线。

View on arXiv Download PDF AI Translation

cs.RO / 18 / 2605.16871

SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

SADP：基于子目标的扩散策略用于从基础模型生成的演示中学习的可解释机器人

Hu, Site, Horii, Takato

Abstract

Explainable robots require not only successful task execution but also the ability to expose internal decision-making process in a user-friendly manner. However, most imitation learning methods are trained solely on task-level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal-level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long-horizon manipulation. To address this issue, this paper proposes Subgoal-Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal-annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real-world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level execution signals for monitoring progress and diagnosing failures. These results highlight that built-in, rather than post-hoc, interpretability can coexist with high task performance.

Chinese Translation

可解释机器人不仅需要成功执行任务，还需要能够以用户友好的方式揭示内部决策过程。然而，大多数模仿学习方法仅基于任务级演示进行训练，并未明确建模子目标结构或执行进度。这一局限性在标准机器人学习数据集中子目标级监督的稀缺性下愈加明显，这限制了能够在长时间操作中传达其正在执行的子任务的机器人的发展。为了解决这一问题，本文提出了基于子目标的扩散策略（SADP），该框架利用基础模型自主生成带有子目标注释的演示，并在这些数据集上训练扩散策略。SADP通过将动作生成条件化于任务级和子目标级描述，围绕人类可解释的子目标构建策略执行。一个轻量级的辅助头进一步预测子目标完成状态，使机器人能够揭示其当前执行阶段并监控子目标进展。在RLBench模拟和UR5e机器人上的实际评估实验表明，SADP的任务成功率高于强任务条件的扩散基线，同时提供了用于监控进展和诊断失败的子目标级执行信号。这些结果突显了内置的可解释性与高任务性能可以共存，而非事后解释。

View on arXiv Download PDF AI Translation

cs.RO / 19 / 2605.16894

Beyond Safety Filtering: Control Barrier Function-Informed Reinforcement Learning for Connected and Automated Vehicles

超越安全过滤：基于控制障碍函数的强化学习在连接与自动驾驶车辆中的应用

Xu, Jianye, Alrifaee, Bassam

Abstract

Reinforcement Learning (RL) uses rewards to guide learning, yet reward design is typically hand-crafted using heuristics that can be difficult to tune. We propose a Control Barrier Function (CBF)-informed reward design for Multi-Agent RL (MARL) that converts CBF constraint values under joint MARL actions into a reward signal that explicitly guides safe learning. We compare against two heuristic reward baselines in a four-way multi-lane intersection with connected and automated vehicles. Results show that our method achieves the highest task performance and is less sensitive to reward hyperparameters, yielding consistently strong performance across the tested hyperparameter range. Code for reproducing the experimental results and a video demonstration are available at https://github.com/bassamlab/SigmaRL.

Chinese Translation

强化学习（RL）通过奖励来指导学习，但奖励设计通常依赖于难以调整的启发式方法。我们提出了一种基于控制障碍函数（CBF）的奖励设计，用于多智能体强化学习（MARL），该设计将联合MARL行动下的CBF约束值转换为明确指导安全学习的奖励信号。我们在一个四路多车道交叉口中与两种启发式奖励基线进行了比较，交叉口中有连接与自动驾驶车辆。结果表明，我们的方法实现了最高的任务性能，并且对奖励超参数的敏感性较低，在测试的超参数范围内始终表现出强劲的性能。用于重现实验结果的代码和视频演示可在 https://github.com/bassamlab/SigmaRL 获取。

View on arXiv Download PDF AI Translation

cs.RO / 20 / 2605.16932

MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation

MORN：面向资源理性长时间导航的元认知目标调节

Lin, Xi, Li, Jiayi, Wu, Kangyi, Tang, Jiaqiao, He, Qingrong, Zhao, Lin

Abstract

Robots deployed in unstructured human environments must frequently execute long-horizon missions, such as find the mug, then the chair, then the printer, under strict operational constraints. While contemporary zero-shot Object Navigation (ObjectNav) agents leverage Vision-Language Models (VLMs) to effectively localize semantic targets, they operate as purely reactive systems that inherently lack global resource awareness. Consequently, these agents inadvertently exhaust critical budgets, including time and battery, on infeasible subgoals due to partial observability, failing to balance local exploration with global mission viability. To bridge this gap by injecting resource-rationality into the navigation loop, we present MORN (Metacognitive Object-goal Regulation Navigation), an executive architecture inspired by Dual-Process Theory in cognitive science. MORN augments frozen navigation backbones with a System 2 meta-controller that continuously monitors the System 1 locomotor. By formalizing three neuro-cognitive states, Potentiality Index, Persistence Gating, and Evidence Accumulation, MORN dynamically regulates the mission schedule based on online estimates of progress velocity and perceptual uncertainty. This mechanism effectively neutralizes the Sunk Cost Fallacy, enabling agents to abort zombie goals early and decisively commit to achievable ones. Extensive experiments on the HM3D dataset demonstrate that MORN improves Goal Completion Rate (CR) from 0.23 to 0.30 and reduces Wasted Step Fraction (WSF) from 0.90 to 0.70, establishing that in resource-constrained autonomy, the metacognitive awareness of global resources is as critical as the reactive ability to navigate.

Chinese Translation

在非结构化的人类环境中部署的机器人必须经常执行长时间的任务，例如在严格的操作约束下依次找到杯子、椅子和打印机。尽管当代的零样本目标导航（Object Navigation，ObjectNav）代理利用视觉-语言模型（Vision-Language Models，VLMs）有效地定位语义目标，但它们作为纯粹的反应系统，固有地缺乏全球资源意识。因此，这些代理由于部分可观察性而无意中耗尽了关键预算，包括时间和电池，未能在局部探索与全球任务可行性之间取得平衡。为了通过将资源理性注入导航循环来弥补这一差距，我们提出了MORN（元认知目标调节导航），这是一种受到认知科学中双过程理论启发的执行架构。MORN通过一个系统2元控制器增强了冻结的导航骨干，该控制器持续监控系统1的运动。通过形式化三种神经认知状态：潜力指数（Potentiality Index）、持久性门控（Persistence Gating）和证据积累（Evidence Accumulation），MORN根据对进展速度和感知不确定性的在线估计动态调节任务计划。该机制有效中和了沉没成本谬误，使代理能够早期放弃无效目标，并果断承诺于可实现的目标。在HM3D数据集上的广泛实验表明，MORN将目标完成率（Goal Completion Rate，CR）从0.23提高到0.30，并将浪费步骤比例（Wasted Step Fraction，WSF）从0.90降低到0.70，确立了在资源受限的自主性中，全球资源的元认知意识与反应能力同样重要。

View on arXiv Download PDF AI Translation

cs.RO / 21 / 2605.16979

NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints

NORM-Nav：具有自然语言行为约束的零-shot移动机器人导航

Huo, Dongjie, Wang, Junhui, Gao, Chao, Qiao, Yan, Zhang, Dong, Zhou, Guyue

Abstract

Mobile robots operating in human-centered environments must generate not only collision-free paths but also trajectories that follow local behavioral conventions. Conventional costmap-based navigation emphasizes geometric feasibility and often overlooks such requirements, which can result in socially inappropriate behaviors. This paper presents NORM-Nav, a zero-shot framework that integrates natural language behavioral constraints into costmap-based planning. An LLM parses each instruction into structured constraints and grounds them using real-time vision--LiDAR perception. These constraints are encoded as multi-layer costmaps that represent geometric, semantic, directional, and velocity cues and are directly compatible with standard grid-based planners. Simulation and real-world experiments indicate that NORM-Nav improves task success rates and produces trajectories closer to human references than representative baselines. The project website is available at https://ei-nav.github.io/NORM-Nav.

Chinese Translation

在以人为中心的环境中运行的移动机器人不仅必须生成无碰撞路径，还必须遵循当地行为规范的轨迹。传统的基于代价图的导航强调几何可行性，往往忽视这些要求，这可能导致社会不当行为。本文提出了NORM-Nav，一个将自然语言行为约束整合到基于代价图的规划中的零-shot框架。一个大型语言模型（LLM）将每个指令解析为结构化约束，并通过实时视觉-激光雷达（LiDAR）感知进行基础。将这些约束编码为多层代价图，表示几何、语义、方向和速度线索，并与标准网格规划器直接兼容。仿真和现实世界实验表明，NORM-Nav提高了任务成功率，并生成的轨迹比代表性基线更接近人类参考。项目网站可访问 https://ei-nav.github.io/NORM-Nav。

View on arXiv Download PDF AI Translation

cs.RO / 22 / 2605.17033

Generalizable and Actionable Parts Pose Estimation with Symmetry Annotation-Free Learning Strategy

可推广且可操作的部件姿态估计：无对称注释学习策略

Chen, Wenxiao, Yuan, Xueyu, Liu, Liu, Wu, Di, Guo, Dan

Abstract

Urgently needed generalizable robot object interaction and manipulation requires high-quality Cross-Category object perception. As a pioneer of this area, Generalizable and Actionable Parts (GAParts) understanding has attracted increasing attention from relevant researchers. However, most recent works either have insufficient design regarding the symmetry issue or require rich symmetry annotation, which severely impedes precise GAPart pose estimation in data-lacking scenarios. In this paper, we propose SAFAG, a novel Symmetry Annotation-Free framework for Generalizable and Actionable Parts Pose Estimation. Specifically, we suggest a stepwise refinement two-stage framework for candidate-to-final quaternion regression, and tackle the symmetry prediction as a probability distribution problem with self-supervised learning strategy. The experimental results demonstrate the superior performance and robustness of our SAFAG. We believe that our work has the enormous potential to be applied in many areas of embodied AI system.

Chinese Translation

迫切需要可推广的机器人物体交互和操作，这要求高质量的跨类别物体感知。作为该领域的先驱，可推广且可操作的部件（Generalizable and Actionable Parts, GAParts）理解引起了相关研究者的越来越多关注。然而，最近的大多数研究要么在对称性问题的设计上不足，要么需要丰富的对称注释，这严重阻碍了在数据匮乏场景下精确的GAPart姿态估计。本文提出了一种新颖的无对称注释框架（SAFAG），用于可推广且可操作的部件姿态估计。具体而言，我们建议一种逐步细化的两阶段框架用于候选到最终四元数回归，并将对称性预测视为一个概率分布问题，通过自监督学习策略进行处理。实验结果证明了我们SAFAG的卓越性能和鲁棒性。我们相信我们的工作在许多具身人工智能系统的应用领域具有巨大的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 23 / 2605.17077

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

如何指导您的机器人：密集语言注释推动机器人策略学习

Kim, Bosung, Wang, Ruiyi, Acuna, David, Jung, Jaehun, Trevithick, Alexander, Cui, Brandon, Choi, Yejin, Ammanabrolu, Prithviraj

Abstract

Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.

Chinese Translation

机器人策略学习的规模化受到收集示范成本的制约，而现有示范的语言注释相对便宜。我们研究语言密度作为从固定机器人或自我中心视频语料库中提取更多信号的杠杆。我们提出了DeMiAn（密集多方面注释），这是一种两阶段的方法，首先用VLM生成的注释对示范片段进行重新标注，涵盖四个互补方面：物理运动、场景组成、手臂姿势和推理。一个学习的指导者随后将任务描述和初始场景快照映射到部署时适当的任务注释，异步运行以隐藏生成延迟在策略执行之后。在超过100万个机器人操作片段和50,000个EgoVerse人类自我中心视频中，DeMiAn在不收集新示范的情况下改善了视觉-语言-动作策略和基于视频的世界-动作模型。在RoboCasa上，指导者的成功率比仅基于任务的基线提高了5个百分点，并且距离每个任务的oracle仅差3个百分点。没有固定的注释方面在各任务中占主导地位，显示出选择正确的密集语言是重要的。DeMiAn还改善了复合任务和分布外性能，并在考虑注释生成的FLOPs后，推动了中期训练和后期训练的计算性能边界。这些结果将密集重新注释定位为机器人策略学习的一个实用规模化杠杆。

View on arXiv Download PDF AI Translation

cs.RO / 24 / 2605.17144

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

对比概念激活引导 (COAST)：通过隐藏状态解锁视觉-语言-行动模型

Miao, Miranda Muqing, Kim, Subin, Yang, Brandon, Ungar, Lyle

Abstract

Vision-Language-Action (VLA) models leverage powerful perceptual priors from web-scale Vision-Language Model (VLM) pre-training, yet they remain surprisingly brittle in practice, frequently failing at simple robotic tasks. To mitigate this, we propose Contrastive Conceptor Activation Steering (COAST). COAST builds on the notion of a "conceptor", a linear operator that soft-projects data into the principal components of a target distribution. COAST uses conceptors to identify success-critical subspaces for a target robotic task from a few examples of success and failure rollouts. At inference time, it steers VLA latents into these identified success subspaces to improve task outcomes. Across three architecturally distinct neural policies (flow-matching VLA, autoregressive VLA, and Diffusion Policy), COAST improves absolute mean simulation and real-robot task success rate by over 20 and 40% respectively. The activation subspace geometry reveals that failure modes share substantial structure across tasks while success representations remain largely task-specific. When tasks share similar failure modes, this structure enables previously fitted conceptors to improve performance on new tasks without refitting. Ultimately, our results suggest that current VLAs retain substantial task-relevant knowledge in their latent representations, and that the action expert's decoding bottleneck could be mitigated by steering its residual stream toward task-relevant subspaces. COAST provides a lightweight, training-free path to unlocking these latent capabilities by steering the model towards its own "success" distributions.

Chinese Translation

视觉-语言-行动 (VLA) 模型利用来自网络规模视觉-语言模型 (VLM) 预训练的强大感知先验，然而它们在实践中仍然表现得相当脆弱，常常在简单的机器人任务中失败。为了解决这个问题，我们提出了对比概念激活引导 (COAST)。COAST 基于“概念器”的概念，这是一种将数据软投影到目标分布主成分的线性算子。COAST 使用概念器从少量成功和失败的示例中识别目标机器人任务的成功关键子空间。在推理时，它将 VLA 潜变量引导到这些识别出的成功子空间，以改善任务结果。在三种结构上不同的神经策略（流匹配 VLA、自回归 VLA 和扩散策略）中，COAST 分别提高了绝对均值模拟和真实机器人任务的成功率超过 20% 和 40%。激活子空间几何揭示了失败模式在任务间共享显著结构，而成功表示则在很大程度上保持任务特异性。当任务共享相似的失败模式时，这种结构使得之前拟合的概念器能够在新任务上提高性能而无需重新拟合。最终，我们的结果表明，当前的 VLA 在其潜在表示中保留了大量与任务相关的知识，并且行动专家的解码瓶颈可以通过将其残差流引导到与任务相关的子空间来缓解。COAST 提供了一种轻量级、无训练的路径，通过引导模型朝向其自身的“成功”分布来解锁这些潜在能力。

View on arXiv Download PDF AI Translation

cs.RO / 25 / 2605.17204

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

基于事件的稀疏自编码器用于视觉-语言-动作策略

Jin, Xinchen, Chatterjee, Aditya, Kumar, Pranav, Paleja, Rohan

Abstract

Vision-Language-Action (VLA) policies translate language and visual inputs into robot actions, where their hidden representations directly shape closed-loop behavior. However, mechanistic interpretability tools from language and vision-language models do not transfer cleanly to VLAs: outputs are robot actions rather than human-readable tokens, and interventions can only be tested via expensive closed-loop rollouts. We propose an event-grounded interpretability pipeline that anchors SAE feature analysis to behavioral events rather than text contexts. End-effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events and, via optional VLM annotations, to semantic context. To our knowledge, our pipeline is among the first to ground SAE-based VLA analysis in closed-loop behavioral events. Across two simulation architectures and a real-robot study, event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of $\pi_{0.5}$. SAE is a sparse but imperfect intervention basis: usability varies with architecture and intervention site, and aggressive intervention reveals safety and interpretability limits. Overall, event-grounded SAE analysis emerges as a practical starting point for behavior-anchored VLA interpretability, motivating future work on SAE features beyond action-aligned coordinates, finer-grained closed-loop evaluation, and safe interventions for high-stakes VLA deployments. Code is available at \url{https://github.com/xc-j/Event-SAE}.

Chinese Translation

视觉-语言-动作（VLA）策略将语言和视觉输入转化为机器人动作，其隐含表示直接影响闭环行为。然而，来自语言和视觉-语言模型的机械可解释性工具并不能直接应用于VLA：输出是机器人动作而非人类可读的符号，并且干预只能通过昂贵的闭环回滚测试。我们提出了一种基于事件的可解释性流程，将稀疏自编码器（SAE）特征分析锚定在行为事件上，而非文本上下文。通过视觉、状态和时间线索对每个任务中的末端执行器关键帧进行聚类，将SAE特征与行为显著事件联系起来，并通过可选的视觉-语言模型（VLM）注释链接到语义上下文。据我们所知，我们的流程是首批将基于SAE的VLA分析与闭环行为事件相结合的研究之一。在两个仿真架构和一个真实机器人研究中，基于事件的排名对OpenVLA产生了最强的因果效应，并且可以转移到$ ext{π}_{0.5}$的连续动作块上。SAE是一个稀疏但不完美的干预基础：可用性因架构和干预位置而异，激进的干预揭示了安全性和可解释性的局限性。总体而言，基于事件的SAE分析作为行为锚定的VLA可解释性的实用起点，激励未来在超越动作对齐坐标的SAE特征、细粒度闭环评估以及高风险VLA部署的安全干预方面的研究。代码可在 ext{https://github.com/xc-j/Event-SAE} 获取。

View on arXiv Download PDF AI Translation

cs.RO / 26 / 2605.17229

Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions

生成现实的安全关键场景以应对车辆与行人互动

Pu, Qingwen, Xie, Kun, Zhu, Yuan, Zhai, Guocong

Abstract

Automated driving system deployment requires rigorous validation across safety-critical vehicle-pedestrian interactions, yet real-world datasets rarely capture high-risk scenarios while simulation platforms lack realistic behavior. In response, this study proposes a three-stage framework that combines real-world grounding with adaptive simulation to generate behaviorally realistic safety-critical scenarios at scale. Stage 1 pre-trains multi-agent state-space Transformer-enhanced DDPG (MA-SST-DDPG) agents on real-world safety-critical data to learn human-like interactive evasive behaviors through data-driven learning. Stage 2 deploys pre-trained multi-agents in CARLA for online reinforcement learning to generalize across diverse scenarios, integrating real-world knowledge with simulation experience to produce a refined MA-SST-DDPG model. Stage 3 uses CARLA with the refined model to generate over 198,000 high-resolution interaction episodes from eight intersection scenarios, culminating in the Vehicle-Pedestrian Safety-Critical Interaction (VPSCI) dataset. The Refined MA-SST-DDPG model outperformed baseline methods in reproducing realistic evasive behaviors, achieving the lowest trajectory errors (ADE = 0.072 m, FDE = 0.142 m). Statistical comparison confirmed distributional equivalence between the generated and real-world data in both conflict severity and behavioral response. A Turing test confirmed that the three-stage framework generated evasive behaviors were indistinguishable from real-world interactions. These results demonstrate the framework's effectiveness in producing high-fidelity safety-critical data, offering valuable sources for the development of ADS and simulation-based safety evaluations.

Chinese Translation

自动驾驶系统的部署需要在安全关键的车辆与行人互动中进行严格的验证，然而现实世界的数据集很少捕捉到高风险场景，而模拟平台则缺乏真实的行为。为此，本研究提出了一个三阶段框架，将现实世界的基础与自适应模拟相结合，以大规模生成行为上真实的安全关键场景。第一阶段在现实世界的安全关键数据上预训练多智能体状态空间增强型DDPG（MA-SST-DDPG）智能体，通过数据驱动学习来学习类人互动的规避行为。第二阶段在CARLA中部署预训练的多智能体进行在线强化学习，以在多样化场景中进行泛化，将现实世界知识与模拟经验相结合，生成改进的MA-SST-DDPG模型。第三阶段使用改进模型的CARLA生成来自八个交叉口场景的超过198,000个高分辨率互动剧集，最终形成车辆-行人安全关键互动（VPSCI）数据集。改进的MA-SST-DDPG模型在再现真实的规避行为方面优于基线方法，达到了最低的轨迹误差（ADE = 0.072 m，FDE = 0.142 m）。统计比较确认生成数据与现实世界数据在冲突严重性和行为反应上的分布等价性。图灵测试确认三阶段框架生成的规避行为与现实世界互动无可区分。这些结果证明了该框架在生成高保真安全关键数据方面的有效性，为自动驾驶系统的开发和基于模拟的安全评估提供了宝贵的资源。

View on arXiv Download PDF AI Translation

cs.RO / 27 / 2605.17249

SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation

SEDualVLN：一种空间增强的双系统视觉-语言导航框架

Huang, Jingzhi, Huang, Junkai, Song, Wenxuan, Yang, Haoyang, Huang, Hailong, Li, Haoang, Wang, Yi

Abstract

Vision-Language Navigation (VLN) approaches have currently followed two primary paradigms: the end-to-end Vision-Language Model (VLM) policy fine-tuned on navigation trajectories to directly predict actions, and the zero-shot modular pipeline integrating pre-trained Multimodal Large Language Model (MLLM) for training-free generalization to unseen environments. However, end-to-end methods struggle with long-horizon navigation and lack dynamic reasoning, whereas zero-shot methods are constrained by limited spatial grounding for reliable planning and also require substantial reasoning time. To bridge this gap, we introduce SEDualVLN, a spatially-enhanced dual-system VLN framework. System 1 is a VLM model enhanced with both global and local spatial awareness, used for action generation. System 2 integrates a general MLLM with a mapping module, wherein the MLLM plans waypoints by leveraging top-down views of the real-time 3D map alongside streams of rendered path images. Both systems leverage different forms of spatial enhancement to cultivate the agent's sense of direction in VLN tasks. Ultimately, they cooperate to complete the navigation task through a fast-slow coordinated approach. SEDualVLN achieves state-of-the-art performance on VLN-CE benchmarks, and further ablation studies demonstrate the effectiveness of each system and module.

Chinese Translation

视觉-语言导航（VLN）方法目前主要遵循两种基本范式：一种是基于导航轨迹微调的端到端视觉-语言模型（VLM）策略，直接预测动作；另一种是集成预训练多模态大型语言模型（MLLM）的零-shot模块化管道，实现对未见环境的无训练泛化。然而，端到端方法在长距离导航中表现不佳，且缺乏动态推理能力，而零-shot方法则受到有限空间基础的限制，难以进行可靠规划，并且需要大量推理时间。为了解决这一问题，我们提出了SEDualVLN，一个空间增强的双系统VLN框架。系统1是一个增强了全局和局部空间意识的VLM模型，用于动作生成。系统2则集成了一个通用的MLLM和一个映射模块，其中MLLM通过利用实时3D地图的自上而下视图及渲染路径图像流来规划路径点。两个系统利用不同形式的空间增强来培养代理在VLN任务中的方向感。最终，它们通过快慢协调的方法合作完成导航任务。SEDualVLN在VLN-CE基准测试中实现了最先进的性能，进一步的消融研究证明了每个系统和模块的有效性。

View on arXiv Download PDF AI Translation

cs.RO / 28 / 2605.17264

Stretch-ICP: A Continuous-Trajectory Registration and Deskewing Algorithm in Scenarios of Aggressive Motions

Stretch-ICP：一种在激烈运动场景下的连续轨迹配准与去倾斜算法

Deschênes, Simon-Pierre, Vannini, Veronica, Giguère, Philippe, Pomerleau, François

Abstract

Robust robotic autonomy remains challenging in complex environments, where loss of stability on uneven or slippery terrain can induce extreme accelerations and angular velocities. Such motions corrupt sensor measurements and degrade state estimation, motivating the need for improved algorithmic robustness. To investigate this issue, we introduce the Tumbling-Induced Gyroscope Saturation (TIGS) dataset, which consists of recordings from a mechanical lidar and an Inertial Measurement Unit (IMU) tumbling down a hill. The dataset contains angular speeds up to four times higher than those in similar datasets and is publicly available. We then propose two complementary methods to improve Simultaneous Localization And Mapping (SLAM) robustness and evaluate them on TIGS. First, Saturation-Aware Angular Velocity Estimation (SAAVE) estimates angular velocities when gyroscope measurements become saturated during aggressive motions, reducing angular speed estimation error by 83.4%. Second, Stretch-ICP, a novel registration and deskewing algorithm, enables reconstruction of smoother 6-Degrees Of Freedom (DOF) trajectories under aggressive motions compared to classical Iterative Closest Point (ICP). Stretch-ICP reduces linear and angular velocity errors by 95.2% and 94.8%, respectively, at scan boundaries. Together, these contributions improve the robustness and consistency of lidar-inertial state estimation under aggressive motions.

Chinese Translation

在复杂环境中，稳健的机器人自主性仍然面临挑战，尤其是在不平坦或滑溜的地形上，失去稳定性可能导致极端的加速度和角速度。这种运动会破坏传感器测量并降低状态估计的精度，因此需要改进算法的鲁棒性。为此，我们引入了翻滚引起的陀螺仪饱和（Tumbling-Induced Gyroscope Saturation, TIGS）数据集，该数据集包含了机械激光雷达和惯性测量单元（Inertial Measurement Unit, IMU）在下坡过程中记录的数据。该数据集的角速度最高可达类似数据集中四倍，并且是公开可用的。接着，我们提出了两种互补的方法来提高同时定位与地图构建（Simultaneous Localization And Mapping, SLAM）的鲁棒性，并在TIGS上进行了评估。首先，饱和感知角速度估计（Saturation-Aware Angular Velocity Estimation, SAAVE）在激烈运动中陀螺仪测量饱和时估计角速度，将角速度估计误差降低了83.4%。其次，Stretch-ICP是一种新颖的配准与去倾斜算法，使得在激烈运动下重建的6自由度（6-Degrees Of Freedom, DOF）轨迹比经典的迭代最近点（Iterative Closest Point, ICP）更加平滑。Stretch-ICP在扫描边界处将线速度和角速度误差分别降低了95.2%和94.8%。这些贡献共同提高了激烈运动下激光雷达-惯性状态估计的鲁棒性和一致性。

View on arXiv Download PDF AI Translation

cs.RO / 29 / 2605.17293

Task Capability Improvement Algorithm for Collaborative Manipulators

协作机械手的任务能力提升算法

Patra, Keshab, Sinha, Arpita, Guha, Anirban

Abstract

This work introduces a cooperative task capability improvement utilizing additional moments. The manipulators apply forces at the object's grasp point. Applying forces at a point other than the object's center of gravity produces undesired moments. The undesired moment acts as an additional moment. It improves the capability of an individual manipulator and, hence, the entire collaborative group. Any improvements in task capability directly add up to the object and transportation capability. The group's enhanced capability also helps achieve optimal capability, optimal resource allocation, and maximum fault tolerance in object manipulation. Our simulation results show an improvement in the capability of 5.86 \% compared to when no moment is used to enhance the capability of the manipulators.

Chinese Translation

本研究提出了一种利用附加力矩的协作任务能力提升方法。机械手在物体的抓取点施加力。在物体重心以外的点施加力会产生不期望的力矩。这种不期望的力矩作为附加力矩，提升了单个机械手的能力，从而提升了整个协作组的能力。任务能力的任何提升都会直接增加物体和运输能力。该组的增强能力还帮助实现最佳能力、最佳资源分配和最大故障容忍度在物体操作中的应用。我们的仿真结果显示，与未使用力矩提升机械手能力的情况相比，能力提升了5.86%。

View on arXiv Download PDF AI Translation

cs.RO / 30 / 2605.17300

HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds

HCLM：一种用于双四足机器人的协作运动操控的分层框架

Li, Qixuan, Le, Chen, Yu, Jincheng, Chen, Xinlei

Abstract

We introduce HCLM, a hierarchical framework for general-purpose cooperative loco-manipulation with dual quadrupedal systems. Coordinating multi-robot collaborative manipulation across floating bases is highly challenging due to the conflicting demands of spatial coordination, robust locomotion, and closed-chain physical interactions. To resolve this, our architecture systematically decouples high-level collaborative reasoning from low-level robust motion execution. At the high level, a centralized Joint Diffusion Policy leverages an SE(3)-invariant task-space representation to learn coordinate-agnostic spatial coordination patterns. To translate these frame-agnostic references into physical motion, a task-centric hybrid Whole-Body Controller synergizes a proactive kinematic Model Predictive Control for collision-free velocity distribution with a reactive execution layer. Crucially, this reactive layer guarantees rapid responsiveness for precise end-effector tracking, while concurrently integrating active force regulation via a cooperative admittance scheme to safely resolve kinematic conflicts and strictly regulate internal stresses during closed-chain interactions. We validate the framework across progressively challenging simulated scenarios, including cooperative carrying, packing and handovers, and successfully deploy the latter in the real world. The results demonstrate reliable task execution, strict configuration agnosticism, and exceptional resilience against severe physical perturbations, offering a highly robust pathway for multi-robot embodied coordination.

Chinese Translation

我们介绍了HCLM，一个用于双四足系统的通用协作运动操控的分层框架。在浮动基座上协调多机器人协作操控面临着空间协调、稳健运动和闭链物理交互之间相互冲突的需求，这一挑战极为复杂。为了解决这一问题，我们的架构系统性地将高层次的协作推理与低层次的稳健运动执行解耦。在高层次上，一个集中式的联合扩散策略（Joint Diffusion Policy）利用SE(3)不变的任务空间表示法来学习与坐标无关的空间协调模式。为了将这些与框架无关的参考转换为物理运动，一个以任务为中心的混合全身控制器（Whole-Body Controller）将主动的运动学模型预测控制（Model Predictive Control）与反应式执行层相结合，以实现无碰撞的速度分配。至关重要的是，这一反应层确保了对精确末端执行器跟踪的快速响应，同时通过协作导纳方案（cooperative admittance scheme）整合主动力调节，以安全地解决运动学冲突并严格调控闭链交互过程中的内部应力。我们在逐步挑战的模拟场景中验证了该框架，包括协作搬运、打包和交接，并成功地在现实世界中部署了后者。结果表明，该框架能够可靠地执行任务，严格保持配置无关性，并在严重的物理扰动下表现出卓越的韧性，为多机器人具身协调提供了一条高度稳健的路径。

View on arXiv Download PDF AI Translation

cs.RO / 31 / 2605.17302

Beyond Geometry: Efficient Topologically-Grounded Navigation in Complex 3D Environments

超越几何：在复杂三维环境中高效的拓扑基础导航

Du, Yifan, Zhang, Chengwei, Liao, Siyu, Wang, Zhongfeng

Abstract

Ground robot navigation in complex 3D environments is often hindered by geometric ambiguity, where non-traversable structures such as furniture share local geometric properties with navigable ground. Furthermore, the computational cost of searching massive voxel spaces remains a significant challenge. To address these issues, we present a surface extraction framework that constructs a reduced state space of physically reachable standing positions by enforcing ground support, overhead clearance, and seed-based connectivity constraints. Evaluation across five Matterport3D indoor scenes and three PCT benchmark scenes demonstrates over 80\% state space reduction and sub-millisecond A* search on the Matterport3D scenes, with 100\% planning success across all 300 tested queries.

Chinese Translation

在复杂的三维环境中，地面机器人导航常常受到几何模糊的阻碍，其中不可通行的结构（如家具）与可导航地面共享局部几何特性。此外，搜索庞大的体素空间的计算成本仍然是一个重大挑战。为了解决这些问题，我们提出了一种表面提取框架，通过强制执行地面支撑、头顶间隙和基于种子的连通性约束，构建了一个可物理到达的站立位置的简化状态空间。在五个Matterport3D室内场景和三个PCT基准场景的评估中，显示出超过80%的状态空间减少，并且在Matterport3D场景上实现了亚毫秒级的A*搜索，所有300个测试查询的规划成功率达到100%。

View on arXiv Download PDF AI Translation

cs.RO / 32 / 2605.17327

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

基于前馈3D模型的单目视觉惯性系统高效无特征初始化

Zhang, Yuantai, Yang, Jiaqi, Zeng, Huajian, Chen, Changhao, Li, Haoang, Li, Liang, Song, Dezhen, Zuo, Xingxing

Abstract

Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai-Z/FF-VIO-Init.

Chinese Translation

快速可靠的初始化对于单目视觉惯性导航系统（VINS）至关重要，因为它为后续状态估计建立了起始条件。尽管已有稳步进展，但大多数现有方法严重依赖于视觉特征对应关系，并需要3-4秒的传感器数据才能成功初始化，这限制了它们的适用性和效率。随着能够直接从图像预测点云的前馈3D模型的出现，我们从一个简洁的角度重新审视视觉惯性初始化问题。在本研究中，我们提出了一种无特征初始化框架，该框架利用前馈3D模型预测的按比例缩放的点云，从而消除了对视觉特征跟踪和估计的需求。这一设计显著降低了系统复杂性，并提高了初始化的可靠性。在公共数据集上的实验表明，所提出的无特征初始化方法实现了超过90%的最高成功率，并显著减少了成功初始化所需的数据持续时间，通常在1.2秒以内。我们进一步在自收集的数据集上验证了我们的方法，该数据集涵盖了各种室内和室外场景，展示了稳健的性能，特别是在现有方法常常失效的视觉退化环境中。代码和数据集可在 https://github.com/Yuantai-Z/FF-VIO-Init 获取。

View on arXiv Download PDF AI Translation

cs.RO / 33 / 2605.17336

Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

基于触觉的多模态融合在具身智能中的应用：视觉、语言和接触驱动范式的综述

Cao, Zhixiang, Tian, Di, Guan, Runwei, Mu, Yanzhou, Sun, Xiaolou, Liang, Shaofeng, Liu, Daizong, Huang, Tao, Yue, Yutao, Ding, Henghui, Fang, Bin, Zhou, Alex, Han, Qing-Long, Xiong, Hui

Abstract

Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.

Chinese Translation

触觉感知是具身智能的一个基本模态，提供了关于接触几何、材料属性和交互动态的独特且直接的反馈，这是远程传感器无法替代的。然而，单一的触觉感知本质上受到稀疏空间覆盖和缺乏全局语义上下文的限制。随着深度学习和大型语言模型的快速发展，将触觉与视觉和语言结合已成为弥合物理交互与语义推理之间的重要途径，从而催生了多模态触觉融合。尽管取得了快速进展，现有研究仍然在不同数据集、感知模态和任务之间碎片化，缺乏统一的理论框架。为了解决这一问题，本文提供了截至2026年第一季度的多模态触觉融合研究的全面综述。我们提出了一个层次化的分类法，将该领域组织为两个主要维度：多模态数据集和多模态方法。在数据方面，我们将资源分类为触觉-视觉数据集、触觉-语言数据集、触觉-视觉-语言数据集和触觉-视觉-其他数据集。在方法方面，我们将先前的工作结构化为三个核心支柱：（1）多模态感知与识别，重点关注物体理解和抓取预测；（2）跨模态生成，关注触觉、视觉和文本之间的双向翻译；（3）多模态交互，强调反馈控制和语言引导的操作。此外，我们总结了代表性的触觉感知硬件，回顾了常用的评估指标和基准设置，并讨论了当前的挑战和有前景的未来方向。

View on arXiv Download PDF AI Translation

cs.RO / 34 / 2605.17421

MUSE: Multimodal Uncertainty Quantification of State Estimation

MUSE：状态估计的多模态不确定性量化

Kim, Minkyung, Che, Henry, Chandaka, Bhargav, Pramuanpornsatid, Bhumsitt, Yang, Chengyu, Cheng, Sheng, Wang, Xiaofeng, Hovakimyan, Naira, Wang, Shenlong

Abstract

Accurate visual state estimation has been a central topic in robotics with a wide range of applications in robot navigation, autonomous driving, and autonomous flight. Recent advances in robot perception have led to significant improvements in the accuracy and robustness of state estimation, yet a fundamental challenge remains in how to quantify and calibrate its precision, i.e., how confident we are in an estimate and whether failures can be detected. This issue is particularly pronounced in visual-inertial odometry (VIO), where the heteroscedastic and multimodal nature of the problem makes uncertainty quantification especially difficult. This paper introduces MUSE (Multimodal Uncertainty Quantification of State Estimation), a novel real-time learning-based framework that leverages the strong and efficient sequential modeling capacity of Mamba to estimate localization uncertainty from multiple asynchronous sensor streams. Experiments on both public and in-house datasets demonstrate that MUSE achieves superior reliability and robustness compared to existing uncertainty quantification methods, and ablation studies justify the benefits of its key design choices.

Chinese Translation

准确的视觉状态估计一直是机器人领域的核心主题，广泛应用于机器人导航、自动驾驶和自主飞行等领域。近年来，机器人感知的进展显著提高了状态估计的准确性和鲁棒性，但在如何量化和校准其精度方面仍然存在一个根本性挑战，即我们对估计的信心程度以及是否能够检测到失败。这个问题在视觉惯性里程计（VIO）中尤为明显，因为该问题的异方差性和多模态特性使得不确定性量化变得特别困难。本文介绍了MUSE（状态估计的多模态不确定性量化），这是一种新颖的实时基于学习的框架，利用Mamba强大而高效的序列建模能力，从多个异步传感器流中估计定位不确定性。在公共和内部数据集上的实验表明，MUSE在可靠性和鲁棒性方面优于现有的不确定性量化方法，消融研究也证明了其关键设计选择的优势。

View on arXiv Download PDF AI Translation

cs.RO / 35 / 2605.17477

Rapid Vibration Suppression and Trajectory Tracking of a Serial Manipulator with Multi-Flexible Links

具有多柔性连杆的串联机械手的快速振动抑制与轨迹跟踪

Wang, Chengyi, Huang, Yilong, Wang, Ji

Abstract

Flexible robotic manipulators (FRMs) offer advantages in lightweight design and large workspace, but their structural flexibility induces vibrations, accelerates fatigue, degrades tracking performance, and limits operational speed. These challenges are further amplified in multi-link serial manipulators, where increased overall length leads to greater structural flexibility. This article presents a backstepping output-feedback framework for fast vibration suppression and tip tracking of an n-degree-of-freedom serial flexible manipulator robot (nDSFMR), with a DeepONet-based approximation for practical deployment. Each link-joint is modeled as a Timoshenko beam coupled with an ODE and transformed into a canonical hyperbolic PDE with boundary dynamics. A backstepping-based boundary controller at the joint is developed to equivalently inject distributed damping along the beam, enabling rapid vibration suppression and trajectory tracking, only using available boundary measurements. To enable real-time implementation and scalability, a DeepONet neural operator is introduced to approximate the backstepping kernels, significantly reducing computational cost and facilitating fast controller updates under varying operating conditions. Experiments on a two-link flexible manipulator demonstrate faster vibration suppression and convergence of the end-effector to the desired trajectory, compared with a linear quadratic regulator (LQR) with feedforward control.

Chinese Translation

柔性机器人 manipulator (FRMs) 在轻量化设计和大工作空间方面具有优势，但其结构柔性会引发振动，加速疲劳，降低跟踪性能，并限制操作速度。这些挑战在多连杆串联机械手中进一步加剧，增加的整体长度导致更大的结构柔性。本文提出了一种基于反步输出反馈的框架，用于 n 自由度串联柔性机械手机器人 (nDSFMR) 的快速振动抑制和末端跟踪，并结合 DeepONet 基于近似以实现实际部署。每个连杆-关节被建模为一个与常微分方程 (ODE) 耦合的 Timoshenko 梁，并转化为具有边界动态的标准双曲偏微分方程 (PDE)。在关节处开发了一种基于反步的边界控制器，以等效地沿梁注入分布阻尼，从而仅使用可用的边界测量实现快速振动抑制和轨迹跟踪。为了实现实时实施和可扩展性，引入了 DeepONet 神经算子来近似反步核，显著降低计算成本，并在变化的操作条件下促进快速控制器更新。在一个两连杆柔性机械手上的实验表明，与具有前馈控制的线性二次调节器 (LQR) 相比，振动抑制更快，末端执行器更快收敛到期望轨迹。

View on arXiv Download PDF AI Translation

cs.RO / 36 / 2605.17486

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

DyGRO-VLA：通过动态分组残差优化实现视觉-语言-动作模型的跨任务扩展

Lin, Sixu, Qing, Yunpeng, Liu, Litao, Zhou, Ming, Jin, Ruixing, Fan, Xiaoyi, Liu, Guiliang

Abstract

Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.

Chinese Translation

近年来，强化学习（Reinforcement Learning, RL）的进展为优化视觉-语言-动作（Vision-Language-Action, VLA）模型提供了一种原则性的方法，促进了从轨迹模仿向任务环境中的主动学习的转变。尽管控制精度有所提高，但大多数RL优化器仍然是任务特定的，这使得VLA模型从通用控制器降级为过拟合于狭窄任务集的策略。在本研究中，我们对这一现象进行了深入分析，并强调了跨任务特征表示在提高VLA模型泛化能力方面的重要性。基于这一发现，我们提出了DyGRO-VLA，一个两阶段的优化框架，1）基于信息论原则有效捕捉跨任务潜在表示，2）通过混合RL残差动态优化策略。DyGRO-VLA使得RL优化器能够利用与任务相关的潜在信息，同时在优化过程中战略性地减轻对学习表示的不利干扰。我们在LIBERO和RoboTwin2基准上评估了我们的方法，并在真实世界中进一步验证，显示出在多任务训练和分布转移下相较于强基线的一致性改进。

View on arXiv Download PDF AI Translation

cs.RO / 37 / 2605.17517

AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

AffordVLA：通过隐式特征对齐将可供性表示注入视觉-语言-动作模型

Kong, Weijie, Su, Zhian, Yu, Wei, Dong, Huixu

Abstract

Recent advances in Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation. However, the visual representations of most VLA models are often dominated by global object appearance and struggle to focus on task-relevant functional interaction regions, which limits their robustness in unstructured environments. Existing affordance-based methods typically rely on explicit mask injection or external perception modules, requiring additional annotations while introducing cascading perception errors and inference overhead. To address these limitations, we propose AffordVLA, an affordance-enhanced VLA framework that internalizes manipulation-centric affordance perception into VLA visual representations through implicit representation alignment. Specifically, we construct a zero-shot affordance teacher to extract task-conditioned affordance visual representations from RGB observations and language instructions. AffordVLA aligns the intermediate visual representations of the VLA with the affordance visual representations extracted by the teacher, thereby implicitly injecting manipulation-centric affordance perception into VLA visual representations and improving action accuracy. Extensive simulation and real-world experiments demonstrate that AffordVLA and its affordance teacher achieve state-of-the-art performance and outperform strong baselines. Ablation analyses show that AffordVLA effectively reshapes VLA visual representations while preserving inference efficiency, leading to improved manipulation success rates and training efficiency.

Chinese Translation

最近在视觉-语言-动作（VLA）模型方面的进展显示出其在通用机器人操作中的强大潜力。然而，大多数VLA模型的视觉表示往往受到全局物体外观的主导，难以关注与任务相关的功能交互区域，这限制了它们在非结构化环境中的鲁棒性。现有的基于可供性的方法通常依赖于显式的掩码注入或外部感知模块，这需要额外的标注，同时引入级联的感知误差和推理开销。为了解决这些局限性，我们提出了AffordVLA，一个增强可供性的VLA框架，通过隐式表示对齐将以操作为中心的可供性感知内化到VLA视觉表示中。具体而言，我们构建了一个零-shot可供性教师，从RGB观测和语言指令中提取任务条件的可供性视觉表示。AffordVLA将VLA的中间视觉表示与教师提取的可供性视觉表示对齐，从而隐式地将以操作为中心的可供性感知注入VLA视觉表示中，提高了动作的准确性。大量的仿真和现实世界实验表明，AffordVLA及其可供性教师达到了最先进的性能，并超越了强基线。消融分析表明，AffordVLA有效地重塑了VLA视觉表示，同时保持了推理效率，导致操作成功率和训练效率的提高。

View on arXiv Download PDF AI Translation

cs.RO / 38 / 2605.17522

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

RoboFlow4D：一种轻量级流动世界模型，旨在实现实时流导向的机器人操控

Lin, Sixu, Chen, Junliang, Xu, Huaiyuan, Li, Zhuohao, Wang, Guangming, Jing, Yixiong, Xu, Sheng, Zhao, Runyi, Sheil, Brian, Chau, Lap-Pui, Liu, Guiliang

Abstract

Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited real-time performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end-to-end framework, RoboFlow4D directly predicts multi-frame 3D flows from visual observations and textual instructions, providing explicit flow-based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation-planning-execution closed loop. Through slow-fast collaboration between flow prediction and action control, RoboFlow4D enables real-time and resource-efficient manipulation. Extensive experiments in both simulation and real-world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow-guided planning for embodied intelligence.

Chinese Translation

在三维环境中进行规划和行动是现实世界中机器人操控的基本能力。尽管之前的研究探索了预测流动规划器以指导三维操控，但现有的方法通常依赖于模块化管道堆叠多个子模型，导致高计算开销和有限的实时性能。为了解决这些挑战，我们提出了RoboFlow4D，一种轻量级流动世界模型，通过估计物理三维空间中的时间运动来统一感知和规划。作为一个端到端框架，RoboFlow4D直接从视觉观察和文本指令中预测多帧三维流动，提供明确的基于流动的规划以指导动作生成。这一设计允许与通用动作策略无缝集成，形成高效的观察-规划-执行闭环。通过流动预测与动作控制之间的慢-快协作，RoboFlow4D实现了实时且资源高效的操控。在模拟和现实环境中的大量实验表明，RoboFlow4D始终提高了操控成功率和计算效率，推动了具身智能的流导向规划。

View on arXiv Download PDF AI Translation

cs.RO / 39 / 2605.17556

Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting

视觉雕刻：用于长时间机器人粘土雕刻的视觉对齐规划表示

Schaldenbrand, Peter, Oh, Jean

Abstract

Clay sculpting is a nuanced, artistic task involving dexterous manipulation with long-horizon planning to achieve high-level goals. As a robotics problem, we formulate clay sculpting as a shape-to-shape matching challenge. Prior deformable object manipulation work either requires retraining a policy per goal or relies on dynamics models which represent state as sparse point clouds which do not capture important clay features, such as textures, well. We present a method for modeling the dynamics of deformable materials and planning for robotic sculpting in a representation that is visually-aligned, capturing lighting and texture features. With three different deformable materials and various end-effectors, we demonstrate that our dynamics model is comparable in performance to the state-of-the-art with the added benefit of being compatible with visual planning. Our actions are represented as parametrized pushes into clay with a single end-effector, which proved to be suitable for long-horizon (>100 actions) clay relief sculptures. Lastly, we show the benefits of planning in a visually-aligned representation, but also provide analysis providing evidence as to why this representation is challenging to plan in compared to 3D representations.

Chinese Translation

粘土雕刻是一项细致的艺术任务，涉及灵巧的操作和长时间的规划，以实现高层次的目标。作为一个机器人问题，我们将粘土雕刻形式化为形状匹配挑战。先前的可变形物体操作研究要么需要为每个目标重新训练策略，要么依赖于将状态表示为稀疏点云的动力学模型，而这些模型无法很好地捕捉粘土的重要特征，如纹理。我们提出了一种建模可变形材料动力学的方法，并在一种视觉对齐的表示中规划机器人雕刻，捕捉光照和纹理特征。通过三种不同的可变形材料和各种末端执行器，我们证明了我们的动力学模型在性能上可与最先进的技术相媲美，并且具有与视觉规划兼容的额外优势。我们的动作被表示为对粘土的参数化推压，使用单一的末端执行器，这证明适用于长时间（超过100个动作）的粘土浮雕雕刻。最后，我们展示了在视觉对齐表示中规划的好处，同时也提供分析，证明为什么与3D表示相比，这种表示在规划上具有挑战性。

View on arXiv Download PDF AI Translation

cs.RO / 40 / 2605.17593

Motion-Uncertainty-Aware Next-Best-View Planning for Moving Object Reconstruction

考虑运动不确定性的下一最佳视角规划用于移动物体重建

Li, Karen, Mantovani, Mattia, Wood, Robert J., Sabattini, Lorenzo, Gil, Stephanie

Abstract

Active 3D reconstruction of moving objects requires selecting informative viewpoints while accounting for object motion uncertainty during the decision-to-execution delay. Existing methods address only parts of this problem: next-best-view (NBV) planners for object reconstruction typically optimize surface coverage but assume static objects, while motion-aware active perception for moving targets accounts for target motion but prioritizes tracking or visibility over reconstruction coverage. This work presents a motion-uncertainty-aware NBV framework for reconstructing an unknown rigid object undergoing planar motion, using noisy planar position measurements of the object and depth observations from a mobile robot. The key idea is to evaluate each candidate viewpoint by its expected observation quality over plausible future object states induced by motion and measurement uncertainty, rather than at a single predicted object pose. To obtain this predictive belief, a fixed-lag Gaussian Process smoother estimates and predicts the object state from noisy position measurements. The resulting belief is used to generate candidate viewpoints around the predicted object location, filter them by reachability, and estimate their expected coverage-driven scores. Simulation and real-world experiments demonstrate improved reconstruction completeness over non-predictive NBV and prediction-only tracking methods, bridging coverage-driven active reconstruction and prediction-driven tracking.

Chinese Translation

主动的三维重建移动物体需要在决策到执行的延迟中选择信息丰富的视角，同时考虑物体运动的不确定性。现有方法仅解决了该问题的一部分：用于物体重建的下一最佳视角（NBV）规划器通常优化表面覆盖，但假设物体是静态的，而针对移动目标的运动感知主动感知则考虑了目标运动，但优先考虑跟踪或可见性而非重建覆盖。本文提出了一种考虑运动不确定性的NBV框架，用于重建正在进行平面运动的未知刚性物体，利用物体的噪声平面位置测量和来自移动机器人的深度观测。关键思想是通过运动和测量不确定性引起的未来物体状态的期望观测质量来评估每个候选视角，而不是基于单一预测的物体姿态。为了获得这种预测信念，固定滞后高斯过程平滑器从噪声位置测量中估计和预测物体状态。生成的信念用于围绕预测的物体位置生成候选视角，通过可达性进行过滤，并估计其期望的覆盖驱动评分。模拟和真实世界实验表明，与非预测性NBV和仅预测跟踪方法相比，重建完整性得到了改善，弥合了覆盖驱动的主动重建和预测驱动的跟踪之间的差距。

View on arXiv Download PDF AI Translation

cs.RO / 41 / 2605.17601

From a Single Demonstration to a General Policy for Contact-Rich Manipulation

从单次演示到接触丰富操作的一般策略

Li, Xing, Brock, Oliver

Abstract

We present a Learning from Demonstration (LfD) framework that achieves one-shot generalization in multi-stage, contact-rich manipulation tasks. Central to our approach is the utilization of environmental constraints as the inductive bias. By representing a demonstration as a sequence of behaviors that exploit environmental constraints, the robot separates task-general structure -- the constraint types and their transitions -- from instance-specific details such as exact demonstration trajectories, poses, and local geometries. Our four-stage pipeline builds a complete policy on this representation: the robot first abstracts a single demonstration into environmental-constraint primitives, then disambiguates them through self-guided exploration, next assimilates targeted human corrections that handle out-of-distribution variations, and finally recovers the abstracted-away details online through compliant interaction. Because the resulting policy follows constraints rather than mimics trajectories, it generalizes across object poses, local geometries, and unmodeled contact dynamics. We validate our approach on seven real-world multi-stage contact-rich manipulation tasks and achieve over 90% success. These extensive experimental results establish environmental constraints as fundamental building blocks for efficient generalization in learning from demonstration.

Chinese Translation

我们提出了一种学习从演示（Learning from Demonstration, LfD）框架，能够在多阶段、接触丰富的操作任务中实现一次性泛化。我们方法的核心是利用环境约束作为归纳偏差。通过将演示表示为一系列利用环境约束的行为，机器人将任务通用结构——约束类型及其转变——与实例特定细节（如精确的演示轨迹、姿态和局部几何）分离开来。我们的四阶段流程在这一表示上构建完整策略：机器人首先将单次演示抽象为环境约束原语，然后通过自我引导探索对其进行消歧，接着吸收针对处理分布外变异的人类修正，最后通过顺应性交互在线恢复被抽象掉的细节。由于生成的策略遵循约束而非简单模仿轨迹，因此它能够在物体姿态、局部几何和未建模的接触动态之间实现泛化。我们在七个真实世界的多阶段接触丰富操作任务上验证了我们的方法，并取得了超过90%的成功率。这些广泛的实验结果确立了环境约束作为高效泛化的基本构建块，在学习从演示中发挥重要作用。

View on arXiv Download PDF AI Translation

cs.RO / 42 / 2605.17661

Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping

Mono-Hydra++：基于多任务学习的实时单目场景图构建用于3D室内映射

Udugama, U. V. B. L., Vosselman, George, Nex, Francesco

Abstract

Autonomous agile robots need more than metric geometry: they must understand objects, rooms, places, and spatial relations for search, inspection, exploration, and human robot interaction. Conventional metric maps support localization and collision avoidance, but do not provide this semantic and relational structure. 3D scene graphs address this gap by connecting geometry with object level and room level understanding. Building such representations on agile platforms remains difficult because aerial and lightweight robots operate under strict payload, power, and compute limits, making RGB-D cameras and LiDAR sensors impractical for many onboard settings. We present Mono-Hydra++, a real time monocular RGB plus IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. The system combines M2H-MX, a DINOv3 based multi-task model for depth and semantics, with a deep feature visual inertial odometry front end, sparse predicted depth constraints in the VIO derived pose graph, semantic masking for dynamic regions, and pose aware temporal alignment before volumetric fusion in the Mono-Hydra backend. On the Go-SLAM ScanNet evaluation subset, Mono-Hydra++ achieves 1.6% lower average trajectory error than the strongest RGB-D baseline in our comparison, while using only monocular RGB plus IMU input. On calibrated 7-Scenes, it improves average ATE by 29.8% over the strongest competing calibrated baseline. We further validate Mono-Hydra++ in a real ITC building deployment using RealSense RGB plus IMU and demonstrate embedded feasibility by deploying the ONNX/TensorRT FP16 M2H-MX-L perception model at 25.53 FPS on a Jetson Orin NX 16GB. These results show that Mono-Hydra++ can provide real time metric semantic mapping and scene graph construction for resource constrained robotic platforms without relying on active depth sensors.

Chinese Translation

自主灵活机器人需要的不仅仅是度量几何：它们必须理解物体、房间、地点和空间关系，以便进行搜索、检查、探索和人机交互。传统的度量地图支持定位和避障，但未提供这种语义和关系结构。3D场景图通过将几何与物体级和房间级理解相连接，填补了这一空白。在灵活平台上构建此类表示仍然困难，因为空中和轻量级机器人在有效载荷、功率和计算能力方面受到严格限制，使得RGB-D相机和激光雷达传感器在许多机载环境中不切实际。我们提出了Mono-Hydra++，这是一种用于室内度量语义映射和分层3D场景图构建的实时单目RGB加IMU管道。该系统结合了M2H-MX，一个基于DINOv3的多任务模型，用于深度和语义，配合深度特征视觉惯性里程计前端、VIO导出的姿态图中的稀疏预测深度约束、动态区域的语义掩蔽，以及在Mono-Hydra后端进行体积融合前的姿态感知时间对齐。在Go-SLAM ScanNet评估子集上，Mono-Hydra++的平均轨迹误差比我们比较中的最强RGB-D基线低1.6%，同时仅使用单目RGB加IMU输入。在标定的7-Scenes上，它的平均绝对轨迹误差（ATE）比最强竞争的标定基线提高了29.8%。我们还在实际ITC建筑部署中验证了Mono-Hydra++，使用RealSense RGB加IMU，并通过在Jetson Orin NX 16GB上以25.53 FPS部署ONNX/TensorRT FP16 M2H-MX-L感知模型，展示了嵌入式可行性。这些结果表明，Mono-Hydra++可以为资源受限的机器人平台提供实时的度量语义映射和场景图构建，而无需依赖主动深度传感器。

View on arXiv Download PDF AI Translation

cs.RO / 43 / 2605.17681

PRIME: Physically-consistent Robotic Inertial and Motion Estimation for Legged and Humanoid Robots

PRIME：适用于四足和类人机器人的物理一致性惯性与运动估计

Kang, Jiarong, Ren, Kunzhao, Pang, Tao, Xiong, Xiaobin

Abstract

Humanoid and legged robots interact with the environment through intermittent contacts, making accurate motion estimation fundamentally dependent on reasoning about contact dynamics. However, standard sensing pipelines-whether based on onboard proprioception with Extended Kalman Filters (EKFs) or external motion capture systems-recover only kinematics, while contact forces, contact timing, and inertial parameters remain unobserved. As a result, purely kinematic reconstructions often violate rigid-body dynamics, particularly during contact-rich motions. To enable accurate motion estimation from onboard kinematics in real-world deployment, we propose PRIME (Physically-consistent Robotic Inertial and Motion Estimation), a Maximum A Posteriori (MAP) formulation that refines measured kinematics and actuator commands into a dynamically consistent trajectory while jointly estimating frictional contact forces and physically consistent inertial parameters. Our approach incorporates differentiable contact dynamics with smoothed complementarity constraints and an Anitescu-style friction model, yielding a smooth optimization problem that remains tractable across versatile contact transitions. We evaluate PRIME on contact-rich locomotion with quadrupedal robots and the Unitree G1 humanoid, demonstrating improved trajectory consistency and accurate inertial parameter identification. Beyond improving state estimation and feedback control with calibrated inertial parameters, PRIME produces force- and contact-annotated motion reconstructions from real robots in deployment, which can be used to provide high-quality data for downstream learning applications, including large-scale behavior modeling and robot foundation models.

Chinese Translation

类人和四足机器人通过间歇性接触与环境进行交互，使得准确的运动估计在根本上依赖于对接触动态的推理。然而，标准的传感管道——无论是基于扩展卡尔曼滤波器（Extended Kalman Filters, EKFs）的机载本体感知，还是基于外部运动捕捉系统——仅能恢复运动学，而接触力、接触时机和惯性参数则未被观测到。因此，纯粹的运动学重构往往违反刚体动力学，特别是在接触丰富的运动中。为了在实际部署中实现基于机载运动学的准确运动估计，我们提出了PRIME（物理一致性机器人惯性与运动估计），这是一种最大后验（Maximum A Posteriori, MAP）形式化，能够将测量的运动学和执行器命令精炼为动态一致的轨迹，同时联合估计摩擦接触力和物理一致的惯性参数。我们的方法结合了可微分的接触动态、平滑的互补约束和Anitescu风格的摩擦模型，从而产生一个在多样化接触过渡中仍然可解的平滑优化问题。我们在接触丰富的四足机器人和Unitree G1类人机器人上评估了PRIME，展示了改进的轨迹一致性和准确的惯性参数识别。除了通过校准的惯性参数改善状态估计和反馈控制外，PRIME还生成了来自实际部署机器人力和接触标注的运动重构，这些重构可用于为下游学习应用提供高质量数据，包括大规模行为建模和机器人基础模型。

View on arXiv Download PDF AI Translation

cs.RO / 44 / 2605.17776

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

CosFly-Track：用于无人机视觉跟踪的大规模多模态数据集通过多约束轨迹优化

Wang, Xiangyue, Chen, Hanxuan, Cheng, Songsheng, Ren, Ruilong, Zheng, Jie, Yuan, Shuai, Zeng, Tianle, Guo, Hanzhong, Wang, Kangli, Pei, Ji

Abstract

Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.

Chinese Translation

近年来，航空视觉-语言导航（VLN）数据集迅速增长，但主要集中于面向目标的导航到静态目的地，导致无人机视觉跟踪——在保持可见性的同时持续跟踪移动目标——缺乏专门的训练数据。我们引入了CosFlyTrack，一个用于城市环境中无人机视觉跟踪的大规模多模态数据集和可扩展生成管道。该数据集提供了大约12,000条专家和扰动的无人机轨迹，这些轨迹是从6,000条行人路径生成的，包含240万时间步（约334小时），并具有七个对齐的数据通道：RGB、度量深度、语义分割、六自由度无人机姿态、带可见性标志的目标状态、双语（中英文）指令以及轨迹对元数据。为了生成高质量的专家轨迹，我们开发了MuCO，一个多约束优化器，直接在连续三维空间中进行规划，利用BVH加速的碰撞和可见性查询，联合强制执行目标可见性、视点质量、碰撞避免、平滑性和运动学可行性，避免了基于网格的规划器的离散化伪影和后处理平滑。在七个视觉-语言模型上的微调实验表明，CosFlyTrack将跟踪性能提高至78.3%至95.6%的SR@1米，相较于零-shot基线提升了53至69个百分点，支持该数据集作为动态目标跟随代理的训练资源。该数据集已公开发布，网址为https://huggingface.co/datasets/AutelRobotics/CosFly；评估脚本和预训练检查点托管在https://huggingface.co/AutelRobotics/CosFly-Track。

View on arXiv Download PDF AI Translation

cs.RO / 45 / 2605.17800

Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers

紧密堆叠桌面块的最佳击打-抓取规划与平行夹具

Lu, Hao, Shome, Rahul

Abstract

Rearranging densely packed tabletop objects is challenging when parallel-gripper picks are infeasible without sufficient clearance around an object. This work studies the problem characteristics for practically motivated settings with uniformly sized blocks placed at planar tabletop grid locations. Since purely prehensile removal can become infeasible, a directional knock primitive is therefore introduced and the optimal knock-pick variant of the problem is formulated. The work proposes a series of abstractions wherein minimal constraining gadgets are covered to identify the necessary knocks. Utilizing a maximum-weight perfect matching on a graphical abstraction yields efficient polynomial-time computation of the optimal plan that minimizes the number of actions. Experiments are reported for increasing grid sizes in synthetic settings as well as in IsaacSim. The theoretical observations provide a promising stepping stone towards rigorously building efficient manipulation strategies that interleave prehensile and non-prehensile actions.

Chinese Translation

在没有足够间隙的情况下，平行夹具抓取密集堆叠的桌面物体是具有挑战性的。本文研究了在均匀大小的块体置于平面桌面网格位置的实际动机设置下的问题特征。由于单纯的抓取移除可能变得不可行，因此引入了一种方向性击打原语，并对问题的最佳击打-抓取变体进行了公式化。本文提出了一系列抽象，其中涵盖了最小约束装置，以识别必要的击打。利用图形抽象上的最大权重完美匹配，能够高效地以多项式时间计算出最优计划，从而最小化动作数量。实验报告了在合成环境中以及在IsaacSim中逐渐增大的网格大小的结果。理论观察为严格构建高效的操作策略提供了一个有前景的起点，这些策略交替进行抓取和非抓取动作。

View on arXiv Download PDF AI Translation

cs.RO / 46 / 2605.17815

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

有序混沌的优点：在桌面堆叠重排中使用倾倒动作进行规划

Lu, Hao, Shome, Rahul

Abstract

Efficient object manipulation strategies have significant impact in automation applications. In this work, the stack rearrangement in tabletop settings is studied, with a focus on augmenting the task planning domain with richer nonprehensile aggregating actions, in particular the toppling of objects from a stack to the table. Toppling can compress long sequences of intermediate relocations. Computed plans need to interleave pick-and-place actions with topple throughout its plan based on the problem. In order to generate the task plan and model an abstraction to compute solutions that include both pick-and-place and topple actions, a novel aggregating gadget for topple is introduced. Using this directed graphical abstraction, candidate task plan computation becomes a variant of the pebble motion problem, treating objects as pebbles. Benchmarks are then reported in a IsaacSim-based physics simulation. Results highlight clear benefits of achieving faster execution than solely using pick-and-place actions. Though this work primarily investigates the topple action, we demonstrate that similar abstractions can model other aggregating actions of interest, like scoop. The current work provides a preliminary, strong indication of the promising benefits of abstractions for rich object interactions in manipulation applications.

Chinese Translation

高效的物体操作策略在自动化应用中具有重要影响。本研究探讨了桌面环境中的堆叠重排，重点在于通过更丰富的非抓取聚合动作来增强任务规划领域，特别是将物体从堆叠中倾倒到桌面。倾倒可以压缩长序列的中间重新定位。计算出的计划需要根据问题在整个计划中交替进行抓取和放置动作与倾倒动作。为了生成任务计划并建模一个抽象，以计算包含抓取和放置及倾倒动作的解决方案，提出了一种新颖的倾倒聚合装置。使用这种定向图形抽象，候选任务计划的计算变成了一个变种的卵石运动问题，将物体视为卵石。然后在基于IsaacSim的物理仿真中报告基准测试。结果突显了与仅使用抓取和放置动作相比，实现更快执行的明显好处。尽管本研究主要探讨倾倒动作，但我们证明类似的抽象可以建模其他感兴趣的聚合动作，如铲取。当前的工作为在操作应用中丰富物体交互的抽象提供了初步的强有力的指示，显示出其潜在的好处。

View on arXiv Download PDF AI Translation

cs.RO / 47 / 2605.17851

A Dexterous and Compliant Gripper With Soft Hydraulic Actuation for Microgravity Manipulation

一种具有软液压驱动的灵巧且柔顺的抓手用于微重力操作

Su, William, Kam, Jordan, Wang, Yixiao, Zhou, Jianshu

Abstract

Astrobee's existing one-degree-of-freedom (DOF) underactuated compliant claw gripper enables perching on the International Space Station (ISS), but provides limited capability for continuous dexterous manipulation. More complex microgravity tasks require an end-effector that can maintain stable contact while limiting disturbance to the free-flying base, since contact forces directly couple into base motion. This article presents the integration of DexCoHand, a dexterous and compliant two-finger, 6-DOF gripper, with the Astrobee free-flying robot for microgravity manipulation. The system is evaluated in MuJoCo using Astrobee's standard handrail perching sequence, including approach, perching, and subsequent pan and tilt motions. Compared with Astrobee's existing gripper, DexCoHand preserves the commanded pan and tilt motions while reducing unintended cross-axis base motion. Hardware experiments on Earth further demonstrate DexCoHand's dexterous manipulation capabilities and its potential for more adaptable intelligent manipulation tasks.

Chinese Translation

Astrobee现有的一自由度（DOF）欠驱动柔顺爪抓手能够在国际空间站（ISS）上停靠，但在连续灵巧操作方面的能力有限。更复杂的微重力任务需要一个能够保持稳定接触的末端执行器，同时限制对自由飞行基座的干扰，因为接触力会直接影响基座运动。本文介绍了将DexCoHand（一种灵巧且柔顺的双指6-DOF抓手）与Astrobee自由飞行机器人集成用于微重力操作的过程。该系统在MuJoCo中进行了评估，使用Astrobee的标准扶手停靠序列，包括接近、停靠以及随后的平移和倾斜运动。与Astrobee现有的抓手相比，DexCoHand在减少意外交叉轴基座运动的同时保持了指令的平移和倾斜运动。地面硬件实验进一步展示了DexCoHand的灵巧操作能力及其在更具适应性的智能操作任务中的潜力。

View on arXiv Download PDF AI Translation

cs.RO / 48 / 2605.17912

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

WorldArena 2.0：扩展具身世界模型基准测试的模态、功能性和平台

Shang, Yu, Tang, Yinzhou, Ma, Yiding, Li, Zhuohang, Jin, Lei, Su, Weikang, Jin, Xin, Wang, Zhaolu, Wang, Ziyou, Zhang, Xin, Su, Haisheng, He, Weizhen, Wu, Wei, Duan, Haoyi, Wetzstein, Gordon, Liu, Xihui, Shah, Dhruv, Zhang, Zhaoxiang, Chen, Zhibo, Zhu, Jun, Tian, Yonghong, Chua, Tat-Seng, Zhu, Wenwu, Gao, Chen, Li, Yong

Abstract

World models have emerged as a central paradigm for embodied intelligence, enabling agents to predict action-conditioned future and reason about environmental dynamics. However, existing embodied world model benchmarks are still largely confined to vision-only prediction, offline embodied applications, and simulator-based evaluation, making them insufficient for assessing increasingly comprehensive world models. In this work, we introduce WorldArena 2.0, an expanded benchmark that systematically broadens embodied world model evaluation along three dimensions: modality, functionality, and platform. Along the modality dimension, WorldArena 2.0 extends evaluation from vision-only to visuotactile modalities, enabling assessment of multimodal perception and prediction. Along the functionality dimension, it extends beyond policy evaluation and planning to assess world models as interactive RL environments for policy optimization. Along the platform dimension, it moves beyond simulator-only evaluation to a diverse suite of simulated and real-world robotic settings across multiple embodiments. Under a standardized protocol, WorldArena 2.0 comprehensively evaluates perceptual quality, interactive utility, and cross-platform performance, providing a comprehensive testbed for tracking progress toward embodied world models. The benchmark is available at: https://world-arena.ai.

Chinese Translation

具身模型已成为具身智能的核心范式，使代理能够预测基于动作的未来并推理环境动态。然而，现有的具身世界模型基准测试仍主要局限于仅基于视觉的预测、离线具身应用和基于模拟器的评估，这使得它们不足以评估日益全面的世界模型。在本研究中，我们介绍了WorldArena 2.0，这是一个扩展的基准测试，系统性地从三个维度扩展了具身世界模型的评估：模态、功能性和平台。在模态维度上，WorldArena 2.0将评估从仅基于视觉扩展到视觉触觉模态，使得多模态感知和预测的评估成为可能。在功能性维度上，它超越了政策评估和规划，评估作为交互式强化学习（RL）环境的世界模型，以进行政策优化。在平台维度上，它超越了仅基于模拟器的评估，涵盖了多种具身形式下的多样化模拟和真实世界机器人设置。在标准化协议下，WorldArena 2.0全面评估感知质量、交互效用和跨平台性能，为跟踪具身世界模型的进展提供了一个全面的测试平台。基准测试可在以下网址获取：https://world-arena.ai。

View on arXiv Download PDF AI Translation

cs.RO / 49 / 2605.17927

Learning-Based Adaptive Control for Surgical Robotic Exposure Task on Deformable Tissues

基于学习的可适应控制在可变形组织外科机器人暴露任务中的应用

Liu, Jiayi, Wei, Kaiqi, Wang, Yiwei, Zhao, Huan, Ding, Han

Abstract

In various surgical procedures, regions of interest (ROIs) such as organs or lesions are often occluded by overlying tissues, requiring surgeons to achieve adequate exposure for precise intervention. However, the irregular geometry, nonlinear biomechanical properties of overlying tissues, and limited intraoperative visibility of the ROI pose significant challenges to the autonomous execution of tissue retraction. To address this, we formulate a realistic model of the tissue retraction task and propose a learning-based adaptive control framework for achieving ROI exposure. The method optimizes control inputs online by monitoring changes in the visual boundary of the tissue, while leveraging a deep deformation estimation model trained on simulation data to identify the optimal grasping point and ensure the convergence and safety of the adaptive controller. Through simulations and real-world experiments on different deformable materials, it has been demonstrated that this framework exhibits zero-shot adaptation to similar tasks and can complete the autonomous retraction process, from initial grasp selection to full ROI exposure. Therefore, it has the potential to be applied in actual surgical assistance scenarios.

Chinese Translation

在各种外科手术中，感兴趣区域（ROI）如器官或病变常常被覆盖组织遮挡，这要求外科医生实现足够的暴露以进行精确干预。然而，覆盖组织的不规则几何形状、非线性生物力学特性以及术中对ROI的有限可视性给组织回缩的自主执行带来了重大挑战。为此，我们构建了组织回缩任务的现实模型，并提出了一种基于学习的可适应控制框架，以实现ROI的暴露。该方法通过监测组织视觉边界的变化在线优化控制输入，同时利用在仿真数据上训练的深度变形估计模型来识别最佳抓取点，确保自适应控制器的收敛性和安全性。通过对不同可变形材料的仿真和实际实验，证明该框架对类似任务具有零样本适应能力，并能够完成自主回缩过程，从初始抓取选择到完全ROI暴露。因此，它在实际外科辅助场景中具有应用潜力。

View on arXiv Download PDF AI Translation

cs.RO / 50 / 2605.17928

Transfer Learning for Customized Car Racing Environments

用于定制赛车环境的迁移学习

Arockiaraj, Benedict Florance, Chang, Richard, Yee, Wesley

Abstract

Transfer Learning, a technique where a model/agent can use the knowledge/expertise that it gained from one task and exploit that to solve another closely-related task, is often used in tackling problems in deep learning. Through this project, we explore transfer learning in the purview of deep reinforcement learning. Specifically, we want to use transfer learning to achieve the fast lap times in OpenAI's Car racing environment by training the agent on one circuit, and racing it on other customized target environments by zero-shot transfer or by additional fine-tuning. In addition, we compare the performance of model-based and model-free approaches, and observe that model-based approaches dominate in performance and converge faster than model-free approaches in this environment. We observe that transfer learning in most setups not only boosts the performance on the target domain, but also shows high performance ability during learning.

Chinese Translation

迁移学习是一种技术，模型/智能体可以利用其在一个任务中获得的知识/专业技能，来解决另一个密切相关的任务，这种技术在深度学习中经常被用来解决问题。在本项目中，我们探讨了在深度强化学习视角下的迁移学习。具体而言，我们希望通过在一个赛道上训练智能体，并在其他定制目标环境中通过零-shot 迁移或额外的微调来实现 OpenAI 赛车环境中的快速圈速。此外，我们比较了基于模型的方法和无模型的方法的性能，并观察到在该环境中，基于模型的方法在性能上占据优势，并且收敛速度快于无模型的方法。我们观察到，在大多数设置中，迁移学习不仅提升了目标领域的性能，而且在学习过程中也表现出较高的性能能力。

View on arXiv Download PDF AI Translation

cs.RO / 51 / 2605.17929

TacSE3: Equivariant SE(3) Motion Estimation from Low-Texture Visuotactile Images for In-Gripper Tracking and Compensation

TacSE3：基于低纹理视觉触觉图像的等变 SE(3) 运动估计用于夹持器内跟踪与补偿

Liao, Zhongyuan, Wang, Junzhe, Liu, Qingyang, Huang, Zhenmin, Ma, Jun, Cai, Yi, Meng, Fei, Liang, Haobo, Wang, Michael Yu

Abstract

Robotic in-hand manipulation requires reliable object-motion tracking under frequent visual occlusion, yet low-texture visuotactile images provide few stable correspondences for conventional image- or geometry-matching methods. This paper presents TacSE3, a tactile motion-estimation pipeline that converts low-texture visuotactile observations into a decoupled three-dimensional force field and estimates incremental rigid-body motion on SE(3). The method derives planar translation from contact-centroid motion and estimates rotation primarily from shear-related tactile responses, yielding a physically interpretable signal for in-gripper tracking and compensation. Experiments with paired DM-Tac fingertip sensors show that dual-sensor sensing reduces translation-rotation ambiguity, supports rotation tracking across axes and object geometries, and provides a lightweight compensation signal that improves disturbance tolerance in downstream manipulation tasks without retraining the base policy.

Chinese Translation

机器人手内操控需要在频繁的视觉遮挡下可靠地跟踪物体运动，但低纹理的视觉触觉图像为传统的图像或几何匹配方法提供了很少的稳定对应关系。本文提出了 TacSE3，一种触觉运动估计管道，它将低纹理视觉触觉观测转换为解耦的三维力场，并在 SE(3) 上估计增量刚体运动。该方法从接触质心运动中推导平面位移，并主要从与剪切相关的触觉响应中估计旋转，从而为夹持器内的跟踪与补偿提供了物理可解释的信号。与配对的 DM-Tac 指尖传感器的实验表明，双传感器感知减少了位移-旋转歧义，支持跨轴和物体几何的旋转跟踪，并提供了一种轻量级的补偿信号，提高了下游操控任务中的干扰容忍度，而无需重新训练基础策略。

View on arXiv Download PDF AI Translation

cs.RO / 52 / 2605.17950

Active Defense Against False Data Injection Attacks in Robotic Manipulators

针对机器人操纵器的虚假数据注入攻击的主动防御

Gualandi, Gabriele, Larsson, Carl Mikael, Papadopoulos, Alessandro V.

Abstract

Robotic systems are vulnerable to False Data Injection Attacks (FDIAs), where adversaries corrupt sensor signals to gain malicious control. Feedback linearization exposes robotic systems to integrator vulnerability, making them susceptible to stealthy attacks that can cause significant deviations in end-effector behavior without raising alarms. This paper addresses the resilience of manipulators against finite-horizon FDIAs by formalizing two defense methods, namely anomaly-aware virtual damping and manipulability reduction, with probabilistic guarantees on nominal task execution. Simulations on a 7-DOF redundant manipulator show that the proposed defenses substantially reduce the impact of FDIA compared to using solely a threshold-based ADS like the Chi-squared, while preserving nominal task performance in the absence of attack.

Chinese Translation

机器人系统易受到虚假数据注入攻击（FDIAs）的影响，攻击者通过破坏传感器信号来获得恶意控制。反馈线性化使机器人系统暴露于积分器脆弱性之下，使其容易受到隐蔽攻击，这些攻击可以在不引发警报的情况下导致末端执行器行为的显著偏差。本文通过形式化两种防御方法，即异常感知虚拟阻尼和可操控性降低，解决了操纵器对有限时间范围内FDIAs的抵御能力，并对名义任务执行提供了概率保证。在对一个7自由度冗余操纵器的仿真中，结果表明，与仅使用基于阈值的ADS（如卡方检验）相比，所提出的防御措施显著降低了FDIA的影响，同时在没有攻击的情况下保持了名义任务的性能。

View on arXiv Download PDF AI Translation

cs.RO / 53 / 2605.18026

Scenario Generation in Roundabouts with Adjustable Interaction Intensity

可调交互强度的环形交叉口场景生成

Li, Li, Temmen, Till, Brinkmann, Tobias, Krautwig, Björn, Eisenbarth, Markus, Andert, Jakob

Abstract

Roundabouts, characterized by frequent merging and yielding interactions, remain a safety-critical corner case for the development and testing of intelligent driving functions. However, extracting sufficient near-critical scenarios from naturalistic data is inefficient. Most existing scenario generation methods provide limited controllability over interaction intensity and criticality, making systematic safety testing and detailed analysis difficult. This paper presents an interaction-aware roundabout scenario generator with continuously adjustable interaction intensity. Geometric routes and temporal progress profiles are first decoupled and mapped to latent codes using pretrained autoencoders. Conditional latent generation is then performed with Wasserstein Generative Adversarial Networks (WGAN) to generate scenarios. Yielding is modeled as a controllable timing intervention via a compact yield code during the approach-to-entry segment, where interaction intensity is modulated by scaling the code with a factor $\lambda$. Results demonstrate enhanced timing-latent fidelity and plausible interaction responses compared to a baseline model. Under criticality-calibrated scaling, increasing $\lambda$ expands the safety margin, providing a scalable and controlled testing mechanism.

Chinese Translation

环形交叉口以频繁的合流和让行交互为特征，仍然是智能驾驶功能开发和测试中的一个安全关键场景。然而，从自然数据中提取足够的近临界场景效率低下。现有的大多数场景生成方法对交互强度和临界性提供的可控性有限，使得系统化的安全测试和详细分析变得困难。本文提出了一种交互感知的环形交叉口场景生成器，具有可持续调节的交互强度。几何路线和时间进度轮廓首先被解耦，并使用预训练的自编码器映射到潜在编码。随后，利用Wasserstein生成对抗网络（WGAN）进行条件潜在生成，以生成场景。在接近入口段中，让行被建模为通过紧凑的让行编码进行的可控时机干预，其中交互强度通过将编码与因子$ ext{λ}$进行缩放来调节。结果表明，与基线模型相比，时机-潜在保真度和合理的交互响应得到了增强。在经过临界性校准的缩放下，增加$ ext{λ}$扩展了安全边际，提供了一种可扩展和可控的测试机制。

View on arXiv Download PDF AI Translation

cs.RO / 54 / 2605.18045

Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help?

信心门控机器人自主性：不确定性何时真正有帮助？

Gaus, Johannes A., Charaja, Jhon P. F., Haeufle, Daniel

Abstract

Robotic systems often use predictive uncertainty to decide whether to act autonomously or defer to a fallback policy. In threshold-gated autonomy, uncertainty matters mainly through its ability to rank likely errors. Standard metrics such as expected calibration error and AUROC do not directly test whether uncertainty changes act/defer decisions. We therefore evaluate uncertainty using Spearman rank correlation, paired bootstrap equivalence testing, and act/defer agreement. Across three temporal activity-recognition benchmarks, we find a dataset-dependent competence regime below which uncertainty provides a weak and unstable error ranking. Above this regime, softmax heuristics, MC Dropout, and ensembles produce similar gating behavior, while threshold choice has a much larger effect on execution outcomes. A multi-seed embodied simulation shows the same pattern for collision rate and cost once realized autonomy is matched. Under temporal covariate shift, ranking quality remains stable, but fine grained semantic OOD detection remains near chance. These results suggest that simple uncertainty proxies can suffice for selective gating once the base model is competent, but not for semantic novelty detection.

Chinese Translation

机器人系统通常利用预测不确定性来决定是自主行动还是依赖备用策略。在阈值门控自主性中，不确定性主要通过其对可能错误的排名能力来发挥作用。标准指标如期望校准误差和AUROC并不能直接测试不确定性是否改变了行动/推迟决策。因此，我们使用斯皮尔曼等级相关、配对自助等效测试和行动/推迟一致性来评估不确定性。在三个时间活动识别基准中，我们发现存在一个依赖于数据集的能力范围，在此范围以下，不确定性提供的错误排名较弱且不稳定。在此范围以上，softmax启发式、MC Dropout和集成方法产生类似的门控行为，而阈值选择对执行结果的影响则大得多。多种种子下的具身仿真显示，在实现自主性匹配后，碰撞率和成本的模式相同。在时间协变量转移下，排名质量保持稳定，但细粒度的语义OOD检测仍接近随机。这些结果表明，简单的不确定性代理在基础模型具备能力后可以足够用于选择性门控，但不适用于语义新颖性检测。

View on arXiv Download PDF AI Translation

cs.RO / 55 / 2605.18047

FUSE: A Framework for Unified State Estimation in Robotic SLAM Systems

FUSE：一种用于机器人SLAM系统的统一状态估计框架

Wu, Wei, Chen, Honglin, Cao, Wenhan, Lyu, Yao, Li, Jiangtao, Zhang, Tao, Li, Shengbo Eben

Abstract

Tightly coupled SLAM formulations under mixed-rate sensing often bind temporal processing, local geometric association, estimator formulation, and map-update policy into method-specific designs. Such binding makes it difficult to vary one design choice without re-engineering the rest of the state-estimation process. This paper presents FUSE, a framework for unified state estimation in robotic SLAM systems. FUSE organizes the state-estimation interface around observation ingestion, propagation, update, and state query, and uses this interface to separate temporal processing, residual-ready local geometric association, estimator formulation, and map-update policy. A LiDAR--IMU instantiation is developed to examine the framework under mixed-rate sensing and directional degeneracy, where high-rate inertial propagation, LiDAR-triggered geometric update, residual screening, and degeneracy-aware correction operate through the same interface boundaries. On a 418 m loop-corridor sequence, the instantiation reports a 1.626~m end-to-end trajectory error, corresponding to a 7.9% relative error reduction compared with Faster-LIO, the lowest-error baseline on this sequence. The results support FUSE as a framework for organizing state-estimation design choices and show how the evaluated instantiation regularizes updates along weakly observable directions.

Chinese Translation

在混合速率传感下，紧耦合的SLAM公式通常将时间处理、局部几何关联、估计器构造和地图更新策略绑定到特定方法的设计中。这种绑定使得在不重新设计其余状态估计过程的情况下，难以改变某一设计选择。本文提出了FUSE，一种用于机器人SLAM系统的统一状态估计框架。FUSE围绕观察数据的摄取、传播、更新和状态查询组织状态估计接口，并利用该接口将时间处理、残差准备的局部几何关联、估计器构造和地图更新策略分离。我们开发了一个LiDAR-IMU实例，以在混合速率传感和方向退化的情况下检验该框架，其中高频率的惯性传播、LiDAR触发的几何更新、残差筛选和考虑退化的校正通过相同的接口边界操作。在一个418米的环形走廊序列中，该实例报告了1.626米的端到端轨迹误差，相较于Faster-LIO（该序列上的最低误差基线）实现了7.9%的相对误差减少。结果支持FUSE作为组织状态估计设计选择的框架，并展示了所评估的实例如何在弱可观测方向上规范化更新。

View on arXiv Download PDF AI Translation

cs.RO / 56 / 2605.18059

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

Bench2Drive-Robust：在部署扰动下闭环自主驾驶的基准测试

Zhang, Zhiyuan, Jin, Zhenghao, Peng, Yanlun, Guo, Xianda, Liu, Haoran, Zhang, Shaofeng, Ma, Xingjun, Wu, Zuxuan, Yan, Junchi, Jia, Xiaosong, Jiang, Yu-Gang

Abstract

Robustness is a critical requirement for deploying autonomous driving systems in the real world. Existing robustness benchmarks for autonomous driving have made important progress in studying the effects of image-level corruptions, such as adverse weather or camera degradation, on perception modules and open-loop planning outputs. However, deployment can also involve system-level imperfections, such as inference latency and ego-state estimation errors, which remain less studied in closed-loop E2E-AD evaluation. These imperfections can accumulate through the feedback loop and destabilize control. In this work, we present Bench2Drive-Robust, to our knowledge the first device-centric robustness benchmark for closed-loop end-to-end autonomous driving under realistic deployment perturbations. We systematically evaluate deployment-oriented perturbations arising from three major sources: camera-stream failures (frame drop, partial observation), ego-state estimation errors (GPS noise, and speed or odometry errors), and compute-induced control delay (model inference delay). We evaluate representative end-to-end driving methods and analyze their robustness under different perturbation severities. Our results show that these deployment-related perturbations can substantially degrade closed-loop driving performance, revealing robustness challenges that are not fully captured by conventional image-level corruption evaluations. By establishing a closed-loop evaluation protocol and demonstrating the substantial impact of these deployment-oriented perturbations, Bench2Drive-Robust defines practical robustness problems for end-to-end autonomous driving and encourages further research on deployment-aware robust driving systems.

Chinese Translation

鲁棒性是将自主驾驶系统部署到现实世界中的关键要求。现有的自主驾驶鲁棒性基准在研究图像级损坏（如恶劣天气或摄像头退化）对感知模块和开环规划输出的影响方面取得了重要进展。然而，部署还可能涉及系统级的不完美，例如推理延迟和自我状态估计误差，这在闭环端到端自主驾驶评估中仍然较少被研究。这些不完美可能通过反馈循环累积并使控制不稳定。在本研究中，我们提出了Bench2Drive-Robust，作为我们所知的首个针对现实部署扰动的闭环端到端自主驾驶的设备中心鲁棒性基准。我们系统地评估了来自三个主要来源的面向部署的扰动：摄像头流故障（帧丢失、部分观察）、自我状态估计误差（GPS噪声、速度或里程计误差）以及计算引起的控制延迟（模型推理延迟）。我们评估了具有代表性的端到端驾驶方法，并分析了它们在不同扰动强度下的鲁棒性。我们的结果表明，这些与部署相关的扰动可以显著降低闭环驾驶性能，揭示了常规图像级损坏评估未能充分捕捉的鲁棒性挑战。通过建立闭环评估协议并展示这些面向部署的扰动的重大影响，Bench2Drive-Robust为端到端自主驾驶定义了实际的鲁棒性问题，并鼓励进一步研究面向部署的鲁棒驾驶系统。

View on arXiv Download PDF AI Translation

cs.RO / 57 / 2605.18074

4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

4DLidarOpen：一个用于运动感知自动驾驶的开放4D FMCW激光雷达数据集

Qian, Kane, Zhao, Xin, Shi, Yining, Yan, Rujun, Pan, Zhengqing, Zhu, Kaojin, Yang, Mengmeng, Sun, Kai, Yang, Diange, Jiang, Kun

Abstract

We present 4DLidarOpen, a large-scale open multi-modal dataset for autonomous driving, centered on 4D frequency-modulated continuous-wave (FMCW) Lidar sensing. Unlike conventional time-of-flight Lidar datasets that mainly provide geometric measurements, 4DLidarOpen includes point-wise radial velocity measurements from a forward-facing 4D FMCW Lidar, together with multiple Lidars of different types, including rotating, solid-state, and blind-spot variants, surround-view cameras, and 6-DOF ego-vehicle poses. The dataset was collected in complex urban environments in Beijing and covers dense pedestrian interactions, congested traffic, high-speed driving, and unprotected maneuvers. 4DLidarOpen provides synchronized multi-sensor data and 3D bounding-box annotations with persistent track IDs across five object categories. A hybrid annotation strategy is adopted, where large-scale auto-labeled data support scalable training and human experts refine annotations for the human-annotated training and validation sets. Based on this dataset, we establish benchmarks for 3D object detection, birds-eye view (BEV) segmentation and flow prediction, and motion forecasting with planning. Extensive experiments show that direct velocity measurements from 4D FMCW Lidar provide complementary motion cues for dynamic-scene understanding. Compared with geometric-only sensing, the velocity-aware representation improves motion-related perception and downstream forecasting and planning, especially in scenarios involving vulnerable road users and fast-moving objects. These results indicate that 4D FMCW Lidar is a promising sensing modality for motion-aware autonomous driving. The dataset and evaluation toolkit are publicly released to support research on 4D scene understanding, multi-Lidar fusion, and velocity-aware perception and planning.

Chinese Translation

我们提出了4DLidarOpen，这是一个大规模开放的多模态数据集，专注于4D频率调制连续波（FMCW）激光雷达感知。与主要提供几何测量的传统飞行时间激光雷达数据集不同，4DLidarOpen包含来自前向4D FMCW激光雷达的逐点径向速度测量，以及多种类型的激光雷达，包括旋转式、固态和盲区变体，环视摄像头，以及6自由度自车姿态。该数据集在北京复杂的城市环境中收集，涵盖了密集的行人互动、拥堵的交通、高速驾驶和无保护的机动行为。4DLidarOpen提供了同步的多传感器数据和跨五个物体类别的带有持续跟踪ID的3D边界框注释。采用了一种混合注释策略，其中大规模自动标注的数据支持可扩展的训练，而人类专家则对人类标注的训练和验证集进行注释的细化。基于该数据集，我们建立了3D物体检测、鸟瞰视图（BEV）分割和流动预测、以及运动预测与规划的基准。大量实验表明，来自4D FMCW激光雷达的直接速度测量为动态场景理解提供了互补的运动线索。与仅依赖几何感知相比，速度感知表示改善了与运动相关的感知和下游预测与规划，尤其是在涉及易受伤害的道路使用者和快速移动物体的场景中。这些结果表明，4D FMCW激光雷达是一种有前景的运动感知自动驾驶传感方式。该数据集和评估工具包已公开发布，以支持4D场景理解、多激光雷达融合以及速度感知感知和规划的研究。

View on arXiv Download PDF AI Translation

cs.RO / 58 / 2605.18184

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

固定外部摄像头作为主动3D场景图生成的共同先验地图

Modi, Giorgia, Buoso, Davide, Averta, Giuseppe, De Martini, Daniele

Abstract

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

Chinese Translation

常见的先验信息，如建筑信息模型（BIM）、平面图和遥感图像，可以为自主机器人系统提供有价值的几何和语义上下文。本文将固定外部RGB摄像头的观测视为共同先验地图（Common Prior Maps, CPMs）：在任何机器人运动开始之前，提供环境的广域视图，以初始化语义和几何场景先验。我们提出了一种仅基于RGB的框架，用于主动增量3D场景图（3DSG）生成，该框架在一个不依赖于硬件的管道中无缝融合来自机器人机载摄像头和固定外部摄像头的观测。该系统仅依赖于通过前馈3D重建模型处理的RGB观测，将所有摄像头（无论是机载还是外部）视为相同，且无需硬件修改。基于图的主动语义探索框架随后直接利用部分场景图，引导机器人前往高语义不确定性区域，逐步完成和细化先验。实验表明，即使仅使用单个外部摄像头引导场景图的引导，也能将初始物体召回率提高多达79%，而且先验的丰富上下文显著提高了后续主动探索的效率。

View on arXiv Download PDF AI Translation

cs.RO / 59 / 2605.18197

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

仅基于RGB的室内移动机器人主动3D场景图生成

Modi, Giorgia, Buoso, Davide, Averta, Giuseppe, De Martini, Daniele

Abstract

Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.

Chinese Translation

当前的3D场景图生成方法依赖于专用的深度传感器，如激光雷达（LiDAR）或RGB-D相机，以进行度量级3D重建。这限制了其在专用机器人平台上的部署，并排除了仅有RGB相机可用的环境，例如固定的外部基础设施。现有的处理流程通常在被动收集的观察轨迹上运行，而不是基于部分构建的场景表示选择视点，因此在探索过程中未能有效利用图中编码的语义和空间信息。本文提出了一种完全视觉化的框架，用于仅从RGB输入主动、增量地构建3D场景图，解决了这两方面的局限。所提出的方法将感知与规划统一在一个共享的结构化表示中，该表示捕捉了物体语义、3D几何、关系上下文以及来自多个视点的信息。由于该框架与硬件无关，仅依赖于RGB观察，因此可以在同一表示中结合来自机器人机载相机和固定外部相机的输入。在Replica数据集上的实验表明，RGB-only处理流程在F1-score上与使用真实深度的基线达到了相同水平。在ReplicaCAD上的主动探索实验进一步表明，基于语义的视点选择在相同探索预算下检测到的物体数量是基于几何前沿的基线的两倍以上。最后，外部相机设置表明，互补的RGB视图可以有效地引导场景图的构建，并在没有额外探索成本的情况下改善上下文理解。

View on arXiv Download PDF AI Translation

cs.RO / 60 / 2605.18262

On Improving Multimodal Pedestrian Trajectory Prediction with CVAE: A Study on Benchmark and Robot Data

基于条件变分自编码器的多模态行人轨迹预测改进研究：基准数据与机器人数据的分析

Liu, Yuzhou, Olaverri-Monreal, Cristina

Abstract

Accurate pedestrian trajectory prediction is crucial for autonomous systems operating in complex environments, such as modular buses and delivery robots in suburban or semi-structured areas. Social Spatio-Temporal Graph Convolutional Neural Networks (Social-STGCNN) have shown strong performance by modeling social interactions; however, producing diverse and well-calibrated future trajectories remains challenging. In this work, we build on a Social-STGCNN backbone and introduce a Conditional Variational Autoencoder (CVAE)-based probabilistic formulation to explicitly model multimodal future trajectories. We evaluate the method on the ETH and UCY pedestrian trajectory datasets as well as on a real-world pedestrian dataset collected by a mobile robot. Results show moderate gains on public benchmarks, but more consistent endpoint accuracy and improved trajectory diversity across different crowd configurations. Evaluation on robot-collected data further demonstrates the approach's effectiveness beyond curated benchmarks and supports its applicability in practical deployments.

Chinese Translation

准确的行人轨迹预测对于在复杂环境中运行的自主系统至关重要，例如在郊区或半结构化区域内的模块化公交车和送货机器人。社会时空图卷积神经网络（Social-STGCNN）通过建模社会互动展现了强大的性能；然而，生成多样且良好校准的未来轨迹仍然具有挑战性。在本研究中，我们基于Social-STGCNN骨干网络，引入了一种基于条件变分自编码器（CVAE）的概率模型，以显式建模多模态未来轨迹。我们在ETH和UCY行人轨迹数据集以及一个由移动机器人收集的真实行人数据集上评估该方法。结果显示，在公共基准上取得了适度的提升，但在不同人群配置下的终点准确性更为一致，并且轨迹多样性有所改善。对机器人收集数据的评估进一步证明了该方法在策划基准之外的有效性，并支持其在实际部署中的适用性。

View on arXiv Download PDF AI Translation

cs.RO / 61 / 2605.18295

Assessing Localization Technologies for Pedestrian Collision Avoidance

评估行人碰撞避免的定位技术

Varughese, Joshua, Gorospe, Joseba, Certad, Novel, Olaverri-Monreal, Cristina

Abstract

Robust pedestrian safety is crucial to the next-generation of intelligent transportation systems. Such systems rely on active pedestrian localization and predictive collision alerts. Pedestrian localization can be supported by Ultra-Wideband technology and Bluetooth 6.0, which offer high-precision ranging and low-latency communication, making them promising candidates for vehicular collision warning systems. This paper assesses the localization accuracy of these technologies for pedestrian alerting and benchmarks their performance against Global Navigation Satellite Systems. Experimental evaluations performed in this paper focused on key performance metrics, including localization accuracy and robustness to environmental conditions. Preliminary results suggest that Ultra-Wideband and Bluetooth 6.0 can serve as viable alternatives or complements to Global Navigation Satellite Systems in certain scenarios, improving situational awareness and enabling timely pedestrian alerts.

Chinese Translation

稳健的行人安全对下一代智能交通系统至关重要。这类系统依赖于主动的行人定位和预测性碰撞警报。行人定位可以通过超宽带（Ultra-Wideband）技术和蓝牙6.0（Bluetooth 6.0）来支持，这些技术提供高精度的测距和低延迟的通信，使其成为车辆碰撞警告系统的有希望的候选者。本文评估了这些技术在行人警报中的定位准确性，并将其性能与全球导航卫星系统（Global Navigation Satellite Systems）进行了基准比较。本文中的实验评估集中在关键性能指标上，包括定位准确性和对环境条件的鲁棒性。初步结果表明，超宽带和蓝牙6.0在某些场景中可以作为全球导航卫星系统的可行替代方案或补充，从而提高情境意识并实现及时的行人警报。

View on arXiv Download PDF AI Translation

cs.RO / 62 / 2605.18373

Dynamic robotic cloth folding with efficient Koopman operator-based model predictive control

基于高效Koopman算子的动态机器人布料折叠模型预测控制

Caldarelli, Edoardo, Coltraro, Franco, Colomé, Adrià, Rosasco, Lorenzo, Torras, Carme

Abstract

Robotic cloth folding is a challenging task, particularly when considering dynamic folding tasks, which aim at folding cloth by fast motions that leverage its dynamics. When subject to such fast motions, the complexity of cloth dynamics hinders both system identification and planning of folding trajectories, resulting in a difficult simulation-to-reality transfer when using physical models of cloth. Compared to the dexterity that humans exhibit when performing folding tasks, robotic approaches usually employ small garments with quite rigid dynamics, and are either too slow, or fast but imprecise, requiring several attempts to achieve a reasonably good fold. In this paper, we tackle these challenges by generating fast folding trajectories with a novel model predictive controller, integrating physics-based simulation of cloth dynamics and efficient, kernel-based Koopman operator regression. Koopman operator regression, an increasingly popular machine learning technique for nonlinear system identification, is used to obtain a linear model for the cloth being folded. Such a surrogate model, trained with data from a high-fidelity, physics-based cloth simulator, can then be employed within a suitable model predictive control algorithm, in place of the costly, nonlinear one, to efficiently generate folding trajectories to be executed by a robotic manipulator. Both in simulated and real-robot experiments, we show how the linearization supplied by the Koopman operator-based model can be employed to efficiently generate fast folding trajectories to unseen poses, without sacrificing folding accuracy.

Chinese Translation

机器人布料折叠是一项具有挑战性的任务，尤其是在考虑动态折叠任务时，这种任务旨在通过快速运动利用布料的动态特性进行折叠。当面临如此快速的运动时，布料动态的复杂性阻碍了系统识别和折叠轨迹的规划，导致在使用布料的物理模型时，仿真到现实的转移变得困难。与人类在执行折叠任务时展现出的灵活性相比，机器人方法通常采用小型服装，其动态特性相对刚性，且要么速度过慢，要么速度快但不精确，往往需要多次尝试才能实现合理的折叠。在本文中，我们通过生成快速折叠轨迹来应对这些挑战，采用了一种新颖的模型预测控制器，整合了基于物理的布料动态仿真和高效的基于核的Koopman算子回归。Koopman算子回归是一种日益流行的非线性系统识别机器学习技术，用于获取被折叠布料的线性模型。这样的替代模型通过高保真度的基于物理的布料仿真器的数据进行训练，随后可以在合适的模型预测控制算法中使用，替代成本高昂的非线性模型，从而有效生成由机器人操纵器执行的折叠轨迹。在仿真和真实机器人实验中，我们展示了如何利用Koopman算子模型提供的线性化来高效生成快速折叠轨迹，以应对未见的姿态，而不牺牲折叠精度。

View on arXiv Download PDF AI Translation

cs.RO / 63 / 2605.18385

Towards Ubiquitous Mapping and Localization for Dynamic Indoor Environments

面向动态室内环境的普适映射与定位

Djerroud, Halim, Steyn, Nico, Rabreau, Olivier, Bonnin, Patrick, Benali, Abderraouf

Abstract

We present UbiSLAM, an innovative solution for real-time mapping and localization in dynamic indoor environments. By deploying a network of fixed RGB-D cameras strategically throughout the workspace, UbiSLAM addresses limitations commonly encountered in traditional SLAM systems, such as sensitivity to environmental changes and reliance on mobile unit sensors. This fixed-sensor approach enables real-time, comprehensive mapping, enhancing the localization accuracy and responsiveness of robots operating within the environment. The centralized map generated by UbiSLAM is continuously updated, providing robots with an accurate global view, which improves navigation, minimizes collisions, and facilitates smoother human-robot interactions in shared spaces. Beyond its advantages, UbiSLAM faces challenges, particularly in ensuring complete spatial coverage and managing blind spots, which necessitate data integration from the robots themselves. In this paper we discuss potential solutions, such as automatic calibration for optimal camera placement and orientation, along with enhanced communication protocols for real-time data sharing. The proposed model reduces the computational load on individual robotic units, allowing less complex robotic platforms to operate effectively while enhancing the robustness of the overall system.

Chinese Translation

我们提出了UbiSLAM，这是一种用于动态室内环境的实时映射与定位的创新解决方案。通过在工作空间内战略性地部署一网络固定的RGB-D摄像头，UbiSLAM解决了传统SLAM系统中常见的局限性，如对环境变化的敏感性和对移动单元传感器的依赖。这种固定传感器的方法使得实时、全面的映射成为可能，提高了在环境中操作的机器人的定位精度和响应能力。UbiSLAM生成的集中式地图会不断更新，为机器人提供准确的全局视图，从而改善导航，减少碰撞，并促进共享空间中人机交互的顺畅性。除了其优势外，UbiSLAM还面临挑战，特别是在确保空间覆盖的完整性和管理盲区方面，这需要来自机器人自身的数据集成。本文讨论了潜在的解决方案，例如自动校准以优化摄像头的放置和方向，以及增强的通信协议以实现实时数据共享。所提出的模型减少了单个机器人单元的计算负担，使得较为简单的机器人平台能够有效运行，同时增强了整体系统的鲁棒性。

View on arXiv Download PDF AI Translation

cs.RO / 64 / 2605.18423

REBAR: Reference Ethical Benchmark for Autonomy Readiness

REBAR：自主准备的参考伦理基准

Diller, Jonathan, Barnes, David, Bogdanoff, Rebekah, Collier, Rhett, Collins, Roddy, Fieldhouse, Keith, Gefen, Yonatan, Johnson, Cameron, Kodali, Anuriha, Kriel, Brad, Murali, Varun, Niehaus, James, Sukharev, Mish, VanPelt, Joseph, Hoogs, Anthony, Kumar, Vijay, Basharat, Arslan

Abstract

As autonomous systems grow more advanced, objective metrics to evaluate their ethical and legal compliance are critical for informing end users of their limitations and ensuring accountability of those who misuse them. Current ethical embodied AI frameworks remain mostly qualitative, focusing on system design (through safety guardrails or targeted red teaming), and the realized guardrails often directly disallow unsafe behavior without providing the user with an override or interpretable reason. Instead, there is a need for computable metrics through rigorous testing that allow a user to determine the applicability of the system to the task. To address this gap, we introduce the Reference Ethical Benchmark for Autonomy Readiness (REBAR), a quantitative test and evaluation framework for autonomous systems. REBAR maps operating metrics into a computable Autonomy Readiness Level (ARL) rubric that can quantify ethical performance. Key innovations of the framework include a neuro-symbolic Large Language Model (LLM) approach to calculate and explain the ethical difficulty of scenarios, LLM-driven at-scale generation of test instances, and a versatile, photorealistic simulation environment. By evaluating white-box autonomy solutions through this rigorous testing pipeline, REBAR delivers an objective and repeatable benchmark score, bridging the gap between abstract principles and verifiable, accountable autonomy.

Chinese Translation

随着自主系统的不断发展，评估其伦理和法律合规性的客观指标对于告知最终用户其局限性以及确保对滥用行为的问责至关重要。目前的伦理具身人工智能框架大多仍然是定性的，主要集中在系统设计（通过安全防护措施或针对性的红队测试），而实现的防护措施往往直接禁止不安全行为，而未向用户提供覆盖或可解释的理由。因此，需要通过严格测试提供可计算的指标，使用户能够确定系统在特定任务中的适用性。为了解决这一问题，我们提出了自主准备的参考伦理基准（REBAR），这是一个用于自主系统的定量测试和评估框架。REBAR将操作指标映射到可计算的自主准备水平（Autonomy Readiness Level, ARL）标准中，以量化伦理表现。该框架的关键创新包括使用神经符号大型语言模型（Large Language Model, LLM）来计算和解释场景的伦理难度，基于LLM的大规模测试实例生成，以及一个多功能的、照片级真实感的仿真环境。通过这一严格的测试流程评估白盒自主解决方案，REBAR提供了一个客观且可重复的基准评分，弥合了抽象原则与可验证、可问责的自主性之间的差距。

View on arXiv Download PDF AI Translation

cs.RO / 65 / 2605.18441

REACT: Environment-Adaptive Architecture for Continuous Formation Navigation of Wheeled Mobile Robots

REACT：适应环境的轮式移动机器人连续编队导航架构

Dong, Jianghong, Zhang, Yifeng, Wang, Jiawei, Cai, Mengchi, Li, Keqiang, Sartoretti, Guillaume

Abstract

Formation control of wheeled mobile robots (WMRs) has been extensively studied due to its broad applications in fields such as logistics transportation, environmental monitoring, and search and rescue. However, most existing works mainly focus on tracking predefined formations, which limits their adaptability to complex real-world environments. To address this, we propose REACT (Real-time Environment-Adaptive architecture for Continuous formation navigaTion), a hierarchical architecture integrating centralized formation generation and distributed formation maintenance. Specifically, our upper layer generates new environment-adaptive formations when necessary and uses our proposed TCF-R2T (Trajectory-Conflict-Free Robot-to-Target assignment) algorithm to compute conflict-free WMR-to-target assignments in polynomial time, enabling timely formation transitions without trajectory conflicts. At the lower layer, each WMR executes our developed JSTP (Joint Spatio-Temporal trajectory Planning) method to maintain the generated formation by simultaneously optimizing spatial positions and temporal durations, thereby enhancing coordination among WMRs and enabling continuous navigation in obstacle-rich environments and dynamic-obstacle scenarios. Both simulation and real-world experiments validate the effectiveness and practical applicability of REACT. Experimental videos are available on our project website: https://dongjh20.github.io/REACT-website.

Chinese Translation

轮式移动机器人的编队控制因其在物流运输、环境监测和搜索救援等领域的广泛应用而受到广泛研究。然而，现有大多数研究主要集中于跟踪预定义的编队，这限制了它们在复杂现实环境中的适应性。为了解决这一问题，我们提出了REACT（实时适应环境的连续编队导航架构），这是一种集成了集中式编队生成和分布式编队维护的分层架构。具体而言，我们的上层在必要时生成新的适应环境的编队，并使用我们提出的TCF-R2T（无冲突轨迹的机器人到目标分配）算法以多项式时间计算无冲突的WMR到目标的分配，从而实现及时的编队转换而不产生轨迹冲突。在下层，每个WMR执行我们开发的JSTP（联合时空轨迹规划）方法，通过同时优化空间位置和时间持续时间来维持生成的编队，从而增强WMR之间的协调能力，并实现对障碍物丰富环境和动态障碍场景的连续导航。仿真和真实世界实验均验证了REACT的有效性和实际应用性。实验视频可在我们的项目网站上查看：https://dongjh20.github.io/REACT-website。

View on arXiv Download PDF AI Translation

cs.RO / 66 / 2605.18482

Bidirectional Optical sensors for Actuation Tracking (BOAT) in soft lattice systems

用于软格子系统的双向光学传感器（BOAT）用于驱动跟踪

Trunin, Petr, Gay, Carolina, Nardin, Anderson Brazil, Exley, Trevor, Cafiso, Diana, Beccai, Lucia

Abstract

The growing adoption of lattice-based structures in soft robotics creates a need for advanced sensing solutions capable of monitoring their global deformation, particularly compression and extension. In this work, we address this challenge by introducing a novel optical sensor based on two patterned waveguides arranged in an ellipsoidal geometry. This Bidirectional Optical sensor for Actuation Tracking (BOAT) is seamlessly co-printed with a lattice structure actuated by an embedded pneumatic artificial muscle (PAM), and its performance is assessed. During PAM elongation or contraction, the bending of the embedded BOAT waveguides induces output signal variations that enable a clear discrimination between compression and extension states. The designs of both each specific waveguide structure (by surface patterning) and of the sensorized lattice-based unit embedding two BOATs are supported by numerical simulations. Experimental calibration over 100 consecutive pressure cycles ranging from +50 kPa to $-$40 kPa demonstrates a highly repeatable response, allowing a reliable distinction between extension and compression. Finally, sensor feedback is used to implement a digital shadow, enabling continuous synchronization between the whole sensorized unit and its virtual counterpart. These results establish BOAT as a powerful and reliable approach for deformation monitoring in soft lattice-based robotic systems.

Chinese Translation

随着基于格子的结构在软机器人中的日益普及，迫切需要先进的传感解决方案来监测其整体变形，特别是压缩和伸展。在本研究中，我们通过引入一种新型光学传感器来应对这一挑战，该传感器基于两条以椭球几何排列的图案化波导。该双向光学传感器（BOAT）与由嵌入式气动人工肌肉（PAM）驱动的格子结构无缝共打印，并对其性能进行了评估。在PAM伸长或收缩过程中，嵌入的BOAT波导的弯曲引起输出信号的变化，从而能够清晰地区分压缩和伸展状态。每个特定波导结构的设计（通过表面图案化）以及嵌入两个BOAT的传感器化格子单元的设计均得到了数值模拟的支持。在+50 kPa到-40 kPa范围内进行的100个连续压力循环的实验校准显示出高度可重复的响应，使得能够可靠地区分伸展和压缩。最后，传感器反馈用于实现数字影像，能够在整个传感器化单元与其虚拟对应物之间实现持续同步。这些结果确立了BOAT作为一种强大且可靠的变形监测方法，适用于软格子结构的机器人系统。

View on arXiv Download PDF AI Translation

cs.RO / 67 / 2605.18543

Geometry-Aware Surrogate for Real-Time Hydrodynamics Estimation of Autonomous Ground Vehicles in Amphibious Environments

面向几何的代理模型用于自主地面车辆在两栖环境中的实时水动力学估计

Waheed, Ammar, Gallantree, Luke, Hasnain, Zohaib

Abstract

Autonomous ground vehicles operating in shallow water or flood-prone terrains require dynamic models that account for hydrodynamic forces. However, the simulation and planning tools currently available either lack the physical fidelity or are too computationally expensive to run in real time. This work presents a per-surface neural network surrogate that bridges this gap by predicting geometry-resolved hydrodynamic forces at real-time rates, trained entirely on high-fidelity CFD data from two geometrically distinct vehicles. A vehicle specific Signed Distance Field (SDF) provides per-surface submergence inputs, allowing the model to resolve how loading varies with vehicle geometry, depth, and flow direction. On held-out CFD data, the surrogate achieves a longitudinal-force symmetric MAPE (sMAPE) of 13\% and a vertical-force sMAPE of 3-12\%, with inference running under 0.9\,ms per sample. To evaluate the model under real-world conditions, water wading trials of a full-scale vehicle at different submersion depths are used. Motion capture derived kinematics serve as the surrogate inputs, and the resulting predictions are tested to reproduce known physical relationships between force, speed, and depth. The predicted drag follows quadratic speed scaling ($R^2 \geq 0.97$) and the buoyancy intercepts scale linearly with depth ($R^2 = 0.973$). Neither relationship is encoded in the model training loss, both emerge from the per-surface architecture summing individually predicted surface forces. The resulting framework provides a pathway for embedding physically grounded hydrodynamics into the simulation and planning loops that autonomous ground vehicles depend on in amphibious environments.

Chinese Translation

在浅水或易洪水地形中运行的自主地面车辆需要考虑水动力作用的动态模型。然而，目前可用的仿真和规划工具要么缺乏物理真实性，要么计算成本过高，无法实时运行。本研究提出了一种基于每个表面的神经网络代理模型，通过预测几何分辨的水动力作用，以实时速率填补这一空白，该模型完全基于来自两种几何特征不同的车辆的高保真计算流体动力学（CFD）数据进行训练。特定于车辆的有符号距离场（Signed Distance Field, SDF）提供每个表面的浸没输入，使模型能够解析载荷如何随车辆几何形状、深度和流向变化。在保留的CFD数据上，该代理模型实现了纵向力的对称平均绝对百分比误差（sMAPE）为13\%，垂直力的sMAPE为3-12\,，推理时间每个样本低于0.9毫秒。为了在现实条件下评估该模型，进行了不同浸没深度的全尺度车辆水中涉水试验。运动捕捉获得的运动学作为代理输入，结果预测被测试以重现已知的力、速度和深度之间的物理关系。预测的阻力遵循二次速度缩放（$R^2 ext{≥} 0.97$），而浮力截距与深度线性相关（$R^2 = 0.973$）。这两种关系均未在模型训练损失中编码，而是通过每个表面架构的个别预测表面力的总和而产生。该框架为将物理基础的水动力学嵌入自主地面车辆在两栖环境中所依赖的仿真和规划循环提供了一条途径。

View on arXiv Download PDF AI Translation

cs.RO / 68 / 2605.18556

Key-Gram: Extensible World Knowledge for Embodied Manipulation

关键语法：用于具身操控的可扩展世界知识

Fan, Jingjing, Li, Siyuan, Ren, Botao, Deng, Zhidong

Abstract

Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone to devote its main capacity to visual reasoning and action inference, while reusable instruction knowledge is stored in an extensible external memory. The logical memory table can be conveniently partitioned during training and, due to its $O(1)$ lookup pattern, efficiently placed on host memory during inference. Across RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation, Key-Gram consistently improves both $\pi_{0}$ and $\pi_{0.5}$ backbones, with average relative gains of $29.5\%/9.9\%$ on RoboTwin2.0, $35.8\%/4.5\%$ on LIBERO-Plus transfer without target-domain fine-tuning, and $15.4\%/8.1\%$ on real-world long-horizon tasks. These results demonstrate that externalized linguistic memory provides an effective and extensible mechanism for improving compositional grounding, transfer, and real-world manipulation.

Chinese Translation

具身控制越来越需要模型在推理动态视觉状态的同时遵循组合语言指令。然而，当前的视觉-语言-动作策略和世界-动作模型通常将语言知识与视觉计算耦合在共享的主干或条件路径中，导致模态竞争，并使知识扩展依赖于主干更新。在本文中，我们介绍了关键语法（Key-Gram），一种条件记忆框架，它将源自语言的世界知识与视觉状态推理分离，以实现具身控制。其核心是一个记忆模块，它将指令分解为任务特定的关键语法，通过确定性哈希查找检索静态语言先验，并通过上下文感知门控和轻量级卷积融合将检索到的条目注入选定的隐藏层。这一设计使得主干能够将其主要能力专注于视觉推理和动作推断，而可重用的指令知识则存储在可扩展的外部记忆中。逻辑记忆表在训练过程中可以方便地进行分区，并且由于其 $O(1)$ 的查找模式，在推理时能够高效地放置在主机内存中。在RoboTwin2.0、LIBERO/LIBERO-Plus以及真实世界的双臂操控中，关键语法始终提升了 $ ext{π}_{0}$ 和 $ ext{π}_{0.5}$ 主干，RoboTwin2.0 上的平均相对增益为 $29.5\%/9.9\\%$，LIBERO-Plus 转移（无目标领域微调）上的增益为 $35.8\\%/4.5\\%$，以及在真实世界的长时间任务中的增益为 $15.4\\%/8.1\\%$。这些结果表明，外部化的语言记忆为改善组合基础、迁移和真实世界操控提供了一种有效且可扩展的机制。

View on arXiv Download PDF AI Translation

cs.RO / 69 / 2605.18611

Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors

通过状态依赖对抗运动先验实现类人机器人统一行走、奔跑和恢复

Lu, Yidan, Zhong, Yichao, Zhao, Liu, Li, Wanyue, Lu, Peng

Abstract

We propose a unified reinforcement learning framework that enables a single policy to perform walking, running, and fall recovery on the Unitree G1 humanoid robot, validated on physical hardware without any explicit mode-switching command at deployment. The framework extends Adversarial Motion Priors (AMP) by replacing the conventional global reference distribution with a state-dependent gate that routes each training transition to one of two discriminators: a dedicated recovery discriminator and a velocity-conditioned locomotion discriminator that jointly covers walking and running. The gate is defined by a single fixed threshold on projected gravity: the recovery discriminator is activated when body tilt exceeds approximately $37^\circ$ from vertical ($|g_z+1|>0.6$); otherwise the locomotion discriminator is used, with the normalized commanded velocity serving as a condition that selects the appropriate reference trajectory between walk and run clips. Only three LAFAN1 reference clips are required to regularize the complete behavior set. At deployment, a single frozen ONNX policy executes at 50\,Hz with no runtime mode logic; hardware experiments demonstrate successful recovery from both prone and supine falls and smooth walk-to-run transitions under the same controller.

Chinese Translation

我们提出了一个统一的强化学习框架，使得单一策略能够在Unitree G1类人机器人上执行行走、奔跑和跌倒恢复，并在部署时无需任何显式的模式切换指令。该框架通过用状态依赖的门替代传统的全局参考分布，扩展了对抗运动先验（Adversarial Motion Priors, AMP），将每个训练过渡路由到两个判别器之一：一个专用的恢复判别器和一个基于速度的运动判别器，后者共同覆盖行走和奔跑。该门由投影重力上的单一固定阈值定义：当身体倾斜超过约 $37^ ext{°}$（$|g_z+1|>0.6$）时，激活恢复判别器；否则使用运动判别器，归一化的指令速度作为条件选择行走和奔跑片段之间的适当参考轨迹。仅需三个LAFAN1参考片段即可规范化完整的行为集。在部署时，单个冻结的ONNX策略以50 Hz的频率执行，无需运行时模式逻辑；硬件实验表明成功从俯卧和仰卧跌倒中恢复，并在同一控制器下实现平滑的行走到奔跑的过渡。

View on arXiv Download PDF AI Translation

cs.RO / 70 / 2605.18617

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft：面向软连续机器人视觉-语言操作的研究

Wei, Ziyu, Wang, Luting, Gao, Chen, Wen, Li, Liu, Si

Abstract

Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.

Chinese Translation

现有的大多数视觉-语言操作研究主要针对刚性机器人手臂，其固定的形态限制了在杂乱或狭小空间中的适应性。软机器人手臂因其可变形性而成为一种有吸引力的替代方案，但面临着诸如不可靠的本体感知和分布式低级驱动等挑战。为了解决这些挑战，我们提出了 extit{ManiSoft}，一个用于软手臂视觉-语言操作的基准。ManiSoft具有一个定制的模拟器，将逼真的软体动力学与通过弹性力约束的丰富接触交互相结合。在此基础上，ManiSoft定义了四个任务，每个任务突出了可变形控制的不同方面，从基本的末端执行器协调到障碍物规避。为了支持策略训练和评估， extit{ManiSoft}包括一个自动化管道，生成6,300个多样化场景及其对应的专家轨迹。为了大规模生成高质量轨迹，我们首先使用高层规划器将每个任务分解为一系列路标点，然后使用低层强化学习策略生成扭矩命令以跟踪路标点。对三种代表性策略模型的基准测试显示，在干净场景中取得了相对令人满意的结果，但在随机化情况下性能显著下降。可视化分析表明，失败主要源于对本体状态的视觉估计不准确以及对可变形性的有限利用以适应性地规避障碍物。我们期待 extit{ManiSoft}作为一个有价值的测试平台，在视觉-语言操作的背景下架起刚性与软性手臂之间的桥梁。我们的代码和数据集已发布在 https://buaa-colalab.github.io/ManiSoft。

View on arXiv Download PDF AI Translation

cs.RO / 71 / 2605.18720

Data-Driven Dynamic Modeling of a Tendon-Actuated Continuum Robot

基于数据驱动的腱驱动连续机器人动态建模

Hansen, Harald Minde, Sæbø, Bjørn Kåre, Pettersen, Kristin Y., Gravdahl, Jan Tommy, Di Castro, Mario

Abstract

Developing dynamic models for tendon-driven continuum robots is challenging due to their nonlinear, high-dimensional, and friction-dominated dynamics. This paper presents a comparative study of data-driven system identification methods, including N4SID, ARX, and SINDYc, for modeling a tendon-actuated continuum robot with rolling joints developed at CERN. Despite the high number of joints of the robot, experimental analysis reveals that a two-degree-of-freedom dynamic model can accurately capture the system dynamics, owing to strong kinematic dependencies between the joints. The models are validated against experimental data, and used in the design of a model predictive controller, demonstrating their feasibility for real-time control.

Chinese Translation

为腱驱动的连续机器人开发动态模型具有挑战性，因为其动力学表现出非线性、高维度和以摩擦为主的特性。本文对数据驱动的系统识别方法进行了比较研究，包括 N4SID、ARX 和 SINDYc，旨在对在欧洲核子研究组织（CERN）开发的具有滚动关节的腱驱动连续机器人进行建模。尽管该机器人关节数量众多，实验分析表明，由于关节之间存在强烈的运动学依赖关系，一个两自由度的动态模型能够准确捕捉系统动态。所建立的模型经过实验数据验证，并用于模型预测控制器的设计，展示了其在实时控制中的可行性。

View on arXiv Download PDF AI Translation

cs.RO / 72 / 2605.18722

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Dexora：面向高自由度双手灵巧性的开源视觉-语言-动作系统

Zhang, Zongzheng, Pang, Jingrui, Yang, Zhuo, Li, Kun, Liao, Minwen, Zhang, Saining, Chi, Guoxuan, Guo, Jinbang, Gao, Huan-ang, Shi, Modi, Ge, Dongyun, Mu, Yao, Gu, Jiayuan, Chen, Rui, Dong, Hao, Xu, Huazhe, Yi, Li, Zhu, Yixin, Zhao, Hang, Wang, Pengwei, Zhang, Shanghang, Yao, Guocai, Chen, Jianyu, Li, Hongyang, Zhao, Hao

Abstract

Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.

Chinese Translation

视觉-语言-动作（VLA）模型最近成为具身人工智能的一个核心方向，但当前系统仅限于双抓手控制或单臂灵巧手操控。虽然低维抓手控制通常可以通过更简单的方法来处理，但高维灵巧手控制在全端到端的VLA学习中受益匪浅。在本研究中，我们介绍了Dexora，这是第一个原生针对双臂、双手高自由度操控的开源VLA系统。我们设计了一个混合遥操作管道，将粗大臂运动学（通过定制的外骨骼背包捕获）与精细手指运动（通过Apple Vision Pro的无标记手部追踪）解耦，并驱动一个物理的双臂双手平台和一个相同的MuJoCo数字双胞胎。通过该接口，我们组建了一个大型训练语料库：一个与具身匹配的合成语料库（10万条模拟轨迹，650万帧）和一个包含1万集遥操作集的真实世界数据集（292万帧）。为了减轻噪声遥操作演示的影响，我们提出了一种数据质量感知的训练方案：一个离线判别器为扩散变换器策略训练提供片段级权重，降低低质量演示的权重。实证结果表明，Dexora在基本和灵巧基准测试中优于竞争性的VLA基线（例如，平均灵巧成功率为66.7%对比51.7%），在基本任务上达到90%的成功率，并展示了强大的分布外和跨具身泛化能力。消融实验确认了真实数据和判别器在灵巧性中的重要性。

View on arXiv Download PDF AI Translation

cs.RO / 73 / 2605.18727

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

DexHoldem：使用灵巧的具身系统玩德州扑克

Chen, Feng, Chu, Tianzhe, Sun, Li, Zhou, Pei, Xu, Zhuxiu, Gao, Shenghua, Zhai, Yuexiang, Yang, Yanchao, Ma, Yi

Abstract

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, $\pi_{0.5}$ obtains the highest task completion rate ($61.2\%$), while $\pi_{0.5}$ and $\pi_0$ tie on scene-preserving success rate ($47.5\%$). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy ($34.3\%$), while GPT 5.5 obtains the best average field-wise accuracy ($66.8\%$), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.

Chinese Translation

在真实的灵巧硬件上评估具身系统需要的不仅仅是孤立的基本技能：代理必须感知变化的桌面场景，选择适合上下文的动作，使用灵巧的手执行该动作，并使场景在后续决策中可用。我们介绍了DexHoldem，这是一个围绕德州扑克灵巧操作构建的真实世界系统级基准，使用了ShadowHand。DexHoldem提供了1470个跨越14种德州扑克操作原语的遥控演示，一个标准化的物理策略基准，以及一个代理感知基准，用于测试代理是否能够恢复具身决策所需的结构化游戏状态。在基本执行方面，$ ext{π}_{0.5}$获得了最高的任务完成率（$61.2\%$），而$ ext{π}_{0.5}$和$ ext{π}_0$在场景保持成功率上平局（$47.5\\%$）。在代理感知方面，Opus 4.7获得了最佳严格问题级准确率（$34.3\\%$），而GPT 5.5获得了最佳平均领域准确率（$66.8\\%$），揭示了孤立视觉子能力与完整路由相关状态恢复之间的差距。最后，我们在三个案例研究中实例化了完整的具身代理循环，其中等待、恢复调度、人类帮助请求和重复的基本执行揭示了感知和策略错误在闭环部署中如何累积。因此，DexHoldem评估了灵巧的桌面执行、代理感知和具身决策路由在共享物理环境中的表现。项目页面：https://dexholdem.github.io/Dexholdem/

View on arXiv Download PDF AI Translation

cs.RO / 74 / 2605.18729

Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction

Robo-Cortex：通过双粒度认知记忆和自主知识诱导的自我进化具身智能体

Chan, Nga Teng, Zhang, Yi, Liu, Yechi, Cui, Renwen, Zeng, Fanhu, Ding, Zeyuan, Ren, Xiancong, Zhang, Zhang, Chen, Qifeng, Liu, Jian, Dai, Yong, Ju, Xiaozhu

Abstract

The ability to navigate and interact with complex environments is central to real-world embodied agents, yet navigation in unseen environments remains challenging due to "experiential amnesia," where existing trajectory-driven or reactive policies fail to synthesize generalizable strategies from past interactions. We propose Robo-Cortex, a self-evolving framework that enables robots to autonomously induce navigation heuristics and refine cognitive strategies through a continuous reflection-adaptation loop. By abstracting success patterns and failure pitfalls into natural-language heuristics, Robo-Cortex enables a transition from passive execution to active strategy evolution. Our core innovation is an Autonomous Knowledge Induction (AKI) mechanism that distills multimodal trajectories into a structured Navigation Heuristic Library for knowledge generalization. The architecture further incorporates a Dual-Grain Cognitive Memory system, comprising a Short-term Reflective Memory (SRM) for real-time local progress analysis, and a Long-term Principle Memory (LPM) that abstracts past trajectories into reusable guiding and cautionary principles. To ensure robust decision-making, we introduce a multimodal Imagine-then-Verify loop, where a world model simulates potential outcomes and a VLM-based evaluator validates action plans. Extensive evaluations on IGNav, AR, and AEQA show that Robo-Cortex consistently outperforms strong baselines in both task success and exploration efficiency, with gains of up to +4.16% SPL over the strongest prior method and up to +15.30% SPL under heuristic transfer to unseen environments. Preliminary real-world robotic experiments further support the effectiveness of Robo-Cortex in physical settings.

Chinese Translation

在复杂环境中导航和互动的能力是现实世界具身智能体的核心，但在未知环境中的导航仍然具有挑战性，这主要是由于“经验性遗忘”，现有的基于轨迹驱动或反应式的策略无法从过去的互动中综合出可推广的策略。我们提出了Robo-Cortex，这是一种自我进化的框架，使机器人能够自主诱导导航启发式，并通过持续的反思-适应循环来完善认知策略。通过将成功模式和失败陷阱抽象为自然语言启发式，Robo-Cortex实现了从被动执行到主动策略进化的转变。我们的核心创新是自主知识诱导（Autonomous Knowledge Induction, AKI）机制，它将多模态轨迹提炼为结构化的导航启发式库，以实现知识的泛化。该架构进一步结合了双粒度认知记忆系统，包括用于实时本地进展分析的短期反思记忆（Short-term Reflective Memory, SRM），以及将过去轨迹抽象为可重复使用的指导和警示原则的长期原则记忆（Long-term Principle Memory, LPM）。为了确保稳健的决策，我们引入了多模态的想象-验证循环，其中世界模型模拟潜在结果，而基于视觉语言模型（VLM）的评估器验证行动计划。在IGNav、AR和AEQA上的广泛评估表明，Robo-Cortex在任务成功率和探索效率方面始终优于强基线，较最强的先前方法提升了高达+4.16%的成功路径长度（SPL），在向未知环境的启发式转移中提升了高达+15.30%的SPL。初步的现实世界机器人实验进一步支持了Robo-Cortex在物理环境中的有效性。

View on arXiv Download PDF AI Translation

计算机视觉 (Computer Vision)

300

cs.CV / 1 / 2605.16317

Noise2Params: Unification and Parameter Determination from Noise via a Probabilistic Event Camera Model

Noise2Params：通过概率事件相机模型实现噪声的统一与参数确定

Root, Owen, Mujo, Julinda, Xu, Min

Abstract

Accurate, unified models for event cameras (ECs) remain elusive, hampering calibration and algorithm design. We develop a foundational probabilistic model for EC event detection, grounded in photon statistics, that unifies the description of static scene noise events and step response curves (S-curves) within a single analytical framework. Three formulations of the probability distributions are derived, spanning all intensity regimes: exact Poisson, saddle-point, and Gaussian. The model reveals the underlying connection between these otherwise disparate EC behaviors and clarifies the interpretation of S-curves, which we show is more nuanced than selecting a fixed probability threshold. Based on this model, we propose Noise2Params, a method for determining camera-specific values of the log-contrast threshold $B$, the lux-to-photon conversion factor $\alpha$, and the leakage term $\theta$ (found to be intensity dependent), via error minimization against observed noise-event distributions. Noise2Params requires only recordings of static, uniform scenes, offering an experimentally accessible alternative to approaches that demand specialized dynamic light sources. We further support the validity the model by training convolutional neural networks (CNNs) on synthetic noise images generated from our distributions and evaluating their ability to reconstruct static scenes from experimental data. We further demonstrate the utility of our model by showing that CNNs incorporating synthetic data outperform those trained solely on experimental data. Our framework provides a quantitative foundation for EC calibration, noise-aware algorithm design, and applications in photon-limited regimes.

Chinese Translation

事件相机（EC）的准确统一模型仍然难以实现，这阻碍了校准和算法设计。我们开发了一种基于光子统计的事件相机事件检测基础概率模型，该模型在单一分析框架内统一了静态场景噪声事件和阶跃响应曲线（S-曲线）的描述。我们推导出了三种概率分布的公式，涵盖了所有强度范围：精确的泊松分布、鞍点分布和高斯分布。该模型揭示了这些原本不同的事件相机行为之间的内在联系，并澄清了S-曲线的解释，我们表明其比选择固定概率阈值更为复杂。基于该模型，我们提出了Noise2Params，一种通过对观察到的噪声事件分布进行误差最小化，确定特定相机的对数对比度阈值 $B$、光照到光子转换因子 $eta$ 和泄漏项 $ heta$（发现与强度相关）的方法。Noise2Params仅需静态均匀场景的录制，提供了一种实验上可行的替代方案，避免了对专用动态光源的需求。我们进一步通过在合成噪声图像上训练卷积神经网络（CNN）来支持该模型的有效性，并评估其从实验数据重建静态场景的能力。我们还展示了我们的模型的实用性，表明结合合成数据的CNN在性能上优于仅在实验数据上训练的CNN。我们的框架为事件相机的校准、噪声感知算法设计以及在光子限制条件下的应用提供了定量基础。

View on arXiv Download PDF AI Translation

cs.CV / 2 / 2605.16353

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

StrLoRA：面向多模态大语言模型的流式持续视觉指令调优

Che, Chang, Wang, Ziqi, Ma, Hui, Wang, Cheems, Shi, Zenglin

Abstract

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model's abilities from continuously evolving data streams.

Chinese Translation

持续视觉指令调优（CVIT）使多模态大语言模型能够逐步获取新能力。然而，现有的CVIT方法在一个限制性的任务增量设置下运行，其中每个训练阶段对应于一个单一的预定义任务。这并不反映现实世界的条件，在现实中，数据以连续的、交错的和动态演变的任务流的形式到达。为了解决这一问题，我们引入了流式CVIT（StrCVIT），这是一个更一般和现实的设置，模型从包含动态任务混合的数据块流中学习。在StrCVIT中，模型必须同时获取新能力、强化重复能力并减轻遗忘。现有的CVIT方法在这里失败，因为它们无法可靠地区分或适应每个数据块内的异构任务样本。因此，我们提出了StrLoRA，一个正则化的两阶段专家路由框架。StrLoRA首先使用文本指令进行任务感知的专家选择，以激活相关专家的稀疏子集，从而减少跨任务干扰。然后，在这个子集中应用逐标记的专家加权，其中贡献权重通过局部视觉标记和全局指令表示之间的跨模态注意力计算。为了在非平稳流中保持稳定性，路由稳定性正则化将当前路由分布与历史指数移动平均参考对齐。在新开发的StrCVIT基准上的大量实验表明，StrLoRA显著优于现有方法，有效提升了模型从不断演变的数据流中获取能力的效果。

View on arXiv Download PDF AI Translation

cs.CV / 3 / 2605.16359

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

多模态语言模型需要多少视觉标记？基于 F^3A 的视觉标记剪枝扩展

Huang, YiJie, Zhang, Yiqun, Jia, Zhuoyue, Yang, Xiaocui, Huang, Junzhao, Wang, Zihan, Feng, Shi, Wang, Daling, Zhang, Yifei, Liu, Yongkang

Abstract

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

Chinese Translation

视觉语言模型通过将越来越长的视觉标记序列输入语言主干来改善感知，但由此产生的推理成本引发了一个基本的扩展问题：随着多模态模型的增长，实际上需要多少视觉标记，以及在固定的视觉标记预算下应如何分配这些标记？现有的无训练剪枝方法通常通过一次性代理（如解码器注意力、视觉相似性或条件多样性）来回答这个问题。我们认为，视觉标记剪枝更应视为任务条件下的证据搜索，特别是在激进压缩和不同模型规模下。我们提出了 F^3A，这是一种无训练的视觉标记剪枝路由器，在语言模型处理图像标记之前进行操作。F^3A 构建轻量级的基于问题的提示，通过冻结的稀疏感知头将其与视觉网格标记匹配，并通过粗略证据定位、局部细化、覆盖保持竞争和恢复覆盖不足区域来分配固定的视觉标记预算。它不需要模型训练，不需要额外的 LLM 前向传递，并且保留了原始的多模态提示和解码管道。

View on arXiv Download PDF AI Translation

cs.CV / 4 / 2605.16366

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

Fre-Res：用于高效视频多模态大语言模型的频率残差视频令牌压缩

Feng, Yigui, Wang, Qinglin, Liu, Yang, Liu, Jie

Abstract

Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy--efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.

Chinese Translation

视频多模态大语言模型（MLLMs）面临空间保真度与时间覆盖之间的持续矛盾：保留细致的视觉细节需要大量空间令牌，而捕捉短暂事件则需要密集的时间采样。我们提出了 extbf{Fre-Res}，一种预算自适应的双轨视频令牌压缩框架，旨在分离这两种证据形式。Fre-Res保留稀疏的高保真空间锚点，并通过紧凑的残差频率令牌表示密集的时间演变。具体而言，它对视觉潜在空间中的帧间残差轨迹应用时间一维离散余弦变换（1D-DCT），在此我们观察到强烈的低频集中。为了将频域动态与原生视觉嵌入对齐，Fre-Res引入了一种空间引导吸收器，将时间残差信息注入空间对应的锚点令牌。在细粒度短视频和长视频推理基准测试中，Fre-Res实现了良好的准确性与效率的权衡，匹配或接近全令牌性能，同时显著减少视觉令牌长度。大量消融实验进一步表明，时间频率残差保留了因果转变线索，而空间锚点对于细粒度物体和布局推理仍然至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 5 / 2605.16371

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

GeoSym127K：可扩展的符号验证合成用于多模态几何推理

Jing, Jinhao, Ma, Zheng, Liang, Jinwei, Zhao, Qiannian, Chen, Shawn, Yang, Jing, Yee, Por Lip, Tiwari, Prayag, Bai, Jingjing, Wang, Benyou, Lu, Lewei, Su, Zhan

Abstract

Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.

Chinese Translation

大型多模态模型（LMMs）在几何推理方面常常面临视觉幻觉和缺乏数学精确的思维链（Chain-of-Thought, CoT）数据的问题。为了解决这一问题，我们提出了GeoSym引擎，这是一种自动化且可扩展的神经符号框架。通过利用类型条件语法和分析性SymGT求解器，它能够推导出精确的符号基础真理，并与强大的渲染管道无缝集成，以生成高精度的几何图形。利用该引擎，我们构建了GeoSym127K，这是一个难度分层的数据集，包含51K张高分辨率图像、127K个具有符号基础真理的问题以及55K个经过答案验证的CoT问答对。我们还推出了GeoSym-Bench，这是一个由专家策划的511个复杂样本的评估套件。通过广泛的监督微调（SFT），我们证明GeoSym在依赖图形的多步骤几何任务上驱动了显著的改进。我们的Qwen3-VL-8B模型在MathVerse视觉仅子集上获得了绝对+22.21%的提升，并在WeMath上达到了61.52%（+6.19%的改进），减轻了长时间逻辑碎片化的问题，并超越了像Doubao-1.8这样的先进闭源模型。此外，通过GRPO应用可验证奖励的强化学习（Reinforcement Learning with Verifiable Rewards, RLVR）表明，从结构性SFT检查点初始化显著提高了性能上限，相较于零-shot RL表现更佳。受确定性精确匹配信号的驱动，这展示了我们可验证推理合成的强大扩展潜力。数据集和代码可在https://huggingface.co/datasets/Tomie0506/GeoSym127K和https://github.com/Tomie56/GeoSym127K获取。

View on arXiv Download PDF AI Translation

cs.CV / 6 / 2605.16372

SwordBench: Evaluating Orthogonality of Steering Image Representations

SwordBench：评估图像表示的正交性

Zaigrajew, Vladimir, Pludowski, Dawid, Baniecki, Hubert, Biecek, Przemyslaw

Abstract

Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

Chinese Translation

在推理时对模型表示进行引导或干预以修正预测，对于人工智能的可解释性和安全性至关重要，但现有的评估协议仅限于模糊的语言建模任务。为了解决这一问题，我们引入了SwordBench，这是一个针对视觉模型的图像表示引导的基准测试，涵盖多个骨干网络和概念移除任务。除了统一的基准测试套件外，我们还提出了新的评估概念，揭示了概念激活向量之间正交化的二阶效应，以实现实用的引导。具体而言，跨概念鲁棒性衡量了在与替代概念正交化的输入上，概念检测性能的稳定性，而附带损害则量化了引导是否无意中影响了缺乏偏见的输入在下游任务上的模型性能。我们的研究发现，尽管线性支持向量机表现出优越的可分离性和正交性，但它未能实现零附带损害，通常落后于稀疏自编码器。在更简单的情况下，标准基准和基于优化的方法都未能实现完美的引导。源代码将很快在GitHub上发布。

View on arXiv Download PDF AI Translation

cs.CV / 7 / 2605.16373

Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT

双模态PET-CT中骨感染分割的跨源监督

Yang, Zonglin, Diao, Xiaolei, Chen, Jishizhan, Man, Xiaozhuang, Kong, Wei, Wen, Gen, Cheng, Pengfei, Shi, Daqian

Abstract

Early and accurate diagnosis and lesion localization of bone infections are crucial for clinical treatment. PET-CT integrates anatomical information from CT with metabolic information from PET, making it an important imaging modality for diagnosing bone infections. However, accurate lesion segmentation remains challenging due to indistinct lesion boundaries and inconsistencies in annotations generated by different experts or automated systems. In this work, we investigate multimodal segmentation of bone infections under annotation discrepancy. We develop a bimodal end-to-end segmentation framework that integrates PET metabolic signals and CT bone-window anatomy through an early-fusion multimodal representation.To mitigate performance inflation caused by inter-slice correlation in small datasets, this study discards traditional two-dimensional evaluation methods and implements a rigorous patient-level 3D volumetric evaluation and cross-validation. Furthermore, instead of forcing a singular consensus, we propose a decoupled dual-source learning framework where parallel models are trained on independent expert annotations driven by high-sensitivity and high-specificity clinical intents. Experimental results objectively report performance variations at the patient level (Mean + SD and Mean - SD), demonstrating the effectiveness of multimodal PET-CT fusion. The cross-evaluation matrix quantitatively reveals how models successfully internalize distinct expert diagnostic philosophies, providing a robust, diversity-preserving paradigm for clinical AI deployment in bone infection segmentation.

Chinese Translation

骨感染的早期准确诊断和病灶定位对临床治疗至关重要。PET-CT将CT的解剖信息与PET的代谢信息相结合，使其成为诊断骨感染的重要影像学模式。然而，由于病灶边界不清晰以及不同专家或自动化系统生成的注释不一致，准确的病灶分割仍然具有挑战性。在本研究中，我们探讨了在注释不一致情况下的骨感染多模态分割。我们开发了一个双模态端到端分割框架，通过早期融合多模态表示，整合了PET代谢信号和CT骨窗解剖信息。为了减轻小数据集中因切片间相关性导致的性能膨胀，本研究摒弃了传统的二维评估方法，实施了严格的患者级三维体积评估和交叉验证。此外，我们提出了一种解耦的双源学习框架，而不是强制达成单一共识，在独立专家注释的基础上训练并行模型，以高灵敏度和高特异性的临床意图为驱动。实验结果客观地报告了患者级别的性能变化（均值 + 标准差和均值 - 标准差），证明了多模态PET-CT融合的有效性。交叉评估矩阵定量揭示了模型如何成功内化不同专家的诊断理念，为骨感染分割中的临床人工智能部署提供了一个稳健的、多样性保留的范式。

View on arXiv Download PDF AI Translation

cs.CV / 8 / 2605.16381

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

StreamPro：从反应式感知到主动决策的流媒体视频理解

Li, Ao, Xiao, Zihan, Yue, Zihao, Xu, Boshen, Yao, Linli, Li, Jiaze, Fu, Pei, Ju, Jianzhong, Luan, Jian, Jin, Qin

Abstract

Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a "see-then-answer" paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model's ability to make timely and reliable decisions under incomplete observations. Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model's ability to make early yet reliable decisions under partial observations. We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.

Chinese Translation

主动流媒体视频理解要求模型持续处理视频流并决定何时响应，而不仅仅是决定响应什么。这自然引入了一个在部分观察下的决策问题，模型必须在早期预测与充分证据之间进行平衡。然而，现有基准大多遵循“先看后答”的范式，响应仅在明确证据出现后触发，实际上将主动推理简化为延迟感知。因此，它们未能评估模型在不完整观察下及时且可靠决策的能力。此外，由于流媒体轨迹中沉默与响应信号之间的极端不平衡，以及需要共同优化响应的正确性和时机，训练主动模型本质上具有挑战性。为了解决这些挑战，我们引入了StreamPro-Bench，一个新的基准，从三个互补的角度评估流媒体模型：感知理解、时间推理和主动代理，其中最后一个衡量模型在部分观察下做出早期且可靠决策的能力。我们进一步提出了StreamPro，一个用于主动学习的两阶段训练框架。首先，我们引入CB-Stream损失，以减轻监督微调（SFT）期间严重的监督不平衡。然后，我们应用了带有多粒度奖励设计的组相对策略优化（GRPO），该设计涉及转弯级和轨迹级奖励。实验表明，StreamPro显著提升了主动性能。在StreamPro-Bench上，它达到了41.5，显著超越了之前的最佳成绩（10.4），同时在实时流媒体基准上也保持了强劲的表现，在StreamingBench-RTVU上达到了78.9。

View on arXiv Download PDF AI Translation

cs.CV / 9 / 2605.16383

A neurosymbolic Approach with Epistemic Deep Learning for Hierarchical Image Classification

一种结合知识符号方法与认知深度学习的层次图像分类研究

Kilicdere, Ezel, Manchingal, Shireen Kudukkil, Cuzzolin, Fabio

Abstract

Deep neural networks achieve high accuracy on image classification tasks. Yet, they often produce overconfident predictions as which fail to express epistemic uncertainty, and frequently violate logical or structural constraints present in the data. These limitations are particularly pronounced in hierarchical classification, where predictions across fine and coarse levels must remain coherent. We propose, for the first time, a unified neurosymbolic and epistemic modelling framework that augments Swin Transformers with focal set reasoning and differentiable fuzzy logic. Rather than treating labels as isolated categories, our method induces data-driven focal sets within the learnt embedding space, which helps capture epistemic uncertainty over multiple plausible fine-grained classes. These focal sets form the basis of a belief-theoretic layer that uses fuzzy membership functions and t-norm conjunctions to encourage consistency between fine- and coarse-grained predictions. A learnable loss further balances calibration, mass regularisation, and logical consistency, allowing the model to adaptively trade off symbolic structure with data-driven evidence. In experiments on hierarchical image classification, our framework maintains accuracy on par with transformer baselines while providing more calibrated and interpretable predictions, reducing overconfidence and enforcing high logical consistency across hierarchical outputs. Our experimental results show that combining focal set reasoning with fuzzy logic provides a practical step toward deep learning models that are both accurate and epistemically aware.

Chinese Translation

深度神经网络在图像分类任务中实现了高准确率。然而，它们往往产生过于自信的预测，未能表达认知不确定性，并且常常违反数据中存在的逻辑或结构约束。这些局限性在层次分类中尤为明显，因为细粒度和粗粒度的预测必须保持一致。我们首次提出了一个统一的知识符号与认知建模框架，该框架通过聚焦集推理和可微分模糊逻辑增强了Swin Transformers。我们的方法并不将标签视为孤立的类别，而是在学习的嵌入空间中诱导数据驱动的聚焦集，这有助于捕捉多个可能的细粒度类别的认知不确定性。这些聚焦集构成了一个基于信念理论的层，该层使用模糊隶属函数和t-范数结合来鼓励细粒度和粗粒度预测之间的一致性。一个可学习的损失函数进一步平衡了校准、质量正则化和逻辑一致性，使模型能够自适应地在符号结构与数据驱动证据之间进行权衡。在层次图像分类的实验中，我们的框架在准确性上与变换器基线持平，同时提供了更为校准和可解释的预测，减少了过度自信，并在层次输出中强化了高逻辑一致性。我们的实验结果表明，将聚焦集推理与模糊逻辑结合为实现既准确又具认知意识的深度学习模型提供了一个实用的步骤。

View on arXiv Download PDF AI Translation

cs.CV / 10 / 2605.16384

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

全局标记与补丁标记之间的相互增强：从理论到实践

Huang, Xiusheng, Jiang, Xin, Zhao, Jun, Liu, Kang, Wang, Yequan

Abstract

Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.

Chinese Translation

准确有效的离散图像标记化对于长图像序列处理至关重要。然而，目前的方法在固定速率下僵化地压缩所有内容，忽视了图像的信息密度变化，导致冗余或信息丢失。受到信息熵的启发，我们提出了TaTok，一个理论基础的自适应图像标记化框架。我们严格识别出现有方法的两个关键缺陷：仅使用补丁标记重建图像时的信息不足，以及补丁标记之间的信息冗余。为了解决这些问题，我们引入了全局标记，以建模补丁标记之间的互信息，并基于累积条件熵的动态标记过滤（Dynamic Token Filtering, DTF）算法来消除冗余。实验结果证实了TaTok的最先进性能，实现了1.3倍的gFID提升和8.7倍的推理加速。通过根据信息丰富度分配标记，TaTok实现了更压缩但更准确的图像标记化，为未来研究提供了宝贵的见解。

View on arXiv Download PDF AI Translation

cs.CV / 11 / 2605.16385

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo：通过神经符号推理解决固体几何问题

Xu, Ruoran, Cheng, Haoyu, Dong, Bin, Wang, Qiufeng

Abstract

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets will be publicly available.

Chinese Translation

几何问题求解作为一种典型的多模态推理问题，近年来引起了广泛关注并取得了显著进展。然而，大多数研究集中于平面几何，而在固体几何方面通常失败，原因在于三维空间图形和复杂的推理。为了解决这一问题，我们提出了Hilbert-Geo，这是第一个统一的固体几何形式语言框架，包括一个广泛的谓词库和一个专门的定理库。基于该框架，我们提出了一种Parse2Reason方法，包含两个步骤：首先解析，然后推理。在解析步骤中，我们利用条件描述语言（Conditional Description Language, CDL），这是一种由专门设计的谓词组成的形式化语言，用于构建几何条件，以表示问题描述（自然文本）和固体图形（视觉图像）。在推理步骤中，我们利用这些形式化的CDL和定理库进行关系推理和代数计算，生成严格正确、可验证且人类可读的推理过程。值得注意的是，我们提出的Hilbert-Geo同样适用于平面几何。为了推动几何推理的发展，我们整理了两个专家注释的数据集SolidFGeo2k和PlaneFGeo3k，这些数据集配备了几何形式语言注释、解决方案和答案。大量实验表明，我们提出的方法在SolidFGeo2k上达到了77.3%的最新性能（SOTA），在MathVerse-Solid（MathVerse中专门针对固体几何的小子集）上达到了84.1%，显著超越了领先的多模态大语言模型（MLLM），如Gemini-2.5-pro（在SolidFGeo2k上为54.2%）和GPT-5（在MathVerse-Solid上为62.9%）。此外，我们的方法在PlaneFGeo3k上达到了80.2%的最新准确率，展示了Hilbert-Geo在几何推理中的普适性。我们的代码和数据集将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 12 / 2605.16386

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

审计多模态大语言模型评分者：临床序数评分中的中心倾向偏差

Zhang, Jiaqing, Elluri, Sandeep, Cherukuvada, Bhanu, Joffe, Yonah, Sena, Jessica, Contreras, Miguel, Siegel, Scott, Nerella, Subhash, Price, Catherine, Rashidi, Parisa

Abstract

Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical settings, yet their scoring behavior on ordinal clinical scales remains poorly understood. We benchmark three frontier LLM families against supervised deep learning models for scoring Clock Drawing Test (CDT) images on two public datasets using the Shulman rubric. While fully fine-tuned Vision Transformers achieve the best calibration (MAE 0.52, within-1 accuracy 91%), zero-shot LLMs remain competitive on tolerance-based agreement (GPT-5 MAE 0.67, within-1 accuracy 92%) despite higher absolute error. However, per-score analysis reveals that all three LLM families exhibit a pronounced central tendency effect (systematic endpoint compression): predictions are systematically compressed toward the middle of the scale, with over-prediction at the low end (score 0 to 1) and under-prediction at the high end (score 5 to 4). This effect disproportionately affects the clinically critical extremes where accurate scoring most impacts screening decisions for cognitive impairment. Targeted ablations show that neither few-shot exemplars spanning the full score range nor removing clinical terminology from the prompt eliminates the effect. Our findings extend the LLM-as-a-judge bias literature from NLP evaluation to clinical assessment, and highlight the need for calibration-aware evaluation and post-hoc calibration before deploying LLM-based raters in high-stakes screening workflows.

Chinese Translation

多模态大语言模型（LLMs）在临床环境中作为自动评估工具的探索日益增多，但它们在序数临床量表上的评分行为仍然知之甚少。我们对三种前沿LLM家族进行了基准测试，与监督深度学习模型在两个公共数据集上使用Shulman评分标准对时钟绘制测试（CDT）图像进行评分。尽管完全微调的视觉变换器（Vision Transformers）实现了最佳的校准（平均绝对误差MAE 0.52，误差在1以内的准确率91%），但零-shot LLM在基于容忍度的协议上仍然具有竞争力（GPT-5 MAE 0.67，误差在1以内的准确率92%），尽管其绝对误差较高。然而，逐分数分析显示，所有三种LLM家族都表现出明显的中心倾向效应（系统性终点压缩）：预测系统性地向量表中间压缩，在低端（得分0到1）出现过度预测，而在高端（得分5到4）则出现低估。这一效应对临床关键的极端值影响尤为显著，因为准确评分对认知障碍筛查决策的影响最大。针对性的消融实验表明，无论是跨越整个得分范围的少量示例，还是从提示中移除临床术语，都无法消除该效应。我们的研究将LLM作为评判者的偏见文献从自然语言处理（NLP）评估扩展到临床评估，并强调在高风险筛查工作流程中部署基于LLM的评分者之前，进行校准意识评估和事后校准的必要性。

View on arXiv Download PDF AI Translation

cs.CV / 13 / 2605.16387

Stabilizing Temporal Inference Dynamics for Online Surgical Phase Recognition

稳定在线外科阶段识别的时间推理动态

Liu, Yang, Zhu, Ning, Peng, Jingjing, Chen, Xiwu, Granados, Alejandro, Wang, Guotai, Ourselin, Sebastien

Abstract

Online Surgical Phase Recognition (SPR) models can reach high frame-wise accuracy, yet their predictions often lack temporal stability, fragmenting workflow understanding and reducing the reliability of downstream assistance. We show that this instability is not random noise but arises from two mechanisms: early misclassifications corrupt temporal feature states and propagate forward to form error cascades, and phase transitions follow evidence-accumulation dynamics whereas most online SPR systems rely on memoryless frame-wise decisions, making them sensitive to transient confidence fluctuations. We propose a unified Train-Inference-Evaluation framework that explicitly stabilizes temporal inference dynamics using model-agnostic, plug-and-play components. For training, the Temporal Error-Cascade (TEC) loss suppresses error onset and mitigates forward error propagation by stabilizing temporal feature evolution. For inference, the Evidence-Gated Transition Predictor (EGTP) enforces evidence-driven state transitions, allowing phase changes only when accumulated evidence exceeds a confidence boundary. For evaluation, we introduce the Temporal Fragmentation Index (TFI), a reliability-aware metric that quantifies instability-induced temporal disagreement beyond conventional frame-wise and token-based measures. Experiments on Cholec80 and AutoLaparo across three representative backbones show that the proposed framework substantially improves temporal stability and reduces prediction fragmentation, while maintaining or modestly improving frame-wise performance.

Chinese Translation

在线外科阶段识别（SPR）模型可以达到高帧级准确率，但它们的预测往往缺乏时间稳定性，导致工作流程理解的碎片化，并降低下游辅助的可靠性。我们表明，这种不稳定性并非随机噪声，而是由两种机制引起的：早期的错误分类破坏了时间特征状态，并向前传播形成错误级联；而阶段转换遵循证据积累动态，而大多数在线SPR系统依赖于无记忆的帧级决策，使其对瞬时置信度波动敏感。我们提出了一个统一的训练-推理-评估框架，明确地使用与模型无关的即插即用组件来稳定时间推理动态。在训练中，时间错误级联（TEC）损失抑制错误的发生，并通过稳定时间特征演变来减轻向前错误传播。在推理中，证据门控转换预测器（EGTP）强制执行基于证据的状态转换，仅在累积证据超过置信边界时允许阶段变化。在评估中，我们引入了时间碎片化指数（TFI），这是一个关注可靠性的度量，量化了超出传统帧级和基于标记的度量的由不稳定性引起的时间不一致性。在Cholec80和AutoLaparo上的三种代表性骨干网络的实验表明，所提出的框架显著提高了时间稳定性并减少了预测碎片化，同时保持或适度提高了帧级性能。

View on arXiv Download PDF AI Translation

cs.CV / 14 / 2605.16388

ChronoSC: Task-Oriented Semantic Communication via Temporal-to-Color Encoding

ChronoSC：通过时间到颜色编码的任务导向语义通信

Nguyen, Phuc H., Nguyen, Trung T., Duong, Quy N., Nguyen, Van-Dinh

Abstract

Semantic communication (SC) aims to reduce transmission overhead by conveying task-relevant information rather than raw data. However, existing SC approaches for video largely focus on pixel-level reconstruction or rely on complex spatiotemporal pipelines, leading to excessive bandwidth usage and latency that are unsuitable for low-resource deployments. In this paper, we propose ChronoSC, a task-oriented semantic communication framework for Video Question Answering (VideoQA). ChronoSC introduces Chrono-Color Stacking, a lightweight and lossless projection scheme that encodes temporal video dynamics into a single static image, enabling extreme temporal compression before transmission. This compact semantic representation is transmitted using a lightweight Deep Joint Source-Channel Coding (DeepJSCC) transceiver and explicitly reconstructed at the receiver. Unlike latent-space methods, explicit visual reconstruction enables the direct reuse of pre-trained vision-language models; specifically, a pre-trained BLIP model is employed to infer answers from noisy, reconstructed chrono-images. Experiments on the CLEVRER dataset show that ChronoSC achieves up to 192 times bandwidth reduction compared to raw video transmission while maintaining high VideoQA accuracy.

Chinese Translation

语义通信（SC）旨在通过传递与任务相关的信息而非原始数据来减少传输开销。然而，现有的视频SC方法主要集中在像素级重建或依赖复杂的时空处理流程，导致带宽使用过高和延迟过大，不适合低资源部署。在本文中，我们提出了ChronoSC，一种针对视频问答（Video Question Answering，VideoQA）的任务导向语义通信框架。ChronoSC引入了Chrono-Color Stacking，这是一种轻量级且无损的投影方案，将视频的时间动态编码为单一静态图像，从而在传输前实现极端的时间压缩。这种紧凑的语义表示通过轻量级的深度联合源信道编码（Deep Joint Source-Channel Coding，DeepJSCC）收发器进行传输，并在接收端显式重建。与潜在空间方法不同，显式视觉重建使得可以直接重用预训练的视觉-语言模型；具体而言，采用预训练的BLIP模型从噪声重建的时间图像中推断答案。在CLEVRER数据集上的实验表明，ChronoSC在保持高VideoQA准确率的同时，相较于原始视频传输实现了高达192倍的带宽减少。

View on arXiv Download PDF AI Translation

cs.CV / 15 / 2605.16390

Inducing Spatial Locality in Vision Transformers through the Training Protocol

通过训练协议诱导视觉变换器中的空间局部性

Toledo, Eduardo Santiago, Martínez, Asael Fabian

Abstract

We investigate whether the training protocol can induce spatial locality in the early layers of a Vision Transformer (ViT) trained from scratch, without large-scale pretraining. Keeping the architecture and optimization procedure fixed, we compare a Baseline protocol with a Modern protocol (AutoAugment/ColorJitter, CutMix, and Label Smoothing) on CIFAR-10, CIFAR-100, and Tiny-ImageNet, characterizing each attention head via Mean Attention Distance (MAD) and normalized entropy. Across all three datasets, the Modern protocol produces more local and more concentrated attention in early layers; on CIFAR-100, the minimum MAD drops from 0.316 (Baseline) to 0.008 (Modern). To identify the source of this effect, we conduct an ablation study on CIFAR-100 by adding or removing each component individually. The results identify CutMix as the determining component within our experiments: all conditions with CutMix exhibit MAD 0.024, while all conditions without CutMix remain at MAD 0.210. AutoAugment and Label Smoothing show no independent effect on locality. Taken together, these findings suggest that the pressure to classify from partial image regions, induced by CutMix, can promote the emergence of local attention in Vision Transformers.

Chinese Translation

我们研究训练协议是否可以在从头开始训练的视觉变换器（Vision Transformer, ViT）的早期层中诱导空间局部性，而无需大规模的预训练。在固定架构和优化过程的情况下，我们在CIFAR-10、CIFAR-100和Tiny-ImageNet上比较了基线协议（Baseline）与现代协议（Modern，包含AutoAugment/ColorJitter、CutMix和Label Smoothing），并通过平均注意距离（Mean Attention Distance, MAD）和归一化熵来表征每个注意力头。在所有三个数据集中，现代协议在早期层中产生了更局部和更集中注意力；在CIFAR-100上，最小MAD从0.316（基线）降至0.008（现代）。为了识别这一效应的来源，我们在CIFAR-100上进行了消融研究，逐个添加或移除每个组件。结果表明，CutMix是我们实验中的决定性组件：所有包含CutMix的条件的MAD为0.024，而所有不包含CutMix的条件的MAD保持在0.210。AutoAugment和Label Smoothing对局部性没有独立影响。综合来看，这些发现表明，CutMix诱导的对部分图像区域进行分类的压力可以促进视觉变换器中局部注意力的出现。

View on arXiv Download PDF AI Translation

cs.CV / 16 / 2605.16393

Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

基于视觉变换器条件的 UNet 用于领域自适应语义分割

Ortega, Joel Valdivia, Peng, Tingying, Jasnin, Marion

Abstract

Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.

Chinese Translation

语义分割在生物医学研究中对于分析解剖特征至关重要，但在该领域，视觉变换器（Vision Transformers, ViTs）的性能仍存在差距，特别是在稀疏、细结构和低信噪比目标的情况下。我们将这一挑战部分归因于在可提示的 ViT 模型中常用的轻量级像素解码器，这些解码器可能缺乏高精度生物医学掩膜所需的局部归纳偏置。我们通过引入 ViTC-UNet 来弥补这一差距，该模型通过可学习的标记和双向注意解码器将 UNet 条件化于冻结的预训练 ViT 表示。这种方法将 ViT 的全局视觉先验与 UNet 的局部归纳偏置和高分辨率解码能力相结合，同时在跨领域设置中避免了端到端的 ViT 微调。ViTC-UNet 在 MRI 和 CT 模态的语义分割任务中超越了基线结果，证明了结构条件的 UNet 解码能够有效地将大规模视觉先验适应于高复杂度的生物医学分割。

View on arXiv Download PDF AI Translation

cs.CV / 17 / 2605.16396

Beyond MMSE: Enhancing PnP Restoration with ProxiMAP

超越最小均方误差：通过 ProxiMAP 增强 PnP 恢复

Vert, Kenta, Meanti, Giacomo, Pesme, Scott, Arbel, Michael, Mairal, Julien

Abstract

Plug-and-Play (PnP) methods have become standard tools for solving imaging inverse problems by replacing the intractable maximum a posteriori (MAP) denoiser with the MMSE one. While this mismatch has been widely treated as unavoidable, recent works have sought to close this gap by targeting the MAP with diffusion-model scores. We show this is problematic in practice: learned scores do not match the true ones, so MAP-targeting iterations converge to cartoon-like images rather than realistic ones, and better results are obtained by stopping short of convergence. We turn this observation into a design principle and introduce ProxiMAP, an iterative MAP approximation whose noise schedule keeps the iterate's residual noise matched to the denoiser's training noise. This keeps the denoiser in-distribution where its score is reliable, and yields implicit early stopping that avoids the failure mode above. ProxiMAP is a modular drop-in replacement for MMSE denoisers in standard PnP algorithms and consistently sharpens reconstructions across deblurring, inpainting, super-resolution, and phase retrieval. Building on the same principle, we propose a hybrid variant that applies ProxiMAP only in the late iterations of PnP, where the denoiser is most reliable -- matching or exceeding the full-replacement variant at a fraction of the cost.

Chinese Translation

插拔式（PnP）方法已成为解决成像逆问题的标准工具，通过将难以处理的最大后验（MAP）去噪器替换为最小均方误差（MMSE）去噪器。尽管这种不匹配被广泛视为不可避免，但最近的研究试图通过针对扩散模型得分来缩小这一差距。我们发现这在实践中是有问题的：学习到的得分与真实得分不匹配，因此针对 MAP 的迭代收敛到卡通般的图像而非现实图像，并且通过在收敛前停止可以获得更好的结果。我们将这一观察转化为设计原则，提出了 ProxiMAP，一种迭代的 MAP 近似，其噪声调度保持迭代的残余噪声与去噪器的训练噪声匹配。这使得去噪器保持在其得分可靠的分布内，并产生隐式的提前停止，避免了上述失败模式。ProxiMAP 是标准 PnP 算法中 MMSE 去噪器的模块化替代品，并在去模糊、修复、超分辨率和相位恢复等任务中持续提高重建质量。基于相同的原则，我们提出了一种混合变体，仅在 PnP 的后期迭代中应用 ProxiMAP，此时去噪器最为可靠——在成本的很小一部分下，匹配或超过完全替代变体的效果。

View on arXiv Download PDF AI Translation

cs.CV / 18 / 2605.16397

Trajectory-Aware Adaptive Inference in Object Detection Models

基于轨迹感知的目标检测模型自适应推理

Papanikolaou, Grigorios, Kontopoulos, Ioannis, Spiliopoulos, Giannis, Zissis, Dimitris, Tserpes, Konstantinos

Abstract

The increasing integration of sensors in autonomous maritime navigation has led to large-scale multimodal datasets, raising challenges in achieving efficient real-time perception. In such systems, object detection and trajectory perception of nearby vessels are tightly coupled, particularly in dynamic environments such as maritime navigation. However, the efficiency of object detection models during inference remains an often-overlooked aspect. To this end, we build upon an existing object detection framework by incorporating GPS trajectory data into the inference process to enable input-adaptive computation. Specifically, we introduce an early-exit mechanism in a YOLOv8-based detector that incorporates motion cues - such as inter-vessel distances. Frames of vessels that are separated by short distances, converging with high speed, are processed using the full model, while only a subset of the network's architecture is activated otherwise. The difficulty degree (or scene complexity) of a frame or set of frames per second is evaluated by leveraging inter-object distance and the rate at which the distance between them decreases. Experimental results demonstrate that this strategy maintains satisfactory detection performance while significantly reducing inference time and computational cost, thus enabling a flexible trade-off between accuracy and efficiency compared to full-model inference.

Chinese Translation

随着传感器在自主海洋导航中的日益集成，大规模多模态数据集的出现带来了实现高效实时感知的挑战。在此类系统中，目标检测与附近船只的轨迹感知紧密相关，尤其是在动态环境如海洋导航中。然而，目标检测模型在推理过程中的效率仍然是一个常被忽视的方面。为此，我们在现有的目标检测框架基础上，结合GPS轨迹数据，改进推理过程以实现输入自适应计算。具体而言，我们在基于YOLOv8的检测器中引入了一种早期退出机制，该机制结合了运动线索——例如船只间的距离。对于那些相距较近且高速汇聚的船只帧，使用完整模型进行处理，而在其他情况下仅激活网络架构的一部分。通过利用物体间距离及其减少速率来评估每秒一帧或一组帧的难度程度（或场景复杂性）。实验结果表明，该策略在显著减少推理时间和计算成本的同时，保持了令人满意的检测性能，从而实现了与完整模型推理相比在准确性和效率之间的灵活权衡。

View on arXiv Download PDF AI Translation

cs.CV / 19 / 2605.16399

Stable and Near-Reversible Diffusion ODE Solvers for Image Editing

稳定且近可逆的扩散常微分方程求解器用于图像编辑

Barancikova, Barbora, Shmelev, Daniil, Salvi, Cristopher

Abstract

The inversion of diffusion models plays a central role in image editing. Algebraically reversible ODE solvers provide an appealing approach to diffusion inversion for text-guided image editing, by eliminating the inversion error inherent in DDIM-based editing pipelines. However, empirical results indicate that reversibility alone is insufficient. As edits require larger semantic or visual changes, reversible diffusion solvers often exhibit instabilities and suffer sharp drops in output quality. In this paper, we show that the trade-off between exact reversibility and numerical stability manifests empirically as a trade-off between background preservation and prompt alignment in image editing. We then investigate the use of near-reversible Runge-Kutta methods as a more stable alternative to exactly reversible diffusion schemes. When combined with a vector-field smoothing strategy, the resulting approach improves edit fidelity, remains stable under large edits, and largely retains the background-preservation benefits of reversible solvers.

Chinese Translation

扩散模型的反演在图像编辑中发挥着核心作用。代数可逆的常微分方程求解器为基于文本的图像编辑提供了一种吸引人的扩散反演方法，通过消除DDIM（去噪扩散插值模型）编辑流程中固有的反演误差。然而，实证结果表明，仅仅具备可逆性是不够的。当编辑需要更大的语义或视觉变化时，可逆扩散求解器往往表现出不稳定性，并且输出质量急剧下降。在本文中，我们展示了精确可逆性与数值稳定性之间的权衡在图像编辑中表现为背景保留与提示对齐之间的权衡。随后，我们探讨了使用近可逆的龙格-库塔方法作为一种比精确可逆扩散方案更稳定的替代方案。当与向量场平滑策略相结合时，所提出的方法提高了编辑的保真度，在大幅编辑下保持稳定，并在很大程度上保留了可逆求解器的背景保留优势。

View on arXiv Download PDF AI Translation

cs.CV / 20 / 2605.16401

CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification

CADS：用于成本效益图像分类的保形自适应决策系统

Mikael, Turkoglu, Tim, Bary, Vincent, Thielens, Manon, Dausort, Benoît, Macq

Abstract

While high-capacity AI models have advanced state-of-the-art performance, their practical deployment is often hindered by high inference costs, environmental impact, and a "one-size-fits-all" approach that ignores varying sample complexity. In clinical settings for instance, the waste of computational resources on routine cases is a significant barrier to sustainable AI. In this paper, we introduce the Conformal Adaptive Decision System (CADS), a sequential multi-model algorithm designed to optimize resource allocation by efficiently sampling models based on the estimated data complexity. CADS leverages conformal prediction to quantify image uncertainty at runtime. CADS provides a mathematically grounded framework for balancing the cost-accuracy dilemma that dynamically routes samples through a model cascade, ranging from lightweight "Scout" models to high-capacity "Oracle" architectures. Validated on two datasets, CADS demonstrated superior efficiency and accuracy at a computational cost that can be up to 12 times lower than heavy-model inference. By accurately routing samples based on real-time complexity, CADS ensures high diagnostic reliability while drastically reducing the economic and environmental footprint of AI.

Chinese Translation

尽管高容量人工智能模型在技术上取得了先进的性能，但其实际部署常常受到高推理成本、环境影响以及忽视样本复杂性差异的“一刀切”方法的阻碍。例如，在临床环境中，常规案例上计算资源的浪费是可持续人工智能的一个重大障碍。本文介绍了保形自适应决策系统（Conformal Adaptive Decision System，CADS），这是一种顺序多模型算法，旨在通过基于估计数据复杂性有效采样模型来优化资源分配。CADS利用保形预测在运行时量化图像的不确定性。CADS提供了一个数学基础框架，用于平衡成本与准确性之间的困境，动态地将样本路由通过从轻量级“侦察者”（Scout）模型到高容量“神谕”（Oracle）架构的模型级联。在两个数据集上的验证显示，CADS在计算成本上最高可低至重模型推理的12倍，同时展现出卓越的效率和准确性。通过基于实时复杂性准确路由样本，CADS确保了高诊断可靠性，同时大幅降低了人工智能的经济和环境足迹。

View on arXiv Download PDF AI Translation

cs.CV / 21 / 2605.16402

WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

WinDeskGround：复杂多窗口桌面环境中稳健GUI定位的基准测试

Zhao, Haoren, Chen, Tianyi, Wang, Zhen

Abstract

Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, single-layer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we introduce WinDeskGround, a novel benchmark and synthesis framework tailored for evaluating GUI grounding robustness. Unlike static datasets, our framework parametrically generates complex desktop scenarios by controlling window occlusion, layout density, and semantic similarity, thereby simulating the distribution shifts of authentic workflows. We construct a diverse meta-dataset of 1,356 high-fidelity instruction-target pairs and conduct comprehensive evaluations of five leading MLLMs. Our results demonstrate that while top-tier agents excel in simplified settings, their accuracy declines under partial occlusion. WinDeskGround provides a valuable benchmark to facilitate the assessment and advancement of GUI agent robustness in realistic environments. The code is available at https://github.com/ZZZhr-1/WinDeskGround.

Chinese Translation

多模态大型语言模型（MLLMs）已彻底改变了GUI自动化，但它们的有效性主要建立在理想化的单层界面上。本文识别出一个关键的可靠性缺口：最先进的智能体在真实桌面环境中面临明显的稳健性挑战，这些环境的特点是多窗口堆叠、遮挡和视觉杂乱。为了解决这一问题，我们引入了WinDeskGround，一个新颖的基准测试和合成框架，旨在评估GUI定位的稳健性。与静态数据集不同，我们的框架通过控制窗口遮挡、布局密度和语义相似性参数化生成复杂的桌面场景，从而模拟真实工作流程的分布变化。我们构建了一个包含1,356对高保真指令-目标对的多样化元数据集，并对五个领先的MLLMs进行了全面评估。我们的结果表明，尽管顶尖智能体在简化环境中表现出色，但在部分遮挡情况下其准确性下降。WinDeskGround为评估和提升GUI智能体在现实环境中的稳健性提供了一个有价值的基准。代码可在 https://github.com/ZZZhr-1/WinDeskGround 获取。

View on arXiv Download PDF AI Translation

cs.CV / 22 / 2605.16403

When Vision Speaks for Sound

当视觉为声音发声

Wen, Xiaofei, Mo, Wenjie Jacky, Fu, Xingyu, Cai, Rui, Zhu, Tinghui, Li, Wendi, Xie, Yanan, Chen, Muhao, Qi, Peng

Abstract

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.

Chinese Translation

尽管视频能力的多模态大语言模型（MLLMs）取得了快速进展，我们发现它们在视频中的音频理解往往是由视觉驱动的：模型依赖视觉线索来推断或虚构声学信息，而不是验证音频流。这一问题在最先进的开源全能模型和来自谷歌（Google）及开放人工智能（OpenAI）等提供商的领先闭源模型中均有所体现。我们将这种失效模式描述为音频-视觉的聪明汉斯效应，其中模型看似（错误地）基于音频，但实际上利用视觉-声学的相关性，而未验证音频和视觉流是否真正对齐。为了系统地研究这种行为，我们引入了Thud，一个基于三种反事实音频编辑的干预驱动探测框架：Shift，用于测试时间同步；Mute，用于测试声音的存在；Swap，用于测试音频-视觉一致性。除了诊断，我们进一步研究了一种两阶段的对齐方法：干预衍生的偏好对用于教导音频验证，而事件级别的一般视频偏好则对模型进行正则化，以防止过度专业化。我们最佳的10K样本方案在三个干预维度上的平均性能提高了28个百分点，同时在一般视频和音频-视觉问答基准上略有提升。

View on arXiv Download PDF AI Translation

cs.CV / 23 / 2605.16404

Hybrid Quantum-MambaVision: A Quantum-enhanced State Space Model for Calibrated Mixed-type Wafer Defect Detection

混合量子-MambaVision：一种用于校准混合类型晶圆缺陷检测的量子增强状态空间模型

Sahoo, Satwik Sai Prakash, Sahoo, Jyoti Prakash, Wang, Ting, Mondal, Subrota Kumar

Abstract

Extracting actionable knowledge from industrial visual data is fundamentally bottlenecked by extreme class imbalance and the prohibitive computational complexity of modern foundation models. In semi-conductor manufacturing, identifying multi-label wafer defects is a complex spatial data mining task where overlapping patterns obscure critical root-cause signals. While Vision Transformers (ViTs) excel at global dependency extraction, their quadratic scaling renders them inefficient for high-throughput, real-time anomaly detection. To overcome these computational barriers, this paper introduces Hybrid Quantum-MambaVision, a highly efficient architecture tailored for spatial knowledge discovery. We integrate a linear-complexity State-Space Model (SSM) backbone with a Parameterized Quantum Context Adapter (QCA) and Low-Rank Adaptation (LoRA). The Mamba backbone efficiently captures long-range spatial dependencies, while the quantum adapter maps compressed latent features into a high-dimensional Hilbert space to disentangle complex, overlapping signatures. On the highly imbalanced MixedWM38 dataset, Hybrid Quantum-MambaVision achieves exceptional multi-label classification performance, significantly reducing the error rate on complex multi-defect topologies compared to classical baselines. The quantum regularizer acts as a profound uncertainty calibrator, substantially reducing Maximum Calibration Error (MCE) and minimizing expected false-positive costs. This work establishes a scalable Quantum-Classical hybrid paradigm for efficient representation learning in industrial data mining.

Chinese Translation

从工业视觉数据中提取可操作知识的过程受到极端类别不平衡和现代基础模型计算复杂性的根本瓶颈。在半导体制造中，识别多标签晶圆缺陷是一项复杂的空间数据挖掘任务，其中重叠模式掩盖了关键的根本原因信号。尽管视觉变换器（Vision Transformers, ViTs）在全局依赖性提取方面表现出色，但其二次扩展性使其在高吞吐量、实时异常检测中效率低下。为克服这些计算障碍，本文提出了混合量子-MambaVision，这是一种专门为空间知识发现量身定制的高效架构。我们将线性复杂度的状态空间模型（State-Space Model, SSM）主干与参数化量子上下文适配器（Parameterized Quantum Context Adapter, QCA）和低秩适配（Low-Rank Adaptation, LoRA）相结合。Mamba主干有效捕捉长程空间依赖性，而量子适配器则将压缩的潜在特征映射到高维希尔伯特空间，以解开复杂的重叠特征。在高度不平衡的MixedWM38数据集上，混合量子-MambaVision实现了卓越的多标签分类性能，与经典基线相比，显著降低了复杂多缺陷拓扑的错误率。量子正则化器作为深刻的不确定性校准器，显著降低了最大校准误差（Maximum Calibration Error, MCE）并最小化了预期的假阳性成本。这项工作建立了一种可扩展的量子-经典混合范式，用于工业数据挖掘中的高效表示学习。

View on arXiv Download PDF AI Translation

cs.CV / 24 / 2605.16405

Concepts Worth Having: Refining VLM-Guided Concept Bottleneck Models with Minimal Annotations

值得拥有的概念：通过最少注释精炼VLM引导的概念瓶颈模型

Debole, Nicola, Passerini, Andrea, Teso, Stefano, Pugnana, Andrea, Marconato, Emanuele

Abstract

Concept-bottleneck models (CBMs) are neural classifiers that compute predictions from high-level concepts extracted from the input. CBMs ensure stakeholders can understand the concepts -- and the predictions they entail -- by learning these from concept-level annotations, which are however seldom available. Recent CBM architectures work around this issue by obtaining annotations from Vision-Language Models (VLMs). While greatly broadening applicability, doing so can yield lower quality concepts and therefore less interpretable models. We strike for a middle ground by introducing Vision-plus-Human-guided CBM (VH-CBM), a hybrid approach that exploits both VLMs and a small amount of dense annotations. VH-CBM employs a Gaussian Process in the VLM's embedding space, which captures useful global information about the target domain, to propagate the expert's supervision to any target data point. Our empirical evaluation shows how VH-CBM predicts more accurate concepts than VLM-guided CBMs even when annotating as little as 1% of the data, while sporting better concept calibration and supporting active learning.

Chinese Translation

概念瓶颈模型（CBMs）是一种神经分类器，通过从输入中提取的高层次概念来计算预测。CBMs确保利益相关者能够理解这些概念及其所蕴含的预测，然而，这些概念通常很少有概念级别的注释可供学习。最近的CBM架构通过从视觉-语言模型（VLMs）获取注释来解决这一问题。尽管这大大拓宽了适用性，但这样做可能会导致概念质量降低，从而使模型的可解释性下降。我们通过引入视觉加人类引导的CBM（VH-CBM）寻求折中，这是一种结合了VLM和少量密集注释的混合方法。VH-CBM在VLM的嵌入空间中采用高斯过程，捕捉目标领域的有用全局信息，以将专家的监督传播到任何目标数据点。我们的实证评估表明，即使在仅注释1%的数据时，VH-CBM也能预测出比VLM引导的CBMs更准确的概念，同时具有更好的概念校准能力，并支持主动学习。

View on arXiv Download PDF AI Translation

cs.CV / 25 / 2605.16406

Contrastive-SDXL: Annotation-Preserving Night-Time Augmentation for Pedestrian Detection

对比-SDXL：用于行人检测的保留注释夜间增强

George, Franky, Khalid, Muhammad, Khan, Adil

Abstract

Night-time pedestrian detection remains challenging because labelled night-time data are limited and large illumination differences make daytime-only trained detectors unreliable. Latent diffusion models (LDMs) provide a powerful basis for image-to-image translation and cross-domain augmentation, but their effectiveness in safety-critical perception depends on whether detector-relevant objects and local semantic structure are preserved when translating between source and target domains. In this work, we present Contrastive-SDXL, a day-to-night augmentation framework for night-time pedestrian detection built on SDXL-Turbo and fine-tuned using Low-Rank Adaptation (LoRA). To preserve semantic correspondence between daytime inputs and translated night-time images, we introduce a patch-wise semantic contrastive loss guided by a pretrained DINOv2 encoder rather than generator encoder features. Multi-level DINOv2 self-attention maps enforce both local and global semantic consistency, while an object consistency loss explicitly encourages pedestrian preservation. Contrastive-SDXL produces realistic night-time images, achieving a Frechet Inception Distance (FID) of 22.5. Detectors trained with our synthetic images obtain a 6-7% reduction in miss rate compared with a daytime-only baseline, approaching the performance of detectors trained on real night-time data. These results demonstrate that consistency-driven diffusion augmentation can effectively support safety-critical night-time pedestrian detection.Specific

Chinese Translation

夜间行人检测仍然面临挑战，因为标注的夜间数据有限，且昼间训练的检测器在大光照差异下表现不可靠。潜在扩散模型（Latent Diffusion Models, LDMs）为图像到图像的转换和跨领域增强提供了强大的基础，但其在安全关键感知中的有效性取决于在源域和目标域之间转换时是否保留了与检测器相关的对象和局部语义结构。在本研究中，我们提出了对比-SDXL，一个基于SDXL-Turbo的昼夜增强框架，并使用低秩适应（Low-Rank Adaptation, LoRA）进行了微调。为了保留昼间输入与转换后的夜间图像之间的语义对应关系，我们引入了一种由预训练的DINOv2编码器指导的基于补丁的语义对比损失，而不是生成器编码器特征。多层次的DINOv2自注意力图强制执行局部和全局的语义一致性，而对象一致性损失则明确鼓励行人的保留。对比-SDXL生成了逼真的夜间图像，达到了22.5的Frechet Inception Distance（FID）。与仅使用昼间数据训练的基线相比，使用我们合成图像训练的检测器的漏检率降低了6-7%，接近于在真实夜间数据上训练的检测器的性能。这些结果表明，基于一致性驱动的扩散增强可以有效支持安全关键的夜间行人检测。

View on arXiv Download PDF AI Translation

cs.CV / 26 / 2605.16408

Visual Search Patterns in 3D Pancreatic Imaging: An Eye Tracking Study

三维胰腺成像中的视觉搜索模式：一项眼动追踪研究

Anikina, Anna, Khaertdinova, Leila, Balschmidt, Trine, Andersen, Michael B, Müller, Christoph F, Brandt, Erik GS, Thomsen, Henrik S, Mello-Thoms, Claudia, Ibragimov, Bulat

Abstract

Eye tracking has emerged as a powerful tool for examining visual perception and search strategies in various domains, including medicine. While it is relatively straightforward to apply in 2D settings, its use in 3D medical imaging remains challenging and not yet well explored. This gap is particularly relevant for radiology, where volumetric images such as computed tomography (CT) scans are routinely read by medical experts. Radiologists typically interpret these images by navigating through hundreds of 2D slices, most often viewed in the axial projection. A taxonomy of eye movement data during navigation through a CT volume could be valuable to understand how radiologists approach diagnostic tasks. As an example of the derived taxonomy, we asked two radiologists to search abdominal CTs of the pancreas. We collect eye tracking data and align eye gaze movements with slice navigation to visualize the representation of the pancreas through volume and analyze clinicians' gaze behavior in both space and time.

Chinese Translation

眼动追踪已成为一种强有力的工具，用于研究视觉感知和搜索策略在各个领域的应用，包括医学。虽然在二维环境中应用相对简单，但在三维医学成像中的使用仍然具有挑战性，且尚未得到充分探索。这一空白在放射学中尤为重要，因为医学专家通常需要解读体积图像，如计算机断层扫描（CT）图像。放射科医生通常通过浏览数百个二维切片来解释这些图像，最常以轴向投影的方式查看。在浏览CT体积时，眼动数据的分类法可能对理解放射科医生如何进行诊断任务具有重要价值。作为所衍生分类法的一个示例，我们请两位放射科医生搜索腹部CT中的胰腺。我们收集眼动追踪数据，并将眼动轨迹与切片导航对齐，以可视化胰腺在体积中的表现，并分析临床医生在空间和时间上的注视行为。

View on arXiv Download PDF AI Translation

cs.CV / 27 / 2605.16409

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

多语言OCR感知微调与提示引导的思维链推理用于多模态大型语言模型

Xu, Qinwu, Liu, Xin, Jiang, Yifan, Ren, Haoyu

Abstract

Optical character recognition (OCR) and multilingual text understanding remain major failure modes of multimodal large language models (MLLMs), particularly in real-world images containing cluttered layouts, small fonts, blur, occlusion, and complex typography. We present an OCR-aware multilingual multimodal training framework that combines (i) large-scale synthetic OCR-to-translation data generation, (ii) OCR-aware supervised fine-tuning (SFT) with LoRA adaptation, and (iii) structured visual chain-of-thought (CoT) prompting for reasoning under uncertain visual conditions. Using a LLaMA-based multimodal architecture, the proposed framework substantially improves OCR completeness, multilingual translation accuracy, and robustness under degraded visual conditions. Experimental results on multilingual receipts, menus, posters, signs, handwritten text, and document images demonstrate significantly improved visual-text grounding compared with the baseline model. In particular, the proposed OCR-aware post-training framework improves extraction of small, blurred, spatially scattered, and partially occluded text while reducing reliance on language priors under uncertain OCR conditions. Qualitative comparisons with frontier multimodal systems, including GPT-5-class and Gemini-family models, further suggest improved OCR grounding and reduced hallucination under noisy and visually ambiguous OCR scenarios. Overall, the results indicate that data-centric OCR-aware multimodal post-training provides an effective and scalable direction for improving multilingual OCR and OCR-based visual question answering systems.

Chinese Translation

光学字符识别（OCR）和多语言文本理解仍然是多模态大型语言模型（MLLMs）的主要失败模式，特别是在包含杂乱布局、小字体、模糊、遮挡和复杂排版的真实世界图像中。我们提出了一种OCR感知的多语言多模态训练框架，该框架结合了（i）大规模合成OCR到翻译数据生成，（ii）带有LoRA适应的OCR感知监督微调（SFT），以及（iii）在不确定视觉条件下进行推理的结构化视觉思维链（CoT）提示。使用基于LLaMA的多模态架构，所提出的框架显著提高了OCR的完整性、多语言翻译的准确性以及在退化视觉条件下的鲁棒性。在多语言收据、菜单、海报、标志、手写文本和文档图像上的实验结果表明，与基线模型相比，视觉-文本对齐显著改善。特别是，所提出的OCR感知后训练框架提高了对小型、模糊、空间分散和部分遮挡文本的提取，同时在不确定的OCR条件下减少了对语言先验的依赖。与前沿多模态系统（包括GPT-5类和Gemini系列模型）的定性比较进一步表明，在嘈杂和视觉模糊的OCR场景下，OCR对齐得到了改善，幻觉现象得到了减少。总体而言，结果表明，以数据为中心的OCR感知多模态后训练为改善多语言OCR和基于OCR的视觉问答系统提供了一种有效且可扩展的方向。

View on arXiv Download PDF AI Translation

cs.CV / 28 / 2605.16410

Test-Time Hinting for Black-Box Vision-Language Models

黑箱视觉-语言模型的测试时提示

Hou, Kaihua, Mudunuri, Abhijith Varma, Qiu, Jiaxing, Daneshjou, Roxana, Hartvigsen, Thomas, Alaa, Ahmed

Abstract

Test-time scaling (TTS) methods have proven highly effective for LLMs, yet their application to vision-language models (VLMs) remains relatively underexplored. Existing VLM TTS methods largely require open-weight model access or expensive repeated sampling, and are evaluated primarily on multimodal mathematical and scientific reasoning benchmarks rather than general visual understanding tasks. In this paper, we propose Test-Time Hinting, a method that improves VLM performance via a single VLM call and requiring only black-box API access, which makes it broadly applicable to frontier closed-weight models. Our method is motivated by the observation that VLM errors tend to cluster around recurring failure patterns. We therefore train a lightweight hint generator model to predict, for a given test input, which "hint" should be prepended to the prompt, providing targeted contextual or procedural guidance that steers the VLM away from its characteristic failure modes. We show that Test-Time Hinting improves the accuracy of multiple closed-weight VLMs on natural-image VQA benchmarks and that these gains generalize to unseen benchmarks and VLMs without retraining the hint generator.

Chinese Translation

测试时缩放（TTS）方法已被证明对大型语言模型（LLMs）非常有效，但其在视觉-语言模型（VLMs）中的应用仍然相对未被充分探索。现有的VLM TTS方法大多需要开放权重模型的访问或昂贵的重复采样，并且主要在多模态数学和科学推理基准上进行评估，而不是一般的视觉理解任务。在本文中，我们提出了测试时提示（Test-Time Hinting）方法，通过一次VLM调用来提高VLM的性能，仅需黑箱API访问，这使得该方法广泛适用于前沿的闭权重模型。我们的方法的动机是观察到VLM错误往往集中在重复的失败模式周围。因此，我们训练了一个轻量级提示生成模型，以预测对于给定的测试输入，应该在提示前添加哪个“提示”，提供有针对性的上下文或程序指导，从而引导VLM远离其特征性失败模式。我们展示了测试时提示在自然图像视觉问答（VQA）基准上提高了多个闭权重VLM的准确性，并且这些提升在未见过的基准和VLM上也具有普适性，而无需重新训练提示生成器。

View on arXiv Download PDF AI Translation

cs.CV / 29 / 2605.16411

Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

通过分阶段偏好优化在分布变化下减少视觉语言模型中的幻觉

Xu, Qinwu

Abstract

Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding.

Chinese Translation

幻觉仍然是视觉语言模型（VLMs）中的一个基本挑战，其中自回归生成可能会产生在语言上看似合理但在物理上不一致或视觉上没有基础的响应，这主要是由于在联合概率建模下的似然最大化。我们提出了一种分阶段偏好优化框架，通过针对性的多模态数据构建来减少幻觉。我们的方法不是直接在通用的指令跟随数据上进行优化，而是逐步构建接近已知失败边界的以幻觉为重点的偏好对。该框架强调模糊的空间方向、物体关系、光学字符识别（OCR）不确定性和对抗性虚假前提训练。幻觉负样本通过最小扰动但视觉上不一致的替代品生成，使得直接偏好优化（DPO）能够更好地区分有根据的推理与看似合理的幻觉。在开放源代码基准和真实世界多模态评估场景中的实验表明，改进了基础一致性，减少了幻觉，并提供了更具信息性的有根据的响应。跨模型的定性评估进一步表明，所提出的多模态LLM DPO框架在模糊空间推理和对抗性虚假前提设置中产生了比几种前沿专有VLMs更具视觉基础的响应。结果表明，幻觉的产生不仅可能源于模型能力的有限性，还可能源于自回归概率生成固有的倾向，即在视觉基础薄弱的情况下偏向于语言上合理的延续。未来的工作可以探索物理一致性建模、不确定性感知的多模态推理以及超越标准自回归解码的架构替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 30 / 2605.16414

NERVE: A Neuromorphic Vision and Radar Ensemble for Multi-Sensor Fusion Research

NERVE：用于多传感器融合研究的神经形态视觉与雷达集成

Mansour, Omar, Martinello, Pietro, Milon, Ethan, Xu, YingFu, Sifalakis, Manolis, Tang, Guangzhi, Yousefzadeh, Amirreza

Abstract

We present NERVE (Neuromorphic Vision and Radar Ensemble), a multi-sensor dataset comprising 257 minutes of synchronized recordings from five sensors: two Dynamic Vision Sensors (DVS), an RGB-D camera, and two Radar units (24GHz and 77GHz). Captured across 12 measurement days in office environments, NERVE contains around 600GB of uncompressed temporally aligned data with around 914,000 frames and around 9.6 million RGB COCO-formatted annotations covering 16 relevant object categories. To evaluate multi-modal fusion, we construct a DVS+Radar subset for human detection and distance estimation. Baseline experiments using feed-forward and recurrent detectors show that combining DVS with 77GHz Radar consistently improves detection, with recurrent models achieving up to 47.5% mAP and mean absolute Radar distance errors below 1.8m against LiDAR ground truth.

Chinese Translation

我们提出了NERVE（神经形态视觉与雷达集成），这是一个多传感器数据集，包含来自五个传感器的257分钟同步录音：两个动态视觉传感器（DVS）、一个RGB-D相机和两个雷达单元（24GHz和77GHz）。该数据集在12天的办公环境测量中捕获，包含约600GB的未压缩时间对齐数据，约914,000帧，以及约960万条RGB COCO格式的注释，涵盖16个相关物体类别。为了评估多模态融合，我们构建了一个用于人类检测和距离估计的DVS+雷达子集。使用前馈和递归检测器的基线实验表明，结合DVS与77GHz雷达可以持续改善检测效果，递归模型在与LiDAR真实值对比时，达到最高47.5%的平均精度（mAP）和平均绝对雷达距离误差低于1.8米。

View on arXiv Download PDF AI Translation

cs.CV / 31 / 2605.16415

Diffusion Models, Denoiser Architecture and Creativity

扩散模型、去噪器架构与创造力

Levine, Itamar, Weiss, Yair

Abstract

The creativity of diffusion models refers to their ability to generate highly realistic images that are different from their training data. Creativity is somewhat surprising since it is known that if the denoiser used in the diffusion model is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples. In this paper we present empirical and theoretical results that suggest that creativity in diffusion models is due to an interaction between the denoiser architecture and the target distribution. Theoretically, we give explicit forms for the distribution of generated samples as a function of the target distribution and the denoiser architecture for three different denoiser architectures (linear, polynomial, bottleneck). Empirically, we show that small changes in the popular UNET denoiser architecture leads to very different forms of creativity, and these small changes often yield samples that are highly nonrealistic. Taken together, our results show that diffusion models will only be successful if the inductive bias of the denoiser architecture is in strong alignment with the true target distribution.

Chinese Translation

扩散模型的创造力指的是它们生成与训练数据不同的高度真实图像的能力。创造力有些令人惊讶，因为已知如果在扩散模型中使用的去噪器是给定训练集的贝叶斯最优去噪器，那么模型将简单地复制训练样本。在本文中，我们提出了实证和理论结果，表明扩散模型中的创造力源于去噪器架构与目标分布之间的相互作用。在理论上，我们给出了生成样本分布的显式形式，该形式作为目标分布和去噪器架构的函数，适用于三种不同的去噪器架构（线性、 polynomial、瓶颈）。在实证方面，我们展示了流行的 UNET 去噪器架构中的小变化会导致非常不同的创造力形式，而这些小变化通常会产生高度不真实的样本。综合来看，我们的结果表明，扩散模型只有在去噪器架构的归纳偏差与真实目标分布强烈一致时，才能取得成功。

View on arXiv Download PDF AI Translation

cs.CV / 32 / 2605.16416

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

CAVE：一种用于碎片化视觉证据推理的结构化信用分配方法

Guo, Tengda, Leng, Jie, Li, Hanlei, Liang, Yaoyuan, Zhang, Qingyue, Yang, Dian, Zhang, Mingyu, Fu, Yuhua, Huang, Shao-Lun

Abstract

Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate evidence to supervise reasoning paths. Experiments demonstrate that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and our newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE effectively improves the visual reasoning capacity and exhibits stronger robustness under longer-range and deeper cross-region dependencies.

Chinese Translation

视觉-语言模型（VLMs）在一般多模态推理中取得了强劲的表现，但在整合非局部视觉信息以支持语义不确定的视觉推理方面仍面临挑战。我们将这一挑战描述为碎片化视觉推理。为此，我们提出了视觉证据信用分配（CAVE），这是一种基于GRPO的结构化过程-奖励方法，用于交错的视觉推理。具体而言，CAVE通过三个互补的推理过程信号：信念更新、证据获取和自适应聚焦控制，评估中间步骤在行动层面的贡献，从而引导模型优化每个推理行动并学习更可靠的视觉推理策略。同时，我们构建了TRACER-Bench，涵盖四个非局部和语义混淆的推理维度，并提供关键的中间证据以监督推理路径。实验表明，CAVE显著提高了在需要碎片化视觉证据整合的任务上的表现，涵盖了公共基准和我们新引入的TRACER-Bench，同时在一般多模态评估中保持了竞争力的表现。进一步的分析表明，CAVE有效提升了视觉推理能力，并在长距离和深层跨区域依赖下展现出更强的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 33 / 2605.16418

Neural Visual Decoding via Cognitive guided Adaptive Blurring and Information Constrained Alignment

通过认知引导的自适应模糊与信息约束对齐实现神经视觉解码

Yin, Fan, Zheng, Chuhang, Gong, Peiliang, Guan, Donghai, Zhu, Qi

Abstract

EEG-based visual decoding aims to establish a mapping between neural signals and visual semantics. However, it remains constrained by the dual challenges of severe information granularity mismatch and the low signal-to-noise ratio (SNR) of EEG signals. Existing approaches typically treat static visual features, ignoring the dynamic selectivity of human vision and the frequency specificity of neural oscillations. To bridge this gap, we propose CAIA, a Cognitive-guided Adaptive blurring with Information-Constrained Alignment framework for Neural-Visual decoding. On the visual side, it simulates selective attention to adaptively reduce redundancy. Meanwhile, on the EEG side, it leverages neural oscillation priors and the information bottleneck mechanism to enhance SNR. Specifically, we devise a cognitive-dynamics-based adaptive blurring mechanism that dynamically integrates center-biased and saliency-guided visual cues via cross-modal attention. Furthermore, we introduce a distribution-aware boundary calibration loss to robustly rectify alignment bias caused by outlier samples. Moreover, a cognitively-guided information-screening method is proposed to select task-relevant EEG oscillations. Extensive experiments demonstrate that CAIA improves both subject-dependent and subject-independent average Top-1 and Top-5 accuracy in zero-shot brain-to-image retrieval, significantly outperforming prior methods. Our work validates that optimizing visual information density to match neural granularity offers a more interpretable and robust pathway for neural decoding.

Chinese Translation

基于脑电图（EEG）的视觉解码旨在建立神经信号与视觉语义之间的映射。然而，它仍然受到严重信息粒度不匹配和脑电图信号低信噪比（SNR）这两大挑战的限制。现有的方法通常将静态视觉特征视为重点，忽略了人类视觉的动态选择性和神经振荡的频率特异性。为了解决这一问题，我们提出了CAIA（Cognitive-guided Adaptive blurring with Information-Constrained Alignment），一个用于神经视觉解码的框架。在视觉方面，它模拟选择性注意力以自适应地减少冗余。同时，在EEG方面，它利用神经振荡先验和信息瓶颈机制来增强信噪比。具体而言，我们设计了一种基于认知动态的自适应模糊机制，通过跨模态注意力动态整合中心偏置和显著性引导的视觉线索。此外，我们引入了一种分布感知的边界校准损失，以稳健地修正由异常样本引起的对齐偏差。此外，我们提出了一种认知引导的信息筛选方法，以选择与任务相关的EEG振荡。大量实验表明，CAIA在零样本脑图像检索中提高了受试者依赖和非依赖的平均Top-1和Top-5准确率，显著优于先前的方法。我们的工作验证了优化视觉信息密度以匹配神经粒度为神经解码提供了更具可解释性和稳健性的途径。

View on arXiv Download PDF AI Translation

cs.CV / 34 / 2605.16419

Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments

用于未校准环境中自同步多视角关节角监测的智能管道

Yu, Juncheng, A, Lusi, Xie, Haoxuan, Wang, Weiming

Abstract

Kinematic monitoring plays a critical role in long-term rehabilitation for patients with spinal cord injury (SCI), where multi-view markerless motion capture methods have shown significant potential. However, owing to the reliance on calibration and the difficulty of achieving multi-view synchronization, their deployment in patient self-deployed environments remains challenging. In this work, we propose an agentic pipeline for self-synchronized multi-view joint angle monitoring in uncalibrated environments using two cameras without hardware triggers. The Multimodal large language models enable automatic video synchronization and agent-driven self-verification. State-of-the-art monocular 2D pose estimation models are employed to extract candidate poses, where an agent-based selection mechanism is then applied to automatically identify and track the target subject, thereby producing consistent 2D poses in the presence of multiple individuals and occlusions. Such 2D poses are optimized to estimate joint angles from uncalibrated multi-view pose sequences, ensuring interpretability through explicit geometric modeling. Validation against Vicon system demonstrated the strong performance, achieving an MAE of $5.97^\circ \pm 2.36^\circ$ and a Pearson correlation coefficient of $0.962 \pm 0.014$. The proposed method is expected to provide a practical, patient self-deployable system to perform daily kinematic monitoring in uncalibrated home environments.

Chinese Translation

运动学监测在脊髓损伤（SCI）患者的长期康复中发挥着关键作用，其中多视角无标记运动捕捉方法显示出显著的潜力。然而，由于依赖于校准以及实现多视角同步的困难，这些方法在患者自我部署环境中的应用仍然具有挑战性。在本研究中，我们提出了一种用于未校准环境中自同步多视角关节角监测的智能管道，该方法使用两台没有硬件触发器的相机。多模态大型语言模型使得自动视频同步和基于代理的自我验证成为可能。采用最先进的单目2D姿态估计模型提取候选姿态，然后应用基于代理的选择机制自动识别和跟踪目标对象，从而在多个个体和遮挡的情况下生成一致的2D姿态。这些2D姿态经过优化，以从未校准的多视角姿态序列中估计关节角度，并通过明确的几何建模确保可解释性。与Vicon系统的验证表明该方法表现优异，达到平均绝对误差（MAE）为$5.97^ heta ext{±} 2.36^ heta$，皮尔逊相关系数为$0.962 ext{±} 0.014$。所提出的方法有望提供一个实用的、患者自我可部署的系统，以在未校准的家庭环境中进行日常运动学监测。

View on arXiv Download PDF AI Translation

cs.CV / 35 / 2605.16420

Video Reconstruction using Diffusion-based Image-to-Video Generation with Trajectory Guidance

基于扩散的图像到视频生成与轨迹引导的视频重建

Bompai, Stelio, Kontopoulos, Ioannis, Spiliopoulos, Giannis, Zissis, Dimitris, Tserpes, Konstantinos

Abstract

This paper addresses the problem of reconstructing missing or dropped frames in top-down drone video of autonomous surface vehicles performing structured maritime manoeuvres. We propose a pipeline that converts raw GPS telemetry and a single reference frame into a trajectory-guided video sequence using a pre-trained image-to-video diffusion model, requiring no domain-specific fine-tuning. GPS coordinates from onboard telemetry logs are projected into image space via an equirectangular mapping, producing per-vessel motion cues that condition the SG-I2V diffusion model. The generated frames are evaluated against ground-truth video using perceptual, temporal and trajectory-based metrics, and benchmarked against optical flow extrapolation and RIFE interpolation baselines. SG-I2V produces the most naturally appearing frames among all methods (BRISQUE 25.52, closest to ground-truth 23.64), the most realistic motion magnitude (temporal smoothness 1.14 vs. ground truth 1.42), and the strongest GPS trajectory adherence (9.31px vs. 28.70px for ground-truth, the latter reflecting approximate temporal alignment between footage and GPS logs rather than generation error), demonstrating that trajectory-guided diffusion synthesis is a viable approach to maritime video reconstruction under challenging low-texture, small-object conditions.

Chinese Translation

本文解决了在自主水面车辆执行结构化海上机动时，顶视无人机视频中缺失或丢失帧的重建问题。我们提出了一种管道，将原始GPS遥测数据和单一参考帧转换为轨迹引导的视频序列，使用预训练的图像到视频扩散模型，无需领域特定的微调。通过等距矩形映射将来自机载遥测日志的GPS坐标投影到图像空间，生成每艘船的运动线索，以条件化SG-I2V扩散模型。生成的帧通过感知、时间和轨迹基础的指标与真实视频进行评估，并与光流外推和RIFE插值基线进行基准测试。SG-I2V在所有方法中生成了最自然的帧（BRISQUE 25.52，最接近真实值23.64），运动幅度最为真实（时间平滑度1.14对比真实值1.42），以及GPS轨迹遵循性最强（9.31px对比真实值28.70px，后者反映了视频与GPS日志之间的近似时间对齐，而非生成误差），证明了轨迹引导的扩散合成在低纹理、小物体条件下进行海洋视频重建的可行性。

View on arXiv Download PDF AI Translation

cs.CV / 36 / 2605.16423

Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization

非线性双极补偿：处理后训练量化中的异常值

Sun, Peilin, Wu, Jianxin

Abstract

Network quantization has emerged as one of the most practical model compression techniques, which significantly reduces a model's memory and compute consumption by mapping floating-point numbers to low-bit representations. However, existing quantization methods typically suffer from the speed-accuracy tradeoff and limited generalization. To address these issues, recent compensation-based methods offer an efficient yet general solution by introducing additional lightweight linear layers into the quantized network. However, the accuracy of these methods suffers from their limited compensation capability and high sensitivity to outliers. In this paper, we propose Nonlinear Bipolar Compensation (NBC), a post-training quantization approach that introduces nonlinear compensation to reduce the effect of outliers. We further design Bipolar Logarithmic Transformation (BLT), which compresses outliers by mapping both the quantized input and the quantization error into a transformed space. A simple linear layer is then applied for compensation in the transformed space, preserving the efficiency of our method. Extensive experiments across various tasks, models, and quantization methods confirm the effectiveness, efficiency, robustness, and generality of our NBC approach.

Chinese Translation

网络量化已成为最实用的模型压缩技术之一，通过将浮点数映射到低位表示，显著降低了模型的内存和计算消耗。然而，现有的量化方法通常面临速度与准确性之间的权衡以及有限的泛化能力。为了解决这些问题，近期基于补偿的方法通过在量化网络中引入额外的轻量级线性层，提供了一种高效而通用的解决方案。然而，这些方法的准确性受到其有限补偿能力和对异常值的高度敏感性的影响。本文提出了非线性双极补偿（Nonlinear Bipolar Compensation, NBC），这是一种后训练量化方法，通过引入非线性补偿来减少异常值的影响。我们进一步设计了双极对数变换（Bipolar Logarithmic Transformation, BLT），该变换通过将量化输入和量化误差映射到一个变换空间中来压缩异常值。然后，在变换空间中应用一个简单的线性层进行补偿，从而保持我们方法的高效性。针对各种任务、模型和量化方法的广泛实验验证了我们NBC方法的有效性、效率、鲁棒性和通用性。

View on arXiv Download PDF AI Translation

cs.CV / 37 / 2605.16427

EAGT: Echocardiography Augmentation for Generalisability and Transferability

EAGT：心脏超声增强以提高泛化性和可转移性

Elyasi, Soroush, Adibzadeh, Sara, Serej, Nasim Dadashi, Wall, Julie, Zolgharni, Massoud

Abstract

Deep learning models for echocardiography segmentation often struggle to generalise across institutions, scanners, and patient populations, where collecting large, consistently annotated datasets is infeasible. Data augmentation is widely used to improve the robustness of deep learning models; however, its role in enhancing cross-dataset generalisability in echocardiography remains insufficiently understood. This study presents a large-scale multi-dataset evaluation of 29 data augmentation techniques and their pairwise combinations for 2D left ventricular segmentation using a U-Net trained on Unity, CAMUS, and EchoNet Dynamic datasets. Each augmentation was explored under several hyperparameter settings and assessed through repeated runs using Dice and IoU in both in-domain and cross-dataset scenarios, with statistical significance quantified via independent t-tests. Results show that anatomically plausible geometric transformations, particularly affine, shift-scale-rotate, perspective, and random horizontal flip, substantially improve cross-dataset performance, whereas aggressive intensity- or artefact-based augmentations often degrade generalisability. Pairwise augmentation combinations outperform individual augmentations and show that moderate flip-centric combinations, especially random horizontal flip with affine, yield consistent gains across most transfer scenarios. These findings provide empirically grounded guidance for designing augmentation policies that enhance the robustness and transferability of echocardiography segmentation models.

Chinese Translation

心脏超声分割的深度学习模型在不同机构、扫描仪和患者群体之间的泛化能力常常面临挑战，因为收集大规模、一致标注的数据集是不可行的。数据增强被广泛用于提高深度学习模型的鲁棒性；然而，其在增强心脏超声跨数据集泛化性方面的作用仍然理解不足。本研究对29种数据增强技术及其成对组合进行了大规模多数据集评估，针对使用在Unity、CAMUS和EchoNet Dynamic数据集上训练的U-Net进行的二维左心室分割。每种增强在多个超参数设置下进行了探索，并通过在领域内和跨数据集场景中的重复运行使用Dice和IoU进行评估，统计显著性通过独立t检验量化。结果表明，解剖上合理的几何变换，特别是仿射变换、平移-缩放-旋转、透视变换和随机水平翻转，显著提高了跨数据集的性能，而激进的强度或伪影基础增强往往会降低泛化能力。成对的增强组合优于单独的增强，并显示出适度的以翻转为中心的组合，特别是随机水平翻转与仿射变换的结合，在大多数转移场景中均能带来一致的提升。这些发现为设计增强策略提供了实证基础指导，以增强心脏超声分割模型的鲁棒性和可转移性。

View on arXiv Download PDF AI Translation

cs.CV / 38 / 2605.16431

CT-DegradBench: A Physics-Informed Benchmark for CT Degradation Detection and Severity Estimation

CT-DegradBench：一种基于物理的CT降质检测与严重性估计基准

Taifour, Yousra Nabila, Tliba, Marouane, Ming, Zuheng, Luong, Marie, Aburaed, Nour, Chetouani, Aladine, Durak, Gorkem, Bruno, Alessandro, Cheikh, Faouzi Alaya, Zaidi, Habib, Bagci, Ulas, Beghdadi, Azeddine

Abstract

Computed tomography (CT) images are frequently degraded by acquisition artifacts, including noise, blur, streaking, aliasing, and metal artifacts. Yet CT enhancement is still largely evaluated using image quality metrics with limited perceptual and clinical validity, while existing datasets remain focused on isolated restoration tasks, hindering unified benchmarking across diverse degradation types. We present CT-DegradBench, a dataset and benchmark for CT degradation detection and severity estimation under controlled single- and mixed-artifact settings. CT-DegradBench enables systematic evaluation across multiple degradation families and severity levels within a common experimental framework. We further propose SeSpeCT (Semantic-Spectral CT degradation estimation), a framework that combines semantic priors from medical vision-language models with complementary frequency-domain cues for artifact analysis. SeSpeCT constructs a training-free semantic quality axis in the multimodal embedding space using radiology-informed text prompts, without task-specific fine-tuning, and combines it with spectral features that capture degradation-specific frequency patterns. The resulting representation enables joint prediction of artifact type and severity. Experimental results show that SeSpeCT consistently outperforms the evaluated baselines under both single- and mixed-degradation settings. The framework is available at https://github.com/yousranb/CT-DEGRADBENCH.

Chinese Translation

计算机断层扫描（CT）图像常常受到获取伪影的影响，包括噪声、模糊、条纹、混叠和金属伪影。然而，CT增强的评估仍主要依赖于具有有限感知和临床有效性的图像质量指标，而现有数据集则集中于孤立的修复任务，阻碍了不同降质类型之间的统一基准测试。我们提出了CT-DegradBench，这是一个用于在受控的单一和混合伪影环境下进行CT降质检测和严重性估计的数据集和基准。CT-DegradBench使得在一个共同的实验框架内对多种降质类别和严重性水平进行系统评估成为可能。我们进一步提出了SeSpeCT（语义-频谱CT降质估计），这是一个将医学视觉-语言模型中的语义先验与补充的频域线索相结合的框架，用于伪影分析。SeSpeCT在多模态嵌入空间中构建了一个无训练的语义质量轴，利用放射学信息的文本提示，而无需特定任务的微调，并将其与捕捉降质特定频率模式的频谱特征相结合。由此产生的表示能够联合预测伪影类型和严重性。实验结果表明，SeSpeCT在单一和混合降质环境下均优于评估的基准。该框架可在 https://github.com/yousranb/CT-DEGRADBENCH 获取。

View on arXiv Download PDF AI Translation

cs.CV / 39 / 2605.16439

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

KVCapsule：针对具有不对称冗余的视觉语言模型的高效序列KV缓存压缩

Huang, Yingbing, Srikrishnan, Tharun Adithya, Reinhardt, Steven K., Chen, Deming

Abstract

Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models.

Chinese Translation

视觉语言模型（VLMs）作为大型语言模型（LLMs）的重要且快速发展的扩展，通过文本和图像输入实现多模态推理。尽管VLMs丰富了语言模型的能力，但它们也继承并放大了关键的计算瓶颈：自回归解码过程中由大型键值（KV）缓存引起的内存开销。这个挑战在VLMs中尤为严重，因为与文本相比，图像生成的标记序列更长且特征表示更密集。此外，视觉标记的空间和信息丰富特性引入了结构化注意力模式，使得许多面向LLM的KV缓存压缩技术在直接应用于VLMs时效果不佳。在本研究中，我们对视觉标记的行为进行了详细的实证分析，突出了其与纯文本模型的关键差异。基于这些洞察，我们提出了KVCapsule，一种针对视觉标记的新型KV缓存压缩框架。KVCapsule保持预训练的VLM主干不变，不需要修改注意力计算模块，并且可以通过轻量级的压缩和重构组件集成到现有的VLMs中。我们在多个VLMs和基准任务上评估了KVCapsule，结果显示在60%的压缩比下，TPS提高了最多2倍，KV缓存内存减少了2.4倍，且准确性或响应质量几乎没有下降。我们的研究结果为在受限内存预算下扩展VLM推理提供了实用路径，并激发了对多模态模型结构感知缓存压缩的进一步研究。

View on arXiv Download PDF AI Translation

cs.CV / 40 / 2605.16440

Semantic Smoothing via Novel View Synthesis for Robust SAR Image Classification

通过新视图合成实现语义平滑以增强SAR图像分类的鲁棒性

Brignac, Daniel, Tian, Fengwei, Latibari, Banafsheh, Mahalanobis, Abhijit, Tandon, Ravi

Abstract

Deep neural networks are vulnerable to adversarial perturbations, limiting deployment in safety-critical applications such as synthetic aperture radar (SAR) automatic target recognition (ATR). Randomized smoothing improves robustness by averaging predictions over noisy inputs, but isotropic noise often fails to preserve the semantic structure of SAR imagery. We propose semantic smoothing, a defense that replaces noised-based perturbations with structured randomized transformations generated by a novel view synthesis model. For SAR, we condition on acquisition geometry to synthesize multiple plausible radar views. Predictions across generated randomized views are aggregated to form a robust classifier. Experiments show that semantic smoothing improves robustness against standard attacks, such as FGSM and PGD, and SAR-specific attacks, such as OTSA and SMGAA, while also increasing clean classification accuracy. These results demonstrate that randomized smoothing via semantically preserving geometric transformations is a promising alternative to isotropic noise for adversarial defense in structured sensing domains.

Chinese Translation

深度神经网络对对抗性扰动敏感，这限制了其在合成孔径雷达（SAR）自动目标识别（ATR）等安全关键应用中的部署。随机平滑通过对噪声输入的预测进行平均来提高鲁棒性，但各向同性噪声往往无法保持SAR图像的语义结构。我们提出了语义平滑，这是一种防御机制，它用由新视图合成模型生成的结构化随机变换替代基于噪声的扰动。对于SAR，我们根据获取几何条件合成多个合理的雷达视图。通过生成的随机视图聚合预测，以形成一个鲁棒的分类器。实验表明，语义平滑在对抗标准攻击（如FGSM和PGD）以及SAR特定攻击（如OTSA和SMGAA）时提高了鲁棒性，同时也提高了干净分类的准确性。这些结果表明，通过语义保持的几何变换实现的随机平滑是对抗性防御在结构化感知领域中对各向同性噪声的有前景的替代方案。

View on arXiv Download PDF AI Translation

cs.CV / 41 / 2605.16444

Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images

用于预测和半自动定位肺癌组织病理图像中气道内扩散（STAS）的扩散注意力专家模型

Pan, Liangrui, Luo, Jiadi, Xiao, Yuxuan, Nie, Chenchen, Wu, Xiaoshuai, Fan, Songqing, Chu, Ling, Li, Manqiu, He, Rongfang, Zhao, Zhenyu, Wang, Ruixing, Liu, Shulin, Liang, Yiyi, Wang, Xiang, Liang, Qingchun, Peng, Shaoliang

Abstract

Accurate intraoperative and postoperative diagnosis of spread through air spaces (STAS) is essential for guiding surgical decisions and postoperative management in lung cancer. However, histopathological assessment is labor-intensive and is prone to missed or incorrect diagnoses. We propose a Diffusion Attention Expert Model (DAEM) to detect STAS in frozen sections (FSs) and paraffin sections (PSs). Its diffusion attention expert module leverages full attention aggregation to learn multi-scale features from histopathological images, while a dual-branch architecture strengthens multi-scale feature representation. On an internal dataset, DAEM achieves AUCs of 0.8946 for FSs and 0.9112 for PSs. Validation on external multi-center datasets from eight institutions demonstrates strong generalizability and interpretability. Using tumor microenvironment (TME) features in PSs, we further enable semi-automatic measurement of STAS location and its distance from the primary tumor. Several quantitative TME metrics are identified as potential biomarkers for STAS, including micropapillary-type STAS. Overall, DAEM offers a clinically actionable framework for STAS assessment by enabling accurate and interpretable detection on FSs and PSs, supporting postoperative risk stratification through quantitative TME-based analysis.

Chinese Translation

在肺癌的手术过程中和术后，准确诊断气道内扩散（STAS）对于指导外科决策和术后管理至关重要。然而，组织病理评估劳动强度大，容易出现漏诊或误诊。我们提出了一种扩散注意力专家模型（Diffusion Attention Expert Model, DAEM）用于检测冷冻切片（Frozen Sections, FSs）和石蜡切片（Paraffin Sections, PSs）中的STAS。其扩散注意力专家模块利用全注意力聚合来学习组织病理图像中的多尺度特征，同时双分支架构增强了多尺度特征的表示能力。在内部数据集上，DAEM在FSs上获得了0.8946的AUC，在PSs上获得了0.9112的AUC。在来自八个机构的外部多中心数据集上的验证显示了其强大的泛化能力和可解释性。利用PSs中的肿瘤微环境（Tumor Microenvironment, TME）特征，我们进一步实现了STAS位置及其与原发肿瘤距离的半自动测量。多个定量TME指标被确定为STAS的潜在生物标志物，包括微乳头状类型的STAS。总体而言，DAEM为STAS评估提供了一个临床可操作的框架，通过在FSs和PSs上实现准确且可解释的检测，支持基于定量TME分析的术后风险分层。

View on arXiv Download PDF AI Translation

cs.CV / 42 / 2605.16456

Multi-hop Relational Contrastive Learning: Extending Spatial Contrastive Pre-training Beyond Pairwise Relations

多跳关系对比学习：超越成对关系的空间对比预训练

Ahmed, Sheikh Tanvir, Raihan, Md. Tanvir

Abstract

Understanding how objects relate to each other in space is fundamental to scene understanding, yet most contrastive pre-training approaches only model pairwise relationships, leaving richer compositional and multi-hop interactions largely unexplored. We introduce Multi-Hop Relational Contrastive Learning (MRCL), a framework that extends spatial contrastive learning to graph-structured scene representations. By tracing k-hop paths through scene graphs built from detected objects, MRCL captures implicit spatial dependencies that go well beyond what direct object pairs can express. We define a multi-level contrastive objective spanning nodes, edges, and multi-hop paths, encouraging embeddings that remain stable across object semantics while staying responsive to spatial layout. On a GQA subset, MRCL produces spatially-aware representations that improve content-based graph retrieval (NDCG@5 = 0.748) and consistently benefit downstream tasks, including spatial relationship recognition and graph-based question answering. Together, these results suggest that multi-hop relational supervision offers substantially richer structural guidance than pairwise-only methods, leading to visual representations that are more robust, compositional, and geometry-aware.

Chinese Translation

理解物体在空间中如何相互关联是场景理解的基础，然而大多数对比预训练方法仅建模成对关系，导致更丰富的组合和多跳交互尚未得到充分探索。我们提出了多跳关系对比学习（Multi-Hop Relational Contrastive Learning, MRCL），这是一个将空间对比学习扩展到图结构场景表示的框架。通过追踪从检测到的物体构建的场景图中的k跳路径，MRCL捕捉到超越直接物体对所能表达的隐含空间依赖关系。我们定义了一个跨越节点、边和多跳路径的多层次对比目标，鼓励嵌入在保持对象语义稳定的同时，对空间布局保持敏感。在GQA子集上，MRCL生成的空间感知表示提高了基于内容的图检索（NDCG@5 = 0.748），并持续促进下游任务的表现，包括空间关系识别和基于图的问答。这些结果表明，多跳关系监督提供了比仅基于成对的方法更丰富的结构指导，从而导致更强健、组合性更强且更具几何意识的视觉表示。

View on arXiv Download PDF AI Translation

cs.CV / 43 / 2605.16458

Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery

安全敏感医疗图像恢复的保守人工智能：针对颅内动脉瘤相关信号恢复的残差约束CT-CTA增强

Ma, Weijun

Abstract

Image restoration models are increasingly applied to degraded medical scans, but in safety-sensitive settings they must improve image quality without uncontrolled modification of clinically important regions. This is especially relevant for intracranial CT and CT angiography (CTA), where small vessels and aneurysm-relevant cues lie near high-contrast anatomical boundaries. We frame medical image restoration as a conservative AI problem and present a residual-bounded 2.5D restoration framework trained on synthetically degraded CT/CTA inputs. The model adds a learned residual to the original center slice through an edit-control map that limits the magnitude and spatial extent of modification. We evaluate the framework using an aneurysm-relevant image-recovery matrix, paired comparison against a Gaussian baseline, Monte Carlo stability testing, anatomical localization of meaningful edits, and external evaluation on low-dose CT. On 50 out-of-distribution CT-CTA cases, the bounded model achieved a mean target gain of 0.0635, a mean PSNR of 37.51 dB, and an iatrogenic-edit rate of 4.0%. Across 1,000 Monte Carlo runs, it remained net positive in 85.4% of runs with no stably negative cases. On external low-dose CT, the model was directionally beneficial and produced a substantially smaller modification footprint than the baseline. Meaningful edits concentrated in brain and skull regions while unrelated anatomy showed negligible change. These findings provide preliminary computational evidence that residual-bounded restoration is feasible in boundary-sensitive vascular imaging, but they do not establish clinical diagnostic performance and require expert review and prospective validation before clinical use.

Chinese Translation

图像恢复模型在退化的医疗扫描中应用越来越广泛，但在安全敏感的环境中，它们必须在不对临床重要区域进行不受控制修改的情况下提高图像质量。这对于颅内CT和CT血管造影（CTA）尤其重要，因为小血管和动脉瘤相关线索位于高对比度解剖边界附近。我们将医疗图像恢复框架视为一个保守人工智能问题，并提出了一种基于残差约束的2.5D恢复框架，该框架在合成退化的CT/CTA输入上进行训练。该模型通过编辑控制图将学习到的残差添加到原始中心切片上，限制了修改的幅度和空间范围。我们使用与动脉瘤相关的图像恢复矩阵进行框架评估，并与高斯基线进行配对比较，进行蒙特卡洛稳定性测试，进行有意义编辑的解剖定位，以及在低剂量CT上的外部评估。在50个分布外的CT-CTA案例中，约束模型实现了0.0635的平均目标增益，37.51 dB的平均PSNR，以及4.0%的医源性编辑率。在1000次蒙特卡洛运行中，模型在85.4%的运行中保持净正值，且没有稳定的负值案例。在外部低剂量CT上，该模型在方向上是有益的，并且产生的修改足迹显著小于基线。有效的编辑集中在大脑和颅骨区域，而无关的解剖结构几乎没有变化。这些发现提供了初步的计算证据，表明在边界敏感的血管成像中，残差约束恢复是可行的，但并未确立临床诊断性能，需在临床使用前进行专家审查和前瞻性验证。

View on arXiv Download PDF AI Translation

cs.CV / 44 / 2605.16460

REC-RL: Referring expression counting via Gaussian and range-based reward optimization

REC-RL：通过高斯和基于范围的奖励优化进行指称表达计数

Liu, Hui, Teng, Yunlai, Bai, Kunlong, Qi, Pengfei, Yan, Haotian, Li, Liang, Feng, Junlan

Abstract

Referring expression counting (REC) is an intention-driven task that requires context-aware visual reasoning. While recent vision-language models incorporate language for visual understanding, most existing REC methods rely on rulebased reinforcement learning with rewards focused primarily on final accuracy, overlooking the quality of intermediate reasoning. We propose REC-RL, a reinforcement learning framework that introduces a think-range-answer paradigm to explicitly optimize the visual reasoning process. RECRL employs Group Relative Policy Optimization and two lightweight rewards: an accuracy reward that combines range-based interval supervision with Gaussian-based precision guidance, and a format reward that enforces structured outputs. By modeling intermediate focus prediction as internal decision-making, REC-RL avoids additional annotations and better aligns with human perception. Extensive experiments demonstrate consistent improvements over strong baselines and robust generalization across benchmarks.

Chinese Translation

指称表达计数（REC）是一项以意图驱动的任务，要求具备上下文感知的视觉推理能力。尽管近期的视觉-语言模型将语言融入视觉理解中，但大多数现有的REC方法依赖于基于规则的强化学习，其奖励主要集中在最终准确性上，忽视了中间推理的质量。我们提出了REC-RL，一个强化学习框架，引入了思考-范围-回答的范式，以明确优化视觉推理过程。REC-RL采用了组相对策略优化和两种轻量级奖励：一种结合基于范围的区间监督与基于高斯的精度指导的准确性奖励，以及一种强制结构化输出的格式奖励。通过将中间聚焦预测建模为内部决策过程，REC-RL避免了额外的标注，并更好地与人类感知对齐。大量实验表明，REC-RL在强基线之上实现了一致的改进，并在基准测试中表现出强大的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 45 / 2605.16464

MHMamba: Multi-Head Mamba for 3D Brain Tumor Segmentation

MHMamba：用于三维脑肿瘤分割的多头Mamba

Tao, Hanjun, Wang, Hua, Zhang, Fan

Abstract

Brain tumors exhibit high heterogeneity in morphology and multimodal contrast, making manual slice-by-slice de lineation time-consuming and experience-dependent, thus necessitating efficient and stable automated segmentation methods. To address the limitations of CNNs in modeling long-range dependencies, and the heavy computational and memory overhead and inter-block contextual in coherence of Transformers in 3D MRI, this paper proposes Multi-Head Mamba (MHMamba). This method combines a U-shaped architecture with a multi-head state-space model (Mamba), splitting the channel dimension into parallel SSM heads and aggregating them with residuals. This enhances long-range representation and improves the stability of multimodal training while maintaining linear complexity. To further align statistics and enhance lesion response, we designed a channel-space calibration module for multi-head outputs and introduced an adaptive fusion mechanism at skip connections to dynamically connect global semantics with local details, thereby improving boundary consistency and the detection of small-volume lesions. We conducted experiments and ablations on BraTS2021 and BraTS2023. The results showed that MHMamba achieved stable and significant improvements in overall accuracy, boundary smoothness, and sensitivity to tumor core and small-volume enhancement areas, while preserving the linear-complexity advantage of Mamba-based modeling, thus verifying the effectiveness and versatility of the method.

Chinese Translation

脑肿瘤在形态和多模态对比方面表现出高度异质性，使得手动逐片描绘耗时且依赖经验，因此迫切需要高效且稳定的自动分割方法。为了解决卷积神经网络（CNN）在建模长距离依赖方面的局限性，以及变换器（Transformers）在三维MRI中的高计算和内存开销及块间上下文不一致性，本文提出了多头Mamba（MHMamba）。该方法结合了U型架构与多头状态空间模型（Mamba），将通道维度拆分为并行的状态空间模型头，并通过残差进行聚合。这增强了长距离表示能力，并在保持线性复杂度的同时提高了多模态训练的稳定性。为了进一步对齐统计信息并增强病灶响应，我们为多头输出设计了通道空间校准模块，并在跳跃连接处引入了自适应融合机制，以动态连接全局语义与局部细节，从而改善边界一致性和小体积病灶的检测。我们在BraTS2021和BraTS2023上进行了实验和消融研究。结果表明，MHMamba在整体准确性、边界平滑度以及对肿瘤核心和小体积增强区域的敏感性方面实现了稳定且显著的提升，同时保持了基于Mamba建模的线性复杂度优势，从而验证了该方法的有效性和通用性。

View on arXiv Download PDF AI Translation

cs.CV / 46 / 2605.16468

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

机制可解释的神经编码揭示人类视觉皮层的细粒度功能选择性

Grosbard, Idan Daniel, Geva, Mor, Yovel, Galit

Abstract

A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel's response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel's response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel's activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel's selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.

Chinese Translation

理解人类视觉的一个核心目标是揭示驱动神经元活动的视觉特征。越来越多的研究使用人工神经网络作为编码模型来预测皮层对自然图像的响应，揭示激活类别选择性区域的视觉内容。然而，现有的方法主要是相关性的，并将编码器视为黑箱，未能明确哪些图像特征驱动每个体素的响应。我们引入了机制可解释的神经编码（Mechanistically Interpretable Neural Encoding, MINE），这一框架通过应用机制可解释性工具来定位驱动毫米级（体素级）活动的自然图像特征，从而打开了这一黑箱。MINE使用与语言对齐的图像表示来预测每个体素的响应，并生成对体素激活至关重要的特征的语义可解释描述。我们进一步将这些每幅图像的特征推广为每个体素的功能特征概况。为了验证每幅图像的描述，我们展示了这些描述足以生成引发与原始图像相匹配的体素响应的图像，其准确性优于从随机或低归因控制生成的图像。此外，从图像中反事实地插入或移除预测特征会使激活朝预期方向移动，提供了因果证据。基于每个体素激活特征概况的反事实编辑产生了更强的激活变化，表明这些特征概况准确捕捉了每个体素的选择性。最后，我们将MINE应用于研究良好的类别选择性脑区，显示它恢复了已知的类别偏好，同时揭示了每个区域内细粒度的独特体素结构。总体而言，我们的结果确立了机制可解释性作为发现和因果验证关于神经功能的细粒度假设的一条路径。

View on arXiv Download PDF AI Translation

cs.CV / 47 / 2605.16481

Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval

视觉代理记忆：通过在线索引、层次记忆和代理检索实现在线长视频理解

Li, Aiden Yiliu, Numan, Nels, Steed, Anthony

Abstract

Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training-free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO-Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end-to-end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month-scale split of MM-Lifelong train@month (105.6 hours over 51 days), VAM reaches 17.11%, second only to ReMA with GPT-5 (17.62%). These results suggest that long-horizon video understanding benefits from treating visual memory as an explicit, inspectable, and queryable substrate. Code is available at https://github.com/yiliu-li/Visual-Agentic-Memory.

Chinese Translation

长视频理解不仅需要大范围的上下文窗口，还需要一种记忆机制，以决定保留哪些视觉证据，并在长时间范围内保持可搜索性，同时使后续推理基于可恢复的观察而非仅仅是压缩的潜在状态。我们提出了视觉代理记忆（Visual Agentic Memory, VAM），这是一个无训练框架，包含三个组成部分。在线索引（Online Indexing）支持在流媒体约束下的选择性证据保留。层次记忆（Hierarchical Memory）将保留的证据组织在一个平行表示（Parallel Representation）中，使时间上下文与空间观察相一致。代理检索（Agentic Retrieval）在生成有根据的答案之前，搜索、检查并验证候选证据。在OVO-Bench上，VAM在所有报告的基准中实现了最高的RT+BT平均值（68.41），相比于同一基础MLLM（Gemini 3 Flash，67.46）的端到端使用有所提升。在MM-Lifelong train@month的月度划分（105.6小时，51天）中，VAM达到了17.11%，仅次于使用GPT-5的ReMA（17.62%）。这些结果表明，长时间范围的视频理解受益于将视觉记忆视为一种明确的、可检查的和可查询的基础。代码可在https://github.com/yiliu-li/Visual-Agentic-Memory获取。

View on arXiv Download PDF AI Translation

cs.CV / 48 / 2605.16515

SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability

SeamCam：通过多线索视觉可检测性量化无缝伪装

Monsefi, Amin Karimi, Meyarian, Abolfazl, Khurana, Mridul, Wang, Shuheng, Navard, Pouyan, Zhang, Cheng, Karpatne, Anuj, Chao, Wei-Lun, Ramnath, Rajiv

Abstract

Animals are described as effectively camouflaged when they blend seamlessly with their surrounding, yet no standardized quantitative measure of this seamlessness exists. We address this gap by framing camouflage evaluation as a visual localization problem: a well-camouflaged animal is one that remains difficult to detect even when its category is known. We introduce SeamCam (Seamless Camouflage), a metric that quantifies how detectable an animal is from the available visual evidence. Given an image and a target species, SeamCam generates category-conditioned detection proposals, extracts segmentation masks, and identifies the subset whose collective union yields the highest IoU with the ground-truth mask. The SeamCam score is one minus this maximum recoverable localization signal, where a higher score indicates stronger camouflage (i.e., lower detectability). In a human two-alternative forced-choice study with 94 participants and 2,390 comparisons, SeamCam achieves 78.82% agreement with human camouflage difficulty judgments, outperforming state-of-the-art by about 25%. We then demonstrate SeamCam's utility as a preference signal for Direct Preference Optimization (DPO) to fine-tune a diffusion-based inpainting model for camouflage generation. This offers an affordable training approach with an objective explicitly suited for camouflage generation, unlike typical diffusion models. To support rigorous benchmarking, we further introduce CamFG-1.5k, a curated dataset of 1,521 high-resolution images in which animals are fully visible prior to camouflage generation, enabling unbiased evaluation by controlling for occlusion artifacts present in existing datasets. https://7amin.github.io/SeamCam/

Chinese Translation

当动物与其周围环境无缝融合时，通常被描述为有效伪装，然而，目前尚无标准化的无缝度定量测量方法。我们通过将伪装评估框架设定为视觉定位问题来填补这一空白：一个伪装良好的动物是指即使已知其类别，仍然难以被检测到的动物。我们引入了SeamCam（无缝伪装），这一指标量化了从可用视觉证据中动物的可检测性。给定一张图像和一个目标物种，SeamCam生成类别条件的检测提议，提取分割掩码，并识别其集合的子集，该子集的联合体与真实掩码的IoU值最高。SeamCam分数为1减去此最大可恢复定位信号，分数越高表示伪装越强（即可检测性越低）。在一项包含94名参与者和2390次比较的人类两项选择强迫选择研究中，SeamCam与人类伪装难度判断的达成率为78.82%，比最先进的方法提高了约25%。随后，我们展示了SeamCam作为直接偏好优化（DPO）中的偏好信号的实用性，以微调基于扩散的修复模型用于伪装生成。这提供了一种经济的训练方法，其目标明确适用于伪装生成，而不同于典型的扩散模型。为了支持严格的基准测试，我们进一步引入了CamFG-1.5k，一个经过精心策划的数据集，包含1521张高分辨率图像，其中动物在伪装生成之前完全可见，从而通过控制现有数据集中存在的遮挡伪影，实现无偏评估。

View on arXiv Download PDF AI Translation

cs.CV / 49 / 2605.16519

DepthPolyp: Pseudo-Depth Guided Lightweight Segmentation for Real-Time Colonoscopy

DepthPolyp：基于伪深度引导的轻量级实时结肠镜分割

Wu, Zhuoyu, Ou, Wenhui, Zhang, Lexi, Tan, Pei-Sze, Wu, Dongjun, Zhao, Junhe, Fang, Wenqi, Phan, Raphaël C. -W.

Abstract

Accurate polyp segmentation in colonoscopy is essential for early colorectal cancer detection, yet real-world clinical environments pose persistent challenges such as motion blur, specular reflections, and illumination instability. Most existing methods are optimized on clean benchmark images and suffer noticeable performance degradation when deployed in authentic surgical scenarios. We propose DepthPolyp, a lightweight and robust segmentation framework based on pseudo-depth-guided multi-task learning and efficient feature modulation. The architecture combines hierarchical Ghost factorization for compact feature generation, Interleaved Shuffle Fusion for low-cost cross-scale interaction, and Dynamic Group Gating for adaptive group-wise feature weighting. Extensive experiments demonstrate that DepthPolyp achieves strong cross-dataset generalization when trained on degraded data and evaluated on both clean and noisy target domains, consistently outperforming lightweight baselines and remaining competitive with substantially larger models. In real surgical video evaluation on PolypGen, DepthPolyp achieves better segmentation performance than models up to $20\times$ larger while preserving real-time inference speed. With only 3.57M parameters and 0.86 GMACs, the proposed method runs at over 180 FPS on mobile devices, making it well suited for real-time deployment in resource-constrained clinical environments. Code and pretrained weights are available at: https://github.com/ReaganWu/DepthPolyp/

Chinese Translation

在结肠镜检查中，准确的息肉分割对于早期结直肠癌的检测至关重要，但现实临床环境中存在运动模糊、镜面反射和光照不稳定等持续挑战。大多数现有方法是在干净的基准图像上进行优化，在真实手术场景中部署时性能显著下降。我们提出了DepthPolyp，这是一种基于伪深度引导的多任务学习和高效特征调制的轻量级鲁棒分割框架。该架构结合了分层Ghost因子分解以生成紧凑特征、交错洗牌融合以实现低成本跨尺度交互，以及动态组门控以适应性地进行组特征加权。大量实验表明，DepthPolyp在使用退化数据训练并在干净和噪声目标领域评估时，能够实现强大的跨数据集泛化，始终优于轻量级基线，并与显著更大的模型保持竞争力。在PolypGen的真实手术视频评估中，DepthPolyp的分割性能优于高达20倍更大的模型，同时保持实时推理速度。该方法仅需3.57M参数和0.86 GMACs，在移动设备上运行速度超过180 FPS，非常适合在资源受限的临床环境中进行实时部署。代码和预训练权重可在以下链接获取：https://github.com/ReaganWu/DepthPolyp/

View on arXiv Download PDF AI Translation

cs.CV / 50 / 2605.16530

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

SWoMo：用于白内障手术模拟的神经符号世界模型

Sivakumar, Ssharvien Kumar, Johnson, Akwele, Dhingra, Anirudh, Frisch, Yannik, Ghazaei, Ghazal, Mukhopadhyay, Anirban

Abstract

Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

Chinese Translation

现实的手术模拟在培训新手外科医生和自主智能体的发展中起着至关重要的作用。世界模型可以通过预测基于当前观察和手术动作的未来患者状态，将此类模拟环境扩展到现实和多样化的程序。然而，目前的最先进方法往往未能满足临床应用所需的关键标准，包括视觉真实感、物理基础的交互以及模拟超出训练分布的场景的能力。因此，我们提出了SWoMo，一个用于白内障手术模拟的神经符号世界模型，它将运动生成与视觉真实感解耦。符号组件由基于规则的模拟器和场景图表示组成，建模运动动态和工具-组织交互，而扩散模型则生成真实的视觉外观，包括纹理和组织变形。我们提出了一种逆配对策略，在模拟器中重建真实手术视频，以获得配对的模拟视频和真实视频，然后用于训练我们的视屏扩散模型，实现从模拟到真实的反向目标。我们的实验显示出相较于之前工作的定性和定量改进。我们证明我们的模拟器进一步满足关键标准，包括对未见交互几何的泛化、下游相位检测的改善以及无监督视频风格迁移。代码、数据和模型权重可在以下网址获取：https://ssharvienkumar.github.io/SWoMo/

View on arXiv Download PDF AI Translation

cs.CV / 51 / 2605.16550

Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

基于注意力机制的变压器聚合网络用于视频周眼识别

Carreira, Luiz G F, Mariano, Breno A, de Melo, Victor H C, Menotti, David, Schwartz, William Robson

Abstract

Video periocular recognition is the task of recognizing an individual's identity based on the region around an individual's eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8\%$ of TPR@$1e^{-1}$ and $96.6\%$ of Rank-5.

Chinese Translation

视频周眼识别是基于个体眼睛周围区域识别个体身份的任务。周眼区域是人脸中最具辨别力的区域之一，适合用于识别任务。作为生物特征的一种替代方式，它在监控场景中得到了应用，尤其是在传统的生物特征如面部或虹膜识别因无约束采集条件而变得不可行的情况下。本文提出了一种针对监控环境中基于视频的周眼识别的注意力机制方法。该框架由两个主要模块组成：特征嵌入和聚合。特征嵌入模块是一个深度卷积神经网络，将周眼数据映射到特征向量。聚合模块是一个仅包含编码器的变压器，能够自适应地学习将帧级特征聚合为单一视频表示和静态参考图像的特征向量。在公开可用的COX Face数据集上的实验表明，所提方法的鲁棒性，始终优于简单的聚合方案。在最佳情况下，该方法在TPR@$1e^{-1}$上达到了$99.8\%$，在Rank-5上达到了$96.6\\%$。

View on arXiv Download PDF AI Translation

cs.CV / 52 / 2605.16572

TriALS: Triphasic-Aided Liver Lesion Segmentation Benchmark in Non-Contrast CT

TriALS：非对比CT下三相辅助肝病灶分割基准

Elbatel, Marawan, Ghonim, Mohamed, Mao, Jiaji, Lin, Zhuosheng, Eckstein, Katharina, Mora, Andrés Martínez, Deissler, Jonathan, Rokuss, Maximilian, Ulrich, Constantin, Marinov, Zdravko, Deng, Wenhui, Li, Baoxun, Hu, Huijun, Shen, Jun, Ghonim, Mohanad, Nassar, Khadiga Omar, Elbakry, Mariam, Dyab, Menna, Salem, Amr Muhammad Abdo, Elghitany, Nouran, Elghitany, Noha, Qin, Yi, Huang, Xuanqi, Wang, Haonan, Yen, Shao-Woo, Saba, Ahmed Elghamry, Ahmad, Salma, Fang, Xinyan, Zhang, Jiahao, Wang, Xiaodi, Ma, Xinghua, Luo, Gongning, Delmoral, Jessica C., Tavares, João Manuel R. S., Deria, Ankan, Dukre, Adinath, Xie, Yutong, Razzak, Imran, Kim, Dongwook, Choi, Matthew, Zhang, Hanxiao, Zhang, Minghui, You, Xin, Qayyum, Abdul, Niederer, Steven A., Mazher, Moona, Hamadache, Rachika E., Montoya-del-Angel, Ricardo, Martí, Robert, Lladó, Xavier, Musah, Toufiq, Ayivor, Livingstone Eli, Almar-Munoz, Enrique, Mayr, Agnes, Mouheb, Kaouther, Bron, Esther E., Klein, Stefan, Abouelhoda, Ahmed, Adel, Amira, Ali, Susan Adil, Stiefelhagen, Rainer, Maier-Hein, Klaus H., Isensee, Fabian, Yassin, Aya, Li, Xiaomeng

Abstract

Automated segmentation of liver lesions on non-contrast computed tomography (NCCT) is clinically important but fundamentally challenging, particularly in low-resource settings across Africa and Asia where contrast agents are frequently unavailable. Progress has been limited by the absence of annotated NCCT benchmarks. Here we describe the TriALS challenge for automated liver lesion segmentation under contrast-limited conditions, supported by a multi-centre dataset of 150 cases with four-phase CT acquisitions (600 volumes) from Egyptian and Chinese institutions. Algorithms were evaluated on 70 cases from three institutions, including an independent external cohort. The top-performing method achieved a mean venous-phase Dice of 0.754, consistent with human-level performance, yet dropped to 0.57 on NCCT. On external validation, the leading method outperformed off-the-shelf models by up to 28% in Dice on NCCT. Algorithm performance was most strongly predicted by training data scale and pre-training strategy. A cross-year comparison exposed a persistent perceptual barrier on NCCT that scaling pre-training alone cannot overcome. Data, annotations, and code are available at https://github.com/xmed-lab/TriALS.

Chinese Translation

在非对比计算机断层扫描（NCCT）上自动分割肝病灶在临床上具有重要意义，但在根本上具有挑战性，尤其是在非洲和亚洲的低资源环境中，那里对比剂通常不可用。由于缺乏经过注释的NCCT基准，进展受到限制。在此，我们描述了TriALS挑战，旨在在对比受限条件下进行自动肝病灶分割，支持的数据集来自埃及和中国机构，包含150个病例的四相CT采集（600个体积）。算法在来自三个机构的70个病例上进行了评估，包括一个独立的外部队列。表现最佳的方法在静脉相阶段的Dice系数达到了0.754，与人类水平的表现一致，但在NCCT上降至0.57。在外部验证中，领先的方法在NCCT上比现成模型的Dice系数提高了多达28%。算法性能最强的预测因素是训练数据规模和预训练策略。跨年度比较揭示了在NCCT上存在持久的感知障碍，仅仅扩大预训练无法克服。数据、注释和代码可在https://github.com/xmed-lab/TriALS获取。

View on arXiv Download PDF AI Translation

cs.CV / 53 / 2605.16579

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

局部关注，线性记忆：线性注意力作为自回归视频扩散的跨帧记忆

Li, Kunyang, Shah, Mubarak, Shang, Yuzhang

Abstract

Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.

Chinese Translation

自回归（AR）视频扩散是一种强大的流媒体和交互式视频生成范式。然而，它对 softmax 自注意力的依赖导致了序列长度和内存使用的平方计算复杂度，由于键值缓存，这限制了其在长视频时间范围内的可扩展性。现有的解决方案（例如，稀疏注意力和 KV-cache 压缩）虽然降低了每步的成本，但仍然依赖于线性增长的缓存或不可逆地丢弃过去的上下文，因此未能解决线性内存增长和流媒体上下文管理的问题。为了解决这一可扩展性瓶颈，我们提出了 ARL2（局部关注，线性记忆），这是一种混合注意力模块，用固定大小的递归状态替代了平方跨帧注意力。我们将自注意力分解为两个分支：一个用于空间细节和局部依赖的帧内 softmax 分支，以及一个用于流媒体上下文的固定大小状态的帧间门控递归线性分支。我们的关键见解是，softmax 注意力捕捉细粒度的局部交互，而递归状态提供可控的长程记忆。这一设计在保持恒定内存的同时实现了线性时间扩展，并提高了相较于全 softmax 模型的时间一致性。为了防止噪声中间状态破坏记忆，我们仅在去噪通过后更新递归状态。为了避免帧内信息的不对称，所有标记共享相同的预更新状态，而不是顺序更新。根据我们所知，这是第一项将预训练的 AR 视频扩散模型转换为混合线性注意力架构的工作，通过高效的两阶段训练方案实现 AR 视频。用混合线性注意力替换 75% 的层后，该模型实现了高达 2.26 倍的实际速度提升和 54% 的内存减少，同时保持了与提高时间一致性相当的质量。

View on arXiv Download PDF AI Translation

cs.CV / 54 / 2605.16582

ArtMesh: Part-Aware Articulated Mesh Fields with Motion-Consistent Dynamics

ArtMesh：具有运动一致性动态的部件感知关节网格场

Yuan, Sylvia, Wang, Dan, Ramamoorthi, Ravi, Cui, Xinrui

Abstract

We present ArtMesh, a mesh-native method for reconstructing articulated objects explicitly as connected triangle meshes with per-part rigid motion from multi-view images in start and end states. Existing 3D Gaussian Splatting pipelines for articulated reconstruction inherit the unstructured point-based geometry of their splatting base, which provides no surface topology for reasoning about part boundaries or enforcing motion consistency along the object's connectivity. ArtMesh instead builds on a mesh-based differentiable rendering backbone, enabling part-aware dynamics to act directly on the structured topology. To make the topology compatible with articulation, we introduce part-aware restricted Delaunay remeshing, producing connected submeshes whose triangles do not cross semantic part boundaries. The dynamic mesh field then optimizes articulation using bidirectional Vertex-wise Motion Consistency on transported mesh vertices and Pixel-wise Motion Consistency on rendered RGB-D observations. We introduce Articulate-100, a new benchmark of 100 articulated objects spanning 16 PartNet-Mobility categories. On this benchmark, ArtMesh outperforms prior 3DGS-based pipelines in joint parameter estimation and part-level geometric reconstruction, with the largest gains on objects with many movable parts.

Chinese Translation

我们提出了ArtMesh，这是一种网格原生方法，用于从多视角图像中的起始和结束状态显式重建关节对象，作为具有每个部件刚性运动的连接三角网格。现有的用于关节重建的3D高斯点云投影管道继承了其点云基础的非结构化点几何，这种几何没有提供表面拓扑以推理部件边界或强制沿对象连通性的一致运动。ArtMesh则基于网格可微渲染骨架，允许部件感知动态直接作用于结构化拓扑。为了使拓扑与关节运动兼容，我们引入了部件感知限制的德劳内重网格化，生成连接的子网格，其三角形不跨越语义部件边界。动态网格场随后利用双向顶点运动一致性（Vertex-wise Motion Consistency）对传输的网格顶点和像素运动一致性（Pixel-wise Motion Consistency）对渲染的RGB-D观测进行关节优化。我们引入了Articulate-100，这是一个新的基准，包含100个跨越16个PartNet-Mobility类别的关节对象。在这个基准上，ArtMesh在关节参数估计和部件级几何重建方面优于先前基于3DGS的管道，尤其在具有多个可动部件的对象上取得了最大的提升。

View on arXiv Download PDF AI Translation

cs.CV / 55 / 2605.16603

Controlla: Learning Controllability via Graph-Constrained Latent Geometry

Controlla：通过图约束潜在几何学习可控性

Murthy, Jamuna S., Monsefi, Amin Karimi, Ramnath, Rajiv

Abstract

Controllable multimodal generation is commonly formulated as an inference-time conditioning problem using prompts, guidance, or auxiliary modules. While effective, such approaches do not explicitly structure how semantic attributes evolve, which can lead to identity drift and inconsistent cross-modal behavior. We propose Controlla, a modular factorized-control framework that treats controllability as a property of structured latent geometry. Controlla learns identity and attribute factors from multimodal inputs and aligns them with graph priors using graph-constrained optimal transport, encouraging attributes to follow graph-consistent trajectories while preserving reference identity. To evaluate this setting, we construct AffectHuman-43K, a leakage-aware multimodal benchmark for reference-grounded affective control, and introduce geometry-aware metrics for trajectory consistency and latent disentanglement. Experiments show consistent improvements in controllability, identity preservation, and cross-modal alignment, with additional analyses on graph sensitivity, extensibility, and robustness.

Chinese Translation

可控的多模态生成通常被表述为一种在推理时使用提示、引导或辅助模块的条件化问题。尽管这种方法有效，但并未明确结构化语义属性的演变，这可能导致身份漂移和跨模态行为不一致。我们提出了Controlla，一个模块化的因子控制框架，将可控性视为结构化潜在几何的一个属性。Controlla从多模态输入中学习身份和属性因子，并使用图约束的最优传输将其与图先验对齐，鼓励属性遵循图一致的轨迹，同时保持参考身份。为了评估这一设置，我们构建了AffectHuman-43K，一个关注泄漏的多模态基准，用于参考基础的情感控制，并引入了关注几何的轨迹一致性和潜在解耦的度量。实验表明，在可控性、身份保留和跨模态对齐方面均有一致的改善，并对图的敏感性、可扩展性和鲁棒性进行了额外分析。

View on arXiv Download PDF AI Translation

cs.CV / 56 / 2605.16628

SCARED-C: Corrected Camera Poses for Endoscopic Depth Estimation

SCARED-C：用于内窥镜深度估计的校正相机姿态

Han, John J., Schmidt, Adam, Allan, Max, Wu, Jie Ying, Mohareri, Omid

Abstract

The SCARED dataset is a widely used benchmark for endoscopic depth estimation, offering ground-truth 3D reconstructions captured with a structured light sensor. However, the depth maps for non-keyframe images rely on robot kinematics that introduce substantial pose errors, limiting the reliably labeled portion of the dataset to 35 keyframes. We present SCARED-C, a corrected version of the SCARED dataset that expands the number of reliable RGB-D pairs from 35 to 17,135. Our pipeline applies COLMAP, a Structure-from-Motion system, to re-estimate camera poses for all frames, followed by a scale recovery step that aligns the resulting reconstructions to metric space using the ground-truth keyframe depth maps. We validate the corrected poses through (1) stereo disparity evaluation and (2) monocular depth estimation experiments. The corrected dataset and code are publicly released to the community.

Chinese Translation

SCARED 数据集是内窥镜深度估计的广泛使用基准，提供了通过结构光传感器捕获的真实3D重建。然而，非关键帧图像的深度图依赖于机器人运动学，这引入了显著的姿态误差，限制了数据集中可靠标注部分仅为35个关键帧。我们提出了SCARED-C，这是SCARED数据集的校正版本，将可靠的RGB-D配对数量从35扩展到17,135。我们的流程应用了COLMAP，一个运动重建系统，重新估计所有帧的相机姿态，随后进行尺度恢复步骤，将生成的重建与基于真实关键帧深度图对齐到度量空间。我们通过（1）立体视差评估和（2）单目深度估计实验验证了校正后的姿态。校正后的数据集和代码已公开发布给社区。

View on arXiv Download PDF AI Translation

cs.CV / 57 / 2605.16649

AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling

AtlasVid：通过解耦的全球-局部建模高效生成超高分辨率长视频

Mai, Ziyang, Zhang, Yuyao, Tai, Yu-Wing

Abstract

Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution and low-FPS global semantic proxy via temporally scaled RoPE, thereby extending the temporal horizon without increasing the training token count. Guided by this proxy, a high-resolution detail branch performs joint denoising with hierarchical locality-preserving attention. Reordered spatiotemporal windows preserve geometric locality and asymmetric global-local attention injects aligned semantic guidance and preserves the model's pretrained ability. This design enables resolution-agnostic training: the model is trained only at 720P with lightweight LoRA adaptation, yet generalizes directly to 4K and beyond for longer (>10s) video synthesis. Experiments show that AtlaVid substantially improves the efficiency of ultra-high-resolution long video generation, achieving high-quality UHR long video generation with 60.9x speed up and significantly less training cost and even better performance than native 4K video generators.

Chinese Translation

最近基于扩散的在线视频生成器在视觉保真度和提示可控性方面取得了显著进展，但将其扩展到超高分辨率（UHR）长视频仍然成本高昂。对于长时间单次生成而言，这一困难尤为明显，因为连续场景必须在不依赖剪辑过渡或自回归镜头拼接的情况下保持全球时间一致性和细粒度空间细节。在本研究中，我们从解耦建模的角度重新审视这一挑战。我们认为，现有的视频扩散模型已经编码了强大的局部视觉先验，而主要瓶颈在于如何有效地扩展全球时空建模，随着分辨率和时长的增加而变得更加复杂。基于这一见解，我们提出了AtlaVid，一个用于高效UHR长视频生成的解耦全球-局部框架。AtlaVid首先通过时间缩放的RoPE生成一个低分辨率和低帧率的全球语义代理，从而在不增加训练标记数量的情况下扩展时间范围。在该代理的指导下，高分辨率细节分支通过分层保持局部性的注意力进行联合去噪。重新排序的时空窗口保持几何局部性，非对称的全球-局部注意力注入对齐的语义指导，并保持模型的预训练能力。这一设计实现了与分辨率无关的训练：模型仅在720P下进行轻量级LoRA适应训练，但可以直接推广到4K及更高分辨率的长视频合成（>10秒）。实验表明，AtlaVid显著提高了超高分辨率长视频生成的效率，实现了高质量的UHR长视频生成，速度提升达60.9倍，训练成本显著降低，且性能优于原生4K视频生成器。

View on arXiv Download PDF AI Translation

cs.CV / 58 / 2605.16651

Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

正确的预测，误导性的解释：视觉-语言模型解释的脆弱性

Babadi, Narges, Karimipour, Hadis

Abstract

Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.

Chinese Translation

解释机制越来越多地被用于支持视觉-语言模型（VLMs）的透明性和可信度，尤其是在模型决策需要人类监督的场景中。然而，这些解释的稳健性仍然不够明确。在本研究中，我们探讨了VLMs中的解释热图，特别是基于CLIP的模型，是否在对抗条件下真实反映了模型推理。我们表明，解释图可以在保持模型原始预测的同时被系统性地操控，揭示了预测行为与解释可信度之间的脱节。为了研究这种脆弱性，我们引入了X-Shift，一种新颖的灰箱攻击，它通过扰动补丁级别的视觉表示，将解释热图引导至语义无关的区域，而不改变预测输出。与旨在引发错误分类的传统对抗攻击不同，X-Shift专门针对解释过程本身的完整性。该攻击在不修改模型参数的情况下运行，并在多个CLIP架构和解释方法中具有广泛的适用性。我们在ImageNet-1k、MS-COCO和Flickr30K上评估了所提出的方法，展示了在不可察觉的扰动下，解释一致性的一致下降，同时保持预测的稳定性。此外，标准的以预测为导向的对抗攻击即使在显著更大的扰动预算下也未能重现相同的解释转移行为。我们的发现突显了当前VLMs中解释机制的基本局限性，并对其在高影响力应用中作为可靠模型可信度指标的使用提出了担忧。

View on arXiv Download PDF AI Translation

cs.CV / 59 / 2605.16672

Multi-Object Tracking Consistently Improves Wildlife Inference

多目标跟踪持续改善野生动物推断

Muthivhi, Mufhumudzi, Huo, Jiahao, Gustafsson, Fredrik, van Zyl, Terence L.

Abstract

Camera traps have become a common tool for wildlife monitoring efforts in ecological research and biodiversity conservation. Wildlife classification models have benefited from the increase in wildlife visual data. These models reach high levels of accuracy on curated, high-quality datasets. However, their performance remains sensitive to real-world environmental constraints. They often produce inconsistent predictions when performing inference on temporally coherent sequences. The predicted label for a single individual shifts rapidly between frames. This study exploits the temporal nature of camera-trap data to augment inferred predictions from a wildlife classification model. Specifically, we adopt several standard Multi-Object Tracking (MOT) models to link detections across consecutive frames. The curated trajectories are used to fuse the softmax class probabilities. The fused probability score produces a single consensus class label estimate that overrides misclassifications caused by noise. The analysis of the experimental results shows that our proposed strategy improves over a standalone classifier over all datasets and for each metric. Specifically, the best-performing MOT models gain a weighted F1-Score of 5.1%, 3.1% and 2.0% over the classifier across three MOT datasets.

Chinese Translation

相机捕捉器已成为生态研究和生物多样性保护中野生动物监测工作的常用工具。野生动物分类模型受益于野生动物视觉数据的增加。这些模型在经过精心策划的高质量数据集上达到了高水平的准确性。然而，它们的表现仍然对现实环境约束敏感。在对时间上连贯的序列进行推断时，它们往往产生不一致的预测。单个个体的预测标签在帧之间迅速变化。本研究利用相机捕捉数据的时间特性来增强来自野生动物分类模型的推断预测。具体而言，我们采用几种标准的多目标跟踪（Multi-Object Tracking, MOT）模型来链接连续帧之间的检测。策划的轨迹用于融合softmax类别概率。融合后的概率得分生成一个单一的共识类别标签估计，覆盖由噪声引起的错误分类。实验结果分析表明，我们提出的策略在所有数据集和每个指标上均优于独立分类器。具体而言，表现最佳的MOT模型在三个MOT数据集上相较于分类器获得了加权F1分数分别提高了5.1%、3.1%和2.0%。

View on arXiv Download PDF AI Translation

cs.CV / 60 / 2605.16696

Face inpainting with Identity Preserving Latent Diffusion Models

基于身份保留的潜在扩散模型的人脸修复

Santos, João, Santiago, Carlos, Marques, Manuel

Abstract

Face inpainting techniques recover missing or occluded facial regions in a visually realistic manner, but preserving the identity in the final output remains a fundamental challenge. Identity consistency is crucial for downstream applications such as face recognition, digital forensics, and human-computer interaction, where even subtle identity distortions can significantly degrade performance or trust. Although diffusion-based generative models have recently achieved remarkable progress in image inpainting, they often struggle to faithfully retain individual-specific facial characteristics. On the other hand, existing identity-aware methods typically rely on costly fine-tuning, auxiliary supervision, or exhibit limited robustness to diverse occlusions, poses, and facial variations. To address these limitations, we propose ID-ControlNet, an identity-preserving face inpainting framework built upon latent diffusion models. Based on ControlNet architecture, our approach conditions the diffusion process on facial identity embeddings extracted from a pretrained face recognition network. This design enables reconstruction of occluded facial regions while maintaining global facial coherence and identity fidelity. Furthermore, we introduce an identity consistency and triplet loss training strategy that explicitly enforces alignment between the generated face and the target identity representation. Extensive experiments on CelebA-HQ, FFHQ, and on a new E-Mask dataset demonstrate that ID-ControlNet significantly improves identity preservation over standard diffusion-based inpainting methods, achieving performance comparable to SOTA identity-aware approaches.

Chinese Translation

人脸修复技术以视觉真实的方式恢复缺失或被遮挡的人脸区域，但在最终输出中保持身份一致性仍然是一个基本挑战。身份一致性对于人脸识别、数字取证和人机交互等下游应用至关重要，因为即使是微小的身份扭曲也会显著降低性能或信任度。尽管基于扩散的生成模型最近在图像修复方面取得了显著进展，但它们往往难以忠实保留个体特征的人脸特征。另一方面，现有的身份感知方法通常依赖于昂贵的微调、辅助监督，或对多样化的遮挡、姿态和面部变化表现出有限的鲁棒性。为了解决这些局限性，我们提出了ID-ControlNet，一个基于潜在扩散模型的身份保留人脸修复框架。基于ControlNet架构，我们的方法将扩散过程条件化于从预训练人脸识别网络提取的人脸身份嵌入。这一设计使得在保持全局面部一致性和身份保真度的同时重建被遮挡的人脸区域。此外，我们引入了一种身份一致性和三元组损失训练策略，明确强制生成的人脸与目标身份表示之间的对齐。在CelebA-HQ、FFHQ和新的E-Mask数据集上的大量实验表明，ID-ControlNet在身份保留方面显著优于标准的基于扩散的人脸修复方法，其性能可与最先进的身份感知方法相媲美。

View on arXiv Download PDF AI Translation

cs.CV / 61 / 2605.16713

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

GeoWorld-VLM：来自世界模型的几何信息用于视觉-语言模型

Gu, Renjie, Zhou, Kaichen, Luo, Yan, Wang, Mengyu

Abstract

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

Chinese Translation

现代视觉-语言模型（VLMs）在语义识别方面表现出色，但在基本空间关系（如左侧、上方、后方和之间）上仍然显得脆弱。这种失败的一个原因在于语言推理开始之前：在特征提取过程中，视觉路径可能会压缩或丢弃关键的三维结构线索，从而使语言模型接收到的图像表示已经不足以进行可靠的空间判断。我们提出了GeoWorld-VLM，这是一个VLM侧的蒸馏框架，将冻结的相机条件视频世界模型中的几何结构转移到VLM中。GeoWorld-VLM仅微调图像编码器和多模态投影器，将投影后图像特征与中间世界模型表示对齐，同时保持主骨干网络不变。在给定图像、提示和采样的相机轨迹的情况下，世界模型教师将静态视觉输入转换为合成的多视角空间信号。训练结合了空间答案监督、教师-学生特征对齐和对原始VLM的保留锚点。由于语言模型保持冻结，GeoWorld-VLM保留了原始模型的语言能力，同时将空间改进归因于增强的视觉路径。为了评估所提方法的有效性和普适性，我们将GeoWorld-VLM应用于两种不同的VLM架构，并观察到在两种骨干网络上均有一致的改进。GeoWorld-VLM在What'sUp和VSR基准测试中提高了约4%的性能，这表明世界模型引导的视觉对齐在模型结构和空间推理数据集之间具有广泛的适用性。

View on arXiv Download PDF AI Translation

cs.CV / 62 / 2605.16716

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

MAVEN：一个多代理框架用于多文化文本到视频生成

Li, Shuowei, Zhao, Yuming, Bhalerao, Parth, Ignat, Oana

Abstract

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available athttps://github.com/AIM-SCU/CRAFT

Chinese Translation

文本到视频（T2V）生成在视觉逼真度方面迅速发展，但在单一提示中忠实呈现多种文化的能力仍然未得到充分探索。我们介绍了MAVEN，一个多代理提示优化框架，旨在提高单文化和跨文化T2V生成中的文化忠实度。MAVEN将提示分解为人物、动作和地点三个维度，由专门的代理并行或顺序处理。为了支持系统评估，我们贡献了一个新的基准，包括243个文化基础的提示和972个相应的视频，涵盖三种文化（中国、美国、罗马尼亚）、三种动作类别，以及单文化和跨文化场景。结合基于CLIP的指标、VLM作为评判的评估和视频质量测量的评估结果表明，多代理优化，特别是并行专业化，显著提高了文化相关性，同时保持了视觉质量和时间一致性。数据集和代码可在https://github.com/AIM-SCU/CRAFT获取。

View on arXiv Download PDF AI Translation

cs.CV / 63 / 2605.16720

Compositional Adversarial Training for Robust Visual Watermarking

用于鲁棒视觉水印的组合对抗训练

Satheesh, Anirudh, Panaitescu-Liess, Michael-Andrei, Xu, Andrew, Milis, Georgios, Huang, Heng, Cai, Zikui, Huang, Furong

Abstract

Robust watermarking is typically trained with random post-processing augmentation, but random sampling under-covers the combinatorial space of realistic attack pipelines and rarely encounters the rare compositions that actually break detection. This leads to unstable training and poor sample efficiency. We instead formulate watermark robustness as a min-max problem over a structured space of compositional transformations. We propose Compositional Adversarial Training (CAT), a plug-in framework that learns a sequential differentiable adversary that observes the current watermarked image and selects an attack family at each step to maximally disrupt message recovery. CAT combines a straight-through Gumbel-Softmax attack selection with entropy regularization, allowing the backward pass to be end-to-end differentiable and aggregate gradient information across attack families, yielding faster, smoother convergence without collapsing to a single attack mode. We evaluate CAT on post-generation watermarks VideoSeal 0.0, VideoSeal 1.0, and PixelSeal and in-generation WMAR under both single-step and two-step attack suites, on in-distribution and multiple out-of-distribution image and video benchmarks. CAT consistently outperforms random-augmentation baselines trained with the same augmentation budget, with the largest gains on hard composed attacks and OOD evaluations; improving overall watermark capacity by up to $63.5\%$ in the single-step attack setting and $13.0\%$ in the compositional setting. In the autoregressive setting, CAT improves the TPR@FPR$=1\%$ by $12\%$ on average on difficult geometric transformations. These results show that robust visual watermarking benefits from training against adaptive compositional adversaries rather than independent random corruptions.

Chinese Translation

鲁棒水印通常通过随机后处理增强进行训练，但随机采样无法覆盖现实攻击流程的组合空间，且很少遇到实际破坏检测的稀有组合。这导致训练不稳定和样本效率低下。我们将水印鲁棒性形式化为一个在结构化组合变换空间上的最小-最大问题。我们提出了组合对抗训练（Compositional Adversarial Training, CAT），这是一个插件框架，学习一个顺序可微的对手，该对手观察当前的水印图像，并在每一步选择一个攻击族，以最大程度地干扰信息恢复。CAT结合了直通的Gumbel-Softmax攻击选择与熵正则化，使得反向传播可以实现端到端可微分，并在攻击族之间聚合梯度信息，从而实现更快、更平滑的收敛，而不至于崩溃到单一攻击模式。我们在后生成水印VideoSeal 0.0、VideoSeal 1.0和PixelSeal，以及在生成WMAR的单步和双步攻击套件上，对CAT进行了评估，涵盖了分布内和多个分布外的图像和视频基准。CAT在与相同增强预算训练的随机增强基线相比，始终表现更佳，在困难的组合攻击和OOD评估中取得了最大的提升；在单步攻击设置中，整体水印容量提高了高达63.5%，在组合设置中提高了13.0%。在自回归设置中，CAT在困难几何变换上平均提高了TPR@FPR=1%的12%。这些结果表明，鲁棒视觉水印从针对自适应组合对手的训练中受益，而不是独立的随机损坏。

View on arXiv Download PDF AI Translation

cs.CV / 64 / 2605.16732

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

DiRotQ：面向旋转的4位扩散变换器量化

Sharify, Sayeh, Salmani, Mahsa, Mostafa, Hesham

Abstract

Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-{\Sigma} over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.

Chinese Translation

扩散变换器（DiTs）在图像生成质量方面达到了最先进的水平，但在推理时会产生大量的内存和计算成本。虽然对4位精度进行激进的后训练量化（PTQ）可以显著提高效率，但通常会导致严重的质量下降。现有方法，包括基于平滑的方法、混合精度方案、旋转技术和低秩残差方法，部分缓解了这一问题，但与FP16/BF16的性能相比仍然存在明显差距。在本研究中，我们提出了DiRotQ，一个W4A4 PTQ框架，通过面向旋转的激活量化来减轻这种降级。DiRotQ通过主成分分析（PCA）识别捕捉主导激活方差的低秩子空间，在该子空间中以更高的精度保留系数，同时将其余组件量化为4位。在推理时，激活通过校准导出的正交变换旋转到PCA基底，而逆旋转则离线融合到层权重中。结合基于GPTQ的权重量化，DiRotQ在MJHQ-30K数据集上的PixArt-{}上实现了15.9的FID（越低越好）和19.1 dB的PSNR（越高越好），在相同的INT W4A4设置下超越了之前的最先进方法SVDQuant（FID 18.9，PSNR 17.6）。除了标准指标外，我们还引入了一种VLM-as-a-Judge评估协议用于扩散模型量化，这是该设置下的首次评估，提供了对感知质量和在激进压缩下的提示对齐的更全面评估。在系统方面，我们实现了一个基于Triton的自定义内核，以实现高效的端到端推理，将12B FLUX.1-dev模型的内存使用减少了2.1倍，并在24 GB RTX 4090 GPU上实现了相较于BF16基线的2.3倍加速。

View on arXiv Download PDF AI Translation

cs.CV / 65 / 2605.16736

CAB: Accelerating Flow and Diffusion Sampling via Rectification and Corrected Adams-Bashforth

CAB：通过整流和修正的亚当斯-巴什福斯加速流动和扩散采样

Roy, Anuska, Nair, Pravin

Abstract

Flow and diffusion models achieve high-fidelity, high-resolution image synthesis, but often require many function evaluations (NFEs) at sampling time. Existing acceleration methods either require additional training through distillation or rely on training-free high-order solvers, and both can degrade sample quality at low NFE budgets. We propose CAB (Corrected Adams-Bashforth), a training-free sampler that accelerates both flow and diffusion models. CAB first transforms the sampling dynamics to a common rectified coordinate system, and then applies a multistep Adams-Bashforth predictor augmented with a simple correction term based on past velocity evaluations and therefore incurs no additional NFEs. The resulting method is simple, has the same algorithmic form across model classes, and has at least third-order local truncation error and second-order global error. Experiments on pretrained flow and diffusion models, including class-conditional and large-scale text-to-image benchmarks, show that CAB improves quality-NFE trade-offs in the low-step regime of 6-20 NFEs. It also remains competitive with strong training-free samplers at higher step counts across most tested models. The official implementation is available at https://github.com/Anuska-Roy/CAB.

Chinese Translation

流动和扩散模型实现了高保真度、高分辨率的图像合成，但在采样时通常需要进行大量的函数评估（NFE）。现有的加速方法要么需要通过蒸馏进行额外的训练，要么依赖于无训练的高阶求解器，而这两者在低NFE预算下都可能降低样本质量。我们提出了CAB（修正的亚当斯-巴什福斯），这是一种无训练的采样器，能够加速流动和扩散模型。CAB首先将采样动态转换为一个共同的整流坐标系统，然后应用一个多步的亚当斯-巴什福斯预测器，并根据过去的速度评估增加一个简单的修正项，因此不会产生额外的NFE。所提出的方法简单，跨模型类别具有相同的算法形式，并且具有至少三阶的局部截断误差和二阶的全局误差。在预训练的流动和扩散模型上进行的实验，包括类条件和大规模文本到图像基准测试，表明CAB在6-20 NFE的低步数范围内改善了质量-NFE权衡。它在大多数测试模型中，在较高步数下仍与强大的无训练采样器保持竞争力。官方实现可在 https://github.com/Anuska-Roy/CAB 获取。

View on arXiv Download PDF AI Translation

cs.CV / 66 / 2605.16740

TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation

TRACE：基于证据引导的多视频事件理解与声明生成

Yan, Pengyu, Gorugantu, Akhil, Bhosale, Mahesh, Wasi, Abdul, Trivedi, Vishvesh, Doermann, David

Abstract

Multi-video event understanding demands models that can locate and attribute query-relevant evidence scattered across long, heterogeneous video corpora. Existing large vision-language models (LVLMs) often underperform in this regime because they quickly exhaust their context budget and struggle to precisely localize evidentially important segments, frequently missing dense informational cues such as broadcast graphics, subtitles, and scoreboards. We introduce TRACE, an evidence grounding-guided framework that follows a ground-before-reasoning strategy for multi-video event reasoning. Our approach first builds a structured, text-searchable timeline for each video using OCR and object detection. A text-only LLM then conducts query-aware evidence localization, selecting relevant moments prior to any downstream visual reasoning. The retrieved frames and their grounding summaries are subsequently used to steer LVLM-based claim generation and cross-video citation consolidation. Experiments on MAGMaR 2026 and WikiVideo demonstrate that structured grounding markedly boosts factual completeness and attribution fidelity. On the MAGMaR validation split, TRACE raises macro-average MiRAGE F1 from 0.705 to 0.811 compared to an unguided Qwen3-VL-30B baseline, with especially strong improvements in citation recall from 0.440 to 0.628. The method also attains state-of-the-art results on the official MAGMaR 2026 leaderboard.

Chinese Translation

多视频事件理解需要能够定位和归因于分散在长篇异构视频语料库中的查询相关证据的模型。现有的大型视觉语言模型（LVLMs）在这一领域通常表现不佳，因为它们快速耗尽上下文预算，并且难以精确定位具有证据重要性的片段，常常错过密集的信息线索，如广播图形、字幕和记分牌。我们提出了TRACE，一个基于证据引导的框架，采用先定位后推理的策略进行多视频事件推理。我们的方法首先使用光学字符识别（OCR）和目标检测为每个视频构建一个结构化的、可文本搜索的时间线。然后，文本专用的语言模型（LLM）进行查询感知的证据定位，在任何下游视觉推理之前选择相关时刻。检索到的帧及其引导摘要随后用于引导基于LVLM的声明生成和跨视频引用整合。在MAGMaR 2026和WikiVideo上的实验表明，结构化的引导显著提高了事实完整性和归因准确性。在MAGMaR验证集上，TRACE将宏平均MiRAGE F1从0.705提升至0.811，相较于无引导的Qwen3-VL-30B基线，尤其在引用召回率方面，从0.440提升至0.628。该方法在官方MAGMaR 2026排行榜上也达到了最先进的结果。

View on arXiv Download PDF AI Translation

cs.CV / 67 / 2605.16742

Diffeomorphic Cortical Alignment via Direct Warping of Streamline Endpoints

通过直接扭曲流线端点实现的微分同胚皮层对齐

Xiang, Yang, Cole, Martin, Zhang, Zhengwu

Abstract

Cortical surface registration is often driven by local geometric descriptors (e.g., sulcal depth and curvature). While this approach achieves geometric correspondence, it neglects the long-range wiring constraints imposed by white-matter anatomy. Diffusion MRI tractography offers these crucial constraints; however, prior connectivity-informed pipelines typically align precomputed connectivity matrices, making the optimization highly sensitive to connectivity estimation and its resolution. In this paper, we introduce a novel connectivity-based surface registration method that aligns cortical surfaces by operating directly on white-matter fiber-tract endpoints. We model tract endpoints as a point cloud on the product manifold $\Omega \times \Omega$, where $\Omega$ represents the spherical domain of the inflated cortical hemispheres. Our alignment method iteratively (i) computes a small diffeomorphic warp for $\Omega$ by minimizing connectivity mismatch, and (ii) updates the endpoints based on this warp. The method relies on a geometric framework that ensures output warps are diffeomorphisms and has a final goal that optimizes the matching of well-known fiber bundles. Experiments on Human Connectome Project (HCP) data demonstrate improved tract-level correspondence, achieving higher connectivity-level overlap coefficients on major fiber bundles and stronger robustness across grid resolutions for $\Omega$ compared to state-of-the-art methods such as ENCORE and MSMAll.

Chinese Translation

皮层表面配准通常依赖于局部几何描述符（例如，沟深和曲率）。虽然这种方法实现了几何对应，但忽视了白质解剖所施加的长程连接约束。扩散MRI轨迹追踪提供了这些重要的约束；然而，先前的连接性信息驱动的流程通常对预计算的连接矩阵进行对齐，使得优化对连接估计及其分辨率高度敏感。本文提出了一种新颖的基于连接性的表面配准方法，通过直接操作白质纤维束端点来对齐皮层表面。我们将纤维束端点建模为乘积流形 $ ext{Ω} imes ext{Ω}$ 上的点云，其中 $ ext{Ω}$ 表示膨胀皮层半球的球面域。我们的对齐方法迭代地（i）通过最小化连接不匹配计算 $ ext{Ω}$ 的小微分同胚扭曲，并（ii）基于该扭曲更新端点。该方法依赖于一个几何框架，确保输出的扭曲是微分同胚，并且最终目标是优化著名纤维束的匹配。在人类连接组计划（HCP）数据上的实验表明，纤维级对应性得到了改善，在主要纤维束上实现了更高的连接级重叠系数，并且在与ENCORE和MSMAll等最先进方法相比，$ ext{Ω}$ 的网格分辨率下具有更强的鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 68 / 2605.16745

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

EVA01：通过混合变换器实现统一的原生3D理解与生成

Yang, Zongyuan, Yi, Mingjing, Ma, Wanli, Fan, Chenzhuo, Li, Bocheng, Liu, Baolin, Lou, Yuke, Song, Yingde, Xiong, Yongping, Guo, Zhengdong, Wang, Shimu

Abstract

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert ($E_{\mathrm{und}}$) and a structurally mirrored Generation Expert ($E_{\mathrm{gen}}$), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01

Chinese Translation

本文解决了将3D网格作为多模态大语言模型（MLLMs）中的原生模态整合的挑战。基于扩散的大型重建模型将语义理解与几何推理解耦，作为无状态重构器在稠密的2D像素先验条件下操作。近期基于MLLM的方法将3D模态视为外部输出，而非多模态序列的原生组成部分，进行增量适配而未对几何流形如何与MLLM特征空间对齐进行系统分析。我们提出了EVA01，一个统一框架，扩展了MLLM的模态边界，以原生方式整合3D网格的理解、生成和上下文感知编辑。EVA01基于混合变换器（Mixture-of-Transformers, MoT）架构，将模型解耦为一个预训练的理解专家（Understanding Expert, $E_{ ext{und}}$）和一个结构上镜像的生成专家（Generation Expert, $E_{ ext{gen}}$），通过共享的全局自注意力与硬模态路由相耦合。该设计将MLLM主干的语义潜在空间与几何流形对齐，使得多模态先验能够直接转移，而无需中间的2D表示。结果表明，EVA01在原生文本到3D生成的保真度上达到了最先进水平，并解锁了具有身份保留的强健长上下文多轮几何编辑能力，这一能力在无状态重构管道中是根本无法实现的。我们的研究结果进一步为将2D基础模型与3D任务集成提供了架构见解，为3D原生多模态系统的设计提供了参考。项目页面：https://www.seeles.ai/research/pages/EVA01

View on arXiv Download PDF AI Translation

cs.CV / 69 / 2605.16764

Synthetic Aperture Radar Image Change Detection Based on Global Dynamic Context-Aware Network

基于全球动态上下文感知网络的合成孔径雷达图像变化检测

Huan, Baogui, Gong, Chuanzheng, Chen, Dezhong, Gao, Feng, Dong, Junyu, Du, Qian

Abstract

Convolutional neural networks (CNNs) have been extensively and successfully applied to the task of synthetic aperture radar (SAR) image change detection. However, conventional convolutional layers are inherently limited by their local receptive fields, which mainly capture spatially localized patterns while neglecting the global context that is often crucial for accurately distinguishing subtle or large-scale changes in SAR imagery. To address these limitations, we propose a novel Global Dynamic Context-Aware Network (GDNet) specifically tailored for SAR image change detection. At the core of our approach lies a novel global dynamic convolution module, which adaptively modulates convolution kernel weights according to the global semantic information extracted from the input features. By dynamically incorporating long-range dependencies, this mechanism enables the network to integrate both local detail and global context, thus improving its ability to detect diverse change patterns. In addition, we introduce a carefully designed two-stage Mixup strategy for model training. Unlike conventional single-stage Mixup, our two-stage design generates more diverse and informative training samples, effectively regularizing the model and yielding more stable and reliable classification results even under limited data scenarios. Extensive experiments on three SAR datasets demonstrate the superiority of the proposed GDNet compared to other state-of-the-art methods. These findings highlight the potential of global dynamic modeling and advanced data augmentation strategies for advancing SAR image interpretation. Source codes are available at \url{https://github.com/oucailab/GDNet}.

Chinese Translation

卷积神经网络（CNN）已广泛且成功地应用于合成孔径雷达（SAR）图像变化检测任务。然而，传统的卷积层本质上受到其局部感受野的限制，主要捕捉空间局部模式，而忽视了通常对于准确区分SAR图像中微小或大规模变化至关重要的全球上下文。为了解决这些局限性，我们提出了一种新颖的全球动态上下文感知网络（GDNet），专门针对SAR图像变化检测进行定制。我们方法的核心是一个新颖的全球动态卷积模块，该模块根据从输入特征中提取的全球语义信息自适应地调节卷积核权重。通过动态地结合长距离依赖关系，该机制使网络能够整合局部细节和全球上下文，从而提高其检测多样变化模式的能力。此外，我们引入了一种精心设计的两阶段Mixup策略用于模型训练。与传统的单阶段Mixup不同，我们的两阶段设计生成了更多样化和信息丰富的训练样本，有效地对模型进行正则化，并在数据有限的情况下产生更稳定和可靠的分类结果。在三个SAR数据集上的大量实验表明，所提出的GDNet相较于其他最先进的方法具有优越性。这些发现突显了全球动态建模和先进数据增强策略在推动SAR图像解读方面的潜力。源代码可在 {https://github.com/oucailab/GDNet} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 70 / 2605.16768

Axial-Relation Guided Fusion State Space Model for Optical-Elevation Sensing Image Segmentation

轴向关系引导的光学-高程感知图像分割融合状态空间模型

Gao, Feng, Jin, Zhilin, Gan, Yanhai, Dong, Junyu, Du, Qian

Abstract

Semantic segmentation of multi-source remote sensing images is a fundamental task for Earth observation applications. Existing methods often struggle with insufficient multi-scale context modeling and suboptimal cross-modal feature fusion, limiting their performance in complex high-resolution scenes. To this end, we propose Axial-Relation Guided Fusion Mamba (ARG-Mamba), a state space model-based framework for optical-elevation remote sensing image segmentation. Specifically, we introduce a Multi-Scale State Space Module to capture both fine-grained local details and global contextual dependencies with linear computational complexity. Moreover, an Axial-Relation Guided Fusion Module is designed to explicitly model global cross-modal correlations along horizontal and vertical axes, enabling efficient feature fusion between optical and elevation modalities. Extensive experiments conducted on the ISPRS Vaihingen and Potsdam datasets demonstrate that our ARG-Mamba consistently outperforms state-of-the-art methods while maintaining favorable computational efficiency. The code will be made publicly available at \url{https://github.com/oucailab/ARG-Mamba}.

Chinese Translation

多源遥感图像的语义分割是地球观测应用中的一项基础任务。现有方法往往在多尺度上下文建模和跨模态特征融合方面存在不足，限制了它们在复杂高分辨率场景中的表现。为此，我们提出了轴向关系引导的融合 Mamba（ARG-Mamba），这是一个基于状态空间模型的光学-高程遥感图像分割框架。具体而言，我们引入了一个多尺度状态空间模块，以线性计算复杂度捕捉细致的局部细节和全局上下文依赖。此外，设计了一个轴向关系引导的融合模块，以显式建模沿水平和垂直轴的全局跨模态相关性，从而实现光学和高程模态之间的高效特征融合。在ISPRS Vaihingen和波茨坦数据集上进行的广泛实验表明，我们的ARG-Mamba在保持良好计算效率的同时，始终优于最先进的方法。代码将公开发布在 {https://github.com/oucailab/ARG-Mamba}。

View on arXiv Download PDF AI Translation

cs.CV / 71 / 2605.16769

GLT-PEFT: Gated Lie-Tucker Parameter-Efficient Fine-Tuning for Alzheimer's Disease Diagnosis with Hippocampal Segmentation Pretraining

GLT-PEFT：用于阿尔茨海默病诊断的门控Lie-Tucker参数高效微调与海马分割预训练

He, Guanghua, Zhu, Hancan, Yu, Gaohang, Zhang, An

Abstract

Parameter-efficient fine-tuning (PEFT) has emerged as a promising paradigm for adapting pretrained models under limited data conditions. However, most existing PEFT methods are designed for matrix-structured parameters and are not well suited for high-dimensional convolutional kernels in medical imaging models. Moreover, they typically rely on additive updates and lack mechanisms to preserve the geometric structure of pretrained parameters, while multiplicative (geometry-aware) updates are difficult to integrate within a unified framework. To address this issue, this paper proposes GLT-PEFT, a gated Lie-Tucker parameter-efficient fine-tuning framework for Alzheimer's disease (AD) diagnosis. The proposed approach transfers a hippocampal segmentation pretrained model to a downstream classification task. Tucker decomposition enables tensor-aware low-rank adaptation of 3D convolutional kernels, while Lie group-based transformations provide structure-preserving multiplicative updates. A gating mechanism further reconciles additive and multiplicative update forms, resulting in a unified and more stable fine-tuning strategy. Extensive experiments demonstrate that GLT-PEFT achieves effective cross-task transfer while significantly reducing trainable parameters, highlighting its effectiveness for efficient and robust adaptation in medical imaging models.

Chinese Translation

参数高效微调（PEFT）已成为在有限数据条件下适应预训练模型的有前景的范式。然而，大多数现有的PEFT方法是为矩阵结构参数设计的，并不适合医学影像模型中的高维卷积核。此外，它们通常依赖于加性更新，缺乏保持预训练参数几何结构的机制，而乘性（几何感知）更新在统一框架内难以整合。为了解决这个问题，本文提出了GLT-PEFT，一个用于阿尔茨海默病（AD）诊断的门控Lie-Tucker参数高效微调框架。该方法将一个海马分割的预训练模型转移到下游分类任务中。Tucker分解使得3D卷积核的张量感知低秩适应成为可能，而基于Lie群的变换提供了结构保持的乘性更新。门控机制进一步调和了加性和乘性更新形式，从而形成一个统一且更稳定的微调策略。大量实验表明，GLT-PEFT在实现有效的跨任务迁移的同时显著减少了可训练参数，突显了其在医学影像模型中高效且稳健适应的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 72 / 2605.16774

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

CANSURF：用于表面级垃圾检测与跟踪的ASV视图罐数据集与基准

Aljundi, Zaid, Rahmatullah, Zahra F., Elemam, Mostafa, Moosa, Abdullah

Abstract

Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset's value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.

Chinese Translation

表面级海洋垃圾仍然是自主清理的实际瓶颈，其中小型反射目标（例如铝罐）必须在耀眼、波纹和部分浸没的情况下远距离检测。本文提出了一种ASV视觉系统和一个新的表面罐数据集。该数据集包含约7.3千张从视频中提取的原始图像，并进行了边界框标注，通过十种增强类型扩展至约57千张训练/验证图像，涵盖了多样的光照和水面状态。针对表面操作定制的一系列检测器和检测-跟踪管道进行了基准测试。在CANSURF上训练YOLOv11使性能提升了12倍，相较于通用数据集，突显了该数据集的价值。实验表明，YOLOv11+ByteTrack在跟踪稳定性（较少的身份切换）和多目标准确性方面表现最佳，而YOLOv11+SAHI在远距离铝罐检测上提高了召回率，但在全景输入下精度较低。考虑到任务特征，单罐拾取的接近与抓取，YOLOv11+SAHI在检测最大数量的罐子方面表现更佳。此前没有公开数据集从表面级视角针对水面上的铝罐进行研究；该数据集填补了这一空白，并支持可重复的评估。

View on arXiv Download PDF AI Translation

cs.CV / 73 / 2605.16775

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

VolTA-3D：基于3D体积标记对齐的脑MRI自监督学习

Makawana, Amy, Parida, Abhijeet, Linguraru, Marius George, Ive, Julia, Anwar, Syed Muhammad

Abstract

Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer's disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.

Chinese Translation

自监督学习（SSL）通过利用大量未标记数据推动了医学图像分析的进展。然而，在脑磁共振成像（MRI）中，大多数3D模型仍然专注于分割或分类，限制了它们在不同数据集、成像协议和下游任务中的泛化能力。这种可迁移性的缺乏限制了3D MRI模型的临床实用性，尽管存在未标记的体积数据。我们提出了Volta-3D，一个旨在学习可迁移体积表示的自监督3D视觉变换器框架。Volta-3D在学生-教师范式下共同对齐全局类别样式标记和局部补丁标记，并强制进行细粒度结构重建。这种全局-局部对齐结合解决了脑MRI的有限语义多样性和微妙解剖特征，这对现有的SSL方法构成了挑战。我们在多个分布外下游任务上评估了Volta-3D，包括海马体分割和性别及阿尔茨海默病与健康对照的分类。在所有任务中，Volta-3D学习的表示优于随机初始化的基线，展示了在领域转移下的可迁移性和鲁棒性。因此，在预训练期间共同强制全局语义一致性和局部结构学习，使得从未标记的脑MRI数据中能够进行更广泛的概念学习。总体而言，VolTA-3D支持有效的多任务下游性能，并通过任务特定的微调向可泛化和临床可行的3D模型迈进了一步。

View on arXiv Download PDF AI Translation

cs.CV / 74 / 2605.16779

A Holistic Method for Superquadric Fitting Using Unsupervised Clustering Analysis

一种基于无监督聚类分析的超二次体拟合整体方法

Zhao, Mingyang, Ruan, Sipu, Jia, Xiaohong

Abstract

This work presents a novel method for fitting superquadrics to point clouds under the contamination of noise and outliers, which has many applications for shape modeling across diverse fields. Unlike prior approaches that either exclusively focus on fitting rigid or deformable superquadrics, or suffer from robustness and numerical instability issues, our method redefines the problem from a new unsupervised clustering perspective, enabling the holistic fitting of both rigid and deformable superquadrics within a unified framework. Central to our approach is a stable optimization function inspired by unsupervised clustering analysis, where we formulate the point cloud data and samples from the potential parametric surface as clustering members and centroids, respectively. Then, the clustering process with dynamic updates to centroid locations serves as a direct proxy for optimizing superquadric parameters, establishing a principled link between geometric fitting and clustering dynamics. We further derive the relationship between pairwise computations of clustering centroids and clustering members to orthogonal distances, effectively eliminating the need for the time-consuming surface sampling process. Moreover, our formulation provides closed-form analytical solutions for both the fuzzy membership degree vector and the covariance matrix, ensuring efficient iteration optimization and enabling more effective handling of geometric deformations. In addition, we provide a theoretical certificate of convergence analysis and demonstrate that the clustering-inspired fitting method can escape local minima by inherently increasing the convexity of the objective function. The implementation is publicly available at https://github.com/zikai1/SuperquadricFitting.

Chinese Translation

本研究提出了一种新颖的方法，用于在噪声和异常值污染下将超二次体拟合到点云中，该方法在多个领域的形状建模中具有广泛应用。与之前的方法不同，后者要么专注于拟合刚性或可变形超二次体，要么面临鲁棒性和数值不稳定性的问题，我们的方法从新的无监督聚类视角重新定义了问题，使得在统一框架内能够整体拟合刚性和可变形超二次体。我们方法的核心是一个受无监督聚类分析启发的稳定优化函数，其中我们将点云数据和潜在参数表面的样本分别视为聚类成员和质心。然后，动态更新质心位置的聚类过程作为优化超二次体参数的直接代理，建立了几何拟合与聚类动态之间的原则性联系。我们进一步推导了聚类质心和聚类成员之间的成对计算与正交距离的关系，有效地消除了耗时的表面采样过程。此外，我们的公式为模糊隶属度向量和协方差矩阵提供了封闭形式的解析解，确保了高效的迭代优化，并能够更有效地处理几何变形。此外，我们提供了收敛分析的理论证明，并展示了受聚类启发的拟合方法能够通过内在增加目标函数的凸性来逃避局部最小值。该实现已公开发布，网址为 https://github.com/zikai1/SuperquadricFitting。

View on arXiv Download PDF AI Translation

cs.CV / 75 / 2605.16785

Encoding Robust Topological Signatures for Hyperdimensional Computing

为超高维计算编码鲁棒的拓扑特征

Kusari, Arpan

Abstract

Hyperdimensional (HD) computing offers an attractive alternative to deep networks for edge learning due to its simplicity, fast prototype-based inference, and compatibility with online updates. However, standard pixel-based HD encoders are brittle: small distribution shifts such as rotation, noise, or occlusion can drastically reduce accuracy. We extract discrete topological primitives-most notably holes-from binarized shapes and pair them with rotation/translation/scale (RTS)-invariant shape signatures. Our method constructs RTS-stable descriptors for (i) the outer shape using a spatial-pyramid variant of Zernike moments and (ii) each hole using an intrinsic Fourier descriptor of its radial signature together with RTS-canonical relative geometry. Each primitive is mapped to a bipolar hypervector via randomized projection and role binding, and variable-cardinality hole sets are aggregated by permutation-invariant bundling to form a single image hypervector. To avoid over-weighting any cue, we learn nonnegative reliability weights for the Zernike and hole channels on a validation set via late fusion of cosine similarities. Experiments on MNIST and EMNIST under controlled corruptions (rotation, Gaussian noise, salt-and-pepper, cutout, zoom) show that Topology-guided HD computing substantially improves robustness compared with a naive HD baseline, maintaining high accuracy across multiple corruption families and benefiting from lightweight online training. Compared with a compact CNN trained on clean data, our method achieves competitive clean accuracy while offering markedly stronger robustness to several pixel-level corruptions, demonstrating that explicit topological structure is a practical route to robust HD representations. The code is provided at https://github.com/arpan-kusari/Topological-HDC.

Chinese Translation

超高维（HD）计算由于其简单性、快速的基于原型的推理以及与在线更新的兼容性，成为边缘学习的一个有吸引力的替代方案。然而，标准的基于像素的HD编码器较为脆弱：小的分布变化，如旋转、噪声或遮挡，可能会显著降低准确性。我们从二值化形状中提取离散的拓扑原语，最显著的是孔，并将其与旋转/平移/缩放（RTS）不变的形状特征配对。我们的方法为（i）外部形状构建基于Zernike矩的空间金字塔变体的RTS稳定描述符，以及（ii）每个孔使用其径向特征的内在傅里叶描述符与RTS规范相对几何结合。每个原语通过随机投影和角色绑定映射到一个双极超向量，并通过置换不变的捆绑将可变基数的孔集聚合形成一个单一的图像超向量。为了避免对任何线索的过度加权，我们通过对余弦相似度的后期融合，在验证集上学习Zernike和孔通道的非负可靠性权重。在控制腐蚀（旋转、高斯噪声、盐和胡椒、切割、缩放）下对MNIST和EMNIST的实验表明，与简单的HD基线相比，基于拓扑的HD计算显著提高了鲁棒性，在多个腐蚀类别中保持高准确性，并受益于轻量级的在线训练。与在干净数据上训练的紧凑型卷积神经网络（CNN）相比，我们的方法在干净准确性上具有竞争力，同时对多种像素级腐蚀表现出明显更强的鲁棒性，证明了显式的拓扑结构是实现鲁棒HD表示的可行途径。代码可在 https://github.com/arpan-kusari/Topological-HDC 获取。

View on arXiv Download PDF AI Translation

cs.CV / 76 / 2605.16789

Accelerating Rectified Flow Models via Trajectory-Aware Caching

通过轨迹感知缓存加速修正流模型

Liu, Xiao, Liu, Kai, Guan, Naiyang, Lu, Hongliang, Wang, Zhixin, Chen, Zhikai, Pei, Renjing, Zhang, Yulun

Abstract

Diffusion and rectified flow (RF) models generate high-fidelity images and videos, but their iterative velocity-field evaluations are computationally expensive. Existing caching methods accelerate sampling by skipping timesteps, yet their coarse approximations introduce accumulated errors over long skip intervals and degrade quality under aggressive acceleration. We propose TACache (Trajectory-Aware Cache), a training-free acceleration framework following a skip-then-compensate paradigm. TACache performs an orthogonal decomposition of discrete velocity acceleration along the RF trajectory into a parallel component and an orthogonal residual, isolating the magnitude and directional sources of per-step approximation error. The framework operates in two stages: offline, cumulative variation thresholds on the magnitude and direction indicators yield the skip schedule and bound how far each skip interval may extend; online, at each skipped step the offline statistics are combined with the sample's historical orthogonal direction to reconstruct the skipped velocity without additional model evaluations. Experiments on BAGEL, FLUX.1-dev, and Wan2.1-1.3B show that TACache achieves up to 4.14 speedup on text-to-image generation and 2.11 speedup on text-to-video generation, with consistent improvements over prior cache-based methods on all reference-based fidelity metrics. Code will be released soon.

Chinese Translation

扩散和修正流（RF）模型生成高保真图像和视频，但其迭代速度场评估计算成本高昂。现有的缓存方法通过跳过时间步来加速采样，但其粗略的近似会在长跳跃间隔中引入累积误差，并在激进加速下降低质量。我们提出了TACache（轨迹感知缓存），这是一个无训练的加速框架，遵循跳过-再补偿的范式。TACache对沿RF轨迹的离散速度加速进行正交分解，将其分为平行分量和正交残差，从而隔离每步近似误差的大小和方向来源。该框架分为两个阶段：离线阶段，累积变化阈值在大小和方向指标上生成跳跃计划，并限制每个跳跃间隔的最大延伸距离；在线阶段，在每个跳过的步骤中，离线统计与样本的历史正交方向结合，以重构跳过的速度，而无需额外的模型评估。在BAGEL、FLUX.1-dev和Wan2.1-1.3B上的实验表明，TACache在文本到图像生成上实现了最高4.14倍的加速，在文本到视频生成上实现了2.11倍的加速，并在所有基于参考的保真度指标上对比以往的基于缓存的方法均有一致的改进。代码将很快发布。

View on arXiv Download PDF AI Translation

cs.CV / 77 / 2605.16795

3DPhysVideo: Consistency-Guided Flow SDE for Video Generation via 3D Scene Reconstruction and Physical Simulation

3DPhysVideo：通过3D场景重建和物理模拟的基于一致性引导的流动SDE视频生成

Kim, Hwidong, Kim, Yunho, Kim, Tae-Kyun

Abstract

Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image-to-3D physics through mesh reconstruction and Physically-Based Rendering, but challenges remain in modeling fluid dynamics, multi-object interactions and photorealism. This work introduces 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. We repurpose an off-the-shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360-degree 3D scene geometry by guiding the image-to-video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high-quality videos. Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation-guided video generation. In the diverse experiments including multi-objects, and fluid interaction scenes, our method successfully bridges the gap from single-images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms state-of-the-art baselines on GPT-based scores, VideoPhy benchmark and human evaluation.

Chinese Translation

视频生成模型取得了显著进展，但它们常常产生违反物理动态基础的视觉伪影。近期的研究如PhysGen3D通过网格重建和基于物理的渲染（Physically-Based Rendering）解决了单幅图像到3D物理的转换，但在流体动力学、多物体交互和照片真实感建模方面仍面临挑战。本研究提出了3DPhysVideo，一种新颖的无训练管道，可以从单幅图像生成物理真实的视频。我们将现成的视频模型重新利用为两个阶段。首先，我们将其用作新视角合成器，通过用渲染的点云引导图像到视频（I2V）流模型，重建完整的360度3D场景几何。其次，在对该几何应用物理求解器后，物理模拟的点云被用来引导同一I2V流模型合成最终的高质量视频。基于一致性引导的流动SDE（Consistency-Guided Flow SDE）将I2V流模型预测的速度分解为去噪和一致性偏差，强制条件输入的一致性，使我们能够有效地重新利用该模型进行3D重建和模拟引导的视频生成。在包括多物体和流体交互场景的多样实验中，我们的方法成功地弥合了从单幅图像到物理可信视频的差距，同时在单个消费级GPU上保持高效运行。在基于GPT的评分、VideoPhy基准和人类评估中，它超越了最先进的基线。

View on arXiv Download PDF AI Translation

cs.CV / 78 / 2605.16797

EgoKit: Towards Unified Low-Cost Egocentric Data Collection with Heterogeneous Devices

EgoKit：面向异构设备的统一低成本自我中心数据收集

Yu, Liuchuan, Murat, Erdem, Wang, Beichen, Zeng, Yan, Luo, Tingting, Zhou, Huizhen, Li, Shanghao, Feng, Huining, Zhao, Zhigen, Yang, Ning, Jing, Ke, Liu, Yunhao, Sheng, Ruoya

Abstract

Egocentric video is increasingly used as a data source for robot learning, activity understanding, and embodied AI research, but collecting it at scale remains fragmented in practice: each candidate host device, such as an Android phone, iPhone, iPad, smart glasses, or extended reality (XR) headset, exposes a different SDK, a different policy on raw camera access, and different limitations on external USB cameras and on-device tracking. Synchronized ego-view and wrist-view capture is therefore typically obtained by either committing to a single proprietary platform or building one-off rigs that do not transfer across devices. To address this gap, we present EgoKit, a toolkit that exposes the same egocentric recording workflow across six heterogeneous host devices. Across all supported devices, EgoKit presents the same recording interaction and produces locally stored video with a uniform log format; on XR headsets, it additionally logs head pose and OpenXR-standard 26-joint hand tracking aligned to the video streams. The companion accessories, including two wrist cameras with mounts, a head strap, and a USB-C hub, add wrist-view capture to any supported host without custom hardware fabrication. EgoKit is available at \url{https://egokit.chuange.org/}.

Chinese Translation

自我中心视频越来越多地被用作机器人学习、活动理解和具身人工智能研究的数据源，但在实际收集过程中仍然存在碎片化的问题：每个候选主机设备，如安卓手机、iPhone、iPad、智能眼镜或扩展现实（XR）头戴设备，暴露出不同的SDK、不同的原始相机访问政策，以及对外部USB相机和设备内追踪的不同限制。因此，同步的自我视角和手腕视角捕获通常需要承诺于单一专有平台，或构建无法跨设备转移的一次性设备。为了解决这一问题，我们提出了EgoKit，一个在六种异构主机设备上提供相同自我中心录制工作流程的工具包。在所有支持的设备上，EgoKit提供相同的录制交互，并生成具有统一日志格式的本地存储视频；在XR头戴设备上，它还记录与视频流对齐的头部姿态和OpenXR标准的26关节手部追踪。配套配件包括两个带支架的手腕相机、一个头带和一个USB-C集线器，使任何支持的主机都能添加手腕视角捕获，而无需定制硬件制造。EgoKit可在 https://egokit.chuange.org/ 上获取。

View on arXiv Download PDF AI Translation

cs.CV / 79 / 2605.16805

NeuroLiDAR: Adaptive Frame Rate Depth Sensing via Neuromorphic Event-LiDAR Fusion

NeuroLiDAR：通过神经形态事件-激光雷达融合实现自适应帧率深度感知

Rathnayake, Darshana, Weerakoon, Dulanga, Radhakrishnan, Meera, Misra, Archan

Abstract

LiDARs are widely used for 3D depth reconstruction, but their performance is often limited by inherent hardware constraints that impose trade-offs between range, spatial resolution, and frame rate. Many LiDAR systems typically operate at low frame rates (e.g., 5-10 Hz), prioritizing long-range sensing over responsiveness to rapid scene changes. We present NeuroLiDAR, an adaptive depth sensing framework that achieves effective frame rates of up to $\approx$66 Hz by fusing temporally sparse LiDAR data with temporally dense inputs from neuromorphic event cameras. NeuroLiDAR integrates two components: event-based keyframe detection and event-guided depth extrapolation, to dynamically adjust the sensing rate in response to scene dynamics. To evaluate our approach, we introduce ELiDAR, a dataset spanning outdoor and indoor scenarios, and show that NeuroLiDAR reduces depth reconstruction error by $\approx$29\% in RMSE while achieving adaptive frame rates between 27.8-47.3 Hz. Our code and dataset are available at https://github.com/darshanakgr/neurolidar.

Chinese Translation

激光雷达（LiDAR）广泛用于三维深度重建，但其性能常常受到固有硬件限制的制约，这导致在探测范围、空间分辨率和帧率之间存在权衡。许多激光雷达系统通常以低帧率（例如，5-10 Hz）运行，优先考虑长距离探测，而忽视对快速场景变化的响应。我们提出了NeuroLiDAR，这是一种自适应深度感知框架，通过将时间稀疏的激光雷达数据与来自神经形态事件相机的时间密集输入融合，实现了高达约66 Hz的有效帧率。NeuroLiDAR集成了两个组件：基于事件的关键帧检测和事件引导的深度外推，以动态调整感知速率以响应场景动态。为了评估我们的方法，我们引入了ELiDAR，一个涵盖室外和室内场景的数据集，并展示了NeuroLiDAR在均方根误差（RMSE）方面将深度重建误差降低了约29%，同时实现了27.8-47.3 Hz之间的自适应帧率。我们的代码和数据集可在https://github.com/darshanakgr/neurolidar获取。

View on arXiv Download PDF AI Translation

cs.CV / 80 / 2605.16807

DecoRec: Decomposed 3D Scene Reconstruction from Single-View Images via Object-Level Diffusion

DecoRec：通过对象级扩散从单视图图像分解的3D场景重建

Ping, Yuhan, Liu, Yuan, Long, Xiaoxiao, Wang, Peng, Hou, Junhui, Zheng, Jianyi, Pan, Jia, Li, Xin, Lin, Cheng

Abstract

In this paper, we introduce \textit{DecoRec}, a novel system designed to elevate single-view 2D images to a decomposed 3D scene mesh. Current methods for single-view scene reconstruction typically rely on object retrieval or the regression of coarse 3D voxels or surfaces, leading to inaccuracies in capturing the appearance and geometry of the input image. The lack of high-quality large-scale scene-level datasets further complicates direct 3D scene generation from single-view images. To achieve high-quality 3D scene generation from a single-view image, DecoRec takes advantage of recent diffusion-based single-view object reconstruction methods to reconstruct individual objects separately. Subsequently, a refinement pipeline is proposed to effectively merge these reconstructed objects, enhancing appearance and geometry through a differentiable rendering technique and diffusion-guided refinement. Our results demonstrate that DecoRec facilitates high-quality single-view scene reconstruction in both geometry and novel synthesis, offering significant benefits for downstream applications like room interior design.

Chinese Translation

本文介绍了 extit{DecoRec}，一个新颖的系统，旨在将单视图2D图像提升为分解的3D场景网格。目前的单视图场景重建方法通常依赖于对象检索或粗略3D体素或表面的回归，这导致在捕捉输入图像的外观和几何形状时存在不准确性。缺乏高质量的大规模场景级数据集进一步复杂化了从单视图图像直接生成3D场景的过程。为了实现从单视图图像生成高质量的3D场景，DecoRec利用了近期基于扩散的单视图对象重建方法，分别重建个体对象。随后，提出了一种精细化管道，有效地合并这些重建的对象，通过可微渲染技术和扩散引导的精细化增强外观和几何形状。我们的结果表明，DecoRec在几何和新颖合成方面促进了高质量的单视图场景重建，为室内设计等下游应用提供了显著的好处。

View on arXiv Download PDF AI Translation

cs.CV / 81 / 2605.16810

Training-Free Occluded Text Rendering via Glyph Priors and Attention-Guided Semantic Blending

基于字形先验和注意力引导的语义混合的无训练遮挡文本渲染

Hou, Jingqi, Wang, Hongtian

Abstract

We present a training-free framework for occluded text rendering with a pretrained FLUX.1-dev backbone. The task requires a model to render recognizable typography and place an occluding object over the intended text region. This setting remains difficult for existing text-to-image generators: the occluder often drifts away from the text, while the text may be distorted or appear to float on top of the occluding object. To address this problem, we propose a restarted dual-stream inference framework that decouples text-layout preservation from occluder insertion. A Base Stream provides a clean typographic reference and same-step key/value (K/V) features, while the Edit Stream is conditioned on the occlusion prompt. We further adopt the spectral glyph-prior idea from FreeText and adapt it to stabilize the target text structure during early-to-mid denoising. In the reasoning pass, our method localizes the target text, estimates a text-band region from token-conditioned attention and glyph support, and derives an anchor-aware hard fusion mask for the occluder. In the final edit pass, generation restarts from the same initial noise and applies hard mask-guided image-token K/V replacement at selected attention sites, preserving the Base layout outside the mask while injecting the occluder appearance from the Edit Stream inside the mask. Experiments on representative occluded text scenarios demonstrate substantially improved text readability and competitive occlusion alignment, yielding more stable object-on-text compositions without any model fine-tuning.

Chinese Translation

我们提出了一种无训练的遮挡文本渲染框架，使用预训练的 FLUX.1-dev 主干网络。该任务要求模型渲染可识别的排版，并在预定文本区域上放置一个遮挡物。这一设置对于现有的文本到图像生成器仍然具有挑战性：遮挡物常常偏离文本，而文本可能会失真或看起来漂浮在遮挡物之上。为了解决这个问题，我们提出了一种重启的双流推理框架，将文本布局保持与遮挡物插入解耦。基础流（Base Stream）提供一个干净的排版参考和相同步骤的关键/值（K/V）特征，而编辑流（Edit Stream）则以遮挡提示为条件。我们进一步采用了来自 FreeText 的光谱字形先验思想，并将其调整以在早期到中期去噪过程中稳定目标文本结构。在推理过程中，我们的方法定位目标文本，从基于标记的注意力和字形支持中估计文本带区域，并为遮挡物推导出一个锚点感知的硬融合掩码。在最终编辑过程中，生成从相同的初始噪声重新开始，并在选定的注意力位置应用硬掩码引导的图像标记 K/V 替换，在掩码外部保持基础布局，同时在掩码内部注入来自编辑流的遮挡物外观。在具有代表性的遮挡文本场景上的实验表明，文本可读性显著提高，遮挡对齐表现出竞争力，生成的对象与文本组合更加稳定，无需任何模型微调。

View on arXiv Download PDF AI Translation

cs.CV / 82 / 2605.16818

Observation-Aligned Mask Priors for Learning Physical Dynamics from Authentic Occlusions

观察对齐的掩码先验用于从真实遮挡中学习物理动态

Ma, Chiyuan, Zhou, Zihan, Yu, Tianshu

Abstract

Learning physical dynamics directly from incomplete observations is challenging because authentic occlusions are structured, sample-dependent, and often missing not at random, whereas existing methods typically rely on heuristic masking rules or predefined mask distributions. We propose Observation-Aligned Mask Priors, a framework that learns the distribution of authentic observation masks and uses it to construct context-query partitions for training from incomplete data. Specifically, we pretrain a Bayesian Flow Network (BFN) on binary observation masks to capture real occlusion topologies, then guide BFN sampling with a globally normalized cross-entropy objective to generate sample-specific masks aligned with each sparse observation. The intersection between the guided mask and the observed mask defines the context, and the remaining observed entries become query targets for a diffusion-based reconstruction model. We show that this intersection-based partitioning gives every valid observed dimension a strictly positive probability of being queried, preventing zero-query dead zones and local generative collapse. Experiments on three real-world oceanographic datasets with authentic satellite occlusions, across resolutions up to 256$\times$256, show consistent improvements over strong diffusion baselines in MSE and PSNR. These results demonstrate that learning mask priors from authentic occlusions is an effective alternative to heuristic masking for learning from incomplete physical observations without access to fully observed fields.

Chinese Translation

直接从不完整观察中学习物理动态具有挑战性，因为真实遮挡是结构化的、依赖样本的，并且通常不是随机缺失的，而现有方法通常依赖于启发式掩码规则或预定义的掩码分布。我们提出了观察对齐的掩码先验（Observation-Aligned Mask Priors），这是一个学习真实观察掩码分布的框架，并利用该分布构建用于从不完整数据中训练的上下文查询分区。具体而言，我们在二进制观察掩码上预训练一个贝叶斯流网络（Bayesian Flow Network, BFN），以捕捉真实遮挡拓扑，然后通过全局归一化的交叉熵目标指导BFN采样，生成与每个稀疏观察对齐的样本特定掩码。引导掩码与观察掩码的交集定义了上下文，而剩余的观察条目则成为基于扩散的重建模型的查询目标。我们展示了这种基于交集的分区使每个有效观察维度都有严格正的被查询概率，从而防止零查询死区和局部生成崩溃。在三个具有真实卫星遮挡的海洋学数据集上的实验，分辨率高达256×256，显示出在均方误差（MSE）和峰值信噪比（PSNR）方面相较于强扩散基线的一致性改善。这些结果表明，从真实遮挡中学习掩码先验是学习不完整物理观察的有效替代方案，而无需访问完全观察的场。

View on arXiv Download PDF AI Translation

cs.CV / 83 / 2605.16832

Coarse Semantic Injection for LLM-Conditioned Structured Indoor Prediction

粗略语义注入用于大语言模型条件下的结构化室内预测

Zhu, Shuliang, Adey, Tomiwa, Zhou, Jinjia

Abstract

Large language models (LLMs) have recently been used as structured decoders for indoor understanding from 3D point-token inputs. However, point cloud encoders often under-represent thin structural elements such as doors and windows after voxelization and sparse pooling, and may miss individual furniture instances in cluttered scenes. We propose an interface-preserving semantic augmentation for LLM-conditioned structured decoding. The key idea is to associate semantic evidence with the point-cloud representation, reduce it to a coarse four-group code (furniture, walls, openings, and others), and encode it as an RGBB point interface: red for furniture, green for walls, blue for openings, and black for others, where RGBB denotes four semantic color states represented in three RGB channels rather than an additional fourth channel. This semantic color code is appended to the original raw point attributes before tokenization, so geometry and semantics share the same sparse tokenization path while the downstream language model decoder and output serialization remain unchanged. We further introduce a lightweight routed semantic shift module, with an auxiliary head used only for training-time ratio/budget regularization and analysis, to strengthen semantic cues after sparse pooling. The overall pipeline can use RGB-derived semantic evidence. Under these controlled semantic-source settings, the reported metrics improve across Structured3D, the SpatialLM dataset, and ARKitScenes, especially for opening localization and per-instance furniture detection in cluttered scenes. Ablations clarify the roles of semantic source, color coding, token fusion, and shift injection, while also showing that color/entropy effects remain nontrivial.

Chinese Translation

大型语言模型（LLMs）最近被用作从3D点令牌输入中进行室内理解的结构化解码器。然而，点云编码器在体素化和稀疏池化后常常对薄结构元素（如门和窗）表现不足，可能会在杂乱场景中遗漏单个家具实例。我们提出了一种保持接口的语义增强方法，用于LLM条件下的结构化解码。关键思想是将语义证据与点云表示关联，将其简化为粗略的四组代码（家具、墙壁、开口和其他），并将其编码为RGBB点接口：红色表示家具，绿色表示墙壁，蓝色表示开口，黑色表示其他，其中RGBB表示在三个RGB通道中表示的四种语义颜色状态，而不是额外的第四个通道。该语义颜色代码在令牌化之前附加到原始点属性上，因此几何和语义共享相同的稀疏令牌化路径，而下游语言模型解码器和输出序列化保持不变。我们进一步引入了一种轻量级的路由语义转移模块，辅助头仅用于训练时的比例/预算正则化和分析，以增强稀疏池化后的语义线索。整体流程可以利用基于RGB的语义证据。在这些受控的语义源设置下，报告的指标在Structured3D、SpatialLM数据集和ARKitScenes上有所改善，特别是在杂乱场景中的开口定位和每个实例家具检测方面。消融实验阐明了语义源、颜色编码、令牌融合和转移注入的作用，同时也表明颜色/熵效应仍然非平凡。

View on arXiv Download PDF AI Translation

cs.CV / 84 / 2605.16834

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

在有限数据下学习细粒度多模态对齐的相对表示

Kim, Shiwon, Park, Yu Rang

Abstract

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.

Chinese Translation

多模态预训练展现出强大的泛化性能，但在配对数据稀缺的领域，这种范式往往不切实际。一种有前景的替代方案是事后多模态对齐，它利用有限数量的配对示例分别对预训练的单模态编码器进行对齐。然而，现有方法主要集中在对齐全局表示，忽视了补丁-标记之间的关系。这可能会阻碍转移到需要细粒度跨模态匹配的任务，超越粗略的样本级语义。为了解决这个问题，我们提出了一种事后对齐方法，通过相对表示学习标记级跨模态结构。具体而言，我们通过每个模态空间中与一组可学习锚点的标记级相似性来表示图像和文本，这些锚点经过训练以诱导匹配对的一致跨模态相似性模式。尽管仅学习锚点而不使用复杂的投影层，我们的方法在零样本分类、跨模态检索和零样本分割中始终显著优于现有方法。这突显了在有限配对数据下，建模细粒度跨模态结构对于有效的事后多模态对齐的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 85 / 2605.16848

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

通过模式思维：通过模式归纳打破视觉规划中的感知瓶颈

Jian, Yichang, Xiao, Boyuan, Huang, Zhenyuan, Peng, Yifei, Ding, Yao-Xiang

Abstract

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

Chinese Translation

从原始视觉输入进行规划仍然是当前视觉语言模型（VLMs）面临的一项重大挑战，尤其是当输入的复杂性超出其一步感知能力时。受最近在图像思维（Thinking with Images, TWI）领域的进展启发，一个合理的解决方案是通过迭代获取和整合局部视觉证据，将感知过程分解为更简单的步骤。然而，尽管当前的VLMs在一般TWI能力上训练良好，但它们在规划领域的感知瓶颈依然存在。为了解决这一挑战，我们将TWI视为一种工具，以逐步构建和反映准确的内部世界模型。我们发现，结果是无训练的规划策略使得VLMs能够解决远超其初始能力的任务，但代价是过多的TWI操作会显著增加计算开销。为了进一步提高效率，我们提出了模式推理（Pattern Inference），这是一种新颖的TWI策略，使VLMs能够主动识别新任务中的已知视觉模式，并直接推断局部世界模型结构。为了获取这些模式，我们提出了模式归纳（Pattern Induction），这是一种在线归纳学习策略，将视觉模式视为复合和可重用的专家，这些专家从经验中自主发现和优化。在FrozenLake、Crafter和CubeBench领域的实验评估表明，我们的方法在准确性和效率之间达到了理想的平衡。

View on arXiv Download PDF AI Translation

cs.CV / 86 / 2605.16859

VGGT-CD: Training-Free Robust Registration for 3D Change Detection

VGGT-CD：无训练的鲁棒性注册用于三维变化检测

Zhang, Wei, Li, Songhua, Wu, Yihang, Li, Qiang, Wang, Qi

Abstract

3D change detection from multi-view images is essential for urban monitoring, disaster assessment, and autonomous driving. However, existing methods predominantly operate in the 2D domain, where viewpoint variations are mistaken for physical changes and depth is unavailable. While visual geometry foundation models like VGGT rapidly produce dense point clouds from unposed images, independent per-epoch reconstruction encounters fundamental obstacles: unpredictable inter-epoch scale ambiguity, registration-change paradox where scene changes corrupt alignment, and pervasive edge-flying noise. To address these challenges, we present VGGT-CD, a training-free pipeline decoupling cross-temporal registration from dynamic-change interference. In the Coarse Stage, sparse keyframe joint inference establishes a unified metric space and yields an initial Sim(3) prior. In the Fine Stage, dense reconstructions are purified by isolating static-background correspondences. A closed-form centroid alignment refines the translation while locking scale and rotation, using a residual self-check to mathematically guarantee non-degradation. Evaluated on an 11-scene benchmark from the World Across Time dataset, VGGT-CD reduces Absolute Trajectory Error by 44% outdoors and 59% indoors. It completes registration over 6 times faster, producing high-purity 3D change maps without task-specific training.

Chinese Translation

从多视角图像中进行三维变化检测对于城市监测、灾害评估和自动驾驶至关重要。然而，现有方法主要在二维领域中操作，在此领域中，视角变化被误认为是物理变化，并且深度信息不可用。虽然视觉几何基础模型如VGGT能够快速从未定姿态的图像中生成密集点云，但独立的每个训练周期重建面临根本性障碍：不可预测的跨周期尺度模糊、注册变化悖论（即场景变化破坏对齐）以及普遍存在的边缘噪声。为了解决这些挑战，我们提出了VGGT-CD，一个无训练的管道，将跨时间注册与动态变化干扰解耦。在粗略阶段，稀疏关键帧联合推理建立了统一的度量空间，并生成初始的Sim(3)先验。在精细阶段，通过隔离静态背景对应关系来净化密集重建。闭式形式的质心对齐在锁定尺度和旋转的同时优化平移，并使用残差自检在数学上保证无降级。在来自“时间跨越世界”数据集的11个场景基准上进行评估，VGGT-CD在户外将绝对轨迹误差降低了44%，在室内降低了59%。它的注册速度超过6倍，生成高纯度的三维变化图，而无需特定任务的训练。

View on arXiv Download PDF AI Translation

cs.CV / 87 / 2605.16861

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

前缀自适应块扩散模型用于高效文档识别

Chai, Mingxu, Shen, Ziyu, Liu, Chenyu, Zhang, Kaidi, Zhang, Jiazheng, Zhu, Dingwei, Xi, Zhiheng, Chen, Ruoyu, Long, Jun, Kang, Jihua, Gui, Tao, Zhang, Qi

Abstract

Block Diffusion Models (BDMs) support parallel generation, flexible-length output, and KV caching, making them promising for efficient document parsing. However, existing BDMs bind denoising and cache commitment to fixed block boundaries: parallelism shrinks during intra-block denoising, while generated tokens cannot be cached until the whole block is completed. Moreover, intra-block bidirectional denoising conflicts with inter-block autoregression, creating inconsistent information flow that can challenge structure-sensitive recognition. We propose the Prefix-Adaptive Block Diffusion Model (PA-BDM), which replaces intra-block bidirectional denoising with causal denoising from prefix to suffix and treats the block size as a maximum candidate range rather than a fixed commitment unit. PA-BDM uses Confidence-gated Structural Loss (CSL) to build low-entropy prefixes before extending training to longer continuations. During inference, Progressive Prefix Commitment (PPC) then dynamically commits the longest reliable prefix into the KV cache and resets the next candidate range from the updated prefix, restoring a large parallel decoding space at each step. Experiments show that the 3B PA-BDM achieves higher recognition scores on several benchmarks and improves inference throughput by 71.6\% over the 2.5B MinerU-Diffusion.

Chinese Translation

块扩散模型（Block Diffusion Models, BDMs）支持并行生成、灵活长度输出和KV缓存，使其在高效文档解析中具有良好的前景。然而，现有的BDMs将去噪和缓存承诺绑定到固定的块边界：在块内去噪过程中并行性减小，而生成的标记在整个块完成之前无法缓存。此外，块内的双向去噪与块间自回归相冲突，导致信息流不一致，这可能会挑战结构敏感的识别。我们提出了前缀自适应块扩散模型（Prefix-Adaptive Block Diffusion Model, PA-BDM），该模型用从前缀到后缀的因果去噪替代了块内的双向去噪，并将块大小视为最大候选范围，而不是固定的承诺单位。PA-BDM使用置信门控结构损失（Confidence-gated Structural Loss, CSL）在扩展训练到更长的延续之前构建低熵前缀。在推理过程中，渐进前缀承诺（Progressive Prefix Commitment, PPC）动态地将最长的可靠前缀提交到KV缓存中，并根据更新的前缀重置下一个候选范围，从而在每一步恢复较大的并行解码空间。实验表明，3B PA-BDM在多个基准测试中实现了更高的识别分数，并且在推理吞吐量上比2.5B MinerU-Diffusion提高了71.6%。

View on arXiv Download PDF AI Translation

cs.CV / 88 / 2605.16864

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

基于度量引导的视觉基础模型特征融合用于分割任务

Guo, Yachan, Zurita, JoseLuis Gomez, Xue, Danna, Xiao, Yi, Pena, AntonioManuel Lopez

Abstract

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

Chinese Translation

尽管大规模视觉基础模型（VFM）在语义理解方面取得了显著的性能，但在实例感知的密集预测任务中仍然表现不佳。这些模型在表示上存在不同的偏差：例如，可提示的分割模型（如 SAM2）关注于细粒度的区域边界，而自监督模型（如 DINOv3）则强调对象级结构。这一观察结果突显了结合来自不同 VFM 的互补特征以增强下游密集预测任务的潜力。然而，简单的多 VFM 融合往往难以带来可靠的提升，而利用其互补特征的可解释原则仍然未被充分探索。在本研究中，我们提出了一种度量引导的方法，该方法基于显式评估分数有效地选择和聚合来自不同 VFM 的互补特征。具体而言，我们设计了一套在特征空间中无标签的度量，从结构一致性和边缘保真度两个方面评估 VFM 编码器的特征。在这些分数的引导下，我们识别出互补的边缘强和结构强编码器对，并通过主-辅助融合方案将它们整合。这种特征融合不需要复杂的架构变化，并且仅在单个阶段进行训练。与基线相比，我们的模型在多个密集预测任务中显示出一致的性能提升，具有更好的对象级语义和更准确的边界定位。代码可在 {https://github.com/gyc-code/metric-guided-fusion} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 89 / 2605.16873

HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction

HAD：用于三维重建的幻觉感知扩散先验

Liu, Xi, Sun, Weiwei, Ren, Zhou, Broaddus, Chris, Huang, Siyu, Guigues, Laurent

Abstract

Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content -- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis. Our project are publicly available at \href{https://xiliu8006.github.io/HAD-Project-website/}{project website}.

Chinese Translation

扩散先验最近在通过在新视点增强训练视图方面展示了强大的能力，从而提升稀疏视图三维重建的质量，但它们不可避免地会在最终的三维模型中引入幻觉内容——与输入视图不一致的伪影。为了解决这一挑战，我们提出了幻觉感知扩散先验（HAD），该方法通过利用在大规模三维数据上预训练的前馈新视点合成（NVS）网络的多视图推理能力，为增强图像估计逐像素的幻觉评分图。这些幻觉评分使得在渐进式三维重建过程中能够选择性地屏蔽不可靠的像素，从而防止将不存在的伪影引入三维模型。为了进一步提升性能，我们通过对不同输入视图进行条件化，在每个新视点创建多个版本的增强图像，然后将这些图像融合为一个最终图像，以利用所有输入视图的更广泛上下文。我们展示了我们的方法显著减少了扩散辅助三维重建中的幻觉伪影，从而在多个新视点合成基准上实现了最先进的性能。我们的项目在项目网站上公开可用。

View on arXiv Download PDF AI Translation

cs.CV / 90 / 2605.16877

Zero-Shot Faithful Textual Explanations via Directional-Derivative Influence on Predictions

通过方向导数对预测的影响实现零-shot 可信文本解释

Yamauchi, Toshinori, Kera, Hiroshi, Kawamoto, Kazuhiko

Abstract

Zero-shot textual explanations aim to make image classifiers more transparent by probing their internal representations, without relying on task-specific supervision or LVLMs. However, existing methods often miss the features that truly drive the prediction, resulting in limited \textit{faithfulness} to the evidence underlying the model's decision. To address this, we propose FaithTrace. Motivated by the idea that faithful explanations should describe concepts that strongly influence the prediction, FaithTrace directly measures how much the representation induced by the explanation changes the class logit. We introduce an influence score, computed as the directional derivative of the class logit along the text-induced direction in the classifier's feature space, and use it as a proxy for faithfulness. Moreover, we extend this influence score into quantitative evaluation metrics, helping fill the gap in faithfulness evaluation for textual explanations. Experiments show that FaithTrace yields more faithful explanations than baselines, facilitating a more accurate understanding of the model. The code will be publicly released.

Chinese Translation

零-shot 文本解释旨在通过探测图像分类器的内部表示，使其更加透明，而不依赖于特定任务的监督或大型视觉语言模型（LVLMs）。然而，现有的方法往往忽视了真正驱动预测的特征，导致对模型决策背后证据的 extit{可信性}有限。为了解决这个问题，我们提出了 FaithTrace。FaithTrace 的提出基于这样一个理念：可信的解释应描述对预测有强烈影响的概念。FaithTrace 直接测量由解释引发的表示如何改变类别对数几率。我们引入了一种影响评分，该评分计算为分类器特征空间中沿文本引导方向的类别对数几率的方向导数，并将其作为可信性的代理。此外，我们将这一影响评分扩展为定量评估指标，以帮助填补文本解释可信性评估的空白。实验表明，FaithTrace 提供的解释比基线方法更具可信性，从而促进了对模型的更准确理解。代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 91 / 2605.16879

Towards Generalized Image Manipulation Localization via Score-based Model

基于评分模型的通用图像篡改定位研究

Wang, Yunfei, Du, Bo, Yang, Zhe, Liu, Xin, Lin, Zhiyu, Xu, Tianxin, Zhou, Ji-Zhe

Abstract

With the rapid evolution of synthetic media, Image Manipulation Localization (IML) has emerged as a critical component in multimedia forensics for ensuring the integrity of digital content. However, generalization remains a core challenge, as existing discriminative methods typically learn a fixed decision boundary that tends to overfit to specific training artifacts and fails to adapt to unseen manipulation types. To address this, we propose DiffIML, a novel framework that introduces score-based generative modeling to IML. Diverging from the direct estimation of hard boundaries, DiffIML approximates the score function, the gradient of the log-likelihood, to capture the intrinsic geometric topology of mask distributions. This paradigm leverages structural priors to iteratively recover coherent masks from noise, thereby circumventing the brittleness associated with discriminative models. Under this formulation, diffusion models serve as an effective numerical solver for the learned score function.To ensure practicality, we respectively resolve the efficiency and stability bottlenecks of standard diffusion by: (1) utilizing a Lightweight Mask-Specific VAE for fast latent-space process and a decoupled architecture with a lightweight denoising UNet, (2) edge supervision and error prior to mitigate error accumulation during sampling. Extensive experiments of two distinct protocols on eight non-generative and three generative benchmarks demonstrate that DiffIML consistently outperforms state-of-the-art methods, yielding remarkable generalization improvements on diverse unseen datasets. The code will be publicly available.

Chinese Translation

随着合成媒体的快速发展，图像篡改定位（IML）已成为多媒体取证中确保数字内容完整性的关键组成部分。然而，泛化能力仍然是一个核心挑战，因为现有的判别方法通常学习固定的决策边界，容易过拟合特定的训练伪影，并且无法适应未见过的篡改类型。为了解决这一问题，我们提出了DiffIML，一个将基于评分的生成建模引入IML的新框架。DiffIML不同于直接估计硬边界，而是近似评分函数，即对数似然的梯度，以捕捉掩模分布的内在几何拓扑。该范式利用结构先验，从噪声中迭代恢复一致的掩模，从而规避了与判别模型相关的脆弱性。在这一框架下，扩散模型作为学习到的评分函数的有效数值求解器。为确保实用性，我们分别通过以下方式解决了标准扩散的效率和稳定性瓶颈：(1) 利用轻量级掩模特定变分自编码器（VAE）进行快速潜在空间处理，并采用轻量级去噪UNet的解耦架构，(2) 边缘监督和误差先验以减轻采样过程中的误差累积。在八个非生成性和三个生成性基准上的两种不同协议的广泛实验表明，DiffIML始终优于最先进的方法，在多样的未见数据集上实现了显著的泛化改进。代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 92 / 2605.16887

Mind the Gap: Learning Modality-Agnostic Representations with a Cross-Modality UNet

注意差距：通过跨模态 UNet 学习模态无关表示

Niu, Xin, Li, Enyi, Liu, Jinchao, Wang, Yan, Osadchy, Margarita, Fang, Yongchun

Abstract

Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.

Chinese Translation

跨模态识别在科学、执法和娱乐等领域有着重要的应用。弥合模态差距的常用方法包括减少不同模态表示的分布差异、学习不可区分的表示或显式模态转移。前两种方法在去除特定模态变异时会损失判别信息，而第三种方法则严重依赖于成功的模态转移，当显式模态转移不可行或困难时，可能会面临灾难性的性能下降。为了解决这个问题，我们提出了一种紧凑的编码-解码神经模块（cmUNet），以学习模态无关的表示，同时保留与身份相关的信息。这是通过跨模态转换和模态内重建实现的，并通过对抗/感知损失增强，鼓励在原始样本空间中表示的不可区分性。对于跨模态匹配，我们提出了 MarrNet，其中 cmUNet 连接到一个标准特征提取网络，该网络以模态无关的表示为输入，并输出匹配的相似度分数。我们在五个具有挑战性的任务上验证了我们的方法，即拉曼-红外光谱匹配、跨模态人重新识别和异构（照片-素描、可见光-近红外和可见光-热成像）人脸识别，其中 MarrNet 的性能优于最先进的方法。此外，观察到跨模态匹配方法可能会偏向于从部分甚至错误的区域提取判别信息，这是由于处理模态差距的能力不足，随后导致较差的泛化能力。我们展示了对遮挡的鲁棒性可以作为判断一种方法是否能够很好地弥合模态差距的指标。

View on arXiv Download PDF AI Translation

cs.CV / 93 / 2605.16889

Controlling Decision Drift in Multimodal Sentiment Analysis with Missing Modalities

在缺失模态下控制多模态情感分析中的决策漂移

Chen, Chenglizhao, Cao, Yuchen, Liu, Xinyu, Song, Mengke, Zhang, Guisheng, Yu, Xiaomin

Abstract

Multimodal sentiment analysis relies on textual, acoustic, and visual signals, yet real-world data often suffer from modality missing and quality imbalance. Existing methods generate features for modality missing from available ones, but differences in expression mechanisms and sentiment dynamics across modalities may cause the generated features to deviate from true distributions and mislead prediction. In addition, unreliable modalities may dominate fusion, resulting in representation shift across modality combinations and unstable sentiment representations. To address these challenges, we propose a two-level reference alignment framework. The framework introduces stable references at the feature representation and sentiment decision levels to improve robustness under modality missing. First-level reference alignment leverages complete-modality samples to constrain representations and align different modality combinations into a shared sentiment space. Second-level reference alignment enforces cross-modal consistency at the decision level by suppressing unreliable modalities through prototype retrieval and voting. As a result, the framework maintains stable and reliable sentiment predictions under diverse missing-modality patterns. Experiments on CMU-MOSI and CMU-MOSEI show consistent improvements across various missing-modality settings. Under full-modality input, the proposed method achieves state-of-the-art performance, with ACC of 86.28% and 85.88%, and F1 of 86.24% and 85.86%.

Chinese Translation

多模态情感分析依赖于文本、声学和视觉信号，但现实世界的数据常常面临模态缺失和质量不平衡的问题。现有方法通过利用可用模态生成缺失模态的特征，但不同模态之间的表达机制和情感动态的差异可能导致生成的特征偏离真实分布，从而误导预测。此外，不可靠的模态可能主导融合，导致模态组合之间的表示转移和情感表示的不稳定。为了解决这些挑战，我们提出了一种两级参考对齐框架。该框架在特征表示和情感决策层引入稳定的参考，以提高在模态缺失情况下的鲁棒性。第一层参考对齐利用完整模态样本来约束表示，并将不同模态组合对齐到共享的情感空间。第二层参考对齐通过原型检索和投票抑制不可靠模态，在决策层强制执行跨模态一致性。因此，该框架在多种缺失模态模式下保持稳定和可靠的情感预测。在CMU-MOSI和CMU-MOSEI上的实验显示，在各种缺失模态设置下均有一致的改进。在全模态输入下，所提方法实现了最先进的性能，ACC分别为86.28%和85.88%，F1分别为86.24%和85.86%。

View on arXiv Download PDF AI Translation

cs.CV / 94 / 2605.16892

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

DriveSafe：一种在驾驶场景中进行风险检测和安全建议的框架

Artham, Sainithin, Gangisetty, Shankar, Dasgupta, Avijit, Jawahar, C. V.

Abstract

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe

Chinese Translation

全面的情境意识对于在安全关键环境中运行的自动驾驶车辆至关重要，因为它使潜在风险的识别和缓解成为可能。尽管近期的多模态大型语言模型（Multimodal Large Language Models, MLLMs）在一般视觉-语言任务上显示出良好的前景，但我们的研究发现，零-shot MLLMs在细粒度、空间基础的风险评估方面仍然不及领域特定的方法。为了解决这一问题，我们提出了DriveSafe，一个利用结构化自然语言描述进行风险感知场景理解的框架。具体而言，我们的方法首先生成富含多模态上下文的空间基础标题，包括运动、空间和深度线索。这些标题随后用于下游风险评估，明确识别危险物体、它们的位置以及它们所暗示的不安全行为，并提供可行的安全建议。为了进一步提高性能，我们采用标题-风险配对来微调一个轻量级适配器模块，有效地将领域特定知识注入基础LLM。通过将风险评估基于明确的语言场景表示，DriveSafe在零-shot MLLMs和之前的领域特定基准上实现了显著的提升。在DRAMA基准上的全面实验展示了最先进的性能，而消融研究验证了我们关键设计选择的有效性。项目页面：https://cvit.iiit.ac.in/research/projects/cvit-projects/drivesafe

View on arXiv Download PDF AI Translation

cs.CV / 95 / 2605.16899

LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

LASAR：朝着具有潜在认知地图的时空推理迈进

Tang, Jinzhou, Liu, Sidi, Xiu, Waikit, Chen, Weixing, Wang, Keze

Abstract

A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2\%-3.5\% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.

Chinese Translation

在具身人工智能中，一个基本挑战是验证代理是否构建了空间结构的内部模型，或者仅仅学习模仿特定任务的专家轨迹。这一点至关重要，因为以动作为中心的基础方法（例如，VLN）和以推理为中心的任务（例如，EQA）往往存在一个共同的局限性：它们缺乏一种学习信号，迫使它们在长距离、碎片化的经验中编码细粒度的空间关系（如拓扑或距离）。为了解决这个问题，我们首先提出了LASAR，这是一种具有双重记忆系统的架构，旨在同时维护情节经验和语义认知地图。然后，我们引入了时空上下文表示学习（Spatio-temporal Contextual Representation Learning, ST-CRL），这是一种对比目标，旨在训练该架构。ST-CRL利用通过注释的时空上下文在仿真中生成的认知查询的时空线索来构建样本对，从而形成代理经验的内部认知地图。实验表明，我们的方法在标准的VLN-CE和VSI-Bench基准测试中实现了2%-3.5%的零样本泛化增益。我们还证明了我们提出的认知地图具有较高的自一致性。

View on arXiv Download PDF AI Translation

cs.CV / 96 / 2605.16901

CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model

CAR-SAM：用于Segment Anything模型后训练量化的交叉注意力重建

Wen, Houji, Yu, Jiangyong, Li, Jun, Yang, Dawei

Abstract

Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resource-constrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder. This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization; and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights. Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence. Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6% and 6.6% mAP on SAM-B and SAM-L respectively.

Chinese Translation

Segment Anything模型（SAMs）在计算机视觉中广泛应用于通用图像分割，但由于其高计算和内存需求，在资源受限的设备上部署它们面临挑战。后训练量化（PTQ）是一种广泛使用的模型压缩和加速技术。然而，现有的PTQ方法未能考虑SAM解码器中的交叉注意力架构。这种降级主要源于SAM所带来的独特挑战：（1）注意力消散，在低位量化下，解码器中的注意力信息（对表示分割掩膜至关重要）崩溃为一种扩散且非语义的形式；（2）重建振荡，双向耦合的双向变换器引入了跨分支的错误干扰，导致收敛不稳定。为了解决这些问题，我们提出了CAR-SAM，一个针对SAM的统一量化框架。首先，为了减轻注意力消散，我们引入了MatMul感知补偿（MAC）机制，将激活引起的量化误差从MatMul转移到前面的线性权重。其次，为了减轻解码器优化中的振荡，我们开发了一种联合交叉注意力重建（JCAR）策略，联合重建耦合的注意力分支，抑制振荡行为并促进稳定收敛。大量实验表明，CAR-SAM能够将SAM模型稳健地量化至4位精度，分别在SAM-B和SAM-L上超越现有方法14.6%和6.6%的mAP。

View on arXiv Download PDF AI Translation

cs.CV / 97 / 2605.16903

WOW-Seg: A Word-free Open World Segmentation Model

WOW-Seg: 一种无词开放世界分割模型

Li, Danyang, Wu, Tianhao, Li, Bin, Chen, Zhenyuan, Zhang, Yang, Li, Yuxuan, Cheng, Ming-Ming, Li, Xiang

Abstract

Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge these discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, we introduce the Cascade Attention Mask to decouple information across different instances. This approach mitigates inter-instance interference, leading to a significant improvement in model performance. We further construct an open world region recognition test benchmark: the Region Recognition Dataset (RR-7K). With 7,662 classes, it represents the most extensive category-rich region recognition dataset to date. WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4. This performance surpasses the previous SOTA while using only one-eighth the parameter count. These results underscore the strong open world generalization capabilities of WOW-Seg. The code and related resources are available at https://github.com/AAwcAA/WOW-Seg-Meta.

Chinese Translation

开放世界图像分割旨在通过解决现实世界中遇到的无限开放对象类别集合，实现对图像中目标的精确分割和语义理解。然而，传统的闭集分割方法在适应复杂的开放世界场景时面临困难，而基础分割模型如SAM在强大的分割能力与相对较弱的语义理解之间存在显著差异。为了解决这些差异，我们提出了WOW-Seg，一种无词开放世界分割模型，用于对开放集类别中的对象进行分割和识别。具体而言，WOW-Seg引入了一种新颖的视觉提示模块Mask2Token，该模块将图像掩膜转换为视觉标记，并确保它们与VLLM特征空间的对齐。此外，我们引入了级联注意力掩膜，以解耦不同实例之间的信息。这种方法减轻了实例间的干扰，从而显著提高了模型性能。我们进一步构建了一个开放世界区域识别测试基准：区域识别数据集（Region Recognition Dataset，RR-7K）。该数据集包含7,662个类别，是迄今为止最广泛的类别丰富区域识别数据集。WOW-Seg在LVIS数据集上取得了优异的结果，语义相似度达到89.7，语义IoU为82.4。这一表现超越了之前的SOTA，同时仅使用了八分之一的参数量。这些结果突显了WOW-Seg强大的开放世界泛化能力。代码及相关资源可在https://github.com/AAwcAA/WOW-Seg-Meta获取。

View on arXiv Download PDF AI Translation

cs.CV / 98 / 2605.16911

VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction

VGGT-Occ：基于几何和密度感知的门控融合用于3D占用预测

Chen, Xun, Deng, Tianchen, Wang, Rui, Wang, Fangjinhua, Ma, Junyi, Shen, Hongming, Wang, Hesheng, Wang, Danwei

Abstract

3D semantic occupancy prediction requires accurate 2D-to-3D feature lifting, yet current methods restrict camera geometry to initial projections. Subsequent operations like offset learning, attention weighting, and cross-camera aggregation remain geometry-agnostic, ignoring essential physical constraints. We propose VGGT-Occ, a framework that embeds geometric tokens throughout the entire pipeline. We introduce Projection-Aware Deformable Attention (PA-DA) to inject geometry into all attention stages. PA-DA projects 3D offsets back to image planes and leverages the projection Jacobian as an additive bias to suppress unreliable observations. Features are then integrated through a view-quality semantic gate for cross-view consistency. To optimize both efficiency and performance, we employ a sequential coarse-to-fine decoder with gated fusion, where low-resolution features are refined into higher resolutions, allocating computation by information density while substantially reducing decoder cost. Extensive evaluations demonstrate the effectiveness and accuracy of our approach. On SurroundOcc-nuScenes, VGGT-Occ achieves 33.00\% IoU and 21.08\% mIoU ($T{=}1$), and 33.64\% IoU and 21.43\% mIoU with $T{=}2$ inference, outperforming existing methods, with only ${\sim}41$M trainable parameters in the occupancy head. Code will be released publicly.

Chinese Translation

3D语义占用预测需要准确的2D到3D特征提升，然而当前的方法将相机几何限制在初始投影中。后续操作如偏移学习、注意力加权和跨相机聚合仍然与几何无关，忽视了重要的物理约束。我们提出了VGGT-Occ，一个在整个流程中嵌入几何标记的框架。我们引入了投影感知可变注意力（Projection-Aware Deformable Attention, PA-DA），将几何信息注入到所有注意力阶段。PA-DA将3D偏移投影回图像平面，并利用投影雅可比矩阵作为附加偏置，以抑制不可靠的观测。然后，通过视图质量语义门进行特征整合，以确保跨视图一致性。为了优化效率和性能，我们采用了一个顺序的粗到细解码器，结合门控融合，将低分辨率特征精炼为更高分辨率，根据信息密度分配计算，同时显著降低解码器成本。广泛的评估表明我们方法的有效性和准确性。在SurroundOcc-nuScenes数据集上，VGGT-Occ在$T{=}1$时达到了33.00\%的IoU和21.08\\%的mIoU，在$T{=}2$推理时达到了33.64\\%的IoU和21.43\\%的mIoU，超越了现有方法，同时在占用头中仅有约41M的可训练参数。代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 99 / 2605.16918

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

HighSync：基于潜在扩散模型的高质量唇部同步

Daghigh, Saeed Firouzi, Mobarekeh, Majid Iranpour, Alavi, Mostafa, Bagheri, Mehdi

Abstract

We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts. Source code, pre-trained models, and supplementary video results are publicly available at: https://github.com/saeed5959/high_sync

Chinese Translation

我们提出了HighSync，一个基于扩散的端到端框架，用于高保真唇部同步，能够生成与任意输入音频对齐的照片级真实感对话视频。现有方法在图像质量与同步精度之间始终难以取得平衡，导致生成的输出要么视觉效果不佳，要么唇部动作在时间上不一致。HighSync同时解决了这两个挑战，并且据我们所知，它是首个原生支持512*512分辨率的唇部同步模型，使其成为电影和广播等专业制作环境的可行解决方案。我们方法的核心在于识别并系统性消除一种数据泄漏现象，该现象在以往的工作中默默破坏了时间建模，阻碍了模型对音频信号的真实依赖。通过对感知质量和同步精度指标的全面评估，确认HighSync在这两个方面均达到了最先进的性能。源代码、预训练模型和补充视频结果已公开，网址为：https://github.com/saeed5959/high_sync

View on arXiv Download PDF AI Translation

cs.CV / 100 / 2605.16922

Motion Cues from Image-based Point Tracking for LiDAR Scene Flow Estimation

基于图像点跟踪的运动线索用于激光雷达场景流估计

Jang, Youngdong, Oh, Gyeongrok, Kim, Jong Wook, Ryu, Hyunju, Chi, Hyung-gun, Kim, SeungHyeon, Kim, Seungryong, Choi, Jonghyun, Kim, Sangpil

Abstract

LiDAR scene flow estimation is essential for autonomous driving, as it provides 3D motion for each point. Self-supervised approaches use static-dynamic classification to mitigate the imbalance between static and dynamic points, deriving targeted supervision. However, existing methods rely on sparse geometric observations for this classification, making them vulnerable to data sparsity and occlusions. The resulting noisy labels provide incorrect motion guidance and degrade scene flow learning. To address this, we introduce TrackCue, a tracking-guided framework for improving dynamic object representation in LiDAR scene flow estimation. In particular, TrackCue repurposes point tracking to obtain dense image-space trajectories anchored to LiDAR points, providing motion cues beyond sparse geometric observations. Furthermore, we present a visually consistent motion compensation strategy that compares the tracked trajectories with ego-induced rigid trajectories in the image plane, effectively isolating true object motion from ego-induced apparent motion. To transfer these isolated motion cues back to the LiDAR domain, we perform visual motion cue lifting, which associates ego-compensated image trajectories with LiDAR points for static-dynamic label refinement. As a result, TrackCue produces more accurate static-dynamic classification and provides more reliable supervision for scene flow learning. Experimental results show that TrackCue significantly improves the precision and F1 score of dynamic labels, leading to performance gains in self-supervised scene flow estimation.

Chinese Translation

激光雷达场景流估计对于自动驾驶至关重要，因为它为每个点提供三维运动。自监督方法利用静态-动态分类来缓解静态点和动态点之间的不平衡，从而获得针对性的监督。然而，现有方法依赖于稀疏几何观测进行此分类，使其易受数据稀疏和遮挡的影响。由此产生的噪声标签提供了错误的运动指导，并降低了场景流学习的效果。为了解决这个问题，我们引入了TrackCue，一个跟踪引导的框架，用于改善激光雷达场景流估计中的动态物体表示。具体而言，TrackCue重新利用点跟踪以获得锚定于激光雷达点的稠密图像空间轨迹，提供超越稀疏几何观测的运动线索。此外，我们提出了一种视觉一致的运动补偿策略，该策略将跟踪的轨迹与图像平面中的自我引起的刚性轨迹进行比较，有效地将真实物体运动与自我引起的表观运动隔离。为了将这些隔离的运动线索转移回激光雷达领域，我们执行视觉运动线索提升，将自我补偿的图像轨迹与静态-动态标签细化的激光雷达点关联起来。因此，TrackCue产生了更准确的静态-动态分类，并为场景流学习提供了更可靠的监督。实验结果表明，TrackCue显著提高了动态标签的精度和F1分数，从而在自监督场景流估计中带来了性能提升。

View on arXiv Download PDF AI Translation

cs.CV / 101 / 2605.16923

Neuroscience-inspired Staged Representation Learning with Disentangled Coarse- and Fine-Grained Semantics for EEG Visual Decoding

基于神经科学启发的分阶段表示学习：针对脑电图视觉解码的解耦粗细语义

Gao, Xiang, Tian, Hui, Zhu, Yanming, Yin, Xuefei, Liew, Alan Wee-Chung

Abstract

Decoding visual information from electroencephalography (EEG) signals remains a fundamental challenge in brain-computer interfaces and medical rehabilitation. Existing EEG visual decoding methods mainly focus on learning a single global EEG embedding for cross-modal alignment, but they largely overlook the staged and hierarchical characteristics of human visual processing. To address this limitation, we propose a neuroscience-inspired staged representation learning framework that reformulates EEG visual decoding as a stage-specific representation decomposition problem. The proposed framework organizes EEG representation learning into three complementary phases: low-level visual representation learning, high-level semantic representation learning, and integrative information fusion. To strengthen semantic modeling, we further introduce a multimodal dual-level semantic learning mechanism that separates coarse label-level semantics from fine image-level visual-semantic information. In addition, semantic latent channels are introduced as computational representation channels generated from observed visual EEG signals, expanding the channel-level semantic representation space for structured semantic abstraction and cross-modal alignment. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves superior performance under subject-dependent zero-shot evaluation and improved exact retrieval under subject-independent zero-shot evaluation. Additional analyses, including layer-wise retrieval, temporal accumulation, expanded multi-image retrieval, and ablation studies, further support the effectiveness of staged decomposition and structured semantic modeling. These results suggest that explicitly modeling staged perceptual, semantic, and integrative representations provides an effective neuroscience-inspired framework for EEG-based visual decoding.

Chinese Translation

从脑电图（EEG）信号中解码视觉信息仍然是脑机接口和医疗康复中的一个基本挑战。现有的EEG视觉解码方法主要集中于学习单一的全局EEG嵌入以实现跨模态对齐，但在很大程度上忽视了人类视觉处理的分阶段和层次特征。为了解决这一局限性，我们提出了一种基于神经科学启发的分阶段表示学习框架，将EEG视觉解码重新表述为一个特定阶段的表示分解问题。该框架将EEG表示学习组织为三个互补的阶段：低级视觉表示学习、高级语义表示学习和综合信息融合。为了增强语义建模，我们进一步引入了一种多模态双层语义学习机制，将粗略标签级语义与细粒度图像级视觉-语义信息分离。此外，引入语义潜在通道作为从观察到的视觉EEG信号生成的计算表示通道，扩展了结构化语义抽象和跨模态对齐的通道级语义表示空间。在THINGS-EEG基准上的大量实验表明，所提方法在依赖被试的零样本评估中表现优越，并在不依赖被试的零样本评估中实现了更好的精确检索。进一步的分析，包括层级检索、时间累积、扩展多图像检索和消融研究，进一步支持了分阶段分解和结构化语义建模的有效性。这些结果表明，明确建模分阶段的感知、语义和综合表示为基于EEG的视觉解码提供了一种有效的神经科学启发框架。

View on arXiv Download PDF AI Translation

cs.CV / 102 / 2605.16925

P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban Reconstruction

P2GS：基于物理先验指导的高斯点云用于光度一致的城市重建

Shimomura, Kota, Arai, Hidehisa, Takahashi, Tsubasa, Yamashita, Takayoshi, Fujiyoshi, Hironobu

Abstract

3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency. To address this limitation, we introduce P2GS, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.

Chinese Translation

三维高斯点云（3D Gaussian Splatting, 3DGS）最近作为一种强大的显式表示方法出现，能够实现快速且高保真的渲染，成为自动驾驶中闭环模拟器和感知模型的有希望基础。然而，传统的3DGS隐含假设视角之间的曝光和色调映射是一致的。由于异构相机管线和动态户外光照，真实驾驶数据违反了这一假设，导致曝光差异和传感器噪声嵌入辐射场中，尤其在静态背景下产生伪影和不一致的光照，这对于真实模拟至关重要。这些问题在自动驾驶中被放大，稀疏的视点、变化的曝光和户外光照相互作用，而以往的研究主要针对动态物体重建，忽视了视角间的光度一致性。为了解决这一限制，我们提出了P2GS，一个物理一致的高斯点云框架，该框架从仅有的低动态范围（LDR）图像中联合分解视角不变的线性高动态范围（HDR）辐射场、每视角的曝光尺度和色调映射函数，而无需HDR监督。P2GS采用基于物理图像形成过程的统一优化策略，强制执行相对曝光一致性和HDR域辐射的正则化。这使得辐射场对相机间光照差异具有鲁棒性，同时保持了标准3DGS的实时效率。在真实和模拟驾驶环境中的实验表明，P2GS在LDR重建中匹配或超越了以往的方法，同时在不同场景中提供了显著改善的光度一致性、可靠的曝光归一化和物理一致的光照。

View on arXiv Download PDF AI Translation

cs.CV / 103 / 2605.16937

DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis

DEVIS-GRPO：释放动态极端视图合成中的GRPO

Zuo, Yi, Wu, Huimin, Li, Lingling, Liu, Fang, Jiao, Licheng, Li, Qing

Abstract

Trajectory-controlled video generation has become essential for controllable video generation. While current methods perform well under small-view camera motions, they degrade significantly with large-view motions. Existing solutions for extreme-view synthesis typically require dedicated video pairs, demanding substantial annotation effort. To address these limitations, we propose Dynamic Extreme VIew Synthesis-GRPO (DEVIS-GRPO), a GRPO-based framework for trajectory-controlled video generation, the first online policy gradient method for extreme view video generation. Central to our approach is a novel sampling strategy: Accumulative Dynamic Extreme VIew Synthesis (ADEVIS), which achieves large-view camera motions by progressively accumulating small-view increments. This method delivers two key advantages: 1) enhanced training efficiency, as it eliminates the need to warm-start the policy model by collecting expensive paired large-view videos, and 2) increased sampling diversity, achieved by flexibly varying trajectory configurations. Finally, we designed a multi-level consistency-quality reward function to select high-quality samples for model optimization. Experiments on the Kubric-4D, iPhone, and DL3DV datasets demonstrate our method's superiority. On Kubric-4D, we achieve relative improvements of 21.57% in PSNR and 7.31% in SSIM over the second-best method in non-occlusion areas. On iPhone, LPIPS is reduced by 18.56%.

Chinese Translation

轨迹控制的视频生成已成为可控视频生成的重要组成部分。尽管当前方法在小视角摄像机运动下表现良好，但在大视角运动下显著下降。现有的极端视图合成解决方案通常需要专门的视频对，且需要大量的标注工作。为了解决这些局限性，我们提出了动态极端视图合成-GRPO（DEVIS-GRPO），这是一个基于GRPO的轨迹控制视频生成框架，也是极端视图视频生成的首个在线策略梯度方法。我们方法的核心是一种新颖的采样策略：累积动态极端视图合成（ADEVIS），通过逐步累积小视角增量来实现大视角摄像机运动。该方法提供了两个关键优势：1）提高了训练效率，因为它消除了通过收集昂贵的配对大视角视频来热启动策略模型的需要；2）通过灵活变化轨迹配置实现了采样多样性的增加。最后，我们设计了一个多层次一致性-质量奖励函数，以选择高质量样本进行模型优化。在Kubric-4D、iPhone和DL3DV数据集上的实验结果证明了我们方法的优越性。在Kubric-4D上，我们在非遮挡区域相较于第二好的方法在PSNR上实现了21.57%的相对提升，在SSIM上提升了7.31%。在iPhone上，LPIPS减少了18.56%。

View on arXiv Download PDF AI Translation

cs.CV / 104 / 2605.16949

Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers

超越逐点匹配：加速扩散变换器的结构表示对齐

Xu, Shaodong, Wang, Zhendong, Gong, Litong, Li, Zexian, Zhou, Wengang, Ge, Tiezheng, Li, Houqiang

Abstract

Recent advances in Diffusion Transformers (DiTs) demonstrate that aligning noisy latent states with well-trained semantic features-as pioneered by Representation Alignment (REPA)-can substantially accelerate training and improve generation fidelity. Subsequent analysis(e.g., iREPA) suggests that these gains arise primarily from transferring spatial structure contained in pre-trained vision representations. However, mostly existing alignment methods employ point-wise matching objectives or rely on implicit architectural tweaks, which fail to explicitly model the spatial relational geometry inherent in vision foundation models. We argue that such element-wise supervision is insufficient to capture the rich spatial topology of visual representations, and that effective alignment for generation should instead be formulated as an explicit structural constraint. To this end, we propose sREPA, a structural REPresentation Alignment framework to enforce consistency in the relational geometry of feature maps, rather than merely matching individual feature points. By encouraging the model to internalize holistic spatial layouts and structural correlations from pre-trained features, sREPA achieves faster and more stable convergence, along with improved sample quality, compared to state-of-the-art alignment strategies. Our code and models will be released.

Chinese Translation

最近在扩散变换器（Diffusion Transformers, DiTs）方面的进展表明，将噪声潜在状态与经过良好训练的语义特征对齐——正如表示对齐（Representation Alignment, REPA）所开创的——可以显著加速训练并提高生成的保真度。后续分析（例如，iREPA）表明，这些收益主要源于转移预训练视觉表示中包含的空间结构。然而，现有的大多数对齐方法采用逐点匹配目标或依赖隐式的架构调整，这未能明确建模视觉基础模型中固有的空间关系几何。我们认为，这种逐元素的监督不足以捕捉视觉表示的丰富空间拓扑，而有效的生成对齐应当被表述为一种明确的结构约束。为此，我们提出了结构表示对齐框架（sREPA），以强制特征图的关系几何保持一致，而不仅仅是匹配单个特征点。通过鼓励模型内化来自预训练特征的整体空间布局和结构相关性，sREPA相比于最先进的对齐策略实现了更快和更稳定的收敛，同时提高了样本质量。我们的代码和模型将会发布。

View on arXiv Download PDF AI Translation

cs.CV / 105 / 2605.16951

Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing

Edit-GRPO：一种用于图像编辑的局部保持策略优化框架

Xu, Shaodong, Li, Zexian, Wang, Zhendong, Gong, Litong, Ge, Tiezheng, Zhou, Wengang, Zheng, Bo, Li, Houqiang

Abstract

A fundamental challenge in image editing lies in preserving spatial locality: edits should improve targeted content without inadvertently altering surrounding regions. However, most optimization-based editing approaches treat images as holistic entities, causing global policy updates that undermine locality and introduce undesired context changes. We observe that this issue stems from a mismatch between localized editing intent and globally applied optimization signals. Motivated by this insight, we propose Edit-GRPO, preserving Locality while optimizing image editing, a locality-preserving policy optimization framework that explicitly decouples editing and preservation objectives. By assigning region-specific optimization signals to edit and non-edit areas, Edit-GRPO aligns policy updates with the spatial structure of editing tasks, enabling localized improvements while maintaining global visual coherence. This design effectively suppresses common artifacts such as context distortion and boundary inconsistency. Extensive experiments across diverse image editing scenarios demonstrate that Edit-GRPO significantly improves locality preservation while maintaining strong editing performance compared to existing optimization-based methods, validating the generality and effectiveness of the proposed framework.

Chinese Translation

图像编辑中的一个基本挑战在于保持空间局部性：编辑应改善目标内容，而不应无意中改变周围区域。然而，大多数基于优化的编辑方法将图像视为整体实体，导致全球策略更新，破坏局部性并引入不必要的上下文变化。我们观察到，这一问题源于局部编辑意图与全球应用优化信号之间的不匹配。基于这一见解，我们提出了Edit-GRPO，即在优化图像编辑的同时保持局部性的策略优化框架，明确解耦编辑和保持目标。通过为编辑和非编辑区域分配区域特定的优化信号，Edit-GRPO使策略更新与编辑任务的空间结构相一致，从而实现局部改进，同时保持全球视觉一致性。这一设计有效抑制了常见的伪影，如上下文扭曲和边界不一致。在多种图像编辑场景下的广泛实验表明，Edit-GRPO在保持局部性方面显著优于现有的基于优化的方法，同时保持强大的编辑性能，验证了所提框架的普遍性和有效性。

View on arXiv Download PDF AI Translation

cs.CV / 106 / 2605.16961

Latent Action Control for Reasoning-Guided Unified Image Generation

用于推理引导的统一图像生成的潜在行动控制

Zhai, Fuxiang, Chen, Sixiang, Li, Yingjin, Li, Shuaibo, Lai, Jianyu, Huang, Tengjun, Zhu, Lei

Abstract

Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.

Chinese Translation

统一的多模态模型能够在共享的骨干网络中编码视觉理解和图像生成，但理解并不自动转化为控制：模型可能推断出物体、关系或知识线索，但未能在生成的图像中体现出来。我们提出了潜在行动控制（Latent Action Control, LAC），通过将推理表示为统一生成器内部的隐藏连续行动，使推理变得可操作。给定一个提示，LAC展开一个角色结构化的潜在轨迹用于规划、内部视觉草图、诊断和优化，并将这些行动注入条件流生成的隐藏流中，而不产生推理标记或中间图像。由于这些行动轨迹是不可观察的，LAC通过从仅训练渲染的语义先验、草图图像特征和监督停止信号中进行先验引导的变分潜在行动对齐来学习它们，随后通过潜在流（Latent-Flow）GRPO将潜在到图像的展开与终端视觉反馈对齐。这为从推断的关系、绑定和知识线索到生成过程提供了一条控制路径。在BAGEL-7B-MoT上实例化后，LAC在GenEval、WISE和T2I-CompBench上持续改善了组合和知识基础的生成，在空间关系、属性绑定和对世界知识敏感的提示上获得了最大的提升。消融实验和潜在干预表明，学习到的行动轨迹被生成器所利用，这表明统一生成在理解不仅被编码而且在生成过程中变得可操作时受益。

View on arXiv Download PDF AI Translation

cs.CV / 107 / 2605.16962

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro：一种增强工具的全方位视觉-语言伪造检测代理

Shen, Jinjie, Huang, Zheng, Zhang, Yuchen, Wu, Yujiao, Wang, Yaxiong, Cheng, Lechao, Tang, Shengeng, Hui, Tianrui, Pu, Nan, Zhong, Zhun

Abstract

Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at \url{https://github.com/shen8424/OmniVL-Guard-Pro}.

Chinese Translation

现有的视觉-语言伪造检测和定位方法在封闭世界范式下运行，假设验证可以仅通过模型完成。然而，自包含的多模态语言模型（MLLMs）受到有限参数知识、静态训练语料库和有限感知分辨率的限制，在动态开放世界的取证中形成了实际的上限——特别是在需要外部线索的实时事件验证和要求对局部操控进行细致审查的伪造分割方面。为了解决这些局限性，我们从扩展自包含模型的规模转向超越它。我们提出了 extbf{OmniVL-Guard Pro}，一种增强工具的代理，它将统一取证从封闭世界预测扩展到开放世界线索驱动推理。OmniVL-Guard Pro集成了一个工具环境，涵盖实时事件搜索、局部裁剪和缩放、边缘异常筛查、人脸检测、视频帧提取和基于SAM3的分割。为了生成高质量的工具推理轨迹，我们引入了 extbf{树结构自演化工具轨迹生成}，通过种子引导、无引导自演化和弱提示硬样本合成生成多样化轨迹，从而产生用于训练的全谱工具推理（FSTR）数据集。我们进一步提出了 extbf{检查器引导的代理强化学习}（CGARL），它提供过程级监督，以惩罚答案正确但推理扭曲的情况。大量实验表明，OmniVL-Guard Pro在各种任务中实现了最先进的性能，并展现出强大的零样本泛化能力。FSTR数据集和OmniVL-Guard Pro的代码将公开发布于 exturl{https://github.com/shen8424/OmniVL-Guard-Pro}。

View on arXiv Download PDF AI Translation

cs.CV / 108 / 2605.16967

Expandable, Compressible, Mineable: Open-World Thermal Image Restoration

可扩展、可压缩、可挖掘：开放世界热成像恢复

Li, Pu, Li, Huafeng, Zhang, Yafei, Wang, Wen, Dong, Neng, Wen, Jie

Abstract

In open-world settings, thermal infrared (TIR) image degradations continuously emerge and evolve, while most existing all-in-one restoration methods are built on a closed-set assumption and struggle to continually adapt to novel degradations. To address this, we propose ECMRNet, an Expandable, Compressible, and Mineable Restoration Network for open-world TIR restoration from a continual learning perspective. Conceptually, ECMRNet unifies continual degradation learning as an "expand-compress-mine" closed-loop process, enabling sustained adaptation to new degradations with controllable evolution. Structurally, ECMRNet decomposes intermediate representations into group-isolated subspaces, and achieves strict parameter isolation and fast adaptation to new degradations by freezing historical groups and isomorphically expanding new ones. To curb model growth as tasks accumulate, we present Structural Entropy Pruning, which identifies and removes redundant channel groups via two-dimensional structural entropy minimization, achieving information contribution-driven adaptive compression. Moreover, we design a Sub-degradation Knowledge Mining Module that dynamically retrieves and recombines transferable components from historical representations to improve restoration under compound degradations. Experimental results demonstrate that ECMRNet achieves superior overall performance across diverse single and compound degradations while using fewer parameters and lower computational cost. The source code is available at https://github.com/Kust-lp/ECMRNet.

Chinese Translation

在开放世界环境中，热红外（TIR）图像的退化不断出现和演变，而现有的大多数一体化恢复方法都是基于封闭集假设，难以持续适应新出现的退化。为了解决这个问题，我们提出了ECMRNet，一种可扩展、可压缩和可挖掘的开放世界TIR恢复网络，从持续学习的角度出发。概念上，ECMRNet将持续退化学习统一为一个“扩展-压缩-挖掘”的闭环过程，使其能够在可控演变的情况下持续适应新退化。在结构上，ECMRNet将中间表示分解为组隔离的子空间，通过冻结历史组并同构地扩展新组，实现严格的参数隔离和对新退化的快速适应。为了在任务累积时抑制模型增长，我们提出了结构熵剪枝，通过二维结构熵最小化识别和移除冗余通道组，实现信息贡献驱动的自适应压缩。此外，我们设计了一个子退化知识挖掘模块，动态检索和重新组合历史表示中的可转移组件，以改善复合退化下的恢复效果。实验结果表明，ECMRNet在多种单一和复合退化下表现出优越的整体性能，同时使用更少的参数和更低的计算成本。源代码可在 https://github.com/Kust-lp/ECMRNet 获取。

View on arXiv Download PDF AI Translation

cs.CV / 109 / 2605.16973

SHED: Style-Homogenized Embedding Alignment for Domain Generalization

SHED：用于领域泛化的风格均质嵌入对齐

Gan, Kai, Wei, Tong

Abstract

Domain generalization aims to enhance model robustness against unseen domains with embedding distribution shifts. While large-scale vision-language models like CLIP exhibit strong generalization, their direct image-text embedding alignment suffers from inherent information asymmetry: images encode both class semantics and domain-specific styles, whereas text prompts primarily convey basic class cues. This asymmetry hinders generalization to novel domains in realistic scenarios. To address this, we propose Style-Homogenized Embedding alignment for Domain-generalization (SHED), a novel CLIP-based method that aligns style-homogenized embeddings instead of raw representations from encoders in CLIP. During training, SHED removes domain-specific style centroids from both image embeddings computed per source domains and text embeddings which are averaged across diverse prompt templates and stripped of a global centroid. For inference, considering the lack of target domain information, SHED projects diverse textual domain centroids into the visual space and aggregates predictions via membership weighting. Extensive experiments on five benchmarks show SHED achieves state-of-the-art performance, outperforming prior methods significantly (e.g., +4.0\% on DomainNet vs. standard fine-tuning).

Chinese Translation

领域泛化旨在增强模型对未见领域的鲁棒性，以应对嵌入分布的变化。尽管像 CLIP 这样的大规模视觉-语言模型展现出强大的泛化能力，但它们直接的图像-文本嵌入对齐受到固有信息不对称的影响：图像同时编码类语义和领域特定风格，而文本提示主要传达基本的类线索。这种不对称性阻碍了在现实场景中对新领域的泛化。为了解决这个问题，我们提出了用于领域泛化的风格均质嵌入对齐方法（SHED），这是一种基于 CLIP 的新方法，它对齐风格均质的嵌入，而不是直接使用 CLIP 编码器的原始表示。在训练过程中，SHED 从每个源领域计算的图像嵌入和通过多样化提示模板平均得到的文本嵌入中去除领域特定的风格中心，并剔除全局中心。对于推理，考虑到缺乏目标领域信息，SHED 将多样化的文本领域中心投影到视觉空间，并通过成员权重聚合预测。在五个基准上的大量实验表明，SHED 达到了最先进的性能，显著超越了之前的方法（例如，在 DomainNet 上相比标准微调提高了 4.0%）。

View on arXiv Download PDF AI Translation

cs.CV / 110 / 2605.16980

Statistical Hand Shape Modeling from Clinical CT Scans Using Deep Learning and Implicit Skinning

基于深度学习和隐式皮肤建模的临床CT扫描手形状统计建模

Guven, Gokce, Ates, Hasan Fehmi, Karasahin, Deniz, Erdogan, Kaan

Abstract

Accurate segmentation and statistical shape modeling of hand anatomy have significant implications for medical diagnostics, ergonomics, and biomechanics. This study proposes an AI-assisted reconstruction pipeline for segmenting and analyzing hand anatomy from 1,271 elbow-to-hand (e2h-CT) computed tomography scans. A Pix2Pix-based conditional generative adversarial network is first employed to remove plaster cast and background artifacts from CT volumes. The cleaned scans are then processed in 3D Slicer to extract skin and bone masks, which are converted into closed-surface mesh models. Segmented bone meshes are used to construct skeletal representations, enabling implicit skinning to align all hand models into a standardized anatomical configuration. Subsequently, non-rigid registration is performed on the hand skin surfaces using the Geodesic Based Coherent Point Drift++ (GBCPD++) algorithm to establish point-wise correspondence across subjects. Principal Component Analysis (PCA) is then applied to the registered models to quantify anatomical shape variability. The Pix2Pix preprocessing stage achieved a Dice coefficient of 0.9856 and an IoU of 0.9720 on the held-out test set. Statistical modeling was performed on a subset of 90 scans in which the fingers were fully visible and anatomically separated. The resulting statistical shape distributions demonstrate strong agreement with the U.S. Army Anthropometric Survey (ANSUR II), supporting the anatomical validity of the reconstructed models. The proposed methodology demonstrates significant potential for advancing biomechanical modeling, ergonomic optimization, prosthetic design, and precision medical diagnostics.

Chinese Translation

手部解剖结构的准确分割和统计形状建模对医学诊断、人体工程学和生物力学具有重要意义。本研究提出了一种AI辅助重建流程，用于从1271个肘部到手部（e2h-CT）计算机断层扫描中分割和分析手部解剖结构。首先采用基于Pix2Pix的条件生成对抗网络去除CT图像中的石膏铸件和背景伪影。清理后的扫描图像随后在3D Slicer中处理，以提取皮肤和骨骼掩膜，并将其转换为闭合表面网格模型。分割后的骨骼网格用于构建骨骼表示，利用隐式皮肤建模将所有手部模型对齐到标准化的解剖配置。随后，使用基于测地线的相干点漂移++（GBCPD++）算法对手部皮肤表面进行非刚性配准，以建立不同个体之间的点对点对应关系。然后对注册后的模型应用主成分分析（PCA），以量化解剖形状的变异性。Pix2Pix预处理阶段在保留的测试集上达到了0.9856的Dice系数和0.9720的IoU。统计建模在90个手指完全可见且解剖上分开的扫描子集中进行。结果显示的统计形状分布与美国陆军人类测量调查（ANSUR II）高度一致，支持重建模型的解剖有效性。所提出的方法展示了在推进生物力学建模、人体工程学优化、假肢设计和精准医学诊断方面的重大潜力。

View on arXiv Download PDF AI Translation

cs.CV / 111 / 2605.16981

Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

重新思考长序列递归3D重建中的状态更新门

Ren, Kejun, Jin, Lei, Huang, Tianxin, Xu, Lianming, Wang, Li

Abstract

Streaming 3D reconstruction under a strict constant-memory budget hinges on how the recurrent state is updated as the stream evolves. We profile TTT3R-style per-token gates across five benchmarks and discover a structural bottleneck: the gate is intrinsically bounded in magnitude (median $0.31$; never exceeding $0.6$) and nearly frame-invariant, yielding an effective memory horizon of only $\sim$3 frames per state token, which serves as the structural origin of long-sequence drift. We trace this to a missing axis: existing inference-time methods modulate updates only at the per-token, intra-frame level, while the orthogonal frame-level question of \emph{how strongly each frame should contribute to the state} has been treated as content-independent. We close this gap with a scalar frame-level gate $\alpha_t \in (0, 1]$ derived in closed form from frame-to-frame changes of internal features -- a continuous relaxation of classical Simultaneous Localization and Mapping (SLAM) keyframe selection that requires no parameters, no training, and no extra forward pass. Across six benchmarks spanning camera pose, video depth, and 3D reconstruction at sequence lengths up to $4,541$ frames, our gate cuts ATE by $51\%$ on long TUM-RGBD pose sequences, reduces AbsRel by $12.8\%$ on Bonn video depth, and on KITTI long-sequence pose estimation surpasses both LongStream and Keyframe-VO, while retaining strictly constant memory at zero training cost.

Chinese Translation

在严格的常量内存预算下，流式3D重建依赖于随着流的演变而更新的递归状态。我们在五个基准测试中分析了TTT3R风格的每个标记门，并发现了一个结构瓶颈：该门的幅度本质上是有限的（中位数为$0.31$；从未超过$0.6$），且几乎对帧不变，导致每个状态标记的有效内存视野仅为约3帧，这成为长序列漂移的结构根源。我们追溯到一个缺失的轴：现有的推理时间方法仅在每个标记的帧内水平调节更新，而正交的帧级问题——每帧应对状态的贡献程度——一直被视为与内容无关。我们通过一个标量帧级门$eta_t ext{ in } (0, 1]$来填补这一空白，该门是从内部特征的帧间变化中以封闭形式推导而来——这是经典同时定位与地图构建（SLAM）关键帧选择的连续松弛，不需要参数、训练或额外的前向传播。在涵盖相机姿态、视频深度和长达$4,541$帧的3D重建的六个基准测试中，我们的门在长TUM-RGBD姿态序列上将ATE降低了$51 ext{ extperthousand}$，在波恩视频深度上将AbsRel降低了$12.8 ext{ extperthousand}$，并且在KITTI长序列姿态估计中超越了LongStream和Keyframe-VO，同时在零训练成本下保持严格的常量内存。

View on arXiv Download PDF AI Translation

cs.CV / 112 / 2605.16990

DreamEdit3D: Personalization of Multi-View Diffusion Models for 3D Editing

DreamEdit3D：多视角扩散模型的个性化用于3D编辑

Ai, Jinxin, Nießner, Matthias, Erkoç, Ziya

Abstract

While 2D diffusion models have achieved remarkable success in identity-preserving personalization, extending this capability to 3D assets remains a significant challenge due to the complexities of multi-view consistency and spatial control. Inspired by these 2D advancements, we present a novel personalization method for text-guided 3D editing that enables compositional, object-level control through natural language. Given a 3D input, we render orthogonal views and extract object-level segmentation masks to isolate semantic components. We then learn distinct token embeddings for each component through a tailored two-phase optimization strategy: multi-view textual inversion with attention alignment, followed by full fine-tuning of multi-view diffusion model. During inference, these disentangled tokens seamlessly compose with editing prompts to generate multi-view consistent images, which are subsequently lifted into high-fidelity textured 3D meshes. Extensive evaluations across diverse editing scenarios demonstrate that our method successfully transfers the flexibility of 2D personalization to 3D, achieving state-of-the-art edit faithfulness and identity preservation compared to existing baselines.

Chinese Translation

尽管2D扩散模型在保持身份个性化方面取得了显著成功，但将这一能力扩展到3D资产仍然是一个重大挑战，因为它涉及多视角一致性和空间控制的复杂性。受到这些2D进展的启发，我们提出了一种新颖的个性化方法，用于文本引导的3D编辑，能够通过自然语言实现组合的对象级控制。给定一个3D输入，我们渲染正交视图并提取对象级分割掩膜，以隔离语义组件。然后，我们通过量身定制的两阶段优化策略学习每个组件的不同标记嵌入：首先进行带有注意力对齐的多视角文本反演，然后对多视角扩散模型进行全面微调。在推理过程中，这些解耦的标记与编辑提示无缝组合，以生成多视角一致的图像，随后提升为高保真纹理3D网格。对多种编辑场景的广泛评估表明，我们的方法成功地将2D个性化的灵活性转移到3D，实现了与现有基线相比的最先进的编辑忠实度和身份保持。

View on arXiv Download PDF AI Translation

cs.CV / 113 / 2605.17014

RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos

RHINO：从单目视频重建与新物体的人类交互

Xue, Lixin, Zheng, Chengwei, Paschalidis, Georgios, Guo, Chen, Kaufmann, Manuel, Zarate, Juan, Tzionas, Dimitrios

Abstract

Reconstructing people, objects, and their interactions in 3D is a long-standing goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a manipulated object from foreground pixels, and a coarse scene shape and camera motion from background pixels. Second, we estimate a human in the camera frame via an off-the-shelf method, and subtract the camera motion from apparent motion to extract the object motion; this registers the human, object, and coarse scene shapes into a common world frame. Third, we refine shapes using a compositional neural field with per-component signed-distance fields. The latter further enables differentiable contact priors that attract surfaces while penalizing interpenetration, improving the physical plausibility of the final reconstruction. For evaluation, we capture a new dataset of handheld monocular videos synchronized with a volumetric 4D capture stage, providing ground-truth shape and camera motion. RHINO outperforms state-of-the-art baselines on novel-view synthesis and 4D reconstruction. Ablations show that each stage contributes substantially. Code and data are available at https://lxxue.github.io/RHINO.

Chinese Translation

在三维空间中重建人类、物体及其交互一直是智能系统的一个长期目标。输入通常是来自移动相机的RGB视频，这使得任务变得不确定；深度信息模糊，人类与物体相互遮挡，且相机与物体的运动交织在一起，造成表观运动。大多数先前的研究要么孤立地处理人类或物体，忽略它们之间的相互作用，要么假设已知的三维形状或相机，这在现实世界应用中并不实用。我们开发了RHINO（Reconstructing Human Interactions with Novel Objects），这是一个三步框架，能够从单目RGB视频中在一个共同的世界坐标系下重建人类、新的（未见过的）被操控物体和静态场景。首先，我们利用3D感知基础模型获取线索，以稳定结构从运动（Structure-from-Motion，SfM），即使在低纹理区域也能实现；这从前景像素中获得被操控物体的粗略形状和表观运动，从背景像素中获得粗略的场景形状和相机运动。其次，我们通过现成的方法估计相机框架中的人类，并从表观运动中减去相机运动以提取物体运动；这将人类、物体和粗略场景形状注册到一个共同的世界坐标系中。第三，我们使用具有每个组件有符号距离场的组合神经场来细化形状。后者进一步实现了可微分的接触先验，吸引表面同时惩罚相互穿透，从而提高最终重建的物理合理性。为了评估，我们捕获了一个新的手持单目视频数据集，并与一个体积4D捕获阶段同步，提供了真实的形状和相机运动。RHINO在新视图合成和4D重建方面超越了最先进的基准。消融实验表明每个阶段都做出了重要贡献。代码和数据可在https://lxxue.github.io/RHINO获取。

View on arXiv Download PDF AI Translation

cs.CV / 114 / 2605.17019

StreamingEffect: Real-Time Human-Centric Video Effect Generation

StreamingEffect：实时人本视频效果生成

Song, Yiren, Liu, Cheng, Jiang, Yuxin, Shou, Mike Zheng

Abstract

Streaming video effect generation is highly desirable for live human-centric applications such as e-commerce streaming, entertainment, and vlogging, yet remains difficult due to the lack of suitable data and deployable editing models. Unlike generic video generation, this task requires real-time video-to-video editing that adds expressive effects while preserving human identity, background content, and temporal consistency. Existing acceleration efforts mainly focus on text-to-video generation, while efficient distillation for video editing remains largely underexplored. In this paper, we present \textbf{StreamingEffect}, a real-time human-centric streaming video effect framework. We adopt an in-context video editing architecture and train a high-quality bidirectional teacher, then distill it into a causal autoregressive student and further reduce sampling from 50 steps to 4 steps. We also introduce keyframe control, allowing reference effect frames to be injected online and propagated through the stream for interactive editing. To address the data bottleneck, we construct \textbf{VideoEffect-130K}, to our knowledge the largest human-centric video effect dataset, containing 70K effect videos and 60K editing videos across 600 effect categories curated from short-video and editing platforms. Experiments show that our method enables real-time, high-quality 720p video editing on a single H200 GPU.

Chinese Translation

流媒体视频效果生成在电子商务直播、娱乐和视频博客等以人为中心的实时应用中极为重要，但由于缺乏合适的数据和可部署的编辑模型，这一任务仍然面临挑战。与通用视频生成不同，该任务需要实时的视频到视频编辑，既要添加表现性的效果，又要保持人类身份、背景内容和时间一致性。现有的加速研究主要集中在文本到视频生成上，而视频编辑的高效蒸馏仍然未得到充分探索。本文提出了 extbf{StreamingEffect}，一个实时人本流媒体视频效果框架。我们采用了上下文视频编辑架构，并训练了一个高质量的双向教师模型，然后将其蒸馏为一个因果自回归学生模型，并进一步将采样步骤从50步减少到4步。我们还引入了关键帧控制，允许参考效果帧在线注入并在流中传播，以实现交互式编辑。为了解决数据瓶颈，我们构建了 extbf{VideoEffect-130K}，据我们所知，这是最大的以人为中心的视频效果数据集，包含70K个效果视频和60K个编辑视频，涵盖600个效果类别，均来自短视频和编辑平台。实验表明，我们的方法能够在单个H200 GPU上实现实时、高质量的720p视频编辑。

View on arXiv Download PDF AI Translation

cs.CV / 115 / 2605.17042

Thermal-Only Crowd Counting with Deployment-Time Privacy Protection

仅基于热成像的人群计数与部署时隐私保护

Qian, Yifei, Guo, Zhongliang, Lei, Chun Tong, Deng, Bowen, Lau, Chun Pong, Hong, Xiaopeng, Pound, Michael P.

Abstract

While RGB-Thermal crowd counting has shown promise, the paradigm faces critical limitations: RGB data raises privacy concerns in public surveillance, and multi-modal misalignment degrades fusion performance. We propose the first thermal-only framework specifically designed for privacy-conscious crowd counting, eliminating RGB dependency at inference time and substantially reducing the privacy exposure associated with continuous RGB capture in public surveillance deployments. To mitigate thermal ambiguity, we leverage depth-to-RGB diffusion models as a cross-modal bridge, extracting discriminative features that enhance thermal representations. Critically, we demonstrate that single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal, while multi-step approaches progressively decouple features from the conditioning input and accumulate errors that degrade counting accuracy. Experiments on RGBT-CC and DroneRGBT datasets show our method achieves competitive performance against state-of-the-art RGB-T fusion methods, while requiring only thermal input during inference, eliminating the need for continuous RGB capture that constitutes the primary privacy concern in real-world surveillance deployment. The code will be made publicly available.

Chinese Translation

尽管RGB-热成像人群计数显示出良好的前景，但该范式面临着关键限制：RGB数据在公共监控中引发隐私问题，而多模态不对齐则降低了融合性能。我们提出了第一个专门为注重隐私的人群计数设计的仅热成像框架，消除了推理时对RGB的依赖，并显著减少了与公共监控部署中连续RGB捕获相关的隐私暴露。为了减轻热成像的模糊性，我们利用深度到RGB的扩散模型作为跨模态桥梁，提取有助于增强热成像表示的判别特征。关键的是，我们证明单步LCM去噪能够产生与深度条件信号的结构内容最为一致的特征，而多步方法则逐步将特征与条件输入解耦，并累积导致计数精度下降的错误。在RGBT-CC和DroneRGBT数据集上的实验表明，我们的方法在性能上与最先进的RGB-T融合方法具有竞争力，同时在推理过程中仅需热成像输入，消除了在现实世界监控部署中构成主要隐私问题的连续RGB捕获的需求。代码将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 116 / 2605.17070

EPIC-Bench: A Perception-Centric Benchmark for Fine-Grained Embodied Visual Grounding in Vision-Language Models

EPIC-Bench：面向感知的细粒度具身视觉定位基准测试

Shan, Haozhe, Ren, Xiancong, Dong, Han, Shi, Haoyuan, Zhang, Yingji, Hu, Jiayu, Zhang, Yi, Dai, Yong, Shen, Bin, Qu, Lizhen, Xu, Zenglin, Ju, Xiaozhu

Abstract

While large vision-language models (VLMs) are increasingly adopted as the perceptual backbone for embodied agents, existing benchmarks often rely on question-answering or multiple-choice formats. These protocols allow models to exploit linguistic priors rather than demonstrating genuine visual grounding. To address this, we present EPIC-Bench, Embodied PerceptIon BenChmark, a fine-grained grounding benchmark designed to systematically evaluate the visual perceptual capabilities of VLMs in real-world embodied environments. Comprising 6.6k meticulously annotated tuples (Image, Text, Mask), EPIC-Bench spans 23 fine-grained tasks across three core stages of the embodied interaction pipeline: Target Localization, Navigation, and Manipulation. Extensive evaluations of over 89 leading VLMs reveal that while advanced reasoning models show promise, current VLMs universally struggle with complex visual-text alignment for physical interactions. Specifically, models exhibit critical bottlenecks in multi-target counting, part-whole relationship understanding, and affordance region detection. EPIC-Bench provides a robust foundation and actionable insights for advancing the next generation of vision-driven embodied models.

Chinese Translation

虽然大型视觉语言模型（VLMs）越来越多地被作为具身代理的感知基础，但现有基准测试通常依赖于问答或多项选择格式。这些协议使模型能够利用语言先验，而不是展示真正的视觉定位。为了解决这个问题，我们提出了EPIC-Bench（具身感知基准），这是一个旨在系统评估VLMs在真实世界具身环境中视觉感知能力的细粒度定位基准。EPIC-Bench包含6600个经过精心注释的元组（图像、文本、掩码），涵盖了具身交互流程三个核心阶段的23个细粒度任务：目标定位、导航和操作。对89个领先VLM的广泛评估表明，尽管先进的推理模型显示出潜力，但当前的VLM在物理交互中普遍面临复杂视觉-文本对齐的挑战。具体而言，模型在多目标计数、部分与整体关系理解以及可用区域检测方面存在关键瓶颈。EPIC-Bench为推动下一代视觉驱动的具身模型提供了坚实的基础和可操作的见解。

View on arXiv Download PDF AI Translation

cs.CV / 117 / 2605.17087

The Learnability Gap in Medical Latent Diffusion

医学潜在扩散中的可学习性差距

Dombrowski, Mischa, Nützel, Felix, Kainz, Bernhard

Abstract

Generative data augmentation with latent diffusion models is a promising strategy for addressing class imbalance in medical imaging, yet current approaches focus on perceptual fidelity and domain-specific autoencoder fine-tuning while neglecting a more fundamental bottleneck. We identify and formalize the learnability gap: large-scale pretrained autoencoders faithfully encode discriminative features for medical classification, as evidenced by near-lossless performance in reconstruction space, yet their latent representations are structured in ways that are difficult for classifiers to learn from. Across five autoencoder families and four medical benchmarks spanning chest radiography, dermatoscopy, computed tomography, and echocardiography, we show that this gap persists regardless of architecture, initialization strategy, or hyperparameter tuning, and that medical-domain fine-tuning of the autoencoder does not close it. To probe and partially narrow the gap, we develop noise-conditioned latent classifiers with FiLM layers and image-space distillation that offer 64x throughput and 120x memory gains over image-space models while serving as diagnostic tools for latent space quality. Our analysis provides a new framework for evaluating autoencoder latent spaces and identifies their structure, rather than their fidelity or domain specificity, as the primary obstacle to closing the performance gap between real and synthetic medical training data.

Chinese Translation

使用潜在扩散模型的生成数据增强是一种有前景的策略，用于解决医学影像中的类别不平衡问题，但当前的方法主要关注感知保真度和特定领域的自编码器微调，而忽视了一个更根本的瓶颈。我们识别并形式化了可学习性差距：大规模预训练的自编码器能够忠实地编码用于医学分类的判别特征，这在重构空间中表现出近乎无损的性能，但它们的潜在表示结构却使得分类器难以学习。通过五种自编码器家族和四个医学基准（涵盖胸部X光、皮肤镜检查、计算机断层扫描和超声心动图），我们展示了这一差距在架构、初始化策略或超参数调优的情况下依然存在，并且医学领域的自编码器微调并未缩小这一差距。为了探测并部分缩小这一差距，我们开发了带有FiLM层的噪声条件潜在分类器和图像空间蒸馏，这些方法在吞吐量上比图像空间模型提高了64倍，在内存使用上提高了120倍，同时作为潜在空间质量的诊断工具。我们的分析提供了评估自编码器潜在空间的新框架，并确定其结构，而非保真度或领域特异性，作为缩小真实与合成医学训练数据之间性能差距的主要障碍。

View on arXiv Download PDF AI Translation

cs.CV / 118 / 2605.17093

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

HEED：用于混合视觉-语言模型蒸馏的密度加权残差对齐

Liang, Yihao, Jha, Niraj K.

Abstract

Distilling vision-language models into faster hybrid architectures, such as 3:1 Mamba-2/attention mixes, is now standard practice for making inference efficient. Aggregate benchmarks suggest that this works but they hide selective failures. When we distill Qwen3-VL-8B-Instruct into a 3:1 Mamba-2/attention hybrid, student model stays within 2 points of the teacher across visual reasoning benchmarks like MMStar, MMBench, and MMMU-Pro, while dropping 13 points on optical-character-recognition and document tasks. The student can still understand the scene but loses the fine-grained text needed to answer. We localize much of the failure to a specific kind of position. In a high-resolution image, most patches are sky, wall, or smooth texture, while a small fraction carries text, edges, object boundaries, or other local details. In a token-level diagnostic, the top 10% highest-density patches have 3.6$\times$ larger residual drift than the bottom 10% lowest-density patches and 3.5$\times$ larger teacher-masking answer contribution. Uniform weighting devotes many loss terms to low-information background patches, whereas sparse answer-bearing patches receive no special protection. The required intervention is minimal: we replace uniform residual alignment with density-weighted residual alignment, using patch self-dissimilarity as a training-free proxy for position importance. We call this HEED. Compared with normal end-to-end distillation, HEED increases performance by 8.7 points on OCRBench v2 and 5.13 points on a 10-benchmark average. The gain is realized on different teacher models and hybrid architectures. After standard post-training, the student reaches teacher-level performance on the 10-benchmark average with a 4.12$\times$ throughput and a 68% memory saving at 128k context, with no additional parameters and no inference-time cost.

Chinese Translation

将视觉-语言模型蒸馏为更快的混合架构，如3:1 Mamba-2/注意力混合，现已成为提高推理效率的标准做法。综合基准测试表明这种方法有效，但隐藏了选择性失败。当我们将Qwen3-VL-8B-Instruct蒸馏为3:1 Mamba-2/注意力混合时，学生模型在视觉推理基准（如MMStar、MMBench和MMMU-Pro）上的表现与教师模型相差不超过2分，但在光学字符识别和文档任务上下降了13分。学生模型仍能理解场景，但失去了回答所需的细粒度文本。我们将大部分失败归因于特定类型的位置。在高分辨率图像中，大多数补丁为天空、墙壁或平滑纹理，而只有一小部分包含文本、边缘、物体边界或其他局部细节。在标记级诊断中，密度最高的前10%补丁的残差漂移是密度最低的10%补丁的3.6倍，教师掩蔽答案贡献是后者的3.5倍。均匀加权将许多损失项分配给低信息背景补丁，而稀疏的答案承载补丁则未受到特殊保护。所需的干预最小：我们用密度加权残差对齐替换均匀残差对齐，使用补丁自相似性作为位置重要性的无训练代理。我们称之为HEED。与正常的端到端蒸馏相比，HEED在OCRBench v2上提高了8.7分，在10个基准的平均上提高了5.13分。该增益在不同的教师模型和混合架构上均得以实现。经过标准的后训练，学生模型在10个基准的平均上达到了教师级别的性能，具有4.12倍的吞吐量和在128k上下文下68%的内存节省，且没有额外参数和推理时间成本。

View on arXiv Download PDF AI Translation

cs.CV / 119 / 2605.17095

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

执法记录仪视频中的警察接触可视化时间线：开放式BWC中的操作背景与活动分类用于培训和分析

Srbinovska, Angela, Homan, Christopher, Martin, Adrian, Fokoué, Ernest

Abstract

Law enforcement agencies are accumulating vast amounts of body-worn camera (BWC) footage. However, this remains operationally opaque. That is, analysts and trainers still have to invest considerable time watching full-length videos to pinpoint the start of key encounters and identify the points where activity shifts to something more physically intense. We present an approach to process BWC video into a time-aligned sequence of fixed-length 10-second windows, processed and labeled using a privacy-conscious protocol. Each window is labeled with two dimensions of information: (i) the operational context of the window and (ii) the level of motion intensity within the window, with low-evidence labels for windows for which insufficient evidence exists due to darkness, blur or occlusion. We train models to classify windows based on these two axes using frames sampled from each window encoded using CLIP model and aggregated into a window-level representation. We extract dense optical flow statistics for each window to capture motion intensity. On test windows the best context model achieves 78.75% accuracy, and the best-accuracy activity model achieves 88.33%. We also included integrity audits to show the results and how the visual timeline representations support faster incident review and make the officer training workflow more practical.

Chinese Translation

执法机构正在积累大量的执法记录仪（BWC）视频。然而，这在操作上仍然不够透明。也就是说，分析师和培训师仍需花费大量时间观看完整视频，以确定关键接触的开始时间，并识别活动转变为更高强度的时刻。我们提出了一种将BWC视频处理为时间对齐的固定长度10秒窗口的方法，采用隐私保护协议进行处理和标注。每个窗口被标注为两个维度的信息：（i）窗口的操作背景和（ii）窗口内的运动强度水平，对于由于黑暗、模糊或遮挡而证据不足的窗口，使用低证据标签。我们训练模型基于这两个轴对窗口进行分类，使用从每个窗口中采样的帧，通过CLIP模型编码并聚合为窗口级表示。我们提取每个窗口的稠密光流统计数据以捕捉运动强度。在测试窗口中，最佳背景模型的准确率达到78.75%，而最佳活动模型的准确率达到88.33%。我们还包括了完整性审计，以展示结果以及可视化时间线表示如何支持更快的事件审查，并使警员培训工作流程更为实用。

View on arXiv Download PDF AI Translation

cs.CV / 120 / 2605.17120

Markerless Motion Capture for Biomechanical Whole-Body Kinematic Estimation in Infants

无标记运动捕捉在婴儿生物力学全身运动学估计中的应用

Joshi, Divya, Peiffer, J. D., Peyton, Colleen, Cotton, R. James

Abstract

arly identification of motor impairment in infancy relies on expert visual assessment of spontaneous movement, motivating the development of automated, objective alternatives. One promising approach is using computer vision, which benefits from high quality pose estimation from video. In this study, we systematically evaluated three state-of-the-art pose estimation frameworks (MeTRAbs-ACAE, SAM 3D Body, and Sapiens) on 100 videos over 13 sessions of 8 infants recorded with a multi-view markerless motion capture system. We quantified keypoint detection accuracy using reprojection error, geometric consistency, and Procrustes-aligned 3D position error, and demonstrated proof-of-concept for fitting an inverse kinematic framework to infant data. While Sapiens achieved the lowest reprojection error and highest geometric consistency of the methods evaluated (22.8 pixels and 0.82, respectively), SAM 3D Body provided the most comprehensive 3D information for kinematic reconstruction with Procrustes-aligned position errors of 19 to 28 mm. We demonstrate in a case comparison example that biomechanical models fit to SAM 3D estimates distinguish representative movement patterns in infants related to motor development, as identified by a clinical expert. Together, these findings highlight both the promise and current limitations of 3D pose estimation for infant biomechanics and establish preliminary groundwork for scalable, video-based assessment of early motor development.

Chinese Translation

早期识别婴儿运动障碍依赖于专家对自发运动的视觉评估，这促使了自动化、客观替代方案的发展。一种有前景的方法是使用计算机视觉，该方法利用视频中的高质量姿态估计。在本研究中，我们系统地评估了三种最先进的姿态估计框架（MeTRAbs-ACAE、SAM 3D Body 和 Sapiens），这些框架在100个视频中进行了评估，这些视频是在多视角无标记运动捕捉系统下记录的8名婴儿的13个会话中获得的。我们通过重投影误差、几何一致性和Procrustes对齐的3D位置误差量化了关键点检测的准确性，并展示了将逆运动学框架拟合到婴儿数据的概念验证。虽然Sapiens在评估的方法中实现了最低的重投影误差和最高的几何一致性（分别为22.8像素和0.82），但SAM 3D Body提供了最全面的3D信息用于运动学重建，其Procrustes对齐的位置误差为19至28毫米。我们在一个案例比较示例中展示了拟合于SAM 3D估计的生物力学模型能够区分与运动发展相关的婴儿代表性运动模式，这一点得到了临床专家的确认。这些发现共同突显了3D姿态估计在婴儿生物力学中的潜力和当前局限性，并为基于视频的早期运动发展评估奠定了初步基础。

View on arXiv Download PDF AI Translation

cs.CV / 121 / 2605.17125

Principal Component Analysis for Lunar Crater Detection

用于月球陨石坑检测的主成分分析

Driver, Travis, Christian, John A.

Abstract

Optical navigation is a critical component for lunar orbiter and lander missions. Image-based crater identification has emerged as a promising technology for optical navigation due to the abundance of craters on the lunar surface and the availability of extensive crater catalogs. Moreover, due to the relative morphological homogeneity among lunar craters, template matching has been identified as a promising approach for identification. In this paper, we propose EigenCrater, an automated crater template generation method based on principal component analysis of crater digital elevation maps (DEMs). We demonstrate superior detection and position estimation performance relative to hand-picked templates on simulated lunar imagery.

Chinese Translation

光学导航是月球轨道器和着陆器任务中的关键组成部分。基于图像的陨石坑识别由于月球表面陨石坑的丰富性以及广泛的陨石坑目录的可用性，已成为光学导航的一项有前景的技术。此外，由于月球陨石坑之间相对形态的同质性，模板匹配被认为是一种有前景的识别方法。在本文中，我们提出了EigenCrater，这是一种基于陨石坑数字高程模型（DEMs）主成分分析的自动化陨石坑模板生成方法。我们在模拟的月球图像中展示了相较于手动选择模板的优越检测和位置估计性能。

View on arXiv Download PDF AI Translation

cs.CV / 122 / 2605.17131

A Systematic Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

点云分类与分割的深度学习架构系统性调查

Kamal, Minhas, Kumar, Hiranya Garbha, Prabhakaran, Balakrishnan

Abstract

Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

Chinese Translation

点云因其简单性和几何保真性而成为表示三维形状和场景的最广泛采用格式。然而，其固有的无序和不规则特性，加上传感器噪声和遮挡，给基于机器学习的方法带来了独特的挑战。为应对这些问题，已经开发出多种策略，包括转换为有序格式、提取局部几何特征，以及基于置换不变性或自注意力的处理。在本文中，我们的重点是针对三维视觉中的三个基本任务：点云分类、部件分割和语义分割的深度学习模型。我们首先正式定义点云数据，随后深入讨论其结构特征。接着，我们根据其主干结构对显著的研究成果进行分类，并评估其在流行基准上的表现。除了经验比较，我们还提供了对架构创新和局限性的见解。最后，我们概述了开放挑战和三维点云理解的有前景的未来方向。

View on arXiv Download PDF AI Translation

cs.CV / 123 / 2605.17133

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection

CAM-VFD：跨注意力多模态视频伪造检测

Elkhodary, Hoda Osama, Youssef, Sherin Mostafa, Elshenawy, Marwa, Sobhy, Dalia

Abstract

The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ($p<0.001$, Cohen's $d=0.68$). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31\% Top-1 accuracy on GenVidBench and 93.43\% accuracy, 90.63\% F1-score, and 96.56\% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}.

Chinese Translation

深度伪造技术和视频处理工具的快速发展对多媒体取证、司法证据完整性和信息真实性构成了重大挑战。目前的检测器依赖于单一模态信号，独立处理外观、几何和运动。然而，先进的生成器在模态内部保持一致性，同时产生跨模态矛盾，这些矛盾在法医学上具有辨别性，但对任何单一模态检测器来说都是不可见的。我们提出了CAM-VFD，一个跨注意力多模态视频伪造检测框架，将跨模态矛盾建模为方向性法医学信号。该框架使用跨注意力融合机制，其中基于CLIP的外观表示作为查询，针对VideoMAE运动特征和MiDaS深度特征，从而识别视觉、时间和几何证据之间的矛盾。我们通过跨模态注意力差异分析来检验这一设计，观察到真实和伪造分布在统计上是可分的（$p<0.001$，Cohen's $d=0.68$）。在两个生成视频基准上的实验结果表明，性能一致，在GenVidBench上达到95.31\%的Top-1准确率，在GenVideo上达到93.43\\%的准确率、90.63\\%的F1-score和96.56\\%的AUROC。此外，CAM-VFD在压缩、噪声、模糊和对抗扰动下表现出稳定的性能，表明跨模态推理可能提高媒体取证的鲁棒性。代码已公开，地址为\url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}。

View on arXiv Download PDF AI Translation

cs.CV / 124 / 2605.17135

Collaborative Learning for Semi-Supervised LiDAR Semantic Segmentation

基于协作学习的半监督LiDAR语义分割

Yang, Bin, Condurache, Alexandru Paul

Abstract

Annotating large-scale LiDAR point clouds for 3D semantic segmentation is costly and time-consuming, which motivates the use of semi-supervised learning (SemiSL). Standard LiDAR SemiSL methods typically adopt a two-step training paradigm, where pseudo-labels are separately generated from a single distillation source, either from the same or another LiDAR representation. Such supervision relies on a unique source of pseudo-labels, which can reinforce confirmation bias and propagate errors during training, ultimately limiting performance. To address this challenge, we introduce CoLLiS, a novel framework that leverages Collaborative Learning for LiDAR Semi-supervised segmentation. Unlike prior paradigms with decoupled pseudo-labeling and training phases, CoLLiS trains multiple representations collaboratively in a single step by treating them as coequal students. Each student is adaptively distilled from multiple representations, while inter-student disparities are monitored online to resolve contradictory supervision and effectively mitigate confirmation bias. Extensive experiments on three datasets demonstrate that CoLLiS consistently outperforms state-of-the-art LiDAR SemiSL methods, with particularly strong gains in low-label regimes.

Chinese Translation

为3D语义分割标注大规模LiDAR点云既费时又昂贵，这促使了半监督学习（SemiSL）的应用。标准的LiDAR半监督学习方法通常采用两步训练范式，其中伪标签是从单一蒸馏源生成的，可能来自相同或不同的LiDAR表示。这种监督依赖于唯一的伪标签源，可能会强化确认偏差并在训练过程中传播错误，最终限制性能。为了解决这一挑战，我们提出了CoLLiS，一个利用协作学习进行LiDAR半监督分割的新框架。与之前的伪标签生成和训练阶段解耦的范式不同，CoLLiS通过将多个表示视为平等的学生，在单一步骤中协同训练。每个学生都从多个表示中自适应蒸馏，同时在线监控学生之间的差异，以解决矛盾的监督并有效减轻确认偏差。在三个数据集上的大量实验表明，CoLLiS在性能上始终优于最先进的LiDAR半监督学习方法，尤其在低标签情况下表现出显著的提升。

View on arXiv Download PDF AI Translation

cs.CV / 125 / 2605.17140

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

UCSF-PDGM-VQA：用于脑肿瘤MRI解读的视觉问答数据集

Ghosh, Shiv, Lateef, Junayd, Chih-Hua, Liu, Yu, Yannan, Rauschecker, Andreas M., Sushil, Madhumita

Abstract

Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

Chinese Translation

脑肿瘤的诊断在很大程度上依赖于磁共振成像（MRI）评估，这要求放射科医生综合分析多个3D序列和纵向研究中的数千幅图像。这个过程需要高级神经放射学培训，带来了巨大的认知负担，并且非常耗时。尽管放射学的需求日益增加，但这种专业知识难以扩展，给当前的医疗系统带来了压力。视觉-语言模型（VLMs）提供了一个通过半自动化、交互式解读复杂脑MRI来减轻这种负担的机会。然而，由于缺乏专门的基准来评估它们，目前在神经肿瘤学中尚未得到充分利用。我们引入了一个临床相关的视觉问答（VQA）基准——UCSF-PDGM-VQA数据集，该数据集由来自公共UCSF-PDGM数据集的473个与胶质瘤相关的MRI研究中的2,387对问答（QA）组成。我们进一步在该数据集上为六个最先进的视觉-语言模型（VLMs）和一个大型语言模型建立了性能基线。我们发现当前模型无法有效处理多序列、三维MRI扫描，从而导致视觉特征的抑制和对语言先验的过度依赖，造成模态崩溃。这些发现强调了当前模型在临床环境中的可靠性和安全性存在严重不足，迫切需要开发稳健的、特定领域的VLMs。

View on arXiv Download PDF AI Translation

cs.CV / 126 / 2605.17165

Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

视频JEPA的因子化潜在动态：辅助目标的实证研究

Premi, Santosh

Abstract

Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video-JEPA training is not well characterized. We report a small-scale empirical study of 18 auxiliary objective variants for Video-JEPA across two pretraining regimes: single-dataset (UCF-101) and mixed-dataset (UCF-101 + Something-Something V2 + ImageNet-100). We evaluate frozen representations on three complementary benchmarks: Diving-48 (fine-grained motion), SomethingSomething V2 (temporal reasoning), and ImageNet-100 (appearance). Our experiments suggest that many auxiliary objectives exhibit capacity trade-offs: gains on one downstream capability often coincide with degradation on another. We then study FWM-HW-LD (Factorized World-Model with Hard-Region-Weighted Latent Dynamics), a training-time objective that separates the latent representation into appearance and dynamics subspaces and applies hard-region weighting to both JEPA prediction errors and latent dynamics errors. In our mixed-dataset setting, FWM-HW-LD improves ImageNet-100 by +5.92 and SSv2 by +3.21 percentage points relative to the reference baseline, while remaining within 0.30 percentage points on Diving-48. These results indicate that latent factorization is a useful direction for studying auxiliary-objective trade-offs in Video-JEPA.

Chinese Translation

联合嵌入预测架构（JEPA）是自监督视频表示学习的一个有前景的框架，但在小规模视频JEPA训练中，辅助目标的行为尚未得到充分表征。我们报告了在两个预训练模式下对18种视频JEPA辅助目标变体的小规模实证研究：单数据集（UCF-101）和混合数据集（UCF-101 + Something-Something V2 + ImageNet-100）。我们在三个互补基准上评估了冻结表示：Diving-48（细粒度运动）、Something-Something V2（时间推理）和ImageNet-100（外观）。我们的实验表明，许多辅助目标表现出容量权衡：在一个下游能力上的提升往往伴随着另一个能力的下降。接着，我们研究了FWM-HW-LD（因子化世界模型与硬区域加权潜在动态），这是一个训练时目标，它将潜在表示分离为外观和动态子空间，并对JEPA预测误差和潜在动态误差应用硬区域加权。在我们的混合数据集设置中，FWM-HW-LD相较于参考基线在ImageNet-100上提高了+5.92，在SSv2上提高了+3.21个百分点，同时在Diving-48上保持在0.30个百分点以内。这些结果表明，潜在因子化是研究视频JEPA中辅助目标权衡的一个有用方向。

View on arXiv Download PDF AI Translation

cs.CV / 127 / 2605.17179

iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning

iMiGUE-3K：用于微手势分析的自监督学习大规模基准

Wang, Chengyan, Chen, Haoyu, Wei, Hui, Yang, Yueyi, Chen, Yunquan, Zhao, Guoying

Abstract

Emotion understanding is a fundamental challenge in affective computing and artificial intelligence. While existing approaches predominantly focus on facial expressions and speech, they often overlook the rich emotional cues conveyed through body language. Recently, micro-gestures (MGs), unintentional, subconscious movements driven by inner feelings, have attracted increasing attention as an alternative to other cues. However, there are no existing large-scale datasets supporting the pre-training of the MG foundation model. To advance MG research, we present a new benchmark for micro-gesture-based emotion understanding, featuring key contributions with a novel dataset (iMiGUE-3K) and a series of foundation models for different tasks. Using a model-based crowd-sourcing data collection strategy, we construct iMiGUE-3K, the largest MG dataset to date. It comprises video recordings from 332 distinct professional tennis players' public press interviews over the past seven years, totaling more than 3.4K long video clips and 37 million frames. The dataset includes 32 micro-gesture classes with rich descriptive annotations, making it the first large-scale, in-the-wild, video dataset for fine-grained gesture-based emotion analysis. Built on iMiGUE-3K, we propose MG-FMs, a discriminative foundation model for transferable gesture presentation learning. Based on the foundation model, we establish five comprehensive evaluation tasks: MG recognition (unsupervised, semi-supervised, supervised), MG retrieval, and MG emotion recognition. Our systematic evaluation of representative methods demonstrates that micro-gesture-based analysis significantly improves emotion understanding. We hope this work can provide comprehensive tools for MG analysis and set a solid foundation for future research in psychological diagnostics, affective computing, and advanced human-computer interaction.

Chinese Translation

情感理解是情感计算和人工智能中的一个基本挑战。尽管现有的方法主要集中在面部表情和语音上，但它们往往忽视了通过身体语言传达的丰富情感线索。最近，微手势（Micro-Gestures，MG）作为一种替代其他线索的方式，因其是由内心感受驱动的无意、潜意识的运动而受到越来越多的关注。然而，目前尚无支持MG基础模型预训练的大规模数据集。为了推动MG研究，我们提出了一个基于微手势的情感理解新基准，具有关键贡献，包括一个新数据集（iMiGUE-3K）和一系列针对不同任务的基础模型。通过基于模型的众包数据收集策略，我们构建了iMiGUE-3K，这是迄今为止最大的MG数据集。该数据集包括过去七年中332名不同职业网球运动员的公开新闻采访视频录制，总计超过3400个长视频片段和3700万帧。数据集包含32个微手势类别，并附有丰富的描述性注释，使其成为第一个大规模、真实场景下的细粒度手势情感分析视频数据集。基于iMiGUE-3K，我们提出了MG-FMs，一个用于可迁移手势表示学习的判别基础模型。基于该基础模型，我们建立了五个全面的评估任务：MG识别（无监督、半监督、监督）、MG检索和MG情感识别。我们对代表性方法的系统评估表明，基于微手势的分析显著提高了情感理解。我们希望这项工作能够为MG分析提供全面的工具，并为未来在心理诊断、情感计算和先进人机交互领域的研究奠定坚实基础。

View on arXiv Download PDF AI Translation

cs.CV / 128 / 2605.17236

Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

视觉变换器在自动化宫颈癌分类中的系统评估：优化、统计验证与临床可解释性

Albzour, Nisreen, Lam, Sarah S.

Abstract

Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.

Chinese Translation

手动宫颈癌筛查的巴氏涂片分析受到观察者间变异性、时间限制和专家可用性不足的限制。尽管卷积神经网络（CNNs）已实现宫颈细胞分类的自动化，但在建模长距离空间依赖性方面仍然有限，且往往缺乏临床可解释性。在本研究中，系统优化了视觉变换器（Vision Transformer, ViT）架构，以增强自动化宫颈癌筛查，从而提高了可解释性。利用Herlev数据集（917张图像：242张正常，675张异常）对ViT-Tiny进行优化，该架构是一种为减少计算复杂度而设计的轻量级视觉变换器，通过对增强策略、类别加权和超参数的全面评估。最佳配置达到了94.9%-95.2%的交叉验证准确率，其中随机水平翻转和类别加权（0.7 x 1.3）被确定为最有效的策略。梯度加权类别激活映射（Gradient-weighted Class Activation Mapping, Grad-CAM）分析确认模型注意力与临床相关的形态特征相对应，包括核区域、细胞边界和染色质纹理，这些特征符合细胞病理标准。这些发现表明，视觉变换器能够为宫颈癌筛查提供准确且可解释的决策支持，满足医疗人工智能部署所需的临床性能和透明度要求。

View on arXiv Download PDF AI Translation

cs.CV / 129 / 2605.17248

Image-to-Video Diffusion: From Foundations to Open Frontiers

图像到视频扩散：从基础到开放前沿

Wang, Xianlong, Pan, Wenbo, Zhou, Shijia, Li, Ke, Wang, Yuqi, Ye, Zeyu, Zhang, Hangtao, Zhang, Leo Yu, Jia, Xiaohua

Abstract

Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation settings, this task places stricter demands on content consistency, identity preservation, and motion coherence. Although the literature grows rapidly, existing works mostly discuss I2V generation within broader topics and still lack a dedicated taxonomy together with a systematic analysis centered on this field. This work addresses that gap by treating diffusion I2V generation as a standalone subject. It first reviews the task formulation, model architectures, datasets, and evaluation metrics, and then organizes existing methods through a taxonomy based on architecture and training paradigm. It further distills four core designs, namely condition encoding, temporal modeling, noise prior design, and spatial-temporal upsampling, and discusses representative application scenarios together with major open challenges.

Chinese Translation

基于扩散的图像到视频（I2V）生成已成为生成模型中的一个核心方向，通过将参考图像（可选条件）转化为时间一致的视频。与更广泛的视频生成设置相比，该任务对内容一致性、身份保持和运动一致性提出了更严格的要求。尽管相关文献迅速增长，现有研究大多在更广泛的主题中讨论I2V生成，仍然缺乏专门的分类法以及围绕该领域的系统分析。本研究通过将扩散I2V生成视为一个独立主题来填补这一空白。首先回顾了任务的表述、模型架构、数据集和评估指标，然后通过基于架构和训练范式的分类法组织现有方法。进一步提炼出四个核心设计，即条件编码、时间建模、噪声先验设计和时空上采样，并讨论了代表性的应用场景以及主要的开放挑战。

View on arXiv Download PDF AI Translation

cs.CV / 130 / 2605.17252

Monocular Depth Perception Enhancement Based on Joint Shading/Contrast Model and Motion Parallax (JSM)

基于联合阴影/对比模型和运动视差的单目深度感知增强 (JSM)

Ryu, Seungchul, Yoo, Hyunjin, Akhavan, Tara

Abstract

Stereoscopic 3D displays adopt a binocular depth cue to provide depth perception. However, users should be equipped with expensive special devices to appreciate depth perception based on the binocular depth cues. Also, visual fatigue induced by the stereoscopic display is still a challenging open problem. In order to overcome this limitation, this paper proposes a novel framework, JSM, to enhance monocular depth perception, significantly improving both depth volume perception and depth range perception. The proposed framework can not only provide an enhanced depth perception on any conventional 2D display devices, but also it can be applicable to the 3D display devices since it is complementary to binocular depth cues. The qualitative evaluation, ablation study, and subjective user evaluation proved the advantages and practicability of the proposed framework.

Chinese Translation

立体3D显示器采用双眼深度线索来提供深度感知。然而，用户需要配备昂贵的特殊设备才能欣赏基于双眼深度线索的深度感知。此外，立体显示引起的视觉疲劳仍然是一个具有挑战性的开放问题。为了克服这一限制，本文提出了一种新颖的框架JSM，以增强单目深度感知，显著改善深度体积感知和深度范围感知。该框架不仅可以在任何传统的2D显示设备上提供增强的深度感知，还可以适用于3D显示设备，因为它与双眼深度线索是互补的。定性评估、消融研究和主观用户评估证明了该框架的优势和实用性。

View on arXiv Download PDF AI Translation

cs.CV / 131 / 2605.17260

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

LiteFrame：高效视觉编码器解锁视频大型语言模型中的帧缩放

Kim, Jihwan, Parthasarathy, Nikhil, Qin, Danfeng, Hur, Junhwa, Sun, Deqing, Han, Bohyung, Yang, Ming-Hsuan, Gong, Boqing

Abstract

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8$\times$ more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.

Chinese Translation

将视频大型语言模型（Video LLMs）扩展到长视频的根本挑战在于管理视觉标记上下文长度的爆炸性增长。现有策略主要集中在“事后”标记减少——在特征提取后减少视觉标记，以减轻LLM的计算负担。尽管这些方法有效地减少了视觉标记的数量，但我们观察到主要的延迟瓶颈随后从LLM转移到了视觉编码器的昂贵逐帧处理。为了解决这个问题，我们提出了LiteFrame，一个强大而高效的视频编码器骨干网，专为视频LLMs设计。为了训练LiteFrame，我们提出了压缩标记蒸馏（Compressed Token Distillation, CTD），这是一种新颖的训练框架，旨在教导一个紧凑的学生视觉编码器直接预测由大型教师视觉模型生成的信息密集型、时空压缩的表示，从而有效地绕过冗余计算。当与进一步的语言模型适应（Language Model Adaptation, LMA）结合时，这种方法产生了一个新的延迟-准确性帕累托前沿——与InternVL3-8B相比，LiteFrame在处理8倍帧数的同时，提供了35%的端到端延迟减少，并提高了多个基准测试中的平均视频理解准确性。我们的结果展示了一条在固定计算预算下解锁长视频理解的新潜在路径。

View on arXiv Download PDF AI Translation

cs.CV / 132 / 2605.17262

EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

EgoIntrospect：一个以自我为中心的数据集和用户中心内部状态推理的基准

Wang, Zeyu, Liu, Chang, Tjitrahardja, Eduardus, Wang, Yuntao, Pavlov, Borislav, Gou, Fangfei, Davila, Jose Manuel, Shi, Dai, Xu, Ran, Pan, Yue, Tan, Jiayi, Chang, Shuting, Wang, Qi, Li, Jinzhao, Hua, Jiacheng, Huang, Yifei, Sun, Jingwei, Zhang, Yu, Zhang, Liuxin, Yao, Guocai, Jia, Jia, Li, Yin, Wang, Qianying, Shi, Yuanchun, Liu, Miao

Abstract

Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/

Chinese Translation

尽管在以自我为中心的视频数据集和基准方面进行了广泛的努力，但理解用户的内部状态仍然在很大程度上被忽视，而这对于实现无缝的人工智能助手体验至关重要。在本研究中，我们介绍了EgoIntrospect，这是第一个在用户驱动场景中捕获的以自我为中心的数据集，具有自我注释，明确揭示用户与人工智能助手的交互意图。EgoIntrospect采用跨设备设置收集，提供同步的视频、音频、注视、运动和生理信号。该数据集由60名受试者的180小时录音组成，每名受试者的平均录音时长为3小时。利用EgoIntrospect，我们正式化了一系列以用户内部状态为中心的任务，包括情感体验、交互意图和认知记忆。我们进一步处理注释，以构建基准，评估现代多模态大型语言模型从以自我为中心的观察中推理用户内部状态的能力。在我们的基准测试中的实验表明，现有的多模态大型语言模型在有效利用多模态信号推断用户主观内部状态方面存在困难。该数据集和注释将公开发布，以推动以自我为中心的视觉和可穿戴人工智能助手的研究。项目页面：https://ego-introspect.github.io/

View on arXiv Download PDF AI Translation

cs.CV / 133 / 2605.17270

Beyond Detection: A Structure-Aware Framework for Scene Text Tracking

超越检测：一种结构感知的场景文本跟踪框架

Yu, Chenmin, Yu, Liu, Wu, Daiqing, Li, Gengluo, Chen, Zeyu, Zhou, Yu

Abstract

Modern visual object trackers show impressive results on general targets, yet their performance drops substantially when dealing with scene text. Although currently underexplored, tracking text in videos is essential for dynamic text manipulations such as segmentation, removal, and editing. To fill this gap, this paper formalizes this specific task as Scene Text Tracking and presents the first systematic work for it. We identify three primary challenges in this task: 1) severe geometric distortions from perspective shifts, 2) high visual ambiguity across different instances, and 3) high sensitivity to fine-grained structural details. To address these issues, we propose SymTrack, a unified detection-free framework with synergistic dual-branch design. It integrates a Cross-Expert Calibration mechanism to reduce semantic bias, along with a Predictive Token Rectification mechanism to correct structural imbalances, complemented by an Adaptive Inference Engine that stabilizes predictions under motion constraints. Considering the lack of dedicated benchmarks for this task, we utilize three datasets from video text spotting to construct a benchmark with high-quality annotations. Extensive experiments demonstrate that SymTrack sets the new state-of-the-art on all three benchmarks, outperforming previous best trackers by up to 11.97\% AUC on $ \text{BOVText}_{\text{SOT}} $. Overall, our work promotes efficient and thorough text tracking, paving the way toward more generalized video text manipulation.

Chinese Translation

现代视觉目标跟踪器在一般目标上表现出色，但在处理场景文本时，其性能显著下降。尽管这一领域目前尚未得到充分探索，但在视频中跟踪文本对于动态文本操作（如分割、去除和编辑）至关重要。为填补这一空白，本文将这一特定任务形式化为场景文本跟踪，并提出了首个系统性研究。我们识别出该任务中的三个主要挑战：1）由于视角变化造成的严重几何失真，2）不同实例之间的高度视觉模糊，以及3）对细粒度结构细节的高度敏感性。为了解决这些问题，我们提出了SymTrack，一个统一的无检测框架，采用协同的双分支设计。它集成了跨专家校准机制以减少语义偏差，以及预测性标记校正机制以纠正结构不平衡，辅以自适应推理引擎以在运动约束下稳定预测。考虑到该任务缺乏专门的基准，我们利用来自视频文本检测的三个数据集构建了一个具有高质量注释的基准。大量实验表明，SymTrack在所有三个基准上设立了新的最先进水平，在$ ext{BOVText}_{ ext{SOT}} $上超越了之前最佳跟踪器高达11.97\%的AUC。总体而言，我们的工作促进了高效而全面的文本跟踪，为更广泛的视频文本操作铺平了道路。

View on arXiv Download PDF AI Translation

cs.CV / 134 / 2605.17284

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

CLAP：用于端到端自主驾驶的对比潜在空间提示优化

Zhu, Ruiyang, He, Yuehan, Zheng, Boyuan, Zhao, Zesen, Chalhoub, Ahmad, Zhang, Qingzhao, Mao, Z. Morley

Abstract

End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs' latent space: (i) at the VLA's hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.

Chinese Translation

由视觉-语言-动作（VLA）模型驱动的端到端自主驾驶系统在常见驾驶场景中表现出色，但在一些罕见但安全关键的长尾场景（如活跃的施工区域和复杂的让行几何形状）中仍然脆弱。本文提出了一种方法，旨在解决超出数据扩展和模型训练的长尾挑战场景。我们引入了CLAP（对比潜在空间提示优化），这是一种位置感知的适应框架，通过每个路障的软提示增强一个冻结的VLA驾驶模型，这些提示是从众包数据中优化而来，并通过车与万物（V2X）通信按需检索。我们的方法基于对VLA潜在空间的两个观察：（i）在VLA的隐藏状态层中，来自同一路障的场景紧密聚集并占据潜在空间的紧凑区域；（ii）在单个路障内，长尾帧和正常帧在潜在表示中高度交织，使得在不干扰另一者的情况下改善一个变得困难。CLAP通过一个两阶段的流程解决了这一问题：首先进行监督对比学习，以发现特定路障的困难场景方向，然后进行方向性正则化的提示优化，选择性地改善具有挑战性的帧，同时保持正常帧的性能。在NAVSIM基准测试中，结合多种最先进的VLA骨干网络，CLAP将具有挑战性的场景规划误差降低了24%，而正常帧没有出现回归，显著提高了规划性能。

View on arXiv Download PDF AI Translation

cs.CV / 135 / 2605.17286

HyperVision: A Channel-Adaptive Ground-Based Hyperspectral Vision Pre-trained Backbone

HyperVision：一种通道自适应的地面高光谱视觉预训练骨干网络

Fu, Guanyiman, Li, Jingtao, Cheng, Zihang, Li, Zhuanfeng, Chen, Diqi, Xu, Yan, Xiong, Fengchao, Lu, Jianfeng, Zhou, Jun

Abstract

While hyperspectral imaging provides rich spatial-spectral information across hundreds of narrow wavelength bands for precise material identification, ground-based hyperspectral pre-trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground-based hyperspectral pre-trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel-adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, to address the scarcity and inconsistency of labels, we introduce a multi-source pseudo-labeling method that fuses semantic representations from both spatial structures generated by SAM2 and fine-grained spectral material information extracted by HyperFree. Third, to compensate for limited dataset scale and enrich scene diversity, a cross-modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre-trained RGB vision model to our hyperspectral backbone. Pre-trained on a collection of 15k images from 26 diverse ground-based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head-only adaptation without adjusting backbone parameters, it achieves state-of-the-art performance compared to task-specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation $\mathrm{Acc}_{\mathrm{M}}$, a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre-trained model will be publicly available at https://github.com/lronkitty/HyperVision .

Chinese Translation

高光谱成像在数百个窄波段中提供丰富的空间-光谱信息，以实现精确的材料识别，但地面高光谱预训练骨干网络仍然缺乏，受限于传感器之间的光谱配置差异、标签的稀缺性和不一致性，以及现有数据集的规模和场景多样性有限。为了解决这些挑战并实现通用感知，我们提出了HyperVision，这是第一个地面高光谱预训练骨干网络。首先，为了处理不同的光谱配置，HyperVision采用了一种通道自适应动态嵌入机制，将异构输入映射到统一的标记空间。其次，为了解决标签的稀缺性和不一致性，我们引入了一种多源伪标签方法，融合了由SAM2生成的空间结构的语义表示和由HyperFree提取的细粒度光谱材料信息。第三，为了弥补数据集规模的限制并丰富场景多样性，利用跨模态知识蒸馏机制，将来自预训练RGB视觉模型的丰富语义表示转移到我们的高光谱骨干网络上。HyperVision在来自26个多样化地面数据集的15,000张图像的集合上进行预训练，展示了卓越的泛化能力。只需高效的头部适应，而无需调整骨干参数，它在三项下游任务中相较于特定任务的方法实现了最先进的性能，在不同传感器配置下，高光谱语义分割的相对提升达到16.3%（$ ext{Acc}_{ ext{M}}$），目标跟踪AUC相对提高2.1%，显著目标检测MAE减少35.5%。源代码和预训练模型将在https://github.com/lronkitty/HyperVision公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 136 / 2605.17287

LISA: Language-guided Interference-aware Spatial-Frequency Attention for Driver Gaze Estimation

LISA：基于语言引导的干扰感知空间频率注意力框架用于驾驶员注视估计

Ma, Jun, Yang, Zhenye, Zhou, Ruichen, Zhang, Pei, Li, Huan, Chen, Jinpeng

Abstract

Driver gaze estimation serves as a fundamental metric for evaluating driver attentiveness in modern monitoring systems. Beyond being vulnerable to sudden lighting changes and sensor noise, spatial-domain models struggle to disentangle authentic gaze cues from irrelevant visual attributes. In this paper, we propose LISA, a \textbf{L}anguage-guided \textbf{I}nterference-aware \textbf{S}patial-Frequency \textbf{A}ttention framework that combines frequency-domain priors with vision-language knowledge. Observing that the amplitude spectrum remains relatively stable even under spatial perturbations, we design a dual-domain fusion mechanism. It integrates stable low-frequency semantics into high-frequency details, employing spatial attention to precisely target ocular regions. To reduce semantic ambiguity, we also introduce a training-time disentanglement strategy. Using a frozen CLIP encoder and orthogonal regularization, we explicitly separate gaze features from appearance interference. Experiments on two benchmarks show that LISA achieves state-of-the-art performance, with significantly improved robustness against occlusions and lighting variations. The code repository is available at https://github.com/Mason-bupt/LISA.

Chinese Translation

驾驶员注视估计作为评估现代监控系统中驾驶员注意力的基本指标，面临着突发光照变化和传感器噪声的脆弱性。此外，空间域模型在分离真实注视线索与无关视觉属性方面也存在困难。在本文中，我们提出了LISA，一个 extbf{L}anguage-guided extbf{I}nterference-aware extbf{S}patial-Frequency extbf{A}ttention框架，结合了频域先验与视觉-语言知识。我们观察到，即使在空间扰动下，幅度谱仍然相对稳定，因此设计了一个双域融合机制。该机制将稳定的低频语义与高频细节相结合，利用空间注意力精确定位眼部区域。为了减少语义模糊性，我们还引入了一种训练时的解耦策略。通过使用冻结的CLIP编码器和正交正则化，我们明确地将注视特征与外观干扰分离。对两个基准的实验表明，LISA实现了最先进的性能，并在遮挡和光照变化方面显著提高了鲁棒性。代码库可在 https://github.com/Mason-bupt/LISA 获取。

View on arXiv Download PDF AI Translation

cs.CV / 137 / 2605.17294

HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing

HierEdit：区域感知的层次扩散用于高效高分辨率编辑

Zhang, Yuyao, Huang-Menders, Alexander, Tai, Yu-Wing

Abstract

High-resolution image editing is essential for professional and creative applications, yet existing multimodal diffusion-based editors remain computationally inefficient and constrained to relatively low resolutions. Current approaches redundantly process the entire image canvas or rely on large-scale high-resolution datasets, resulting in substantial training and inference costs. We introduce HierEdit, a region-aware hierarchical diffusion framework designed for efficient and scalable high-resolution image editing. Our method first performs edits on a low-resolution proxy using an off-the-shelf editing model to generate a reference and to localize the modified regions. A hierarchical local-window diffusion model (\textbf{Local-Window MMDiT}) that refines only edited regions within the original high-res image, while reusing the unaltered regions as conditioning inputs. The low-resolution proxy further provides structural guidance and intermediate denoising supervision (\textbf{Inference Acceleration}) , ensuring consistent global semantics and stable generation without the need for full-resolution attention computation. This targeted and hierarchical design enables fast, high-fidelity editing of images up to 4K resolution without any specialized high-resolution training data. Extensive experiments demonstrate that HierEdit achieves competitive visual quality on commodity-resolution datasets while significantly accelerating inference and extending seamlessly to ultra-high-resolution 4K editing. Please check our {\href{https://peteryyzhang.github.io/HierEdit-page/}{\textbf{Project Page}}}.

Chinese Translation

高分辨率图像编辑对于专业和创意应用至关重要，但现有的基于多模态扩散的编辑器在计算效率上仍然不足，并且受限于相对较低的分辨率。当前的方法往往冗余地处理整个图像画布或依赖于大规模的高分辨率数据集，导致训练和推理成本高昂。我们提出了HierEdit，一种区域感知的层次扩散框架，旨在实现高效且可扩展的高分辨率图像编辑。我们的方法首先使用现成的编辑模型在低分辨率代理图像上进行编辑，以生成参考图像并定位修改区域。接着，采用层次局部窗口扩散模型（Local-Window MMDiT），仅对原始高分辨率图像中的编辑区域进行细化，同时将未更改区域作为条件输入进行重用。低分辨率代理图像进一步提供结构指导和中间去噪监督（Inference Acceleration），确保全局语义的一致性和稳定生成，而无需进行全分辨率的注意力计算。这种针对性和层次化的设计使得在没有任何专门高分辨率训练数据的情况下，实现高达4K分辨率的快速高保真图像编辑。大量实验表明，HierEdit在商品分辨率数据集上实现了具有竞争力的视觉质量，同时显著加速了推理，并无缝扩展到超高分辨率的4K编辑。请查看我们的{ extbf{项目页面}}。

View on arXiv Download PDF AI Translation

cs.CV / 138 / 2605.17303

LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos

LongDPM：基于长单目视频的重叠感知四维重建

Xu, Chenyi, Wu, Yihao, Yan, Liqi, Yang, Chao, Zhang, Jianhui, Guan, Fangli, Li, Pan

Abstract

Recovering a dynamic 3D scene from a long monocular video is crucial for dense geometry, camera motion, and temporal correspondence to remain consistent in a shared coordinate system. Existing methods face two key challenges: (1) feed-forward reconstruction models provide accurate local predictions but are limited to short clips, and (2) long-range trackers preserve correspondences without producing dense sequence-level reconstruction. This paper presents LongDPM, a novel overlap-aware framework for scalable long-range monocular dynamic reconstruction. First, LongDPM processes long videos in overlapping chunks, keeping inference memory bounded by the chunk length. Second, it connects chunk-local coordinate systems through confidence-weighted registration with static-aware overlap abstraction. Third, it associates dynamic identities across chunk boundaries and fuses matched trajectories to recover coherent long-range 3D motion. Experimental results demonstrate that LongDPM achieves superior long-range reconstruction and tracking performance, reducing dense tracking EPE over V-DPM on PointOdyssey, Kubric-F, and Kubric-G, while obtaining the best TUM-dynamics ATE for camera pose estimation.

Chinese Translation

从长单目视频中恢复动态三维场景对于保持密集几何、相机运动和时间对应关系在共享坐标系统中的一致性至关重要。现有方法面临两个主要挑战：（1）前馈重建模型提供准确的局部预测，但仅限于短片段；（2）长距离跟踪器在不产生密集序列级重建的情况下保持对应关系。本文提出了LongDPM，一种新颖的重叠感知框架，用于可扩展的长距离单目动态重建。首先，LongDPM以重叠块的方式处理长视频，将推理内存限制在块长度内。其次，通过静态感知重叠抽象的置信加权配准，连接块局部坐标系统。第三，它在块边界之间关联动态身份，并融合匹配轨迹以恢复一致的长距离三维运动。实验结果表明，LongDPM在长距离重建和跟踪性能上优于现有方法，在PointOdyssey、Kubric-F和Kubric-G上减少了相较于V-DPM的密集跟踪EPE，同时在相机姿态估计中获得了最佳的TUM-dynamics ATE。

View on arXiv Download PDF AI Translation

cs.CV / 139 / 2605.17309

StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting

StyleText：一个用于风格化场景文本修复的大规模数据集和基准

Simonyan, Aleksandr, Jindal, Nipun

Abstract

We present StyleText, a large-scale dataset and benchmark for localized scene-text inpainting with style preservation. StyleText contains 28,518 image-mask-prompt triplets grouped into 9,932 scene families, enabling controlled evaluation of text legibility and visual consistency under shared scene context. We construct the dataset with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value (KV) cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. We define a reproducible evaluation protocol using normalized OCR metrics (word accuracy and character error rate) and CLIP image-image similarity with explicit preprocessing. A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency, establishing a strong reference point for future comparisons.

Chinese Translation

我们提出了StyleText，这是一个用于局部场景文本修复并保持风格的大规模数据集和基准。StyleText包含28,518个图像-掩码-提示三元组，分为9,932个场景家族，使得在共享场景上下文下对文本可读性和视觉一致性的控制评估成为可能。我们通过一个自动化管道构建该数据集，该管道结合了LLM提示模板化、基于Flux的源生成与键值（KV）缓存注入、基于OCR的语义过滤、多边形掩码提取以及掩码条件的FluxFill增强。我们定义了一个可重复的评估协议，使用标准化的OCR指标（单词准确率和字符错误率）以及CLIP图像-图像相似性，并进行了明确的预处理。在StyleText上训练的FluxFill+LoRA基线在初始化时显著提高了OCR准确率，同时保持了场景风格的一致性，为未来的比较建立了一个强有力的参考点。

View on arXiv Download PDF AI Translation

cs.CV / 140 / 2605.17310

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

注意力劫持：视觉-语言模型中跨查询的响应操控

Wang, Zhiqiang, Liu, Dongrui, Li, Yan, Ying, Zonghao, Xue, Wei, Luo, Wenhan, Guo, Yike

Abstract

Existing adversarial attacks on vision-language models (VLMs) can steer model outputs toward attacker-specified target responses, but their effectiveness often degrades when the same perturbed input is paired with different textual queries. This paper studies cross-query response manipulation, where a single adversarial example is expected to remain effective across diverse user queries. We first analyze the limitations of existing attacks and find that successful transfer is closely associated with preserving an image-dominant attention pattern during response generation. Motivated by the observation, we propose \textbf{Attention Hijacking}, a novel adversarial attack that explicitly steers internal attention distributions toward a persistent image-dominant pattern. By amplifying the influence of visual tokens on target response tokens while suppressing the competing influence of textual tokens, our method reduces the dependence of the manipulated output on the specific wording of the query. Extensive experiments on widely used VLMs show that Attention Hijacking substantially improves cross-query transferability across diverse target responses and unseen queries. The method also extends effectively to multiple attack scenarios, offering new insights into the role of attention stability in transferable response manipulation for VLMs.

Chinese Translation

现有的针对视觉-语言模型（VLMs）的对抗攻击能够将模型输出引导至攻击者指定的目标响应，但当相同的扰动输入与不同的文本查询配对时，其有效性往往会降低。本文研究了跨查询响应操控，其中单个对抗示例期望在多样化的用户查询中保持有效。我们首先分析了现有攻击的局限性，发现成功的迁移与在响应生成过程中保持图像主导的注意力模式密切相关。基于这一观察，我们提出了 extbf{注意力劫持}，一种新颖的对抗攻击，明确地将内部注意力分布引导至持久的图像主导模式。通过增强视觉标记对目标响应标记的影响，同时抑制文本标记的竞争影响，我们的方法减少了操控输出对查询特定措辞的依赖。对广泛使用的VLMs进行的广泛实验表明，注意力劫持显著提高了在多样化目标响应和未见查询中的跨查询迁移能力。该方法还有效扩展到多种攻击场景，为VLMs中可转移响应操控的注意力稳定性角色提供了新的见解。

View on arXiv Download PDF AI Translation

cs.CV / 141 / 2605.17311

SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

SpecSem-Net：结合光谱和语义特征以增强AI生成视频检测的鲁棒性

Wei, Zixi, Zhang, Huixuaun, Wan, Xiaojun

Abstract

The remarkable visual fidelity of recent commercial video generative models, such as Sora and Veo, renders robust AI-generated video detection increasingly essential to prevent synthetic content from being indistinguishable from real videos and exploited for disinformation. However, existing detectors often fail due to an over-reliance on increasingly realistic semantic features, neglecting subtle spectral artifacts. In this paper, we propose SpecSem-Net, the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. Specifically, we design a spectral module to extract high-frequency features via Fourier-Transform based filtering. Furthermore, to reduce misjudgments arising from spectral noise, we employ a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. Additionally, to evaluate detector performance on the latest top-tier generative models, we construct a comprehensive benchmark comprising 5 SOTA commercial generators. Extensive experiments demonstrate that SpecSem-Net outperforms existing methods, achieving accuracies of 87.25% and 95.59% on our benchmark and public datasets, respectively.

Chinese Translation

近期商业视频生成模型（如Sora和Veo）所展现出的卓越视觉逼真度，使得强有力的AI生成视频检测变得愈发重要，以防止合成内容与真实视频难以区分并被用于虚假信息传播。然而，现有检测器常常由于过度依赖日益逼真的语义特征而失效，忽视了微妙的光谱伪影。本文提出了SpecSem-Net，这是第一个引入语义引导光谱去噪机制的框架，专门用于高保真AI生成视频检测。具体而言，我们设计了一个光谱模块，通过基于傅里叶变换的滤波提取高频特征。此外，为了减少因光谱噪声引起的误判，我们采用了门控融合机制，以自适应地融合语义上下文，有效缓解光谱噪声。此外，为了评估检测器在最新顶尖生成模型上的性能，我们构建了一个包含5个最先进商业生成器的综合基准。大量实验表明，SpecSem-Net在我们的基准和公共数据集上分别达到了87.25%和95.59%的准确率，优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 142 / 2605.17312

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers

VISTA：基于扩散变换器的三元组监督视频风格迁移

Song, Yiren, Yao, Wangzi, Wang, Haofan, Shou, Mike Zheng

Abstract

Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and motion.To address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a lightweight style adapter for robust style extraction. Extensive experiments demonstrate SOTA performance in style fidelity, temporal consistency, and content preservation.

Chinese Translation

视频风格迁移旨在将视频渲染为目标艺术风格，同时保留内容、结构和运动。尽管图像风格化已经取得了快速进展，但由于时间一致性的问题，视频风格化仍然具有挑战性。现有的大多数方法对帧或关键帧进行风格化，并通过启发式时间传播来强制一致性，但在遮挡、去遮挡和长期运动的情况下，这种方法较为脆弱，导致漂移和闪烁伪影。我们认为，根本瓶颈在于缺乏大规模的三元组数据和一个原则性的训练范式，该范式能够共同建模和解耦风格、内容和运动。为了解决这个问题，我们引入了VISTA-1000，这是一个包含1,000种风格和运动对齐的三元组（风格参考、干净视频和风格化视频）的合成数据集，并提出了一种基于扩散变换器的上下文视频风格迁移框架，配备轻量级风格适配器以实现稳健的风格提取。大量实验表明，在风格保真度、时间一致性和内容保留方面，VISTA-1000达到了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 143 / 2605.17341

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

针对视觉-语言模型的单样本黑箱成员推断攻击：通过跨模态语义对齐

Li, Jiaqing, Lu, Yajuan, Shi, Xiaochuan, Wu, Gang, Wang, ZhongYuan, Liang, Chao

Abstract

Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model's training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.

Chinese Translation

视觉-语言模型（VLMs）已取得显著成功，但其对大规模数据集的依赖以及对训练数据的无意记忆带来了显著的数据安全风险。成员推断攻击（MIAs）旨在通过确定数据样本是否包含在模型的训练集中来评估这些风险。然而，现有的针对VLMs的MIA方法面临关键瓶颈：灰箱方法依赖于通常在现实世界应用程序接口（APIs）中受限的内部logits，而黑箱方法则依赖于大规模统计分布，这在单样本场景中难以奏效。为此，我们从跨模态语义对齐的角度研究MIAs，并观察到成员图像由于训练记忆而表现出显著更强的图像-标题对齐，而非成员的生成标题可能偏离原始视觉内容。基于这一洞察，我们提出了一种新颖的MIA框架，旨在严格的黑箱和单样本设置下量化这种对齐关系，进而绕过这些不切实际的假设。我们在三个开源和两个闭源的VLMs上进行了广泛的实验。在VL-MIA/Flicker数据集上，我们的方法在LLaVA-1.5上实现了0.821的AUC，显著优于现有基线。此外，它在多种图像扰动下仍保持稳健，突显了其实用性。

View on arXiv Download PDF AI Translation

cs.CV / 144 / 2605.17343

GraphMAR: Geometry-Aware Graph Learning Framework for Spatially Adaptive CT Metal Artifact Reduction

GraphMAR：一种几何感知的图学习框架用于空间自适应CT金属伪影去除

Li, Zilong, Ma, Chenglong, Lei, Yiming, Li, Yuanlin, Han, Jing, Liu, Jiannan, Xie, Huidong, Zhang, Junping, Zhang, Yi, Shan, Hongming

Abstract

Computed tomography (CT) metal artifact reduction (MAR) aims to reduce the severe streaking artifacts induced by metallic implants and other high-density objects. Effective MAR generally requires both accurate artifact localization and artifact removal. Sinogram-domain methods can exploit explicit geometric cues, such as metal traces, to identify metal-corrupted measurements, while requiring raw projection data, which is often unavailable in clinical and practical scenarios. Image-domain methods are more flexible and widely applicable, yet they usually lack comparable geometric guidance, limiting their ability to localize artifacts and leading to suboptimal results. To address this limitation, we propose GraphMAR, a geometry-aware learning framework for explicit artifact identification and spatially adaptive MAR in the image domain. The key idea is to introduce graph-based geometric modeling as an image-domain analogue of sinogram metal traces. Specifically, we first construct a geometric graph from the metal mask and derive a geometric density graph that coarsely localizes artifact-prone regions according to inter-implant geometry. We then design GraphMoE, a graph-routed mixture-of-experts module that builds a polar-coordinate artifact graph in feature space and adaptively routes different experts to different spatial regions for MAR. By aligning the learned routing maps with the geometric density graph, GraphMAR provides explicit and interpretable artifact localization while enabling region-adaptive artifact reduction. Experiments on both simulated and real-world datasets demonstrate that GraphMAR achieves superior MAR performance compared with existing methods. To the best of our knowledge, this is the first work to introduce graph-based modeling for CT MAR and to enable explicit artifact identification in the image domain, improving both restoration quality and interpretability.

Chinese Translation

计算机断层扫描（CT）金属伪影去除（MAR）旨在减少由金属植入物和其他高密度物体引起的严重条纹伪影。有效的MAR通常需要准确的伪影定位和伪影去除。正弦图域方法可以利用明确的几何线索，如金属痕迹，来识别受金属污染的测量值，但这需要原始投影数据，而原始投影数据在临床和实际场景中往往不可用。图像域方法更加灵活且广泛适用，但通常缺乏可比的几何指导，限制了它们定位伪影的能力，导致次优结果。为了解决这一局限性，我们提出了GraphMAR，一种用于图像域中显式伪影识别和空间自适应MAR的几何感知学习框架。其关键思想是引入基于图的几何建模，作为正弦图金属痕迹的图像域类比。具体而言，我们首先从金属掩模构建几何图，并根据植入物之间的几何关系推导出几何密度图，以粗略定位易受伪影影响的区域。然后，我们设计了GraphMoE，一个图路由混合专家模块，在特征空间中构建极坐标伪影图，并自适应地将不同的专家路由到不同的空间区域进行MAR。通过将学习到的路由图与几何密度图对齐，GraphMAR提供了明确且可解释的伪影定位，同时实现区域自适应的伪影去除。在模拟和真实世界数据集上的实验表明，GraphMAR相比现有方法实现了更优的MAR性能。据我们所知，这是首次将基于图的建模引入CT MAR，并在图像域中实现显式伪影识别，从而提高了恢复质量和可解释性。

View on arXiv Download PDF AI Translation

cs.CV / 145 / 2605.17345

VoxShield: Protecting 3D Medical Datasets from Unauthorized Training via Frequency-Aware Inter-Slice Disruption

VoxShield：通过频率感知的层间干扰保护3D医学数据集免受未经授权的训练

Liu, Xinyao, Deng, Zhipeng, Jiang, Wenhan, Wang, Haolin, Lin, Xun, Ou, Yafei, Zheng, Yefeng

Abstract

The release of public 3D medical image segmentation (MIS) datasets accelerates clinical research but simultaneously heightens risks of unauthorized AI model training. While Unlearnable Examples (UE) offer protection by injecting imperceptible perturbations to prevent effective model learning, existing methods primarily target 2D scenarios. They neglect the volumetric spatial correlations and inter-slice anatomical consistency inherent in 3D medical volumes, which serve as critical learning priors for 3D segmentation networks. To bridge this gap, we propose VoxShield, a UE framework that explicitly targets the volumetric inductive biases of 3D networks. Our core insight is that by systematically dismantling the cross-slice continuity that 3D architectures rely on, we can fundamentally impair their spatial aggregation process. Specifically, we introduce an Inter-Slice Frequency Consistency Disruption mechanism that maximizes the spectral divergence between adjacent slices, injecting structural incoherence along the $z$-axis. Complementing this structural attack, a Semantic Prediction Disruption module is incorporated. By maximizing the $\ell_1$ divergence between clean and perturbed logits, it forces the injected noise to penetrate the entire network and corrupt the final semantic mapping. Experiments on BraTS19 and FLARE21 demonstrate that VoxShield successfully degrades 3D segmentation performance, reducing the DSC from 80.0% to near 0.0% and from 88.6% to 6.8%, respectively. All protections are achieved with minimal perturbation ($\epsilon=4/255$) to preserve high visual fidelity. The code is available at https://github.com/KK266299/VoxShield.

Chinese Translation

公共3D医学图像分割（MIS）数据集的发布加速了临床研究，但同时增加了未经授权的人工智能模型训练的风险。尽管不可学习示例（UE）通过注入不可察觉的扰动来防止有效模型学习，从而提供保护，但现有方法主要针对2D场景，忽视了3D医学体积中固有的体积空间相关性和层间解剖一致性，这些是3D分割网络的重要学习先验。为了解决这一问题，我们提出了VoxShield，一个明确针对3D网络体积归纳偏差的UE框架。我们的核心见解是，通过系统性地拆解3D架构所依赖的层间连续性，我们可以从根本上削弱其空间聚合过程。具体而言，我们引入了一种层间频率一致性干扰机制，最大化相邻切片之间的谱差异，在$z$轴上注入结构不一致性。作为对这种结构攻击的补充，我们还引入了语义预测干扰模块。通过最大化干净和扰动logits之间的$ ext{l}_1$差异，它迫使注入的噪声渗透整个网络并破坏最终的语义映射。在BraTS19和FLARE21上的实验表明，VoxShield成功降低了3D分割性能，将DSC从80.0%降低到接近0.0%，从88.6%降低到6.8%。所有保护措施均在保持高视觉保真度的前提下，以最小的扰动（$ ext{ε}=4/255$）实现。代码可在https://github.com/KK266299/VoxShield获取。

View on arXiv Download PDF AI Translation

cs.CV / 146 / 2605.17354

GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction

GeoHand：解锁单目3D手部重建的先验几何知识

Lin, Weiquan, Hu, Yaoqing, Dai, Liangchen, Tang, Xu, Chen, Xingyu

Abstract

Monocular 3D hand reconstruction is intrinsically a geometric problem, yet RGB appearance features alone often struggle to resolve severe ambiguities caused by self-occlusions and hand-object interactions. While introducing depth can explicitly provide spatial cues, raw sensor-captured depth maps are extensively noisy and incomplete, limiting their usefulness for fine-grained hand reconstruction. To bridge this gap, we propose GeoHand, a novel framework that unlocks high-quality geometric priors from a frozen foundational monocular geometry estimator (MoGe2). Recognizing that these priors are oriented toward general scenes, we introduce a map-level GeoAdapter to recalibrate the spatial features, specifically adapting them for detailed hand reconstruction. Furthermore, to systematically integrate these adapted priors without overwhelming intrinsic RGB appearance cues, we employ a gated cross-modal token fusion strategy. Finally, to secure precise local articulation, we design a Keypoint-Queried Iterative Refiner (KQIR) that uses projected joint locations to query geometry-aware image features for spatial correction. By combining global geometric disambiguation with local refinement in a unified pipeline, GeoHand achieves state-of-the-art performance on FreiHAND, DexYCB, and HO3Dv3, especially under severe occlusions and hand-object interactions.

Chinese Translation

单目3D手部重建本质上是一个几何问题，但仅依靠RGB外观特征往往难以解决因自遮挡和手物体交互所造成的严重歧义。尽管引入深度信息可以明确提供空间线索，但原始传感器捕获的深度图通常噪声较大且不完整，限制了其在细粒度手部重建中的有效性。为了解决这一问题，我们提出了GeoHand，一个新颖的框架，能够从冻结的基础单目几何估计器（MoGe2）中解锁高质量的几何先验。我们认识到这些先验主要针对一般场景，因此引入了一个地图级别的GeoAdapter来重新校准空间特征，特别是将其适应于详细的手部重建。此外，为了系统性地整合这些适应后的先验而不淹没内在的RGB外观线索，我们采用了一种门控跨模态令牌融合策略。最后，为了确保精确的局部关节运动，我们设计了一个关键点查询迭代细化器（KQIR），利用投影的关节位置查询几何感知的图像特征以进行空间校正。通过在统一的流程中结合全局几何消歧和局部细化，GeoHand在FreiHAND、DexYCB和HO3Dv3上实现了最先进的性能，尤其是在严重遮挡和手物体交互的情况下。

View on arXiv Download PDF AI Translation

cs.CV / 147 / 2605.17356

UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings

UniPPTBench：一个针对多样输入设置的演示生成统一基准

Zhao, Bo, Pang, Maosheng, Zhang, Chen, Yang, Huan, Cao, Yixin, Ji, Wei

Abstract

Existing works typically focus on presentation generation under isolated input settings, whereas real-world use cases span diverse scenarios, including vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources. Moreover, current evaluations are often insufficiently scenario-specific. They mainly rely on generic presentation-quality criteria, such as visual appeal, layout quality, and overall coherence, but fail to assess the core capabilities required by different input settings, including grounded compression, visual-text alignment, and cross-source synthesis. Consequently, the field lacks a unified benchmark and a scenario-aware evaluation framework for faithfully diagnosing presentation-generation systems across diverse real-world settings. We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics for cross-setting comparison with scenario-specific metrics tailored to the core requirements of each setting. We also provide transparent reference baselines to support reproducible comparison. Experiments on UniPPTBench reveal substantial performance variation across settings and recurring failure modes in content grounding, multimodal integration, and cross-source synthesis. In particular, strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment in grounded scenarios. Together, UniPPTBench and UniPPTEval provide a faithful and diagnostic foundation for evaluating presentation generation across diverse real-world scenarios. Code and data will be publicly available.

Chinese Translation

现有研究通常集中于孤立输入设置下的演示生成，而现实世界的使用案例涵盖了多种场景，包括模糊的用户提示、长文档、多模态材料和多个异构来源。此外，目前的评估往往缺乏足够的场景特异性。它们主要依赖于通用的演示质量标准，如视觉吸引力、布局质量和整体连贯性，但未能评估不同输入设置所需的核心能力，包括基于内容的压缩、视觉-文本对齐和跨源合成。因此，该领域缺乏一个统一的基准和一个场景感知的评估框架，以真实地诊断在多样现实世界设置下的演示生成系统。我们提出了UniPPTBench，这是一个针对四种代表性输入设置的演示生成统一基准：模糊提示、长文档、多模态文档和多源生成。我们进一步介绍了UniPPTEval，这是一个场景感知的评估协议，结合了用于跨设置比较的共享指标和针对每个设置核心要求量身定制的场景特异性指标。我们还提供了透明的参考基线，以支持可重复的比较。在UniPPTBench上的实验揭示了不同设置之间显著的性能差异以及内容基础、跨模态整合和跨源合成中的反复失败模式。特别是，在通用演示质量指标上表现强劲并不一定意味着在基于内容的场景中任务的强大完成度。总之，UniPPTBench和UniPPTEval为评估多样现实世界场景下的演示生成提供了一个真实且具有诊断性的基础。代码和数据将公开发布。

View on arXiv Download PDF AI Translation

cs.CV / 148 / 2605.17360

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Omni-DuplexEval：评估实时双向全模态交互

He, Chaoqun, Xiang, Mingyang, Xu, Yingjing, Xu, Bokai, Cui, Junbo, Zhou, Jie, Yao, Yuan, Wen, Lijie

Abstract

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.

Chinese Translation

实时双向交互对于在现实场景中运行的多模态人工智能系统至关重要，这些系统必须持续处理流媒体输入并在适当时刻做出响应。然而，目前大多数现有的多模态大型语言模型（MLLMs）是在离线环境中进行评估的，在这种情况下，整个视频输入在生成任何响应之前都会被处理。尽管近期的研究开始探索实时双向MLLMs，但仍然缺乏全面的基准或自动评估方法来适应这一设置。为了解决这一空白，我们提出了Omni-DuplexEval，这是一个系统评估实时双向交互的基准。该基准由两个互补的场景组成：（1）实时描述，评估生成连续、时间对齐的响应以跟踪不断变化的多模态输入的能力；（2）主动提醒，评估识别显著事件并在适当时刻做出响应的能力。Omni-DuplexEval包含660个视频，配有细粒度的人类标注标签和精确的时间元数据，涵盖9个基于现实场景的任务，所有问题均以开放式查询的形式提出。我们进一步引入了基于LLM-as-a-Judge的自动评估框架，通过时间戳感知和顺序推理，联合评估响应内容的一致性和响应时机，从而实现与人类判断的强一致性。对最先进的双向MLLMs的实验揭示了显著的局限性。表现最佳的模型整体得分仅为39.6%，而在主动提醒任务上的得分仅为20.0%。我们的分析识别出两个关键挑战：模型在及时响应与连贯、整体内容生成之间难以平衡，并且通常无法确定何时响应以及生成什么内容。我们希望我们的工作能够促进多模态大型语言模型的进一步发展。

View on arXiv Download PDF AI Translation

cs.CV / 149 / 2605.17365

Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval

基于记忆增强的查询意图理解用于高效的聊天式图像检索

Chen, Xianke, Liu, Daizong, Lou, Yushuo, Tan, Xin, Yang, Xun, Wang, Shuhui, Wang, Xun, Dong, Jianfeng

Abstract

Different from traditional text-to-image retrieval tasks, chat-based image retrieval allows the human-interactive system to iteratively clarify and refine user intent through multi-round dialogue, thereby achieving more fine-grained retrieval results. The key challenge in this task lies in dynamically understanding and updating the user's query intent across dialogue rounds. Although existing works have achieved great performance on this new task, they simply handle history query information either by directly concatenating all previous queries into a long textual sequence or by relying on large language models to reconstruct the current query from history. Such strategies are computationally redundant and easily lead to inconsistent intent representations as the dialogue progresses. To alleviate these issues, this paper proposes a novel and efficient memory-based user intent updating framework for the chat-based image retrieval task, called Memory-Augmented Query Intent Understanding (MAQIU). It introduces a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues, while a memory recall mechanism is further employed to prevent intent forgetting and enhance long-term semantic integrity. In addition, MAQIU also integrates historical image retrieval results as visual guidance, allowing the model to strengthen cross-round correlations and refine current visual understanding. Extensive experiments demonstrate that MAQIU achieves substantial performance gains while maintaining high computational efficiency, reducing dialogue encoding FLOPs by 86.4\% compared with the prior baseline ChatIR. Source code is available at https://github.com/HuiGuanLab/MAQIU.

Chinese Translation

与传统的文本到图像检索任务不同，聊天式图像检索允许人机交互系统通过多轮对话迭代地澄清和细化用户意图，从而实现更细粒度的检索结果。该任务的关键挑战在于动态理解和更新用户在对话轮次中的查询意图。尽管现有研究在这一新任务上取得了良好的性能，但它们简单地通过将所有先前的查询直接连接成一个长文本序列，或依赖大型语言模型从历史中重构当前查询来处理历史查询信息。这些策略在计算上是冗余的，并且在对话进行时容易导致意图表示的不一致。为了解决这些问题，本文提出了一种新颖且高效的基于记忆的用户意图更新框架，称为记忆增强查询意图理解（Memory-Augmented Query Intent Understanding，MAQIU）。该框架引入了一个轻量级的记忆模块，动态聚合和演变对话中的查询意图的语义表示，同时进一步采用记忆回忆机制以防止意图遗忘并增强长期语义完整性。此外，MAQIU还将历史图像检索结果整合为视觉指导，使模型能够加强跨轮次的关联性并细化当前的视觉理解。大量实验表明，MAQIU在保持高计算效率的同时实现了显著的性能提升，与之前的基线ChatIR相比，减少了86.4%的对话编码FLOPs。源代码可在 https://github.com/HuiGuanLab/MAQIU 获取。

View on arXiv Download PDF AI Translation

cs.CV / 150 / 2605.17367

Bridging Data Trials and Task Barriers: A Unified Framework for Sketch Biometric Identification

弥合数据试验与任务障碍：一个统一的素描生物识别框架

Liu, Decheng, Hu, Bin, Gao, Xinbo, Zhou, Dawei, Peng, Chunlei, Wang, Nannan, Hu, Ruimin

Abstract

Different from existing cross-modality identification tasks (e.g., heterogeneous face recognition, sketch re-identification, etc.), we introduce a novel yet practical setting for these related identification tasks, named \textbf{sketch biometric identification}, which aims to continually train a unified model across different data domains, even diverse identification tasks. Sketch biometric identification faces challenges, including scarce real sketch data, high annotation costs, privacy risks, and insufficient generalization ability of cross-task models. Existing methods usually rely on limited real data or single-task optimization, making it difficult to effectively address the joint challenges of cross-modality and cross-task. This paper proposes a unified framework that integrates efficient synthetic sketch generation and task-sequential continual learning. First, we design an efficient pipeline to generate a large-scale and high-quality synthetic person and face sketch data, which significantly reduces costs and avoids privacy risks. Meanwhile, we enhance the model's robustness by fusing real data. Second, we construct a universal unified framework for sketch biometric identification, which adopts a task-sequential training strategy: the model first completes sketch person re-identification learning on the person dataset; subsequently, it maintains the acquired person recognition capability through a trusted sample replay technique and seamlessly performs incremental training on the face dataset. This enables a single model to simultaneously handle the cross-task capabilities of multiple sketch biometric identification tasks. To support the study of the mentioned sketch biometric identification, we built a new large-scale benchmark, SketchUnified-BioID, with several practical evaluation protocols.

Chinese Translation

与现有的跨模态识别任务（例如，异构人脸识别、素描重识别等）不同，我们引入了一种新颖且实用的设置，称为 extbf{素描生物识别}，旨在跨不同数据领域甚至多样化的识别任务中持续训练一个统一模型。素描生物识别面临诸多挑战，包括真实素描数据稀缺、高昂的标注成本、隐私风险以及跨任务模型的泛化能力不足。现有方法通常依赖有限的真实数据或单任务优化，难以有效应对跨模态和跨任务的联合挑战。本文提出了一个统一框架，整合了高效的合成素描生成和任务顺序持续学习。首先，我们设计了一个高效的流程，以生成大规模高质量的合成人物和人脸素描数据，显著降低成本并避免隐私风险。同时，通过融合真实数据增强模型的鲁棒性。其次，我们构建了一个通用的素描生物识别统一框架，采用任务顺序训练策略：模型首先在人物数据集上完成素描人物重识别学习；随后，通过可信样本重放技术保持获得的人物识别能力，并在面部数据集上无缝进行增量训练。这使得单一模型能够同时处理多个素描生物识别任务的跨任务能力。为了支持上述素描生物识别的研究，我们建立了一个新的大规模基准数据集SketchUnified-BioID，并设定了多个实用的评估协议。

View on arXiv Download PDF AI Translation

cs.CV / 151 / 2605.17368

RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection

RadGenome-Anatomy：通过物理基础的体积投影构建的大规模解剖标注胸部放射影像数据集

Ye, Shuchang, Meng, Mingyuan, Wang, Hao, Naseem, Usman, Kim, Jinman

Abstract

Anatomical structure labels for chest radiographs are essential for medical image segmentation and a broad range of downstream diagnostic tasks. However, annotating anatomy directly on 2D chest radiographs is labor-intensive and intrinsically ambiguous, as 3D anatomical structures are projected onto a single 2D plane where boundaries may overlap, be occluded, or appear only partially visible. Consequently, existing anatomy-labeled chest radiograph datasets remain limited in scale, anatomy coverage, and label reliability. To address these limitations, we introduce RadGenome-Anatomy, the largest anatomy-labeled chest radiograph dataset, containing over 10 million segmentation masks across 210 anatomical structures in 25,692 studies. It is constructed by projecting large-scale 3D anatomical masks from CT volumes into 2D radiographic space through canonical radiographic geometry. This shifts annotation from directly tracing uncertain 2D boundaries to defining anatomy in volumetric space, where structures that overlap or become partially invisible in radiographs remain spatially separable. As a result, each 2D mask represents the physically grounded projected footprint of a volumetrically defined structure. The scale and broad anatomical coverage of RadGenome-Anatomy, including structures that are overlapping, partially visible, or difficult to delineate directly, enable research on geometric measurements as explicit evidence for chest radiograph interpretation. We demonstrate this by training XAnatomy to predict structure-specific masks and derive clinically relevant measurements, achieving diagnostic accuracies of 96.4%, 95.6%, and 89.2% for cardiomegaly, kyphosis, and scoliosis, respectively.

Chinese Translation

胸部放射影像的解剖结构标签对于医学图像分割和广泛的下游诊断任务至关重要。然而，直接在二维胸部放射影像上进行解剖标注劳动强度大且本质上存在模糊性，因为三维解剖结构投影到单一的二维平面上，边界可能重叠、被遮挡或仅部分可见。因此，现有的解剖标注胸部放射影像数据集在规模、解剖覆盖和标签可靠性方面仍然有限。为了解决这些限制，我们推出了RadGenome-Anatomy，这是最大的解剖标注胸部放射影像数据集，包含超过1000万个分割掩膜，涵盖210个解剖结构，涉及25,692个研究。该数据集通过将大规模的三维解剖掩膜从CT体积投影到二维放射影像空间中，采用规范的放射几何构建。这一方法将标注从直接描绘不确定的二维边界转变为在体积空间中定义解剖结构，在这里，重叠或在放射影像中部分不可见的结构仍然可以在空间上分离。因此，每个二维掩膜代表了一个在体积上定义的结构的物理基础投影足迹。RadGenome-Anatomy的规模和广泛的解剖覆盖，包括重叠、部分可见或难以直接描绘的结构，使得对几何测量的研究成为可能，作为胸部放射影像解读的明确证据。我们通过训练XAnatomy来预测特定结构的掩膜并推导临床相关的测量，达到了心脏肥大、驼背和脊柱侧弯的诊断准确率分别为96.4%、95.6%和89.2%。

View on arXiv Download PDF AI Translation

cs.CV / 152 / 2605.17423

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

Soap2Soap：通过多智能体协作进行长篇电影视频重制

Song, Yiren, Zhong, Huilin, Lin, Kevin Qinghong, Wang, Haofan, Shou, Mike Zheng

Abstract

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.

Chinese Translation

我们研究系列级别的电影重制，这是一种长时间跨度的视频到视频生成问题，通过风格化或演员替换来定位完整的剧集或电影，同时严格保持叙事结构、动作编排和角色身份在数百个镜头中的一致性。现有的视频生成和编辑流程在这一领域常常出现问题，原因在于身份漂移、背景变异以及在大幅度相机运动和视角变化下的语义侵蚀。我们提出了Soap2Soap，一个通过双桥一致性机制来强制执行长期语言-视觉一致性的多智能体框架：一个场景感知的JSON剧本作为持久的语义支撑，以及在场景和镜头层面动态分配的视觉参考锚点。为了在视频合成之前抑制漂移，我们引入了批量关键帧一致性，通过基于网格的公式在共享潜在上下文中联合生成多个关键帧。一个闭环验证智能体进一步审计身份、稳定性和对齐，以触发选择性再生。在SoapBench上的实验表明，在长期一致性和叙事保真度方面，相较于商业视频生成API有显著提升。

View on arXiv Download PDF AI Translation

cs.CV / 153 / 2605.17433

VISTA: Variance-Gated Inter-Sequence Test-Time Adaptation for Multi-Sequence MRI Segmentation

VISTA：用于多序列MRI分割的方差门控序列间测试时适应

Deng, Zhipeng, Zhou, Jiale, Jiang, Wenhan, Wang, Haolin, Lin, Xun, Ou, Yafei, Zheng, Yefeng

Abstract

Deploying multi-sequence magnetic resonance imaging (MRI) segmentation models to new clinical environments is challenging due to variations in scanners and acquisition protocols. Although existing TTA methods handle basic per-modality shifts, they often fail under a fundamental dual-shift problem, as their adaptation signals fail to capture modality-interaction shifts that disrupt inter-sequence consistency. To address this, we propose Variance-gated Inter-Sequence Test-time Adaptation (VISTA), a source-free framework that tackles modality-interaction shifts. First, we design an Inter-Sequence Intervention Generator (ISIG) that generates a set of consistency probes by swapping low-frequency spectra and entropy-localized patches across sequences, preserving anatomical semantics while challenging inter-sequence dependencies. Second, we introduce Cross-View Disagreement-Aware Pseudo Labeling (CDPL), which establishes a voxel-wise reliability metric using cross-view disagreement variance to dynamically gate self-training and enforce interventional consistency, encouraging the network to rely on robust anatomical semantics. Extensive experiments adapting from standard adult MRI (BraTS-GLI-Pre) to African low-field (BraTS-SSA) and pediatric (BraTS-PED) cohorts show improved performance over competing methods under clinical shifts, achieving absolute Dice improvements of +1.89% (SSA) and +2.82% (PED) over the source model. The code is available at https://github.com/dzp2095/VISTA.

Chinese Translation

将多序列磁共振成像（MRI）分割模型部署到新的临床环境中面临挑战，原因在于扫描仪和采集协议的变化。虽然现有的测试时适应（TTA）方法能够处理基本的每种模态的变化，但在根本的双重变化问题下，它们往往失败，因为其适应信号无法捕捉到干扰序列间一致性的模态交互变化。为了解决这一问题，我们提出了方差门控序列间测试时适应（VISTA），这是一个无源框架，旨在应对模态交互变化。首先，我们设计了一个序列间干预生成器（ISIG），通过在序列间交换低频谱和熵局部化补丁，生成一组一致性探针，保留解剖语义，同时挑战序列间的依赖关系。其次，我们引入了跨视图不一致感知伪标注（CDPL），该方法利用跨视图不一致方差建立体素级可靠性指标，以动态门控自我训练并强制实施干预一致性，鼓励网络依赖于稳健的解剖语义。大量实验表明，从标准成人MRI（BraTS-GLI-Pre）适应到非洲低场（BraTS-SSA）和儿童（BraTS-PED）队列，在临床变化下，相较于竞争方法，性能得到了提升，绝对Dice系数提高了+1.89%（SSA）和+2.82%（PED）。代码可在 https://github.com/dzp2095/VISTA 获取。

View on arXiv Download PDF AI Translation

cs.CV / 154 / 2605.17436

Medical Context Distorts Decisions in Clinical Vision Language Models

医学背景扭曲临床视觉语言模型中的决策

Restrepo, David, Ktena, Ira, Vakalopoulou, Maria, Christodoulidis, Stergios, Ferrante, Enzo

Abstract

Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.

Chinese Translation

视觉语言模型（VLMs）越来越多地被提议用于临床决策支持，但它们在需要整合来自医疗记录的视觉和文本背景的真实场景中的可靠性仍然缺乏充分的表征。本文识别了三种失效模式：（1）对文本的过度依赖而忽视图像，（2）对无关临床历史的虚假依赖，以及（3）在语义等价输入之间的提示敏感性。我们在胸部X光任务中使用MIMIC-CXR评估了一组多样的通用领域和医学调优的开放和封闭VLMs。通过系统地操控图像-文本对齐、临床历史和提示表述，我们发现即使在有视觉证据的情况下，VLM的决策仍然受到文本模态的主导。此外，我们观察到VLMs受到无关报告的严重影响，而微小的提示变化可能会逆转基于图像的正确预测。我们的发现强调了在考虑将这些模型应用于临床实践之前，明确的保护措施和压力测试的必要性。

View on arXiv Download PDF AI Translation

cs.CV / 155 / 2605.17447

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

FastOCR：通过 KV 缓存剪枝实现动态视觉注视以提高文档解析效率

Tang, Zihan, Shen, Leqi, Chen, Hui, Wang, Ao, Wan, Ben, Feng, Yan, Zhang, Ke, Zhao, Sicheng, Liu, Tongxuan, Ding, Guiguang

Abstract

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

Chinese Translation

视觉-语言模型（VLMs）在光学字符识别（OCR）方面展现出强大的潜力，但编码密集文档所需的视觉标记数量庞大，导致推理成本高昂。现有的剪枝方法依赖于物理驱逐，例如在预填充阶段永久丢弃视觉标记。尽管这种策略对自然图像有效，但在OCR中基本失效，因为几乎每个视觉标记都可能对应一个字符或结构元素，任何不可逆的损失都会导致灾难性的准确度下降。我们观察到，尽管文档图像在全局上看起来密集且似乎无法剪枝，但模型对它们的注意力实际上是时间上稀疏的：在每个解码步骤中，它集中于一个小区域，该区域在步骤之间逐渐移动，类似于人类读者逐字注视，而不是一次性感知整页内容。受到这一动态视觉注视现象的启发，我们将难以处理的全局剪枝问题重新表述为可处理的局部动态问题，并提出了FastOCR，一个无需训练的框架，包含两个互补模块。具体而言，Focal-Guided Pruning识别一小组焦点层，并在每个步骤中从中选择最相关的视觉标记，而Cross-Step Fixation Reuse利用注视的逐步移动，从上一个步骤热启动每个步骤。通过动态调整关注的标记而不是从缓存中驱逐任何标记，FastOCR避免了永久性信息损失。大量实验表明，FastOCR作为一个即插即用的加速模块，在五种不同大小和架构的VLMs中始终保持一致的泛化能力。在Qwen2.5-VL上，FastOCR在每个解码步骤中仅关注5%的视觉标记，同时保留了未剪枝模型98%的准确度，将注意力延迟减少了3.0倍。

View on arXiv Download PDF AI Translation

cs.CV / 156 / 2605.17449

Spatial Blindness in Whole-Slide Multiple Instance Learning

全幻灯片多实例学习中的空间盲ness

Li, Xiangyu, Su, Ran

Abstract

Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.

Chinese Translation

全幻灯片多实例学习（MIL）模型通常被称为上下文感知的图形、变换器（Transformers）或状态空间模块，前提是将其置于补丁嵌入之上。我们表明，这一标签可能具有误导性。在组织病理学任务中，组织结构是诊断信号的一部分，多个强大的MIL基线在补丁坐标被置换后，幻灯片级别的AUC几乎保持不变。它们的预测是准确的，但在很大程度上是组合性的。我们将这种失败模式称为空间盲ness。我们的解释基于优化：在幻灯片级别的监督下，早期学习了密集的外观统计数据，导致稀疏空间关系的梯度较弱。ResTopoMIL通过首先拟合一个不变于置换的原型直方图来解决这个问题，然后在轻量级图形分支学习残差时冻结它，同时施加坐标洗牌约束。该架构设计简单；干预在于空间分支的训练方式。在9个公共全幻灯片图像（WSI）基准测试中，ResTopoMIL以115万参数提高了分类和生存预测的性能，恢复了对坐标扰动的敏感性，并在CAMELYON-16上提供了更强的定位证据。

View on arXiv Download PDF AI Translation

cs.CV / 157 / 2605.17451

DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking

DeTrack：一种基准测试和高度感知的双世界模型用于无人机主体跟踪

Hu, Guyue, Liu, Haoming, Song, Siyuan, Li, Chenglong, Chen, Feng, Tang, Jin

Abstract

Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.

Chinese Translation

空中物体跟踪在公共安全、紧急救援、野生动物监测及相关领域具有广泛的应用。然而，现有的空中跟踪基准主要基于从固定摄像头位置或预定义飞行路径捕获的被动2D视频序列，其中无人机被视为被动摄像机，而非在动态3D场景中主动感知、互动和控制其运动的主体代理。在本文中，我们定义了一项新的无人机主体跟踪任务，称为DeTrack，该任务要求无人机在互动3D环境中使用在线自我中心观察和闭环主动飞行控制来跟踪目标。我们构建了一个大规模基准，包含11,368个目标轨迹，涵盖多样的场景、渲染条件、语义区域和移动干扰物，并提供了目标可见性、跟踪精度和轨迹成功率的评估指标。我们进一步提出了AaDWorlds，一个高度感知的双世界模型框架用于无人机主体跟踪。AaDWorlds由一个高度感知的感知模块和双世界模型组成，能够在高空和低空状态下想象未来状态。通过结合伪高度感知观察和想象的未来状态，AaDWorlds缓解了目标可见性与飞行安全之间的内在高度介导矛盾。在DeTrack基准上的实验表明，AaDWorlds在所有评估指标上都提高了闭环跟踪性能。

View on arXiv Download PDF AI Translation

cs.CV / 158 / 2605.17456

GCE-MIL: Faithful and Recoverable Evidence for Multiple Instance Learning in Whole-Slide Imaging

GCE-MIL：全幻灯片成像中多实例学习的可靠且可恢复的证据

Li, Xiangyu, Su, Ran

Abstract

Multiple instance learning (MIL) is the standard approach for whole-slide image (WSI) classification and survival prediction, where attention-based models ag gregate patch features into slide-level predictions. These models treat attention weights as evidence for their predictions, but attention is optimized for classi fication, not for identifying which patches actually support the diagnosis. This conflation leads to three failures: selected patches are insufficient (keeping them alone drops Macro-F1 by 0.078), unnecessary (removing them barely changes the prediction), and unrecoverable (continuous attention scores disagree with discrete patch subsets used at inference). The central premise is that evidence quality should be optimized directly through explicit criteria- Sufficiency, Necessity, and Recov erability (S/N/R)- rather than inherited as a byproduct of classification. GCE-MIL is a backbone-agnostic wrapper implemented through three injection modes and three evidence components: a grounding mechanism that aligns selection with domain-specific concepts, noisy-OR coverage that acts as a differentiable proxy for interventional evidence search, and threshold-plus-repair recovery that converts continuous selectors into discrete subsets through marginal-guided repair. Across 9 backbones and 9 datasets (81 configurations), GCE-MIL improves average Macro-F1 by 0.024 and C-index by 0.014, reduces the continuous-discrete gap by 4-7, and increases complement degradation by 2-4. With optional tile prefiltering after discrete recovery, inference runs up to 5 faster while retaining 0.989 full-bag utility.

Chinese Translation

多实例学习（MIL）是全幻灯片图像（WSI）分类和生存预测的标准方法，其中基于注意力的模型将补丁特征聚合为幻灯片级预测。这些模型将注意力权重视为其预测的证据，但注意力是针对分类进行优化的，而不是用于识别哪些补丁实际上支持诊断。这种混淆导致了三个问题：所选补丁不足（仅保留它们会使Macro-F1下降0.078）、不必要（移除它们几乎不改变预测）以及不可恢复（连续注意力分数与推理时使用的离散补丁子集不一致）。核心前提是，证据质量应通过明确的标准——充分性（Sufficiency）、必要性（Necessity）和可恢复性（Recoverability）（S/N/R）——直接优化，而不是作为分类的副产品继承。GCE-MIL是一个与骨干网络无关的包装器，通过三种注入模式和三个证据组件实现：一个将选择与领域特定概念对齐的基础机制、作为干预证据搜索的可微代理的噪声-或覆盖，以及通过边际引导修复将连续选择器转换为离散子集的阈值加修复恢复。在9个骨干网络和9个数据集（81种配置）中，GCE-MIL将平均Macro-F1提高了0.024，C-index提高了0.014，减少了连续-离散差距4-7，并增加了补充降解2-4。通过在离散恢复后可选的瓦片预过滤，推理速度提高了最多5倍，同时保持0.989的完整包效用。

View on arXiv Download PDF AI Translation

cs.CV / 159 / 2605.17470

EchoSR: Efficient Context Harnessing for Lightweight Image Super-Resolution

EchoSR：轻量级图像超分辨率的高效上下文利用

Zhao, Hanli, Wang, Binhao, Zhao, Shihao, Wang, Tao, Zhang, Kaihao

Abstract

Image super-resolution (SR) aims to reconstruct high-quality, high-resolution (HR) images from low-resolution (LR) inputs and plays a critical role in various downstream applications. Despite recent advancements, balancing reconstruction fidelity and computational efficiency remains a fundamental challenge, particularly in resource-constrained scenarios. While existing lightweight methods attempt to expand receptive fields, many of them either incur substantial computational overhead, naively scale up kernel sizes, or lack mechanisms for coherent multi-scale integration, limiting their overall effectiveness and scalability. To address these limitations, we propose EchoSR, an efficient context-harnessing framework for lightweight image super-resolution, which unifies multi-scale receptive field modeling and hierarchical context fusion. EchoSR decouples feature learning into disentangled local, multi-scale, and global modeling stages through an efficient context-harnessing strategy, and further promotes seamless cross-scale integration via a cross-scale overlapping fusion mechanism. Extensive experiments have shown that EchoSR consistently outperforms state-of-the-art lightweight super-resolution methods across multiple benchmarks, while also achieving a faster speed $(\sim 2\times)$. The source code is available at \url{https://github.com/funnyWang-Echoes/EchoSR}.

Chinese Translation

图像超分辨率（SR）旨在从低分辨率（LR）输入重建高质量、高分辨率（HR）图像，并在各种下游应用中发挥着关键作用。尽管最近取得了一些进展，但在资源受限的场景中，平衡重建保真度和计算效率仍然是一个基本挑战。现有的轻量级方法虽然试图扩展感受野，但许多方法要么导致显著的计算开销，要么简单地扩大卷积核尺寸，或者缺乏一致的多尺度融合机制，限制了它们的整体有效性和可扩展性。为了解决这些局限性，我们提出了EchoSR，一个高效的上下文利用框架，用于轻量级图像超分辨率，它统一了多尺度感受野建模和分层上下文融合。EchoSR通过高效的上下文利用策略，将特征学习解耦为局部、多尺度和全局建模阶段，并通过跨尺度重叠融合机制进一步促进无缝的跨尺度集成。大量实验表明，EchoSR在多个基准测试中始终优于最先进的轻量级超分辨率方法，同时实现了更快的速度（约2倍）。源代码可在 https://github.com/funnyWang-Echoes/EchoSR 获取。

View on arXiv Download PDF AI Translation

cs.CV / 160 / 2605.17472

Weighted Reverse Convolution for Feature Upsampling

加权反卷积用于特征上采样

Li, Wentong, Qi, Zhiyuan, Zhao, Zichen, Zhang, Kai, Zhang, Lei

Abstract

Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.

Chinese Translation

预训练的视觉基础模型（VFM）提供了强大的语义表示，然而它们的块级特征本质上较为粗糙，限制了它们在需要细粒度定位、密集预测和逐点对应的任务中的有效性。在本研究中，我们从 extbf{ extit{逆问题}}的角度重新审视VFM的特征上采样，并提出了加权反卷积（WRC），这是一种空间自适应的逆算子，用于增强高级视觉描述符的密度。具体而言，我们将特征上采样公式化为一个加权的Tikhonov正则化最小二乘问题，其中空间变化的权重在每个空间位置调节数据保真度和先验强度。这使得WRC能够根据空间变化的特征特性调整重建，从而在减轻过平滑的同时保留关键结构。此外，WRC保留了高效、完全可微的封闭形式FFT解，使其成为一个实用的即插即用上采样算子。集成到一个轻量级的自监督密集化框架中，WRC在多个下游基准测试中持续提高了密集特征的质量，包括分割、深度估计、视频目标分割、目标发现和关键点对应，同时保持了高计算效率。

View on arXiv Download PDF AI Translation

cs.CV / 161 / 2605.17478

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

Mamba-VGGT：通过外部滑动窗口Mamba记忆实现的持久长序列视频几何基础变换器

Deng, Tianchen, Xiong, Zhenxiang, Wang, Nailin, Wang, Fangjinhua, Liu, Jiuming, Yang, Jianfei, Wang, Hesheng

Abstract

Visual Geometry Grounded Transformers (VGGT) have set new benchmarks in high-fidelity 3D scene reconstruction. However, as the sequence length increases, these models suffer from catastrophic geometric forgetting and accumulation drift, primarily due to the quadratic complexity of global attention which necessitates truncated temporal windows. To overcome the resulting geometric drift, we present Mamba-VGGT, an enhanced VGGT framework capable of persistent long-range reasoning. Our key contribution is a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows. This module leverages selective state-space modeling to distill and propagate global geometric priors, effectively bypassing the memory constraints of traditional transformers. To integrate these long-term temporal cues without disrupting the highly optimized spatial features of the pre-trained VGGT, we propose a Zero-Init Spatial Memory Injector. Utilizing zero-convolutional layers, this injector adaptively fuses persistent memory into the patch token stream, ensuring structural stability and seamless feature alignment. Extensive experiments demonstrate that our approach significantly outperforms existing VGGT-based methods in maintaining spatial consistency and reducing trajectory accumulation errors. Our work provides a scalable, linear-complexity solution for geometry-grounded world modeling in extensive 3D environments.

Chinese Translation

视觉几何基础变换器（VGGT）在高保真3D场景重建中设立了新的基准。然而，随着序列长度的增加，这些模型遭遇了灾难性的几何遗忘和累积漂移，这主要是由于全局注意力的二次复杂性，导致必须使用截断的时间窗口。为了解决由此产生的几何漂移问题，我们提出了Mamba-VGGT，这是一种增强的VGGT框架，能够进行持久的长范围推理。我们的关键贡献是一个滑动窗口Mamba（SWM）记忆模块，它在时间窗口之间维护一个显式的外部记忆标记。该模块利用选择性状态空间建模来提炼和传播全局几何先验，有效绕过传统变换器的记忆限制。为了在不破坏预训练VGGT高度优化的空间特征的情况下整合这些长期时间线索，我们提出了一种零初始化空间记忆注入器。该注入器利用零卷积层，自适应地将持久记忆融合到补丁标记流中，确保结构稳定性和无缝特征对齐。大量实验表明，我们的方法在维持空间一致性和减少轨迹累积误差方面显著优于现有的基于VGGT的方法。我们的工作为在广泛的3D环境中进行几何基础世界建模提供了一种可扩展的线性复杂度解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 162 / 2605.17483

On Applicability of Synthetic Datasets for Facial Expression Recognition

合成数据集在面部表情识别中的适用性研究

Azmoudeh, Ali, Sarıtaş, Erdi, Yıldırım, Ömer, Ekenel, Hazım Kemal

Abstract

Facial Expression Recognition faces two core challenges. The first is class imbalance in public datasets, which skews the learning process and weakens generalization. The second is related to privacy and data collection constraints, which limit the sharing of facial images and restrict the creation of large, balanced datasets. To address these issues, we examine three complementary strategies for constructing privacy-preserving FER datasets in the standard seven discrete facial expression classes setting. Our strategies are: (i) pseudo-labeling large unlabeled face collections with a teacher model under a confidence-thresholding scheme, (ii) prompt-driven synthesis using diffusion models conditioned on demographic attributes, and (iii) task-aware GAN-based expression editing that modifies facial expression while preserving identity and realism. For training and evaluation, we employed widely adopted datasets, including AffectNet, RAF-DB, and FER2013. We utilized the synthetic datasets DigiFace, DCFace, and EmoNet-Face BIG as unlabeled sources for pseudo-labeling. Additionally, we utilized the FFHQ dataset as the source for generative synthesis. The main experiments are conducted using a classic CNN backbone, IR50, and we also explore a more complex architecture, POSTERv1, to assess its feasibility and robustness. Using cross-dataset evaluations, we analyze the trade-offs each strategy presents in curated datasets. The findings demonstrate how synthetic data can effectively substitute or be combined with real datasets to mitigate imbalance and privacy limitations. Code and generated datasets:https://www.github.com/AliAZ98/SyntFER

Chinese Translation

面部表情识别面临两个核心挑战。第一个是公共数据集中类别不平衡，这扭曲了学习过程并削弱了泛化能力。第二个与隐私和数据收集限制相关，这限制了面部图像的共享并阻碍了大型平衡数据集的创建。为了解决这些问题，我们研究了三种互补策略，以在标准的七个离散面部表情类别设置中构建隐私保护的面部表情识别（FER）数据集。我们的策略包括：(i) 在置信度阈值方案下，使用教师模型对大量未标记的面部集合进行伪标记；(ii) 基于人口统计属性的提示驱动合成，使用扩散模型；以及 (iii) 任务感知的基于生成对抗网络（GAN）的表情编辑，修改面部表情的同时保留身份和真实感。为了进行训练和评估，我们采用了广泛使用的数据集，包括 AffectNet、RAF-DB 和 FER2013。我们利用合成数据集 DigiFace、DCFace 和 EmoNet-Face BIG 作为伪标记的未标记来源。此外，我们还利用 FFHQ 数据集作为生成合成的来源。主要实验使用经典的卷积神经网络（CNN）骨干网络 IR50 进行，我们还探索了一种更复杂的架构 POSTERv1，以评估其可行性和鲁棒性。通过跨数据集评估，我们分析了每种策略在策划数据集中的权衡。研究结果表明，合成数据可以有效替代或与真实数据集结合，以缓解不平衡和隐私限制。代码和生成的数据集：https://www.github.com/AliAZ98/SyntFER

View on arXiv Download PDF AI Translation

cs.CV / 163 / 2605.17488

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

Omni-Customizer：端到端多模态定制的联合音视频生成

Chen, Yuheng, He, Qingdong, Hu, Teng, Wang, Yuji, Wang, Yabiao, Ma, Lizhuang, Zhang, Jiangning

Abstract

The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.

Chinese Translation

强大的基础模型的出现从根本上改变了联合音频和视频生成的格局。尽管取得了这些进展，实现多交互主体之间视觉身份和声音音色的同时保留的连贯多模态定制仍然在很大程度上未被深入探索。为了解决这一问题，我们提出了Omni-Customizer，一个旨在精确绑定和无缝融合多模态身份信息的端到端框架。具体而言，我们引入了一个Omni-Context Fusion (OCF)模块，有效地用密集的多模态身份线索丰富基础文本提示，并设计了一个Masked TTS Cross-Attention (MTP-CA)机制，专门用于防止严重的“语音泄漏”问题。在该架构中，我们提出了语义锚定多模态RoPE (SA-MRoPE)，将视觉和音频参考标记以及TTS嵌入与其对应的语义描述锚定，从而实现结构化的多模态融合和稳健的身份绑定。此外，我们设计了一种综合训练策略，结合交错的音视频调度，以快速适应多语言场景而不降低基础先验，并采用渐进式的对内到对外课程，以促进高层次和稳健身份特征的学习。大量实验表明，Omni-Customizer在双模态定制生成中实现了最先进的性能，在视觉身份相似性、音色一致性、精确的音视频同步和整体视频音频保真度等方面表现优异。

View on arXiv Download PDF AI Translation

cs.CV / 164 / 2605.17489

Employing Vision-Language Models for Face Image Quality Assessment

利用视觉-语言模型进行人脸图像质量评估

Sarıtaş, Erdi, Onaran, Eren, Štruc, Vitomir, Ekenel, Hazım Kemal

Abstract

Face Image Quality Assessment (FIQA) is a crucial control step in biometric pipelines. It ensures only reliable samples are processed to maintain system accuracy. State-of-the-art FIQA methods achieve high utility but typically operate as "black boxes." They produce scalar scores without human-interpretable justifications. This lack of transparency limits their effectiveness in human-in-the-loop scenarios, such as automated border control, where actionable feedback is essential. In this paper, we investigate the potential of off-the-shelf Vision-Language Models (VLMs) to bridge this gap by performing FIQA in a zero-shot setting. We present a comprehensive evaluation framework for assessing VLM performance. This involves benchmarking traditional FIQA methods through error-versus-reject curves. Additionally, using a diverse set of datasets, ranging from surveillance-oriented to synthetically generated, we analyzed their interpretability, consistency, and robustness to prompt changes. Our results show biometric utility performance depends significantly on architecture, not merely on parameter count. Most VLMs' outputs align with those of traditional methods. We also find that VLM ranking performance and the generated scores may vary across prompts. Our synthetic ablation study shows that while increasing the parameter count can improve internal consistency, it yields worse degradation-detection performance than smaller models. These findings suggest that zero-shot FIQA score estimation using VLMs is promising and could effectively complement conventional FIQA pipelines as an interpretability module. The codes are available at https://github.com/ThEnded32/VLM4FIQA.git.

Chinese Translation

人脸图像质量评估（FIQA）是生物识别流程中的一个关键控制步骤。它确保只有可靠的样本被处理，以维持系统的准确性。最先进的 FIQA 方法实现了高效用性，但通常作为“黑箱”操作。它们生成的标量分数缺乏人类可解释的依据。这种透明度的缺失限制了它们在需要可操作反馈的人机交互场景中的有效性，例如自动边境控制。在本文中，我们探讨了现成的视觉-语言模型（VLMs）在零样本设置下进行 FIQA 的潜力。我们提出了一个全面的评估框架来评估 VLM 的性能。这包括通过错误与拒绝曲线对传统 FIQA 方法进行基准测试。此外，利用从监控导向到合成生成的多样化数据集，我们分析了它们的可解释性、一致性和对提示变化的鲁棒性。我们的结果表明，生物识别效用性能显著依赖于架构，而不仅仅是参数数量。大多数 VLM 的输出与传统方法一致。我们还发现 VLM 的排名性能和生成的分数可能因提示而异。我们的合成消融研究表明，尽管增加参数数量可以提高内部一致性，但其降级检测性能却低于较小的模型。这些发现表明，使用 VLM 进行零样本 FIQA 分数估计是有前景的，并可以有效补充传统 FIQA 流程作为可解释性模块。代码可在 https://github.com/ThEnded32/VLM4FIQA.git 获取。

View on arXiv Download PDF AI Translation

cs.CV / 165 / 2605.17504

A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

视觉机械解释性的分布视角：KL-最小软约束原则

Zhou, Guancheng, Luo, Yisi, He, Zhengfu, Jin, Zhenyu, Ge, Xuyang, Shu, Wentao, Meng, Deyu, Qiu, Xipeng

Abstract

Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top-$K$ activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.

Chinese Translation

目前大多数视觉机械解释性（MI）范式仍然局限于通过启发式方法（例如，top-$K$ 激活检索或带正则化的优化）来解释视觉模型的内部单元。在本研究中，我们建立了一种理论性的分布视角，用于视觉 MI，该视角建模特征激活对自然图像分布的影响，从而将 MI 任务表述为一个 Kullback-Leibler (KL)-最小化优化问题。在这一框架下，识别出之前 MI 范式中的统计偏差，揭示它们可能对人类感知上不可解释（即，偏离自然图像分布），或在机制上对视觉模型不忠实（即，无法激活模型特征）。为了在分布视角下解决这些偏差，我们提出了一种具有 KL-最小软约束原则的视觉 MI 模型，该模型在理论上平衡了解释性和忠实性。我们通过能量引导的扩散后验采样实现这一原则。大量实验验证了所提分布视角的理论合理性，并展示了我们范式在 DINOv3 视觉模型上的实际有效性。

View on arXiv Download PDF AI Translation

cs.CV / 166 / 2605.17506

Degradation Frequency Curve: An Explicit Frequency-Quantified Representation for All-in-One Image Restoration

降解频率曲线：一种用于一体化图像恢复的显式频率量化表示

Huang, Xinghua, Yang, Zhixiong, Wu, Chen, Li, Shengxi, Zhi, Shuaifeng, Zhang, Yue, Hou, Qibin, Deng, Xin, Xia, Jingyuan

Abstract

A fundamental difficulty in all-in-one blind image restoration is that degradation is usually treated as an implicit factor hidden in degraded-to-clean mapping, rather than as an explicit object that can be measured and manipulated. This limitation becomes more pronounced under mixed, compound, or unseen degradation conditions, where degradation effects are hard to assign to predefined labels or task-specific parameters. We propose the Degradation Frequency Curve (DFC), a structured spectral representation that quantifies degradation responses by measuring band-wise residual-to-degraded energy ratios in the frequency domain. DFC converts visually entangled and hard-to-describe degradation effects into a measurable degradation coordinate space. Moreover, DFC can be adaptively decomposed into band-wise spectral tokens, allowing local degradation responses to be represented as reusable restoration priors. Based on this representation, we develop the DFC-guided Image Restorer (DFC-IR), a token-conditioned multi-scale framework that progressively estimates DFCs from intermediate restorations and uses the resulting spectral tokens to guide degradation-aware restoration in a coarse-to-fine manner. Extensive experiments on standard, composite, unseen, and real-world degradation benchmarks show that DFC provides an effective representation basis for all-in-one restoration, leading to state-of-the-art performance and improved generalization under complex degradation profiles.

Chinese Translation

一体化盲图像恢复的一个基本困难在于，降解通常被视为隐藏在降级到清晰映射中的隐含因素，而不是可以被测量和操作的显式对象。在混合、复合或未见降解条件下，这一限制变得更加明显，因为降解效应难以分配给预定义标签或特定任务参数。我们提出了降解频率曲线（Degradation Frequency Curve, DFC），这是一种结构化的频谱表示，通过测量频域中的带宽残余与降解能量比来量化降解响应。DFC将视觉上纠缠且难以描述的降解效应转化为可测量的降解坐标空间。此外，DFC可以自适应地分解为带宽频谱标记，使局部降解响应能够作为可重用的恢复先验进行表示。基于这一表示，我们开发了DFC引导的图像恢复器（DFC-guided Image Restorer, DFC-IR），这是一个基于标记条件的多尺度框架，逐步从中间恢复中估计DFC，并利用得到的频谱标记以粗到细的方式指导降解感知恢复。在标准、复合、未见和真实世界降解基准上的广泛实验表明，DFC为一体化恢复提供了有效的表示基础，导致了最先进的性能，并在复杂降解特征下改善了泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 167 / 2605.17527

Designing streetscapes from street-view imagery using diffusion models

利用扩散模型从街景图像设计街道景观

Chen, Yuzhou, Liang, Yuebing, Hu, Lingqian, Sun, Kailai, Song, Qingqi, Zhao, Chang, Wang, Shenhao

Abstract

Street-view imagery (SVI) is widely used to quantify key indicators of urban environment, such as green- ery, sky, or road view indices. However, existing studies largely focus on measuring current streetscapes and rarely support the generation of alternative and non-existing urban scenarios, which is a core task in geospatial disciplines such as urban planning and design. To address this gap, we propose a gener- ative multimodal AI framework that synthesizes alternative streetscapes conditioned on targeted visual metrics, enabling direct visual exploration of urban scenarios. We first construct a multimodal dataset that aligns SVIs with textual descriptions, segmentation maps, road masks, and quantitative metrics of visual elements in Chicago and Orlando. Using this dataset, we demonstrate that diffusion models can produce realistic and semantically consistent streetscape imagery while responding to both textual and imagery controls. Our quantitative evaluations show that incorporating visual controls can improve semantic consistency, reducing the LPIPS index by approximately 6% while maintaining global visual realism. In addition, overall semantic consistency increases by 23.7% in Orlando and 46.4% in Chicago, as measured by the mIoU index, with class-wise gains exceeding even 100% improvement for building view indices. Streetscape generation can be controlled in a fine-grained manner by both visual and textual prompts, and when textual and visual controls conflict, imagery controls consistently dominate, indicating a clear control hierarchy and the importance of further developing visual controls for urban scene generation. Overall, this work establishes an important benchmark for streetscape generation us- ing SVIs and diffusion models, and illustrates how generative AI can serve as a practical, scalable, and controllable approach for urban scenario exploration.

Chinese Translation

街景图像（SVI）广泛用于量化城市环境的关键指标，如绿化、天空或道路视图指数。然而，现有研究主要集中在测量当前街道景观，鲜有支持生成替代和不存在的城市场景，这在城市规划和设计等地理空间学科中是一个核心任务。为了解决这一问题，我们提出了一种生成性多模态人工智能框架，该框架基于目标视觉指标合成替代街道景观，从而实现城市场景的直接视觉探索。我们首先构建了一个多模态数据集，将芝加哥和奥兰多的SVI与文本描述、分割图、道路掩膜和视觉元素的定量指标对齐。利用该数据集，我们展示了扩散模型能够生成现实且语义一致的街道景观图像，同时响应文本和图像的控制。我们的定量评估表明，结合视觉控制可以提高语义一致性，LPIPS指数降低约6%，同时保持全球视觉现实感。此外，基于mIoU指数的测量，奥兰多的整体语义一致性提高了23.7%，芝加哥提高了46.4%，在建筑视图指数上，类别级别的增益甚至超过了100%的改善。街道景观生成可以通过视觉和文本提示进行细粒度控制，当文本和视觉控制发生冲突时，图像控制始终占主导地位，表明了明确的控制层级以及进一步发展城市场景生成的视觉控制的重要性。总体而言，这项工作为利用SVI和扩散模型生成街道景观建立了一个重要的基准，并展示了生成性人工智能如何作为一种实用、可扩展且可控的方法来探索城市场景。

View on arXiv Download PDF AI Translation

cs.CV / 168 / 2605.17531

$\textit{Don't Guess, Just Ask}$: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

别猜了，直接问：通过多轮澄清解决指代分割中的模糊性

Yang, Yuting, Jiang, Haichao, Liang, Tianming, Zhang, Quan, Hu, Jian-Fang

Abstract

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose \textbf{IC-Seg}, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce \textbf{Hi-GRPO}, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish \textbf{Ambi-RVOS}, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at \url{https://github.com/iSEE-Laboratory/IC-Seg}.

Chinese Translation

指代分割旨在根据文本查询对图像或视频中的目标对象进行分割。尽管近年来取得了显著进展，但现有研究通常假设用户提供的查询已经精确且明确。然而，这一假设并不切实际。在现实场景中，期望所有用户都能彻底审查他们的视觉内容并仔细确保他们的查询是独特且无歧义的，这并不现实。当遇到此类情况时，现有的分割模型往往会随意猜测用户的偏好，常常导致不理想的结果。为了解决这一局限性，我们提出了 extbf{IC-Seg}，一种新颖的代理框架，通过多轮对话主动澄清用户意图，然后再进行分割。为了有效激励这一能力，我们进一步引入了 extbf{Hi-GRPO}，一种新的层次优化策略，在轨迹、轮次和步骤级别注入密集且信息丰富的监督信号。这一策略鼓励高效的意图澄清，有效消除冗余交互并提高整体对话质量。为了评估，我们建立了 extbf{Ambi-RVOS}，一个具有模糊用户查询的指代视频对象分割基准。大量实验表明，IC-Seg不仅在解决模糊查询方面大幅超越现有方法，而且在标准推理分割基准上保持了最先进的性能。代码和数据将发布在 exturl{https://github.com/iSEE-Laboratory/IC-Seg}。

View on arXiv Download PDF AI Translation

cs.CV / 169 / 2605.17543

HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos

HL-OutPaint：高分辨率长视频的粗到细视频外绘

Park, Jeongeun, Han, Janghyeok, Kim, Geonung, Lee, Hyun-Seung, Choi, Kyuha, Han, Youngseok, Cho, Sunghyun

Abstract

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.

Chinese Translation

视频外绘生成超出视频原始空间范围的合理视觉内容，在将视频适配到多种显示格式中发挥着关键作用。为了支持此类应用，它必须能够在长序列中进行大范围的空间外推。然而，大多数现有方法仅解决了这些挑战中的一个，或缺乏确保全局时空一致性的明确机制，导致显著的局限性。本文提出了HL-OutPaint，一个针对长序列的高分辨率视频外绘框架。我们的方法遵循粗到细的策略，采用两阶段的管道。我们首先构建全局粗略引导（Global Coarse Guidance, GCG），这是一个低分辨率表示，捕捉视频中的全局结构和主导运动。与简单的下采样不同，GCG通过一种新颖的全局-局部帧交换机制构建，该机制将稀疏的全局关键帧与局部时间窗口相结合，并在采样过程中交换信息。这使得GCG能够在统一表示中编码长期结构一致性和短期时间动态。在此表示的指导下，HL-OutPaint随后执行高分辨率外绘，以生成空间细节丰富且时间一致的内容。通过将全局结构建模与细粒度合成分离，我们的框架实现了对大空间扩展和长视频序列的稳定、一致的生成。大量实验表明，HL-OutPaint在涉及广泛空间外推和长视频序列的挑战性场景中优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 170 / 2605.17564

A Conditional U-Net Pipeline with Pre- and Post-Processing for Aerial RGB-to-Thermal Image Translation

带有前处理和后处理的条件U-Net管道用于航空RGB到热图像的转换

Sherpa, Tseten, Ali, Sikandar, Parab, Shubham, Feng, Haoyun, Dennis, Matthew, Gibbons, Keenan, Otiende, Verrah, Siwo, Geoffrey H.

Abstract

Paired RGB-thermal data has shown significant utility across a range of applications, including image fusion, object tracking, and anomaly detection; however, its broader adoption is constrained by the limited availability of aligned RGB-thermal image pairs. RGB-to-thermal (and vice versa) image translation has emerged as a practical solution to this challenge. Prior approaches including conditional generative adversarial networks (cGANs) such as ThermalGAN and Scalable Interpolant Transformer (SiT)-based architectures such as ThermalGen have demonstrated strong potential for aerial-to-thermal image translation. In this work, we explore alternative architectures that prioritize simplicity while maintaining performance. Specifically, we propose a conditional U-Net that incorporates weather data at the bottleneck layer, complemented by targeted preprocessing and post-processing techniques applied within the Pix2Pix GAN architecture. We utilize a training set of 612 paired RGB and thermal images, and evaluate over 5-fold cross-validation, ultimately testing on a held-out test set. Our conditional U-Net model performed best, with a peak signal-to-noise ratio (PSNR) of 14.5485, structural similarity index measure (SSIM) of 0.8095, and learned perceptual image patch similarity (LPIPS) of 0.1666. These results outperformed the base ThermalGen model, which attained PSNR, SSIM, and LPIPS scores of 7.56, 0.2444, and 0.6317 respectively. We find that while saturation boost and contrast enhancement for preprocessing and Gaussian blur for post-processing provide observable improvements, the incorporation of conditioning data was most effective. Our findings cement the potential of integrating auxiliary metadata into thermal image generation, suggesting that such information can serve as a proxy for environmental conditions critical to accurate thermal reconstruction.

Chinese Translation

配对的RGB-热数据在图像融合、目标跟踪和异常检测等多个应用中显示出显著的效用；然而，由于对齐的RGB-热图像对的有限可用性，其更广泛的应用受到限制。RGB到热图像（及反向）转换已成为解决这一挑战的实用方案。之前的研究方法包括条件生成对抗网络（cGANs），如ThermalGAN，以及基于可扩展插值变换器（SiT）架构的ThermalGen，已展示出在航空到热图像转换中的强大潜力。在本研究中，我们探索了优先考虑简单性的替代架构，同时保持性能。具体而言，我们提出了一种条件U-Net，该网络在瓶颈层中结合了天气数据，并在Pix2Pix GAN架构中应用了针对性的前处理和后处理技术。我们利用612对配对的RGB和热图像作为训练集，并进行了5折交叉验证，最终在一个保留的测试集上进行测试。我们的条件U-Net模型表现最佳，峰值信噪比（PSNR）为14.5485，结构相似性指数（SSIM）为0.8095，学习的感知图像块相似性（LPIPS）为0.1666。这些结果超过了基础的ThermalGen模型，其PSNR、SSIM和LPIPS得分分别为7.56、0.2444和0.6317。我们发现，尽管前处理中的饱和度提升和对比度增强以及后处理中的高斯模糊提供了可观察的改进，但条件数据的整合是最有效的。我们的研究结果巩固了将辅助元数据整合到热图像生成中的潜力，表明此类信息可以作为准确热重建所需环境条件的代理。

View on arXiv Download PDF AI Translation

cs.CV / 171 / 2605.17566

Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework

重新思考点云作为序列：一种因果下一个标记预测学习框架

Yao, Yumeng, Dong, Jingzhi, Gu, Haowen, Chen, Tao, Wu, Zonghan, Huang, Xiaoshui, Yao, Yazhou

Abstract

With the rapid progress of multimodal foundation models and predictive pre-training, an important open question is how to equip 3D point clouds with a pre-training paradigm that is better aligned with next-token and next-embedding learning. Existing point-cloud self-supervised methods are largely built on masked reconstruction or explicit geometric generation, and thus remain tied to input recovery rather than predictive dependency modeling. In this paper, we introduce PointNTP, which reformulates point cloud pre-training as a fully causal, decoder-free latent Next-Token Prediction problem. Specifically, each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry. The resulting sequence is then modeled by a causal Transformer under prefix-only conditioning, and trained with a shift-based prediction objective stabilized by stop-gradient targets. This design enables the model to learn structural dependencies directly in latent space, without reconstruction decoders or explicit geometric recovery. Extensive experiments demonstrate that the proposed PointNTP is highly competitive across multiple downstream tasks: it achieves 93.8%(+0.5%), 92.6%(+0.3%), and 89.3%(+1.1%) on OBJ_BG, OBJ_ONLY, and PB_T50_RS of ScanObjectNN, respectively; obtains 85.0%(+0.1%) in Cls.mIoU on ShapeNetPart; and reaches 71.1% mAcc on S3DIS Area 5. Overall, decoder-free causal latent prediction provides a simple, scalable, and potentially modality-agnostic paradigm for point-cloud self-supervised learning, offering a new 3D perspective on foundation-style predictive learning for 3D data.

Chinese Translation

随着多模态基础模型和预测预训练的快速发展，一个重要的开放性问题是如何为3D点云提供一种更好地与下一个标记和下一个嵌入学习对齐的预训练范式。现有的点云自监督方法主要基于掩码重建或显式几何生成，因此仍然与输入恢复紧密相连，而不是预测依赖建模。在本文中，我们引入了PointNTP，它将点云预训练重新表述为一个完全因果的、无解码器的潜在下一个标记预测问题。具体而言，每个点云首先被划分为局部补丁，并根据补丁中心几何结构序列化为结构化的3D标记序列。然后，生成的序列在仅前缀条件下由因果Transformer建模，并通过基于位移的预测目标进行训练，该目标由停止梯度目标稳定。该设计使模型能够直接在潜在空间中学习结构依赖，而无需重建解码器或显式几何恢复。大量实验表明，所提出的PointNTP在多个下游任务中具有很强的竞争力：在ScanObjectNN的OBJ_BG、OBJ_ONLY和PB_T50_RS上分别达到了93.8%(+0.5%)、92.6%(+0.3%)和89.3%(+1.1%)的成绩；在ShapeNetPart的Cls.mIoU上获得了85.0%(+0.1%)；在S3DIS Area 5上达到了71.1%的mAcc。总体而言，无解码器的因果潜在预测为点云自监督学习提供了一种简单、可扩展且潜在上不依赖于模态的范式，为3D数据的基础风格预测学习提供了新的3D视角。

View on arXiv Download PDF AI Translation

cs.CV / 172 / 2605.17571

Stable Routing for Mixture-of-Experts in Class-Incremental Learning

增量学习中的专家混合模型的稳定路由

Guo, Zirui, Cheng, Quan, Zhou, Da-Wei, Zhang, Lijun

Abstract

Class-incremental learning (CIL) requires models to learn new classes sequentially while preserving prior knowledge. Recently, approaches that combine pre-trained models with mixture-of-experts (MoE) have received increasing attention in CIL: they typically expand experts during learning and employ a router to assign weights across experts. However, existing MoE methods often overlook routing drift induced by expert expansion. Once new experts are introduced, the router may reassign samples from earlier classes to newly added experts, thereby perturbing previously established expert compositions and causing interference even when old experts remain frozen. We argue that expandable MoE in CIL requires two complementary properties: stable old-class routing for knowledge preservation and sufficient capacity utilization for new-class adaptation. To this end, we propose Stable Routing for MoE (StaR-MoE), a routing-level framework for expandable MoE in CIL. By incorporating sensitivity-aware routing alignment, StaR-MoE aligns current old-class routing behavior with historical routing distributions through sensitivity-guided constraints. Complementarily, StaR-MoE introduces asymmetric capacity regularization to encourage effective utilization of the expanded expert pool without compromising class-specific routing specialization. Extensive experiments across four standard CIL benchmarks demonstrate that StaR-MoE consistently improves both average and last accuracy over state-of-the-art methods, highlighting the importance of stable routing.

Chinese Translation

增量学习（CIL）要求模型顺序学习新类别，同时保持先前知识。最近，将预训练模型与专家混合模型（MoE）相结合的方法在增量学习中受到越来越多的关注：它们通常在学习过程中扩展专家，并使用路由器在专家之间分配权重。然而，现有的MoE方法往往忽视了由专家扩展引起的路由漂移。一旦引入新专家，路由器可能会将早期类别的样本重新分配给新添加的专家，从而扰乱先前建立的专家组合，并导致干扰，即使旧专家保持不变。我们认为，在增量学习中，扩展的MoE需要两个互补的特性：稳定的旧类别路由以保持知识和足够的容量利用以适应新类别。为此，我们提出了MoE的稳定路由（StaR-MoE），这是一个用于增量学习中可扩展MoE的路由级框架。通过结合敏感性感知的路由对齐，StaR-MoE通过敏感性引导的约束将当前的旧类别路由行为与历史路由分布对齐。作为补充，StaR-MoE引入了不对称容量正则化，以鼓励有效利用扩展的专家池，而不损害类别特定的路由专业化。在四个标准增量学习基准上的广泛实验表明，StaR-MoE在平均准确率和最后准确率上始终优于最先进的方法，突显了稳定路由的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 173 / 2605.17573

Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks

社交媒体中的深伪检测：基于3D卷积神经网络的时间伪影分析

Rashidi, Mohammadreza, Ali, Raja Hashim, Rahman, Sami Ur

Abstract

Synthetic facial videos have proliferated across social media faster than platform moderation can respond, raising the cost of disinformation and identity-based attacks. Frame-level deepfake detectors degrade sharply as generator quality increases; high-quality 128x128 GAN output cuts spatial-only accuracy by five percentage points while leaving temporal inconsistencies largely intact. We address this gap with a 3D Convolutional Neural Network detector based on R3D-18, trained with a composite loss that combines binary cross-entropy with a temporal-consistency regularizer. The model processes 16-frame clips from the DeepfakeTIMIT dataset and is initialized from Kinetics-400 action-recognition weights. We report 92.8% accuracy on intra-dataset evaluation at 128x128 resolution; cross-dataset transfer to FaceForensics++ without fine-tuning reaches 76.4%, rising after minimal fine-tuning. Ablation studies show that transfer learning contributes 7.2 percentage points and face tracking adds 3.5 points, while temporal consistency regularization provides additional gains on high-quality fakes. The results establish that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding.

Chinese Translation

合成面部视频在社交媒体上的传播速度超过了平台的审核响应能力，增加了虚假信息和基于身份的攻击的成本。随着生成器质量的提高，帧级深伪检测器的性能急剧下降；高质量的128x128 GAN输出使得仅基于空间的准确率降低了五个百分点，同时时间不一致性基本保持不变。我们通过基于R3D-18的3D卷积神经网络检测器来解决这一问题，该检测器使用复合损失进行训练，结合了二元交叉熵和时间一致性正则化。该模型处理来自DeepfakeTIMIT数据集的16帧片段，并从Kinetics-400动作识别权重初始化。我们在128x128分辨率下的内部数据集评估中报告了92.8%的准确率；在不进行微调的情况下，跨数据集迁移到FaceForensics++的准确率达到76.4%，并在进行最小微调后有所提升。消融研究表明，迁移学习贡献了7.2个百分点，面部跟踪增加了3.5个百分点，而时间一致性正则化在高质量伪造视频上提供了额外的提升。结果表明，时间伪影的泛化能力优于空间伪影，提供了一种在社交媒体重新编码后仍能存活的检测信号。

View on arXiv Download PDF AI Translation

cs.CV / 174 / 2605.17577

TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models

TAME：通过专家混合模型进行视觉-语言模型的测试时对抗提示调优

Wang, Xin, Wang, Yixu, Zhang, Jiaming, Wang, Ruofan, Yu, Jiaqi, Chen, Kai, Chen, Jingjing, Ma, Xingjun, Jiang, Yu-Gang

Abstract

Large-scale pre-trained Vision-Language models (VLMs), such as CLIP, exhibit strong zero-shot generalization, yet remain highly vulnerable to imperceptible adversarial perturbations, raising serious safety concerns for open-world deployment. To enhance robustness without requiring downstream task-specific retraining, we propose TAME, a novel test-time defense. Building upon our prior Test-Time Adversarial Prompt Tuning (TAPT), TAME introduces an architectural reformulation by replacing TAPT's single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) framework, enabling more expressive and adaptive defense. Specifically, TAME maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time. This test-time defense mechanism is driven by three unsupervised objectives: (1) multi-view prediction entropy minimization, (2) layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and (3) MoE regularization for balanced expert utilization and prompt diversity. We evaluated TAME on 11 benchmark datasets, including ImageNet and 10 additional zero-shot datasets. The results show that TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples. TAME also consistently outperforms existing adversarial prompt tuning methods across multiple prompt designs, yielding an average robustness gain of at least 30.2%.

Chinese Translation

大规模预训练的视觉-语言模型（VLMs），如CLIP，展现出强大的零-shot泛化能力，但对不可察觉的对抗扰动高度脆弱，这对开放世界部署提出了严重的安全隐患。为了增强模型的鲁棒性而无需针对下游任务进行特定的再训练，我们提出了TAME，一种新颖的测试时防御方法。在我们之前的测试时对抗提示调优（TAPT）的基础上，TAME通过将TAPT的单一自适应提示替换为输入条件的专家混合（Mixture-of-Experts, MoE）框架，引入了架构上的重构，从而实现了更具表现力和适应性的防御。具体而言，TAME维护一个可学习的专家提示库，并采用输入依赖的路由机制，在推理时为每个未标记的测试样本聚合定制的提示混合。该测试时防御机制由三个无监督目标驱动：（1）多视图预测熵最小化，（2）视觉标记统计与预计算的干净和对抗参考分布的层级对齐，以及（3）MoE正则化以平衡专家的利用和提示的多样性。我们在11个基准数据集上评估了TAME，包括ImageNet和另外10个零-shot数据集。结果表明，TAME在AutoAttack下提高了原始CLIP的零-shot对抗鲁棒性至少49.1%，同时在干净样本上大幅保持了泛化能力。TAME还在多个提示设计中始终优于现有的对抗提示调优方法，平均鲁棒性提升至少30.2%。

View on arXiv Download PDF AI Translation

cs.CV / 175 / 2605.17583

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

AgentSteerTTS：一个用于复合指令文本到语音的多智能体闭环框架

Kang, Bin, Wen, Shaoguo, Fan, Yang, Wu, Shunlong, Wang, Junjie, Li, Yulin, Zhao, Junzhi, Wang, Junle, Tian, Zhuotao

Abstract

While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.

Chinese Translation

尽管现有的文本到语音（TTS）模型表现出较高的表现力，但由于离散文本意图与连续声学实现之间的结构不匹配，细粒度控制复合指令仍然具有挑战性。受到人类认知解耦的启发，我们提出了AgentSteerTTS，一个旨在实现意图忠实表达控制复合指令的多智能体闭环框架。首先，在我们的框架中，一个对抗性解耦智能体通过学习可分离的身份和情感-韵律子空间，并采用抑制泄漏的正则化，来减轻说话者情感泄漏。接下来，一个双流锚定控制器利用大规模声学原型库为抽象意图提供基础：检索智能体选择表现力锚点，而合成智能体通过门控注意力将其融合为连续控制向量。最后，一个快慢反馈智能体通过潜在梯度校正来细化输出强度，并利用高层次的感知批评解决语义-声学不匹配。对复合指令基准和公共测试集的实验表明，AgentSteerTTS在基线模型上实现了一致且显著的改进，证明了所提方法的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 176 / 2605.17584

VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

VVitCutLER：朝着无监督视频目标检测与分割的方向发展

Lu, Zhijing, Hashmi, Khurram Azeem, Stricker, Didier, Afzal, Muhammad Zeshan

Abstract

Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo-labels.We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross-frame feature aggregation to enhance video-level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation performance while reducing temporal instability. These results highlight the importance of temporally consistent supervision for robust pixel-level video understanding.

Chinese Translation

在现实场景中，无监督的像素级视频理解仍然面临挑战，运动模糊、遮挡和快速物体动态常常导致时间漂移和闪烁的伪标签。我们提出了VVitCutLER，一个用于视频目标检测和实例分割的无监督框架，通过时间一致性来提高伪标签的质量。我们的核心贡献是VitCut，一个暂时稳定的伪标签生成器，通过跨帧区域一致性减少在场景退化过程中的错误累积。同时，VitCut使用蒸馏解码器实现有效的实例掩码预测。基于VitCut，VVitCutLER进一步整合了跨帧特征聚合，以增强视频级的鲁棒性。在标准视频基准上的大量实验表明，VVitCutLER显著提高了检测和分割性能，同时减少了时间不稳定性。这些结果突显了时间一致性监督在鲁棒的像素级视频理解中的重要性。

View on arXiv Download PDF AI Translation

cs.CV / 177 / 2605.17588

MSIQ: Moment-based Scale-Invariant Quality Measure for Single Image Super-Resolution

MSIQ：基于时刻的尺度不变质量度量用于单幅图像超分辨率

Bedratyuk, Leonid

Abstract

Assessing the quality of single image super-resolution (SISR) results remains an open methodological problem. Common full-reference metrics (PSNR, SSIM, LPIPS) do not explicitly evaluate the preservation of the geometric structure of images, which is critical for the correctness of scale-based reconstruction. In addition, they require the forced alignment of images to the same size (\textit{forced resizing}), which introduces an external interpolation error into the evaluation process. This paper proposes a diagnostic scale-invariant quality measure, MSIQ (\textit{Moment-based Scale-Invariant Quality}), based on the comparison of normalized central geometric moments of two images. MSIQ enables direct comparison of images with different spatial resolutions without resizing, is mathematically deterministic (\textit{model-free}), and has an analytical form. To provide a theoretical basis for the approach, we introduce a conceptual distinction between the ability of metrics to monotonically track degradation (\textit{tracking ability}) and their geometric selectivity (\textit{geometric specificity}). The experimental validation confirmed the stability of MSIQ under uniform scaling and, at the same time, revealed the high sensitivity of traditional metrics to the choice of interpolation method. The results show that MSIQ has pronounced geometric selectivity: the proposed measure effectively separates geometric deformations from non-geometric artifacts, in particular JPEG compression, unlike pixel-based and perceptual metrics. It is also shown that the response of MSIQ to structural perturbations remains stable across different classes of SR algorithms, including DNN models with different architectures. The proposed measure is a complementary diagnostic tool for domains where geometric fidelity has priority, in particular medical imaging and remote sensing.

Chinese Translation

评估单幅图像超分辨率（SISR）结果的质量仍然是一个开放的方法论问题。常见的全参考指标（PSNR、SSIM、LPIPS）并未明确评估图像几何结构的保留，而这对于基于尺度的重建的正确性至关重要。此外，它们要求将图像强制对齐到相同大小（ extit{forced resizing}），这在评估过程中引入了外部插值误差。本文提出了一种基于两个图像的归一化中心几何矩比较的诊断尺度不变质量度量MSIQ（ extit{Moment-based Scale-Invariant Quality}）。MSIQ能够在不进行调整大小的情况下直接比较不同空间分辨率的图像，具有数学确定性（ extit{model-free}）且具有解析形式。为了为该方法提供理论基础，我们引入了指标在单调追踪退化能力（ extit{tracking ability}）与其几何选择性（ extit{geometric specificity}）之间的概念区分。实验验证确认了MSIQ在均匀缩放下的稳定性，同时揭示了传统指标对插值方法选择的高度敏感性。结果表明，MSIQ具有明显的几何选择性：该度量有效地区分几何变形与非几何伪影，特别是JPEG压缩，而不像基于像素和感知的指标。还表明，MSIQ对结构扰动的响应在不同类别的超分辨率算法中保持稳定，包括具有不同架构的深度神经网络（DNN）模型。所提出的度量是一个补充的诊断工具，适用于几何保真度优先的领域，特别是医学成像和遥感。

View on arXiv Download PDF AI Translation

cs.CV / 178 / 2605.17591

Error-Decomposed Class-Conditional Fusion for Statistically Guaranteed Hard-Category Robust Perception

基于误差分解的类别条件融合用于统计保证的困难类别鲁棒感知

Luo, Guowei, Shi, Ziqi, Xie, Zhao

Abstract

Aggregate object detection metrics inherently mask catastrophic and repeatable failures in operationally critical, long-tail minority classes. This paper formally defines this pervasive vulnerability as the Hard-Category Reliability Problem (HCRP): the fundamental architectural challenge of strictly rectifying vulnerable categories without compromising the performance boundaries of stable classes under stringent protocols. To systematically dismantle this limitation, we propose Error-Decomposed Class-Conditional Fusion (ED-CCF), an elegant decision-layer inference framework. Diverging from heuristic global post-processing, ED-CCF projects predictions into a sophisticated quad-state error taxonomy, dynamically activating calibration pathways exclusively upon rigorous empirical justification. On a highly constrained 600-image validation benchmark, isolating cz as the critical vulnerability (HCEC=0.86, BSR=0.14), our framework achieves a targeted breakthrough: it elevates cz mAP50 from 0.089343 to 0.109353 (a massive +22.4% relative surge) while flawlessly preserving the Pareto optimality of global stability (raising all mAP50 from 0.581925 to 0.584864). Backed by exhaustive validation across 50 paired subset trials demonstrating an overwhelming 96% win rate and strict Bonferroni-corrected Wilcoxon significance (p<0.05), this work fundamentally redefines output-level fusion as an auditable, statistically guaranteed paradigm for safety-critical visual perception.

Chinese Translation

聚合对象检测指标本质上掩盖了在操作关键的长尾少数类别中发生的灾难性和可重复性失败。本文正式定义了这一普遍脆弱性为困难类别可靠性问题（Hard-Category Reliability Problem, HCRP）：在严格协议下，严格纠正脆弱类别而不妥协稳定类别性能边界的基本架构挑战。为了系统性地拆解这一限制，我们提出了基于误差分解的类别条件融合（Error-Decomposed Class-Conditional Fusion, ED-CCF），这是一种优雅的决策层推理框架。与启发式全局后处理不同，ED-CCF将预测投影到一个复杂的四状态误差分类法中，仅在严格的实证证明下动态激活校准路径。在一个高度受限的600图像验证基准上，孤立cz作为关键脆弱性（HCEC=0.86, BSR=0.14），我们的框架实现了目标突破：将cz的mAP50从0.089343提升至0.109353（相对增长22.4%），同时完美保持全球稳定性的帕累托最优性（将所有mAP50从0.581925提升至0.584864）。通过在50对子集试验中进行的详尽验证，展示了压倒性的96%胜率和严格的Bonferroni校正Wilcoxon显著性（p<0.05），这项工作从根本上重新定义了输出级融合，成为一种可审计的、统计保证的安全关键视觉感知范式。

View on arXiv Download PDF AI Translation

cs.CV / 179 / 2605.17610

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

SafeLens：具有快速与慢速筛选的高效视频护栏

Nahin, Shahriar Kabir, Askari, Hadi, Chen, Muhao, Chhabra, Anshuman

Abstract

The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment. While most videos can be screened through fast pattern recognition, a small subset requires deeper reasoning over temporally complex content and nuanced policy constraints. Existing approaches typically rely on large vision-language models applied uniformly across all inputs, resulting in high inference costs and inefficient allocation of computation. We propose SafeLens, a video guardrail framework that introduces a fast-and-slow inference architecture for efficient and accurate content moderation with variable computational cost across inputs. Additionally, we construct a high-quality dataset by applying influence-guided filtering to the SafeWatch Dataset, retaining only 2.4% of the original data. To further address limitations of training-time scaling, we enable test-time reasoning by augmenting the filtered data with structured Chain-of-Thought traces. Across real-world and AI-generated video benchmarks, SafeLens achieves state-of-the-art performance, outperforming strong open-source video guardrails (e.g., SafeWatch-8B, OmniGuard-7B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-pro) while significantly reducing inference cost, demonstrating that efficient design serves to be more effective than scaling data or model size alone.

Chinese Translation

在线视频平台和人工智能生成内容的快速增长使得可靠的视频护栏成为安全和实际部署的关键挑战。虽然大多数视频可以通过快速模式识别进行筛选，但少数视频需要对时间复杂内容和细致的政策约束进行更深入的推理。现有方法通常依赖于在所有输入上均匀应用的大型视觉-语言模型，导致高推理成本和计算资源的低效分配。我们提出了SafeLens，一个视频护栏框架，引入了一种快速与慢速的推理架构，以实现高效且准确的内容审核，并在不同输入之间实现可变的计算成本。此外，我们通过对SafeWatch数据集应用影响引导过滤构建了一个高质量的数据集，仅保留了原始数据的2.4%。为了进一步解决训练时间扩展的局限性，我们通过用结构化的思维链（Chain-of-Thought）痕迹增强过滤后的数据，启用了测试时推理。在真实世界和人工智能生成的视频基准测试中，SafeLens实现了最先进的性能，超越了强大的开源视频护栏（例如，SafeWatch-8B，OmniGuard-7B）和闭源模型（例如，GPT-5.4，Gemini-3.1-pro），同时显著降低了推理成本，证明了高效设计比单纯扩展数据或模型规模更为有效。

View on arXiv Download PDF AI Translation

cs.CV / 180 / 2605.17620

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SynVA：一种用于血管生成和动脉瘤编辑的模块化工具包

Finck, Marten J., Koser, Niklas C., Mahfuz, Sarker M., Jahangir, Tameem, Wilhelm, Jon E., Behme, Daniel, Larsen, Naomi, Palubicki, Wojtek, Saalfeld, Sylvia, Pirk, Sören

Abstract

Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

Chinese Translation

颅内动脉瘤（IAs）以其不可预测的生长和破裂风险为特征，是中风的主要原因，可能导致危及生命的出血，伴随高死亡率和长期残疾。随着人口老龄化，脑血管疾病的发生率和整体负担预计将增加，这突显了分析复杂医学数据和改善对这些疾病的群体级理解的可扩展方法的必要性。尽管数字双胞胎和深度学习为改善诊断、预后和治疗提供了有希望的途径，但其有效性受到大规模高质量医学数据及相应标签稀缺的限制。我们提出了合成血管（Synthetic VAsculature，SynVA），这是一种用于血管网生成和解剖一致的动脉瘤合成的模块化工具包。SynVA结合了基于流匹配的新颖方法来生成健康的血管网，并结合基于学习的方法进行解剖条件下的动脉瘤网生成——动脉瘤是从现有的血管几何形状计算得出的，而不是孤立生成。此外，我们引入了SynVA程序模型，用于基于生理原理和统计先验进行血管和动脉瘤合成，这使得生成大规模数据集成为可能（例如，用于基于网格的生成模型的训练）。为此，我们发布了一个包含50,000个完全标注的网格样本的数据集，适用于各种下游视觉任务，如语义分割。广泛的定量和定性评估表明，SynVA生成了逼真的血管几何形状和解剖上合理的动脉瘤。具体而言，我们的实验表明，一些方法生成的动脉瘤形状更符合专家的人类感知，而另一些方法在与真实动脉瘤重建的定量相似性指标上表现更佳。

View on arXiv Download PDF AI Translation

cs.CV / 181 / 2605.17624

Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning

通过不变/等变半监督学习在部分标记数据集上进行多任务学习

Rabadán, Miquel Martí i, Pieropan, Alessandro, Azizpour, Hossein, Maki, Atsuto

Abstract

We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.

Chinese Translation

我们研究了不变和等变半监督学习在解决在部分标记数据集上训练多任务模型时面临的挑战的潜力，这些数据集具有不同结构的输出任务。具体而言，我们使用流行的 FixMatch 方法进行不变半监督学习及其等变扩展 Dense FixMatch。我们在 Cityscapes 和 BDD100K 数据集上评估它们在计算机视觉中普遍存在的目标检测和语义分割任务中的表现。我们考虑了为每个任务标注的子集的不同大小以及它们之间的不同重叠情况。我们的结果表明，在大多数情况下，不变和等变半监督学习的表现优于监督基线，尤其是在任务可用的标记样本较少时，且后者方法通常表现更好。我们的研究表明，不变/等变学习是从有限标记数据进行多任务学习的一个有前景的总体方向。

View on arXiv Download PDF AI Translation

cs.CV / 182 / 2605.17630

SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

SegRAG：无训练的检索增强语义分割

Boudiaf, Abderrahmene, Hussain, Irfan, Javed, Sajid

Abstract

Here's a trimmed version under 1920 characters: Open-vocabulary segmentation models such as SAM3 achieve strong performance through concept-level text prompting, yet degrade when the target class is visually underrepresented in pretraining data or when its appearance departs from canonical depictions. Text prompts provide no spatial signal to resolve such ambiguity. We present SegRAG, a training-free retrieval-augmented segmentation framework that grounds SAM3 with spatially precise, class-specific point prompts derived from a curated DINOv3 feature bank. During an offline stage, patch-level descriptors are extracted from annotated reference images using a frozen DINOv3 ViT-L/16 backbone and filtered by Intra-Class Cohesion Distillation (ICCD), retaining only prototypes that reliably retrieve within-class foreground. At inference, Topographic Similarity Grounding (TSG) computes a cosine-similarity landscape between the query image and retrieved prototypes, identifies spatially coherent high-confidence regions via connected-component analysis, and extracts peak locations through non-maximum suppression. These point prompts are delivered to SAM3 alongside the class-name text in a single joint grounding pass, enabling the mask decoder to resolve semantic intent and spatial evidence together. SegRAG requires no task-specific training and no synthetic data. On four open-vocabulary benchmarks it achieves consistent gains over the SAM3 text-only baseline, with improvements of up to +3.92 mIoU on LVIS. On AgML agricultural benchmarks representing a zero-shot domain transfer setting, it raises mean IoU from 25.27 to 59.24 (+33.97) and recovers individual classes from zero to over 95 mIoU. Ablation studies confirm that ICCD, TSG, and joint prompting each contribute independently and compound when combined. Code is available at https://github.com/boudiafA/SegRAG.

Chinese Translation

开放词汇分割模型如SAM3通过概念级文本提示实现了强大的性能，但在目标类别在预训练数据中视觉上表现不足或其外观偏离典型描述时性能会下降。文本提示未能提供空间信号以解决这种模糊性。我们提出了SegRAG，一种无训练的检索增强分割框架，它通过来自策划的DINOv3特征库的空间精确、类别特定的点提示来增强SAM3。在离线阶段，使用冻结的DINOv3 ViT-L/16主干从带注释的参考图像中提取补丁级描述符，并通过类内一致性蒸馏（Intra-Class Cohesion Distillation, ICCD）进行过滤，仅保留能够可靠检索类内前景的原型。在推理阶段，地形相似性定位（Topographic Similarity Grounding, TSG）计算查询图像与检索到的原型之间的余弦相似性景观，通过连通组件分析识别空间一致的高置信度区域，并通过非极大值抑制提取峰值位置。这些点提示与类别名称文本一起以单次联合定位的方式传递给SAM3，使得掩膜解码器能够同时解析语义意图和空间证据。SegRAG不需要特定任务的训练和合成数据。在四个开放词汇基准测试中，它在SAM3仅文本基线的基础上实现了一致的提升，在LVIS上提高了多达+3.92 mIoU。在代表零样本领域迁移设置的AgML农业基准测试中，它将平均IoU从25.27提高到59.24（+33.97），并使个别类别的mIoU从零恢复到超过95。消融研究确认ICCD、TSG和联合提示各自独立贡献，并在组合时相互增强。代码可在https://github.com/boudiafA/SegRAG获取。

View on arXiv Download PDF AI Translation

cs.CV / 183 / 2605.17633

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

SparseSAM：在Segment Anything模型中对激活进行结构化稀疏化

Tran, Hoai-Chau, Nguyen, Chi H., Nguyen, Duy M. H., Niepert, Mathias, Lai, Fan, Doan, Khoa D.

Abstract

The Segment Anything Model (SAM) achieves strong open-vocabulary segmentation, but its ViT-based image encoders dominate inference latency and memory. Existing activation compression methods, such as token merging, reduce the token length to process, yet introduce non-trivial runtime overhead and encounter catastrophic quality drop under high compression. Other methods applying Sparse Attention focus on attention alone, leaving the MLP fully dense and capping achievable speedup. We propose SparseSAM, a (i) training-free structured sparsification framework that jointly accelerates attention and MLP layers while preserving token identity. SparseSAM introduces (ii) Stripe-Sort Attention, which uses a deterministic Z-order permutation to transform dense attention into static hardware-friendly sparse patterns, eliminating dynamic masking overhead. SparseSAM further introduces a (iii) Residual-Consistency MLP that routes only informative tokens through the MLP while propagating remaining tokens through the residual pathway. Across four segmentation benchmarks, SparseSAM loses only 0.004 mIoU at a 0.4 density and 0.021 mIoU at 0.3, a 2.10x reduction in accuracy loss versus token merging advances, while achieving 2x faster inference and 2.8x memory reduction.

Chinese Translation

Segment Anything模型（SAM）实现了强大的开放词汇分割，但其基于ViT的图像编码器在推理延迟和内存方面占据主导地位。现有的激活压缩方法，如令牌合并，减少了处理的令牌长度，但引入了非平凡的运行时开销，并在高压缩下遇到了灾难性的质量下降。其他应用稀疏注意力的方法仅关注注意力，导致多层感知器（MLP）保持完全稠密，从而限制了可实现的加速。我们提出了SparseSAM，这是一种（i）无训练的结构化稀疏化框架，能够在保持令牌身份的同时加速注意力和MLP层。SparseSAM引入了（ii）条带排序注意力（Stripe-Sort Attention），它使用确定性的Z-order排列将稠密注意力转换为静态硬件友好的稀疏模式，从而消除动态掩蔽开销。SparseSAM进一步引入了（iii）残差一致性MLP（Residual-Consistency MLP），该MLP仅通过MLP路由信息丰富的令牌，同时通过残差路径传播其余令牌。在四个分割基准测试中，SparseSAM在0.4密度下仅损失0.004 mIoU，在0.3密度下损失0.021 mIoU，相比于令牌合并方法减少了2.10倍的准确性损失，同时实现了2倍的推理速度和2.8倍的内存减少。

View on arXiv Download PDF AI Translation

cs.CV / 184 / 2605.17638

TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts

TouchMap-OR：手部与表面接触的多视角三维映射

Ktistakis, Sophokles, Wang, Rui, Grande, Bastian, Sax, Hugo

Abstract

Hand-surface interactions between clinicians, patients, and medical equipment play a central role in pathogen transmission during medical procedures. However, these interactions remain largely unobserved, as current infection-prevention practices rely on manual observation and cannot reconstruct detailed contact histories. In this work we formulate the problem of identity-resolved hand-surface interaction reconstruction in operating rooms and introduce TouchMap-OR, a multi-view RGB-D vision system that models clinicians, articulated hand geometry, and the semantic structure of the clinical environment to infer when and where contacts occur. The system reconstructs globally consistent multi-person 3D skeleton tracks across cameras while estimating articulated MANO hand meshes from RGB observations aligned to depth data. Multi-view hand reconstructions are fused and associated with tracked clinicians to obtain consistent left and right hand trajectories. A semantic 3D model of the operating room is built from multi-view segmentation and depth fusion, enabling reconstructed hand trajectories to be mapped to specific surfaces, including medical equipment, movable objects, and patient body sites. Temporal hand-surface proximity is used to infer contact episodes describing which clinician touched which surface and when. We evaluate TouchMap-OR on recordings from three real anesthesia inductions with manually annotated contact events. TouchMap-OR achieves 0.75 binary contact F1, outperforming tracking-based baselines while maintaining comparable multi-person tracking accuracy and achieving 0.96 identity attribution accuracy.

Chinese Translation

临床医生、患者与医疗设备之间的手部与表面交互在医疗程序中对病原体传播起着核心作用。然而，这些交互仍然在很大程度上未被观察到，因为当前的感染预防措施依赖于人工观察，无法重建详细的接触历史。在本研究中，我们提出了在手术室中解决身份分辨的手部与表面交互重建问题，并引入了TouchMap-OR，一个多视角RGB-D视觉系统，该系统建模临床医生、关节手部几何形状以及临床环境的语义结构，以推断接触发生的时间和地点。该系统在摄像机之间重建全局一致的多人的三维骨架轨迹，同时从与深度数据对齐的RGB观测中估计关节MANO手网格。多视角手部重建被融合并与跟踪的临床医生关联，以获得一致的左右手轨迹。通过多视角分割和深度融合构建手术室的语义三维模型，使得重建的手部轨迹能够映射到特定表面，包括医疗设备、可移动物体和患者身体部位。时间上的手部与表面接近性被用来推断接触事件，描述哪个临床医生在何时触摸了哪个表面。我们在三次真实麻醉诱导的录音上评估了TouchMap-OR，并手动标注了接触事件。TouchMap-OR达到了0.75的二元接触F1分数，超越了基于跟踪的基线，同时保持了可比的多人人员跟踪精度，并实现了0.96的身份归属精度。

View on arXiv Download PDF AI Translation

cs.CV / 185 / 2605.17668

Deep learning-based compression of giga-resolution whole slide images

基于深度学习的千兆分辨率全幻灯片图像压缩

Høibø, Maren, Gaucher, Etienne, Reinertsen, Ingerid, Valla, Marit, Smistad, Erik

Abstract

Implementation of digital pathology leads to an increased number of whole slide images (WSIs). The large size of WSIs is challenging. Today, WSIs are compressed with codecs like JPEG resulting in several gigabytes per WSI, and large amounts of space are wasted storing glass. In this study, deep learning-based tissue segmentation for glass removal, and deep learning compression methods were explored and compared with JPEG, JPEG-2000 and JPEG-XL. Image pyramids (N=21) with intact glass, glass replaced by single-colored pixels, and glass replaced by zero-byte tiles were created and compressed with JPEG, JPEG-XL and a deep learning model. Additionally, several compression models were evaluated on a tissue patch dataset and compared with JPEG, JPEG-2000 and JPEG-XL. Removing glass reduced file sizes considerably for JPEG and JPEG-XL. Deep learning-based image compression reduced the WSI size by 43-72% compared to JPEG compression, whereas deep learning-based glass removal reduced the WSI size by 0.3-33%, and 6-62% using only single-colored pixels and removing all-glass tiles, respectively. Combining the two gave a small improvement to a 44-80% total size reduction which indicates that deep learning-based image compression is able to efficiently compress glass tiles, whereas JPEG is not. On the tissue patch dataset, the best deep learning-based compression models saved on average ~35-40% per patch compared to JPEG, while keeping an average SSIM above 0.95, whereas JPEG-XL and JPEG-2000 saved 17% and 14%, respectively while keeping an SSIM of 0.96. However, the deep learning models had higher decompression times than JPEG and JPEG-XL.

Chinese Translation

数字病理学的实施导致全幻灯片图像（WSIs）数量的增加。WSIs 的大尺寸带来了挑战。目前，WSIs 通常使用 JPEG 等编解码器进行压缩，导致每个 WSI 的大小达到数千兆字节，并且存储玻璃的空间浪费严重。本研究探讨并比较了基于深度学习的组织分割以去除玻璃和深度学习压缩方法与 JPEG、JPEG-2000 和 JPEG-XL 的效果。创建了包含完整玻璃、用单色像素替代玻璃以及用零字节瓦片替代玻璃的图像金字塔（N=21），并使用 JPEG、JPEG-XL 和深度学习模型进行压缩。此外，还在组织补丁数据集上评估了几种压缩模型，并与 JPEG、JPEG-2000 和 JPEG-XL 进行了比较。去除玻璃显著减少了 JPEG 和 JPEG-XL 的文件大小。与 JPEG 压缩相比，基于深度学习的图像压缩将 WSI 大小减少了 43-72%，而基于深度学习的去玻璃处理将 WSI 大小减少了 0.3-33%，使用单色像素和去除全玻璃瓦片时分别为 6-62%。将两者结合起来，整体大小减少幅度小幅提升至 44-80%，这表明基于深度学习的图像压缩能够有效压缩玻璃瓦片，而 JPEG 则无法实现。在组织补丁数据集上，最佳的基于深度学习的压缩模型相比 JPEG 平均节省了 ~35-40% 的补丁大小，同时保持了平均 SSIM 高于 0.95，而 JPEG-XL 和 JPEG-2000 分别节省了 17% 和 14%，同时保持 SSIM 为 0.96。然而，深度学习模型的解压缩时间高于 JPEG 和 JPEG-XL。

View on arXiv Download PDF AI Translation

cs.CV / 186 / 2605.17673

A simple approach for biometrics: Finger-knuckle prints recognition based on a Sobel filter and similarity measures

一种简单的生物识别方法：基于索贝尔滤波器和相似性度量的指关节指纹识别

Rodrigues, E. O., Porcino, T. M., Conci, Aura, Silva, Aristofanes C.

Abstract

The objective of this work is to propose a novel methodology for the finger knuckle print recognition, which is essentially a digital photo of the finger-knuckle region. We have employed very simple concepts of visual computing such as a filter based on the Sobel operator for finding edges and a simple noise reduction algorithm. These operations are exceptionally fast and produce binary images, which are very efficient to process and to store. Furthermore, alongside this preprocessing, some similarity measures were also regarded and evaluated for the task. After preprocessing an input finger it is compared to all the images of fingers in the dataset, one by one. We have obtained up to 17.02% of successful recognitions (true positive rate) with a large dataset.

Chinese Translation

本研究的目的是提出一种新颖的指关节指纹识别方法，该方法本质上是指关节区域的数字照片。我们采用了非常简单的视觉计算概念，例如基于索贝尔算子的边缘检测滤波器和简单的噪声减少算法。这些操作速度极快，并生成二值图像，便于处理和存储。此外，在这一预处理过程中，我们还考虑并评估了一些相似性度量用于该任务。在对输入指纹进行预处理后，它将逐一与数据集中所有指纹图像进行比较。我们在一个大型数据集中获得了高达17.02%的成功识别率（真正率）。

View on arXiv Download PDF AI Translation

cs.CV / 187 / 2605.17682

GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning

GEM：用于占用预测和运动规划的高斯演化模型

Chen, Cheng, Huang, Hao, Bagchi, Saurabh

Abstract

Future 3D semantic occupancy forecasting and motion planning are central to autonomous driving, as they require models to reason about how surrounding scenes evolve and how the ego vehicle should act. Existing occupancy world models commonly discretize scenes into latent embeddings, volumetric features, or quantized tokens, and forecast future states through fixed-step autoregressive generation. This limits temporal flexibility, obscures scene evolution, accumulates errors over long horizons, and poorly matches the continuous-time dynamics of real driving scenes. We propose GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling, where driving scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. This enables efficient forecasting over the full horizon while retaining a compact and interpretable scene representation. By decoupling spatial geometry, temporal support, and primitive motion, GEM makes the predicted world easier to inspect, as each primitive's evolution can be followed continuously over time. The same representation also supports motion planning by predicting future ego trajectories from the learned Gaussian world. Extensive experiments show that GEM achieves state-of-the-art future semantic occupancy forecasting and strong motion planning performance, while providing flexible temporal querying.

Chinese Translation

未来的三维语义占用预测和运动规划是自动驾驶的核心，因为它们需要模型推理周围场景的演变以及自我车辆应如何行动。现有的占用世界模型通常将场景离散化为潜在嵌入、体积特征或量化标记，并通过固定步长的自回归生成来预测未来状态。这限制了时间灵活性，模糊了场景演变，导致长期预测中的误差累积，并且与真实驾驶场景的连续时间动态匹配不佳。我们提出了GEM，一种用于非自回归占用世界建模的高斯演化模型，其中驾驶场景被表示为具有学习动态的显式连续四维高斯原语。GEM不是逐步展开未来的占用状态，而是直接在任意时间戳查询高斯世界表示，并将相应的条件三维高斯分布投影到语义占用体积中。这使得在整个预测范围内高效预测成为可能，同时保持紧凑且易于解释的场景表示。通过解耦空间几何、时间支持和原语运动，GEM使得预测的世界更易于检查，因为每个原语的演变可以在时间上连续跟踪。相同的表示还支持运动规划，通过从学习的高斯世界中预测未来的自我轨迹。大量实验表明，GEM在未来语义占用预测和强运动规划性能方面达到了最先进的水平，同时提供灵活的时间查询。

View on arXiv Download PDF AI Translation

cs.CV / 188 / 2605.17685

Attention-Guided Fusion of 1D and 2D CNNs for Robust ECG-Based Biometric Recognition

基于注意力引导的1D和2D卷积神经网络融合用于稳健的心电图生物识别

Arioua, Islameddine, Benzaoui, Amir, Zeroual, Abdelhafid, Houam, Lotfi

Abstract

Electrocardiogram (ECG)-based biometric recognition has emerged as a promising solution for secure authentication and liveness detection. However, most existing methods rely on unimodal deep learning architectures that independently process either one-dimensional (1D) temporal signals or two-dimensional (2D) time-frequency representations, limiting robustness and generalization. To address this issue, this paper proposes a hybrid framework integrating 1D and 2D convolutional neural networks (CNNs) within a unified end-to-end architecture. The 1D branch extracts temporal and morphological features from raw ECG signals, while the 2D branch captures discriminative spectral information from time-frequency representations. An attention-guided fusion mechanism dynamically weights both modalities according to input characteristics, overcoming the limitations of conventional static fusion strategies. The framework was evaluated on three benchmark datasets (ECG-ID, MIT-BIH, and PTB), including healthy subjects and patients with cardiac pathologies, achieving identification accuracies of 99.56%, 100.00%, and 99.89%, respectively. To assess long-term biometric permanence, experiments were also conducted on the multi-session Heartprint dataset spanning ten years. The proposed approach achieved same-session accuracies of 98.54% (S1), 99.09% (S2), 94.93% (S3R), and 96.08% (S3L), while cross-session evaluations reached 56.33% (S1-S2) and 53.27% (S2-S3R), demonstrating the ability to capture stable biometric signatures over time. The optimal configuration combines InceptionTime for 1D processing, ResNet-34 for 2D analysis, and attention-based fusion. Ablation studies confirm that the proposed attention mechanism consistently outperforms conventional fusion approaches. Overall, the proposed framework provides a robust, scalable, and high-performance solution for ECG biometric recognition.

Chinese Translation

基于心电图（ECG）的生物识别已成为安全认证和活体检测的有前景的解决方案。然而，大多数现有方法依赖于单模态深度学习架构，独立处理一维（1D）时间信号或二维（2D）时频表示，限制了其稳健性和泛化能力。为了解决这一问题，本文提出了一种混合框架，将1D和2D卷积神经网络（CNNs）集成在统一的端到端架构中。1D分支从原始ECG信号中提取时间和形态特征，而2D分支则从时频表示中捕获判别性谱信息。注意力引导的融合机制根据输入特征动态加权两种模态，克服了传统静态融合策略的局限性。该框架在三个基准数据集（ECG-ID、MIT-BIH和PTB）上进行了评估，包括健康受试者和心脏病患者，分别达到了99.56%、100.00%和99.89%的识别准确率。为了评估长期生物识别的持久性，还对跨越十年的多会话Heartprint数据集进行了实验。所提方法在同一会话中的准确率为98.54%（S1）、99.09%（S2）、94.93%（S3R）和96.08%（S3L），而跨会话评估的准确率为56.33%（S1-S2）和53.27%（S2-S3R），展示了捕获稳定生物特征的能力。最佳配置结合了InceptionTime进行1D处理，ResNet-34进行2D分析，以及基于注意力的融合。消融研究证实，所提出的注意力机制始终优于传统融合方法。总体而言，所提出的框架为ECG生物识别提供了一种稳健、可扩展且高性能的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 189 / 2605.17686

Brain-inspired spike-timing plasticity for reliable label-efficient event-camera vision

基于大脑启发的脉冲时序可塑性用于可靠的标签高效事件相机视觉

Sadoun, Mohamad Yazan, Sharif, Sarah, Banad, Yaser Mike

Abstract

Deploying event-camera object detectors is constrained by per-frame labeling requirements and GPU compute demands. This work introduces three local spike-timing-dependent plasticity (STDP) modules, including sequence, candidate, and tube-reliability modules, that operate on a single CPU thread without GPU support. On the FRED drone benchmark, the proposed framework spans three label-efficient supervision tiers. A strict zero-label detector achieves 53.8% mAP@30, approximately 26 train-derived bits achieve 76.9% mAP@30, and an STDP candidate-reliability gate achieves 78.60 +/- 0.42% mAP@30. Under acquisition-order drift, the cohort gate outperforms streaming k-means by 2.03 +/- 0.58 percentage points across 20 of 20 positive trials, while a no-drift control falsifies the effect. STDP reduces single-model variance by 6.6 times, and one trained gate matches a 44-seed ensemble bound. The gate transfers to Intel Lava with 89% top-2 agreement. On the EVUAV benchmark, a tube-level STDP layer reduces false alarms from 454 to 331e-4 at Pd >= 88%. Dense gradient-trained detectors cannot provide this combination of gradient training, dense matrix multiplication, and local plasticity-free operation by construction.

Chinese Translation

事件相机物体检测器的部署受到每帧标注要求和GPU计算需求的限制。本研究引入了三个局部脉冲时序依赖可塑性（STDP）模块，包括序列模块、候选模块和管道可靠性模块，这些模块在没有GPU支持的情况下在单个CPU线程上运行。在FRED无人机基准测试中，所提出的框架涵盖了三个标签高效的监督层级。严格的零标签检测器在mAP@30上达到了53.8%，大约26个训练衍生位达到了76.9%的mAP@30，而STDP候选可靠性门达到了78.60 +/- 0.42%的mAP@30。在获取顺序漂移的情况下，群体门在20次正试验中比流式k均值提高了2.03 +/- 0.58个百分点，而无漂移控制则否定了这一效果。STDP将单模型方差降低了6.6倍，一个训练好的门与44种种子集成界限相匹配。该门在Intel Lava上的转移达到了89%的前2名一致性。在EVUAV基准测试中，管道级STDP层将假警报从454减少到331e-4，且在Pd >= 88%时有效。密集梯度训练的检测器无法通过构造提供这种梯度训练、密集矩阵乘法和局部无可塑性操作的组合。

View on arXiv Download PDF AI Translation

cs.CV / 190 / 2605.17719

Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation

Patch-MoE Mamba：一种用于医学图像分割的有序补丁混合专家状态空间架构

Adame, Diego, Vazquez, Fabian, Nunez, Jose A., Li, Huimin, Yang, Jinghao, Enriquez, Erik, Kim, DongChul, Tang, Haoteng, Fu, Bin, Gu, Pengfei

Abstract

CNN- and Transformer-based architectures have achieved strong performance in medical image segmentation, but CNNs are limited in modeling long-range dependencies, while Transformers often suffer from quadratic computational and memory complexity. State space models, especially Mamba-based networks, offer an efficient alternative with linear sequence complexity. However, existing Mamba segmentation models still face two limitations: pixel-wise directional scanning can disrupt local 2D spatial structure, and simple summation-based fusion of scan directions cannot adapt well to diverse object sizes, shapes, and boundaries. To address these issues, we propose \textit{Patch-MoE Mamba}, a patch-ordered mixture-of-experts state space architecture for medical image segmentation. It introduces a hierarchical patch-ordered scanning mechanism that preserves local spatial neighborhoods while capturing multi-scale context, and an MoE-based directional fusion module that adaptively combines multiple Mamba scanner outputs using four directional experts, a learnable concatenation expert, and residual directional aggregation. Experiments on five public polyp segmentation benchmarks and the ISIC 2017/2018 skin lesion segmentation datasets demonstrate the effectiveness and generality of Patch-MoE Mamba.

Chinese Translation

基于卷积神经网络（CNN）和变换器（Transformer）的架构在医学图像分割中取得了良好的性能，但CNN在建模长距离依赖性方面存在局限，而变换器通常面临二次计算和内存复杂性的问题。状态空间模型，特别是基于Mamba的网络，提供了一种具有线性序列复杂度的高效替代方案。然而，现有的Mamba分割模型仍然面临两个限制：逐像素的方向扫描可能会破坏局部二维空间结构，而基于简单求和的扫描方向融合无法很好地适应不同的物体大小、形状和边界。为了解决这些问题，我们提出了 extit{Patch-MoE Mamba}，一种用于医学图像分割的有序补丁混合专家状态空间架构。它引入了一种分层的有序补丁扫描机制，能够在捕捉多尺度上下文的同时保持局部空间邻域，并且采用基于混合专家（MoE）的方向融合模块，利用四个方向专家、一个可学习的连接专家和残差方向聚合自适应地组合多个Mamba扫描器的输出。在五个公共息肉分割基准和ISIC 2017/2018皮肤病变分割数据集上的实验表明，Patch-MoE Mamba的有效性和通用性。

View on arXiv Download PDF AI Translation

cs.CV / 191 / 2605.17727

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

GraSP-VL：作为视觉-语言表示的语义粒度接口的长度

Li, Zesheng, Pan, Chengchang, Qi, Honggang

Abstract

Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01 and hard-negative selectivity of 89.76, while keeping full-space drift below $10^{-6}$. It also transfers to SugarCrepe-clean with 86.03 object accuracy and 11.96 mean external emergence, and preserves full-dimensional zero-shot CIFAR-100 accuracy. These results show that frozen VLM embeddings can be reorganized into a truncatable semantic prefix interface rather than merely compressed.

Chinese Translation

冻结的视觉-语言嵌入包含多种语义分辨率的信号，从对象身份到属性、关系和完整标题的意义，但它们通过固定长度的向量接口暴露这些信号。我们研究了嵌入长度是否可以转变为可控的语义访问接口。我们提出了 extbf{GraSP-VL}，它学习在冻结的视觉-语言模型（VLM）嵌入上共享近正交的前缀变换。GraSP-VL 实现了一个 extbf{语义套娃（Semantic Matryoshka）} 接口：短前缀被分配为粗略的语义角色，而较长的前缀逐步揭示更细致的语言基础区分。由于该变换在图像和文本嵌入之间共享，并保持全维几何，因此前缀行为的变化不会重写原始的 VLM 空间。在一个包含 20,147 个示例的 COCO/Flickr30K 注释池中，GraSP-VL 达到了 53.01 的阶梯得分和 89.76 的困难负样本选择性，同时保持全空间漂移低于 $10^{-6}$。它还在 SugarCrepe-clean 数据集上转移至 86.03 的对象准确率和 11.96 的平均外部出现率，并保持全维零-shot CIFAR-100 准确率。这些结果表明，冻结的 VLM 嵌入可以重组为一个可截断的语义前缀接口，而不仅仅是压缩。

View on arXiv Download PDF AI Translation

cs.CV / 192 / 2605.17729

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

面向疫情韧性的胸部X光分析的领域增量学习

Kim, Danu

Abstract

Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.

Chinese Translation

深度学习模型在肺炎检测方面已实现高准确率。然而，由于成像设备、采集协议和机构条件的变化，它们在临床领域的泛化能力仍然有限。本研究提出了一种基于重放的领域增量持续学习方法，旨在实现对跨领域变化的持续适应，而不发生灾难性遗忘。所提出的方法结合了类感知平衡重放，以在受限内存中保持平衡的类表示，并采用类感知损失在训练过程中动态重新加权类不平衡。在包含五个模拟领域的领域转移肺炎MNIST数据集上进行的实验表明，所提出的方法实现了88.66%的平均准确率，优于经验重放、微调和联合训练基线。这些发现突显了所提方法在实现跨临床环境变化的稳健和一致的肺炎检测中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 193 / 2605.17742

UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation

UST-Hand：一种基于不确定性感知的时空点云交互网络用于3D自监督手势估计

Han, Tianhao, Zhang, Haoyang, Xie, Liang, Chang, Haochen, Gao, Kun, Cheng, Yuan, Ren, Pengfei, Yin, Erwei

Abstract

Manually annotating accurate 3D hand poses is extremely time-consuming and labor-intensive. Existing self-supervised hand pose estimation methods leverage the discrepancy between input images and rendered outputs, or multi-view consistency constraints, as the driving force to optimize networks and progressively refine pose accuracy. However, these methods are highly susceptible to noisy pseudo-labels and overlook the importance of fully exploiting fine-grained spatial correlations, which undermines the stability of model training. To address these issues, we propose UST-Hand, a self-supervised learning framework that estimates uncertainty distribution of hand pose and constructs a probabilistic point cloud feature space, which enables the complex spatiotemporal relationship modeling. UST-Hand employs a conditional normalizing flow model to capture hand pose distributions and samples diverse hypotheses, facilitating robust learning under noisy pseudo-labels supervision with enhanced stability. These multi-hypothesis are mapped to a unified probabilistic 3D point cloud space for multi-view and temporal feature interaction, comprehensively exploring hand motion patterns and fine-grained spatial correlations. Extensive experiments on three challenging datasets demonstrate that UST-Hand achieves state-of-the-art performance, outperforming existing self-supervised methods by up to 37.8% in Mean Per Vertex Position Error (MPVPE).

Chinese Translation

手动标注准确的3D手势极为耗时且劳动密集。现有的自监督手势估计方法利用输入图像与渲染输出之间的差异，或多视角一致性约束，作为优化网络和逐步提高姿态准确性的驱动力。然而，这些方法对噪声伪标签高度敏感，并忽视了充分利用细粒度空间关联的重要性，这削弱了模型训练的稳定性。为了解决这些问题，我们提出了UST-Hand，一种自监督学习框架，能够估计手势的不确定性分布并构建概率点云特征空间，从而实现复杂的时空关系建模。UST-Hand采用条件归一化流模型来捕捉手势分布并采样多样化假设，促进在噪声伪标签监督下的稳健学习，增强稳定性。这些多假设被映射到统一的概率3D点云空间中，以实现多视角和时间特征的交互，全面探索手部运动模式和细粒度空间关联。在三个具有挑战性的数据集上的大量实验表明，UST-Hand达到了最先进的性能，在每个顶点位置误差（MPVPE）上比现有自监督方法提高了多达37.8%。

View on arXiv Download PDF AI Translation

cs.CV / 194 / 2605.17743

MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation

MoASE++：具有领域自适应在线蒸馏的激活稀疏专家混合模型用于持续测试时间适应

Zhang, Ronyu, Cheng, Aosong, Dai, Gaole, Luo, Yulin, Liu, Jiaming, Du, Li, Yang, Huanrui, Wang, Dan, Fang, Leyuan, Du, Yuan, Zhang, Shanghang

Abstract

Continual test-time adaptation adapts a source-pretrained model to non-stationary, unlabeled target streams while retaining past competence, yet texture-biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug-in mixture-of-experts that disentangles domain-agnostic structure from domain-specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high- and low-activation pathways, while high- and low-rank bottlenecks diversify representations. The Activation Sparsity Gate produces input-adaptive SDD thresholds for precise token selection, and the Domain-Aware Router assigns per-sample expert weights using texture-sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain-Adaptive On-Policy Distillation to constitute MoASE++, with an EMA-anchored on-policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness-plasticity balance. Extensive experiments on classification (CIFAR-10/100-C, ImageNet-C) and semantic segmentation (Cityscapes->ACDC) demonstrate consistent state-of-the-art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.

Chinese Translation

持续测试时间适应旨在将源预训练模型适应于非平稳的、未标记的目标流，同时保持过去的能力。然而，受纹理偏见影响的骨干网络存在错误累积和灾难性遗忘的风险。我们受到人类视觉系统中形状与纹理解耦过程的启发，提出了MoASE，这是一种插件式的专家混合模型，利用激活稀疏专家（Activation Sparsity Experts）与空间可微分丢弃（Spatial Differentiable Dropout）将领域无关的结构与领域特定的纹理解耦，形成互补的高激活和低激活通路，同时高秩和低秩瓶颈多样化表示。激活稀疏门（Activation Sparsity Gate）生成输入自适应的SDD阈值以实现精确的标记选择，而领域感知路由器（Domain-Aware Router）则使用对纹理敏感的线索为每个样本分配专家权重。为了抑制对未标记流的确认偏见并稳定监督，我们进一步引入领域自适应在线蒸馏（Domain-Adaptive On-Policy Distillation），构成MoASE++，该方法结合了基于EMA的在线反向KL蒸馏和基于熵与置信度的增强策略，以对齐相同视图下的预测并改善鲁棒性与可塑性之间的平衡。在分类（CIFAR-10/100-C，ImageNet-C）和语义分割（Cityscapes->ACDC）上的大量实验表明，该方法在动态视觉环境中的持续适应中提供了一种原则性、可控的方法，并实现了一致的最先进性能。

View on arXiv Download PDF AI Translation

cs.CV / 195 / 2605.17748

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

释放视觉变换器在图像质量评估中的潜力：通过全局-局部自适应交互

Li, Yu, Zhou, Puchao, Mi, Yachun, Wu, Yanfeng, Wang, Xiaoming, Liu, Shaohui

Abstract

In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.

Chinese Translation

在盲图像质量评估（BIQA）领域，准确预测真实失真图像的感知质量仍然面临巨大挑战，因为自然环境中存在多样且复杂的失真。尽管现有方法已取得显著的准确性，但其可扩展性常常受到主观标注高成本和可用数据集规模有限的制约。近期大规模预训练视觉模型的进展引入了强大的语义和表征能力，然而其在图像质量评估（IQA）任务中的应用受到显著计算需求和亚优化微调效率的阻碍。为克服这些限制，我们提出了全局-局部交互适配器（GLIA），这是一个新颖的框架，能够通过双流特征提取机制有效利用预训练的视觉变换器，并结合交互式全局-局部融合。通过共同保留全局语义信息和细粒度局部细节，我们的方法在预测准确性和鲁棒性方面表现出色，同时所需的可训练参数显著减少。在多个基准测试上的广泛实验验证了我们方法的有效性和优越性。

View on arXiv Download PDF AI Translation

cs.CV / 196 / 2605.17759

FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

FrequencyBooster：高保真像素扩散的全频建模

Ma, Lichen, Guo, Zipeng, He, Yu, Fu, Xiaolong, Liu, Luohang, Fu, Jingling, Huang, Junshi, Li, Yan

Abstract

To circumvent the inherent fidelity bottlenecks and optimization misalignment of VAE-based latent diffusion, pixel-space diffusion models have emerged as a compelling end-to-end paradigm. However, existing pixel diffusion models often struggle to balance computational efficiency with the preservation of high-frequency details. They frequently resort to patch-based compression or restricted local decoding, leading to a "spectral compromise" where high-frequency and fine-grained pixel information are suppressed. To address these challenges, we propose \textbf{FrequencyBooster}, a novel framework designed to empower pixel diffusion with full-frequency modeling capabilities without prohibitive overhead. The core of our method is a high-capacity decoder that specializes in extracting exhaustive high-frequency details and low-frequency semantics, the latter of which is derived from a Diffusion Transformer (DiT) backbone. Unlike prior works that sacrifice global context for local refinement, FrequencyBooster leverages high-dimensional feature representations to maintain global structural integrity while achieving superior pixel-level precision. Extensive experiments on ImageNet demonstrate the effectiveness of our approach: our model achieves a state-of-the-art FID of \textbf{1.60} at $256 \times 256$ resolution within only 320 epochs. Furthermore, at $512 \times 512$ resolution, FrequencyBooster attains an FID of \textbf{1.69}, significantly outperforming existing pixel-space and latent-space generative models.

Chinese Translation

为了解决基于变分自编码器（VAE）的潜在扩散模型固有的保真度瓶颈和优化不一致性，像素空间扩散模型作为一种引人注目的端到端范式应运而生。然而，现有的像素扩散模型往往难以在计算效率与高频细节的保留之间取得平衡。它们常常依赖于基于补丁的压缩或受限的局部解码，导致出现“频谱妥协”，即高频和细粒度的像素信息受到抑制。为了解决这些挑战，我们提出了 extbf{FrequencyBooster}，一个旨在赋予像素扩散全频建模能力的新框架，而不会带来过高的开销。我们方法的核心是一个高容量解码器，专门用于提取详尽的高频细节和低频语义，后者源自扩散变换器（Diffusion Transformer, DiT）主干。与之前牺牲全局上下文以进行局部细化的工作不同，FrequencyBooster利用高维特征表示来保持全局结构的完整性，同时实现卓越的像素级精度。在ImageNet上的大量实验表明我们方法的有效性：我们的模型在仅320个周期内，在$256 imes 256$分辨率下达到了 extbf{1.60}的最先进FID。此外，在$512 imes 512$分辨率下，FrequencyBooster达到了 extbf{1.69}的FID，显著优于现有的像素空间和潜在空间生成模型。

View on arXiv Download PDF AI Translation

cs.CV / 197 / 2605.17766

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

LatentUMM：统一多模态模型的双重潜在对齐

Luo, Yinyi, Wang, Wenwen, Bai, Hayes, Savvides, Marios, Wang, Jindong

Abstract

Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.

Chinese Translation

统一多模态模型（UMMs）通过学习共享的潜在空间，在理解和生成方面取得了强大的性能，然而它们在这两种能力之间往往表现出功能不一致。我们观察到，这一问题并非源于缺乏共享表示，而是由于缺乏将映射到潜在空间内外的变换之间的显式对齐。因此，生成和重新编码可能遵循不一致的轨迹，导致在模态转换下的语义漂移。在本研究中，我们提出了LatentUMM，一个构建增强共享潜在空间的框架，以显式对齐这些变换并改善跨模态一致性。LatentUMM包括两个阶段。首先，双重潜在对齐在模态和容量层面上强制一致性：跨模态对齐使用更强的嵌入模型来施加结构化的跨模态语义，而双重容量对齐则在生成和重新编码下强制双向一致性。其次，潜在动态稳定化通过随机潜在展开和偏好优化提高了鲁棒性，倾向于保留更好的语义一致性的轨迹。实验表明，LatentUMM在多种架构中始终改善了多模态一致性。代码可在以下链接获取：https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.

View on arXiv Download PDF AI Translation

cs.CV / 198 / 2605.17772

Towards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework

通过联合多目标和多模型优化框架实现通用物理对抗攻击

Liu, Ziyang, Wang, Hongyuan, Wang, Zijian, Lu, Yinxi, Zang, Yunzhao, Yan, Zhiqiang, Ning, Qianhao

Abstract

Physical adversarial attacks often overfit single surrogate models and optimization objectives. While ensemble attacks can mitigate this, existing methods struggle with severe gradient conflicts within restricted physical texture spaces, significantly degrading cross-model transferability. To bridge this gap, this paper proposes a Joint Multi-Objective and Multi-Model Optimization Framework (JMOF) that leverages quantitative similarity analysis to select the optimal surrogate model ensemble. Within JMOF, a dual-level mechanism jointly suppresses prediction outputs and flattens intermediate feature distributions, balancing attack efficiency with deep generalization. Additionally, an Orthogonal Gradient Alignment (OGA) strategy resolves cross-model gradient conflicts, transforming mutually repulsive gradients into synergistic optimization directions. Extensive simulated and real-world experiments demonstrate that JMOF outperforms state-of-the-art baselines against diverse black-box detectors. Crucially, JMOF exhibits substantial cross-vision-task generalization, generating attacks capable of simultaneously deceiving object detection and semantic segmentation or monocular depth estimation models. This research advances the generalization limits of physical adversarial attacks, providing a robust framework for evaluating visual AI vulnerabilities in real-world deployments.

Chinese Translation

物理对抗攻击通常会过拟合单一的替代模型和优化目标。虽然集成攻击可以缓解这一问题，但现有方法在受限的物理纹理空间内面临严重的梯度冲突，显著降低了跨模型的可迁移性。为了解决这一问题，本文提出了一种联合多目标和多模型优化框架（Joint Multi-Objective and Multi-Model Optimization Framework, JMOF），该框架利用定量相似性分析选择最佳的替代模型集成。在JMOF中，双层机制共同抑制预测输出并平滑中间特征分布，平衡攻击效率与深度泛化。此外，正交梯度对齐（Orthogonal Gradient Alignment, OGA）策略解决了跨模型的梯度冲突，将相互排斥的梯度转化为协同优化方向。大量的模拟和真实世界实验表明，JMOF在对抗各种黑箱检测器时优于最先进的基线。重要的是，JMOF表现出显著的跨视觉任务泛化能力，生成能够同时欺骗目标检测、语义分割或单目深度估计模型的攻击。这项研究推动了物理对抗攻击的泛化极限，为评估现实世界部署中的视觉人工智能脆弱性提供了一个强大的框架。

View on arXiv Download PDF AI Translation

cs.CV / 199 / 2605.17773

PlantPose: Universal Plant Skeleton Estimation via Tree-constrained Graph Generation

PlantPose：通过树约束图生成进行通用植物骨架估计

Liu, Xinpeng, Santo, Hiroaki, Toda, Yosuke, Okura, Fumio

Abstract

Accurate estimation of plant skeletal structures (e.g., branching structures) from images is essential for smart agriculture and plant science. Unlike human skeletons with fixed topology, plant skeleton estimation presents a unique challenge, i.e., estimating arbitrary tree graphs from images. To address this problem, we introduce PlantPose, a universal plant skeleton estimator via tree-constrained graph generation. PlantPose combines learning-based graph generation with traditional graph algorithms to enforce tree constraints during the training loop. To enhance the model's generalization capability, we curate a large and diverse dataset comprising real-world and synthetic plant images, along with simplified representations (e.g., sketches and abstract drawings). This dataset enables the generalized model to adapt to diverse input styles and categories of plant images while preserving topological consistency. Our approach demonstrates robust and accurate plant skeleton estimation across multiple domains, including previously unseen out-of-domain scenarios. Further analyses highlight the method's strengths and limitations in handling complex, heterogeneous data distributions. All implementations and datasets are available at https://github.com/huntorochi/PlantPose/.

Chinese Translation

从图像中准确估计植物骨架结构（例如，分枝结构）对于智能农业和植物科学至关重要。与固定拓扑的人体骨架不同，植物骨架估计面临独特的挑战，即从图像中估计任意树图。为了解决这个问题，我们提出了PlantPose，一种通过树约束图生成进行的通用植物骨架估计器。PlantPose结合了基于学习的图生成与传统图算法，以在训练过程中施加树约束。为了增强模型的泛化能力，我们整理了一个大型多样化的数据集，包含真实世界和合成的植物图像，以及简化表示（例如，草图和抽象图）。该数据集使得通用模型能够适应多样的输入风格和植物图像类别，同时保持拓扑一致性。我们的方法在多个领域展示了强大而准确的植物骨架估计能力，包括以前未见的域外场景。进一步的分析突出了该方法在处理复杂异构数据分布时的优势和局限性。所有实现和数据集可在 https://github.com/huntorochi/PlantPose/ 获取。

View on arXiv Download PDF AI Translation

cs.CV / 200 / 2605.17777

Efficient Sparse-to-Dense Visual Localization via Compact Gaussian Scene Representation and Accelerated Dense Pose Estimation

通过紧凑的高斯场景表示和加速的密集姿态估计实现高效的稀疏到密集视觉定位

Li, Zizhuo, Deng, Songchu, Tang, Linfeng, Ma, Jiayi

Abstract

This letter presents LiteLoc, a novel and efficient localizer built on 3D Gaussian Splatting (3DGS). The previous state-of-the-art (SoTA) sparse-to-dense localizer, STDLoc, has shown remarkable localization capability but suffers from severe storage redundancy and computational latency. By revisiting its design decisions, we derive two simple yet highly effective improvements that cumulatively make LiteLoc much more efficient in both memory and computation, while also being easier to train. One key observation is that the color field, inherited directly from Feature 3DGS, is functionally useless for localization. Yet, its reconstruction of high-frequency photometric details necessitates excessive Gaussian primitives, resulting in a tightly coupled color-feature representation with significant memory overhead and sub-optimal feature field optimization. To resolve this, we propose a color-free decoupled feature field that constructs a compact Gaussian scene representation by retaining only task-essential feature attributes, thereby eliminating approximately 94% of redundant storage with no loss of localization-relevant information. We further find that the primary computational bottleneck lies in the dense Perspective-n-Point (PnP) solver, where most matches contribute saturated geometric constraints with diminishing accuracy gains. Accordingly, we propose a condensing strategy that distills dense matches into a subset of 5% representative matches, enabling a nearly 19-fold speedup in robust estimation with negligible performance drop. Extensive experiments show that LiteLoc surpasses STDLoc in multiple scenes with considerable efficiency benefits, opening up exciting prospects for latency-sensitive visual localization.

Chinese Translation

本文提出了LiteLoc，一种基于3D高斯点云（3D Gaussian Splatting, 3DGS）的新型高效定位器。之前的最先进稀疏到密集定位器STDLoc展示了卓越的定位能力，但在存储冗余和计算延迟方面存在严重问题。通过重新审视其设计决策，我们提出了两个简单但极为有效的改进，使LiteLoc在内存和计算效率上都大幅提升，同时训练过程也更加简便。一个关键观察是，直接继承自特征3DGS的颜色场在定位中功能上是无用的。然而，其对高频光度细节的重建需要过多的高斯原语，导致颜色特征表示紧密耦合，造成显著的内存开销和次优的特征场优化。为了解决这个问题，我们提出了一种无色解耦特征场，通过仅保留任务必需的特征属性来构建紧凑的高斯场景表示，从而消除约94%的冗余存储而不损失与定位相关的信息。我们进一步发现，主要的计算瓶颈在于密集的透视n点（Perspective-n-Point, PnP）求解器，其中大多数匹配贡献饱和的几何约束，且准确性提升逐渐减小。因此，我们提出了一种浓缩策略，将密集匹配提炼为5%的代表性匹配，从而在稳健估计中实现近19倍的加速，且性能下降微乎其微。大量实验表明，LiteLoc在多个场景中超越了STDLoc，带来了显著的效率提升，为对延迟敏感的视觉定位开辟了令人兴奋的前景。

View on arXiv Download PDF AI Translation

cs.CV / 201 / 2605.17780

Network Knowledge Prior Guided Learning for Data-Efficient Surface Defect Detection

基于网络知识先验引导的高效数据表面缺陷检测学习

Dong, Hang-Cheng, Liu, Guodong, Ye, Dong, Liu, Bingguo

Abstract

Deep learning-based methods have become the de facto standard for industrial defect detection. However, their data-hungry nature and inherent "black-box" characteristics often lead to performance bottlenecks and limited trustworthiness in real-world applications. To address these challenges, this paper proposes a novel knowledge-guided loss function that seamlessly integrates model interpretability into the training process without incurring any additional inference cost. Our method operates in two phases: first, a primary classification network is trained, and its explanations, in the form of saliency maps, are generated as prior knowledge. Second, a multi-task learning framework is established, where the main task performs classification, and an auxiliary task imposes consistency between the saliency maps of the final model and the primary model. This consistency is enforced by a dedicated knowledge-guided loss term, effectively acting as a powerful regularizer to steer the model towards robust feature representations. Extensive experiments on multiple public defect datasets demonstrate that our approach consistently enhances the performance of baseline models in terms of accuracy and AP. Moreover, visual analysis reveals that the proposed method yields more concentrated and human-intelligible saliency maps. This work presents a simple yet effective paradigm for bridging the gap between model performance and interpretability, paving the way for more reliable and high-performing vision systems in industrial quality inspection.

Chinese Translation

基于深度学习的方法已成为工业缺陷检测的事实标准。然而，它们对数据的高需求和固有的“黑箱”特性常常导致性能瓶颈以及在实际应用中的可信度有限。为了解决这些挑战，本文提出了一种新颖的知识引导损失函数，该函数无缝地将模型可解释性整合到训练过程中，而不增加任何额外的推理成本。我们的方法分为两个阶段：首先，训练一个主要分类网络，并生成以显著性图形式呈现的解释作为先验知识。其次，建立一个多任务学习框架，其中主要任务执行分类，而辅助任务则强制要求最终模型与主要模型的显著性图之间的一致性。这种一致性通过一个专门的知识引导损失项来强制执行，有效地充当强大的正则化器，引导模型朝着稳健的特征表示发展。在多个公共缺陷数据集上的广泛实验表明，我们的方法在准确性和平均精度（AP）方面始终增强了基线模型的性能。此外，视觉分析表明，所提出的方法产生了更集中且易于人类理解的显著性图。这项工作展示了一种简单而有效的范式，弥合了模型性能与可解释性之间的差距，为工业质量检测中更可靠和高性能的视觉系统铺平了道路。

View on arXiv Download PDF AI Translation

cs.CV / 202 / 2605.17799

Is Complex Training Necessary for Long-Tailed OOD Detection? A Re-think from Feature Geometry

复杂训练对于长尾分布外（OOD）检测是否必要？从特征几何的角度重新思考

Peng, Ningkang, Chen, Xuanming, Gu, Yanhui

Abstract

Long-tailed out-of-distribution (LT-OOD) detection is often addressed with specialized training, including auxiliary out-of-distribution (OOD) data, abstention heads, contrastive objectives, energy losses, or gradient-conflict control. We show that these training mechanisms can obscure a simpler issue: frozen long-tailed representations may already contain useful OOD evidence, but raw Mahalanobis distance is distorted by frequency-coupled feature radius and poorly supported tail covariance. We propose Hyperspherical Pooled Mahalanobis (HPM), a post-hoc detector that normalizes features onto the unit sphere and replaces class-specific covariance with a pooled, ridge-regularized metric while keeping class means as semantic anchors. In CIFAR-LT experiments and an ImageNet-100-LT near-OOD boundary analysis, HPM improves raw Mahalanobis scoring; for Prior-Calibrated ERM (PC-ERM), it raises AUROC from 46.49 to 85.67 on CIFAR-10-LT and from 50.40 to 78.35 on CIFAR-100-LT. This simple PC-ERM+HPM pipeline also achieves the best Log Efficiency Score (LES; 3.08) on CIFAR-100-LT, retaining roughly 95% of the best CIFAR-100-LT AUROC observed among the compared post-hoc scores at substantially lower training-time cost. These results argue for evaluating representation quality, detector geometry, and training complexity as separate factors in LT-OOD detection.

Chinese Translation

长尾分布外（LT-OOD）检测通常通过专门的训练方法来解决，包括辅助的分布外（OOD）数据、弃权头、对比目标、能量损失或梯度冲突控制。我们表明，这些训练机制可能掩盖了一个更简单的问题：冻结的长尾表示可能已经包含有用的OOD证据，但原始的马哈拉诺比斯距离受到频率耦合特征半径和支持不足的尾部协方差的扭曲。我们提出了超球面池化马哈拉诺比斯（Hyperspherical Pooled Mahalanobis, HPM），这是一种后处理检测器，它将特征归一化到单位球面，并用池化的、岭正则化的度量替换类特定协方差，同时保持类均值作为语义锚。在CIFAR-LT实验和ImageNet-100-LT近OOD边界分析中，HPM改善了原始马哈拉诺比斯评分；对于先验校准的经验风险最小化（Prior-Calibrated ERM, PC-ERM），它在CIFAR-10-LT上将AUROC从46.49提高到85.67，在CIFAR-100-LT上从50.40提高到78.35。这个简单的PC-ERM+HPM流程在CIFAR-100-LT上也达到了最佳的对数效率得分（Log Efficiency Score, LES；3.08），保留了在比较的后处理得分中观察到的最佳CIFAR-100-LT AUROC的约95%，同时训练时间成本显著降低。这些结果表明，在LT-OOD检测中，应将表示质量、检测器几何和训练复杂性作为独立因素进行评估。

View on arXiv Download PDF AI Translation

cs.CV / 203 / 2605.17807

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

课程组策略优化：自适应采样以释放文本到图像生成的潜力

Li, Baoteng, Zang, Xianghao, Wang, Xinran, Na, Xiangyu, He, Zhixiang, Sun, Hao, Zhang, Chi, He, Zhongjiang, Cao, Tianwei, Liang, Kongming, Ma, Zhanyu

Abstract

Text-to-Image (T2I) generation has achieved remarkable progress in recent years. Meanwhile, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and been successfully applied to T2I tasks. However, the uniform sampling strategy commonly used during training often ignores the match between sample difficulty and the model's current learning capability, leading to low training efficiency. We argue that improving training efficiency requires continuously prioritizing prompts that match the model's evolving capability and remain actively learnable. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt produces a group of images scored by a reward model. We use the variance of group rewards as an online proxy for prompt inconsistency. A higher variance suggests that the model has partially captured the prompt requirements but has not yet achieved stable mastery. Such prompts are more likely to provide useful learning signals, so we increase their sampling probabilities accordingly. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.

Chinese Translation

近年来，文本到图像（T2I）生成取得了显著进展。同时，基于组相对策略优化（GRPO）的强化学习方法引起了广泛关注，并成功应用于T2I任务。然而，训练过程中常用的统一采样策略往往忽视了样本难度与模型当前学习能力之间的匹配，导致训练效率低下。我们认为，提高训练效率需要持续优先考虑与模型不断演变的能力相匹配并保持积极可学习的提示。为此，我们提出了课程组策略优化（CGPO），一种自适应课程训练框架。在训练过程中，每个提示生成一组由奖励模型评分的图像。我们使用组奖励的方差作为提示不一致性的在线代理。较高的方差表明模型部分捕捉了提示要求，但尚未实现稳定掌握。这类提示更有可能提供有用的学习信号，因此我们相应地提高它们的采样概率。此外，为了解决多类别数据集中的数据不平衡问题，我们设计了一种基于比例公平优化的类别校准方法，以平衡各类别之间的训练难度。在GenEval、T2I-CompBench++和DPG Bench上的实验表明，我们的框架有效提高了生成性能。

View on arXiv Download PDF AI Translation

cs.CV / 204 / 2605.17818

Evidence-Guided Unknown Rejection for High-Confidence Near-Known Unknowns

基于证据的未知拒绝方法用于高置信度近已知未知样本

Chen, Xi, Xiao, Yingjun, Fang, Gang

Abstract

Open-set recognition systems face a neglected failure mode: high-confidence near-known unknowns, which lie outside the known label set but are close enough to known classes that a closed-set classifier accepts them with high confidence. We show that this failure is widespread across scalar-threshold methods, including recent post-hoc detectors, and that stronger encoders can amplify rather than remove the risk. We propose EGUR-A, which changes the decision from ``is this sample's score high enough?'' to ``does this predicted known class have sufficient evidence to accept this sample?'' EGUR-A combines class-conditional local acceptance evidence with global residual evidence, and selects their relative weight from known-sample statistics without unknown validation data. Across CUB, FGVC-Aircraft, and ImageNet-hard, EGUR-A substantially reduces high-confidence false known acceptance at matched known-rejection operating points. The result is not a stronger threshold; it is a different question: whether a known class is entitled to accept a sample.

Chinese Translation

开放集识别系统面临一种被忽视的失败模式：高置信度的近已知未知样本，这些样本位于已知标签集之外，但与已知类别的距离足够近，以至于闭集分类器以高置信度接受它们。我们展示了这种失败在标量阈值方法中普遍存在，包括最近的后处理检测器，并且更强的编码器可能会放大而不是消除这种风险。我们提出了EGUR-A，它将决策从“这个样本的得分是否足够高？”转变为“这个预测的已知类别是否有足够的证据来接受这个样本？”EGUR-A结合了类别条件的局部接受证据与全局残差证据，并从已知样本统计中选择它们的相对权重，而无需未知验证数据。在CUB、FGVC-Aircraft和ImageNet-hard数据集上，EGUR-A在匹配的已知拒绝操作点上显著减少了高置信度的错误已知接受。其结果并不是一个更强的阈值，而是一个不同的问题：一个已知类别是否有权接受一个样本。

View on arXiv Download PDF AI Translation

cs.CV / 205 / 2605.17822

Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection

释放傅里叶形状的表征能力以攻击红外物体检测

Yong, Yixing, Wang, Jian, Lei, Ming, He, Lijun, Li, Fan

Abstract

Infrared object detection is crucial for perception in autonomous driving and surveillance but remains vulnerable to physical adversarial attacks. Unlike in the RGB domain, where attacks rely on color texture, infrared attacks must manipulate thermal signatures, making the geometry shape of heat-blocking materials the primary adversarial information carrier. Current shape-based methods suffer from a fundamental trade-off between representational capability and optimization power, limiting their attack effectiveness.In this work, we overcome this dilemma by introducing learnable Fourier shapes to the infrared domain. We utilize an end-to-end differentiable framework where a compact set of Fourier coefficients, defining the shape boundary, is analytically mapped to a pixel-space mask via the winding number theorem. This enables efficient gradient-based optimization to generate potent shapes that cause human targets to evade detection. Extensive digital and physical experiments provide a comprehensive evaluation and validate our superior performance. Our resulting physical patch achieves striking robustness, successfully evading detectors across diverse distances, angles, poses, and individuals, and achieves over 88% attack success rate at distances greater than 25m (conf.=0.5). Code is available at https://github.com/Yongyx99/Fourier-shape-attack.

Chinese Translation

红外物体检测对于自动驾驶和监控中的感知至关重要，但仍然容易受到物理对抗攻击。与RGB领域中的攻击依赖于颜色纹理不同，红外攻击必须操控热签名，使得热阻材料的几何形状成为主要的对抗信息载体。目前的基于形状的方法在表征能力和优化能力之间存在根本性的权衡，限制了其攻击效果。在本研究中，我们通过引入可学习的傅里叶形状到红外领域来克服这一困境。我们利用一个端到端可微分的框架，其中一组紧凑的傅里叶系数定义了形状边界，并通过绕数定理解析映射到像素空间掩膜。这使得高效的基于梯度的优化成为可能，从而生成强大的形状，使人类目标逃避检测。大量的数字和物理实验提供了全面的评估，并验证了我们的优越性能。我们得到的物理补丁表现出显著的鲁棒性，成功地在不同距离、角度、姿势和个体中逃避检测，并在超过25米的距离下实现了超过88%的攻击成功率（置信度=0.5）。代码可在 https://github.com/Yongyx99/Fourier-shape-attack 获取。

View on arXiv Download PDF AI Translation

cs.CV / 206 / 2605.17823

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

我们为何注视我们所注视的：一种最大化场景理解的具有人类特征的聚焦视觉语言模型的自发注视行为

Murlidaran, Shravan, Wen, Ziqi, Shehabi, Sana, Eckstein, Miguel P.

Abstract

When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.

Chinese Translation

当人类在没有特定任务的情况下观看场景（自由观看）时，他们最初会将眼动指向场景中心，然后注视于人、文本、被凝视或抓取的物体以及语义上有意义的区域。这些特征性注视模式所反映的内容以及它们是否优化了潜在的感知任务仍然未知。我们展示了一种具有模拟聚焦能力的计算代理，经过训练以优化场景理解，表现出自发的人类注视特征模式。相比之下，经过训练以搜索或分类场景的代理版本，或具有人类视觉更好或更差的周边视觉的代理，预测人类注视模式的准确性较低。因此，人类在自由观看时的注视模式可能是优化场景理解在生物聚焦视觉约束下的一个功能性副产品。

View on arXiv Download PDF AI Translation

cs.CV / 207 / 2605.17826

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

CounterCount：一种用于诊断视觉语言模型计数偏差的框架

Alzahrani, Reem, Alshanqiti, Hassan, Hemid, Bushra Bin, Alyafeai, Zaid, Eldesokey, Abdelrahman, Ghanem, Bernard

Abstract

Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.

Chinese Translation

视觉语言模型（VLMs）在多模态推理方面表现出色，但尚不清楚它们的答案是否基于视觉证据，还是受到学习到的语言和世界先验的驱动。计数提供了一个精确的测试平台：当视觉证据与典型物体知识相冲突时，模型必须依赖图像而不是原型计数。我们提出了CounterCount，一个用于VLMs的反事实计数的诊断框架，包含配对的事实和反事实图像，具有编辑过的与计数相关的属性、经过验证的答案和局部证据注释。对近期VLMs的评估显示，在事实图像上表现良好，但在反事实属性变化下表现一致下降，表明即使在存在矛盾视觉证据的情况下，模型仍依赖于物体级先验。通过使用局部注释，我们表明这些失败并不仅仅是由于缺失或模糊的视觉证据，而是模型对与计数相关的视觉标记的关注不足。我们引入了一种统一的推理时注意力调制策略，重新加权选定的视觉标记，提高了多个VLMs的反事实计数准确性，提升幅度高达8%。总体而言，CounterCount揭示了基于先验的计数失败，并为未来VLMs的设计提供了诊断性见解。

View on arXiv Download PDF AI Translation

cs.CV / 208 / 2605.17834

Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation

稳定、扩展与增强大规模扩散蒸馏中的MeanFlow

He, Xiao, Li, Yang, Zhang, Peizhen, Liu, Songtao, Zhong, Zhao, Wang, Nannan

Abstract

Diffusion models exhibit remarkable generative capability, but their high latency limits practical deployment. Many studies have attempted to reduce sampling steps to accelerate inference. Among them, MeanFlow has attracted considerable attention due to its concise formulation and remarkable performance. Nevertheless, the instability of its optimization objective and the ''mean-seeking bias'' have limited its applicability to distill large-scale industrial models. To stabilize MeanFlow for distilling large-scale models, we first introduce a warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution. This design avoids training collapse caused by the MeanFlow target containing a stop-gradient term from an undertrained model. Once the model acquires a preliminary ability to fit the average velocity field, we switch the optimization objective back to the differential solution, enabling further refinement. Meanwhile, to alleviate the ''mean-seeking bias'' of MeanFlow under extremely few-step inference with complex target distributions, we incorporate trajectory distribution alignment as an auxiliary objective, encouraging the student model's trajectory distribution to align more closely with that of the teacher model. Our proposed distillation framework achieves superior performance compared to existing distillation approaches when applied to the text-to-image (T2I) model FLUX.1-dev (up to 12B parameters). Furthermore, when extended to the 80B-parameter state-of-the-art (SOTA) T2I model HunyuanImage 3.0, our method continues to demonstrate robust generalization and strong performance.

Chinese Translation

扩散模型展现出卓越的生成能力，但其高延迟限制了实际部署。许多研究尝试减少采样步骤以加速推理。其中，MeanFlow因其简洁的公式和显著的性能而受到广泛关注。然而，其优化目标的不稳定性和“均值寻求偏差”限制了其在大规模工业模型蒸馏中的适用性。为了稳定MeanFlow以蒸馏大规模模型，我们首先引入了一种预热技术，在该技术中，MeanFlow的原始微分解被离散解所替代。这一设计避免了由于MeanFlow目标包含来自未充分训练模型的停止梯度项而导致的训练崩溃。一旦模型获得了初步拟合平均速度场的能力，我们便将优化目标切换回微分解，以便进行进一步的细化。同时，为了缓解MeanFlow在极少步推理与复杂目标分布下的“均值寻求偏差”，我们将轨迹分布对齐作为辅助目标，鼓励学生模型的轨迹分布与教师模型的轨迹分布更紧密地对齐。我们提出的蒸馏框架在应用于文本到图像（T2I）模型FLUX.1-dev（最多12B参数）时，表现出优于现有蒸馏方法的性能。此外，当扩展到80B参数的最先进（SOTA）T2I模型HunyuanImage 3.0时，我们的方法继续展现出强大的泛化能力和优异的性能。

View on arXiv Download PDF AI Translation

cs.CV / 209 / 2605.17837

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

基于时间感知的剪枝方法用于高效的扩散视频生成

Li, Sheng, Sui, Yang, Ran, Junhao, Yuan, Bo, Dai, Yue, Tang, Xulong

Abstract

Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected layers to align token pruning with layers' diverse semantic focus and avoid error accumulation in specific areas; it also (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes pruning during fidelity-critical refinement. The experimental results show that TAPE delivers significant speedups while preserving high visual fidelity, outperforming prior token reduction approaches.

Chinese Translation

视频扩散模型最近使得基于ViT架构的高质量视频生成成为可能，但由于生成过程需要对长时空序列进行注意力计算，因此仍然计算密集。令牌剪枝已被证明对ViTs和VLMs有效。然而，大多数先前的剪枝方法基于注意力，并且按帧操作，未能确保视频生成任务中帧间的重要时间一致性。在实践中，简单地采用仅基于注意力的剪枝会导致背景一致性下降、闪烁现象明显以及图像质量降低。为了解决这个问题，我们提出了TAPE，一种无训练的基于时间感知的剪枝方法，用于高效的扩散视频生成。TAPE (i) 通过时间平滑对齐相邻帧的令牌重要性，抑制选择抖动；(ii) 在选定层中进行令牌重新选择，以使令牌剪枝与层的多样语义焦点对齐，并避免特定区域的错误积累；(iii) 采用时间步级预算调度，在早期噪声步骤中进行激进剪枝，并在关键保真度的细化过程中放松剪枝。实验结果表明，TAPE在保持高视觉保真度的同时显著加快了速度，超越了之前的令牌减少方法。

View on arXiv Download PDF AI Translation

cs.CV / 210 / 2605.17865

Imaging Hidden Objects with Consumer LiDAR via Motion Induced Sampling

通过运动诱导采样使用消费级LiDAR成像隐藏物体

Somasundaram, Siddharth, Young, Aaron, Dave, Akshat, Pediredla, Adithya, Raskar, Ramesh

Abstract

LiDARs are being increasingly deployed for consumer imaging in handheld, wearable, and robotic applications. These sensors can capture the time-of-flight of light at picosecond resolution, which in principle, enables them to capture information about objects hidden from their field of view. While such non-line-of-sight (NLOS) imaging capabilities have been shown on research-grade LiDARs, they are challenging to achieve on consumer devices due to poor signal quality resulting from low laser power, low spatial resolution, and object and camera motion. Inspired by burst photography and synthetic aperture radar, we propose a multi-frame fusion strategy to overcome these challenges and demonstrate NLOS imaging on consumer LiDAR. We first introduce the motion-induced aperture sampling model to unify the effects of object shape, object motion, and camera motion under a single measurement model. Using this model, we demonstrate several NLOS capabilities on a smartphone-grade LiDAR: (1) 3D reconstruction, (2) single and multi-object tracking, and (3) camera localization using hidden objects. Previously, NLOS imaging capabilities were largely restricted to bulky and expensive research-grade hardware that requires extensive setup and calibration. Our results represent a shift towards plug-and-play NLOS imaging, where anyone can image hidden objects with off-the-shelf hardware ($<100) and no additional setup. We believe that democratization of such capabilities will advance consumer applications of NLOS imaging.

Chinese Translation

LiDAR（激光雷达）在手持、可穿戴和机器人应用中越来越多地被用于消费级成像。这些传感器能够以皮秒分辨率捕获光的飞行时间，从理论上讲，使它们能够捕获视野外隐藏物体的信息。尽管这种非视线（NLOS）成像能力在研究级LiDAR上得到了验证，但由于激光功率低、空间分辨率低以及物体和相机运动等因素导致的信号质量差，使得在消费级设备上实现这一能力面临挑战。受到爆发摄影和合成孔径雷达的启发，我们提出了一种多帧融合策略，以克服这些挑战，并在消费级LiDAR上展示NLOS成像。我们首先引入运动诱导孔径采样模型，以统一物体形状、物体运动和相机运动在单一测量模型下的影响。利用该模型，我们在智能手机级LiDAR上展示了几种NLOS能力：（1）三维重建，（2）单个和多个物体跟踪，以及（3）利用隐藏物体进行相机定位。之前，NLOS成像能力主要局限于笨重且昂贵的研究级硬件，这些硬件需要大量的设置和校准。我们的结果代表了向即插即用NLOS成像的转变，任何人都可以使用现成的硬件（$<100）成像隐藏物体，而无需额外的设置。我们相信，这种能力的民主化将推动NLOS成像在消费级应用中的发展。

View on arXiv Download PDF AI Translation

cs.CV / 211 / 2605.17869

PySIFT: GPU-Resident Deterministic SIFT for Deep Learning Vision Pipelines

PySIFT：用于深度学习视觉管道的GPU驻留确定性SIFT

S., Sivakumar K., Rahman, Mohammad Daniyalur, Matta, Gopi Raju

Abstract

A widespread assumption in local feature research holds that classical handcrafted descriptors are accuracy-limited relics best replaced by learned alternatives. We show this is wrong. Through an 8-configuration ablation spanning four benchmarks (HPatches, ROxford5K, IMC Phototourism, MegaDepth), we demonstrate that classical SIFT with DSP multi-scale pooling outperforms neural descriptor and orientation replacements (HardNet, OriNet) on every accuracy metric--while running 2--18$\times$ faster--and that learned matchers (LightGlue) complement rather than supersede classical features. The conclusion reframes a decade of work: not "replace SIFT" but "compose with SIFT," classical extraction paired with learned matching only where geometric context demands it. This finding was invisible because no prior GPU SIFT kept the complete pipeline in VRAM or offered modularity for controlled classical-vs-learned ablations. We present PySIFT, the first fully GPU-resident SIFT, implemented in CuPy/Numba CUDA kernels with DLPack zero-copy handoff to downstream DL frameworks--submillisecond O(1) metadata swap regardless of keypoint count. On a laptop-grade NVIDIA RTX 3050 (4 GB VRAM), PySIFT achieves: (i) higher Mean Matching Accuracy (MMA) than OpenCV SIFT on HPatches, (ii) 383 ms faster per pair on high-resolution MegaDepth, (iii) higher geometric accuracy on cross-dataset benchmarks (+5.6 pp AUC@10${}^\circ$ on MegaDepth, more inliers on IMC Phototourism), and (iv) bitwise deterministic output--identical keypoints and descriptors across runs, with detection reproducing identically even across GPU architectures: a guarantee that learned extractors cannot match without significant performance sacrifice, and cannot achieve at all across GPU architectures due to cuDNN's architecture-dependent algorithm selection. PySIFT is open-source, requiring no C++ compilation.

Chinese Translation

在局部特征研究中，一个普遍的假设认为经典的手工设计描述符是精度受限的遗物，最好用学习到的替代品来替代。我们证明这一观点是错误的。通过在四个基准（HPatches、ROxford5K、IMC Phototourism、MegaDepth）上进行的8种配置消融实验，我们展示了经典的SIFT结合DSP多尺度池化在每个精度指标上都优于神经描述符和方向替代品（HardNet、OriNet），同时运行速度快2到18倍，并且学习到的匹配器（LightGlue）是对经典特征的补充，而不是取代。结论重新框定了十年的研究：不是“替代SIFT”，而是“与SIFT组合”，经典提取与学习匹配仅在几何上下文需要时结合。这一发现之前未被注意，因为没有先前的GPU SIFT能够将完整的管道保存在VRAM中或提供可控的经典与学习消融的模块化。我们提出了PySIFT，这是第一个完全驻留于GPU的SIFT，使用CuPy/Numba CUDA内核实现，并通过DLPack实现零拷贝交接到下游深度学习框架——无论关键点数量如何，均可实现亚毫秒的O(1)元数据交换。在一台配备NVIDIA RTX 3050（4 GB VRAM）的笔记本电脑上，PySIFT实现了：（i）在HPatches上比OpenCV SIFT更高的平均匹配精度（MMA），（ii）在高分辨率MegaDepth上每对快383毫秒，（iii）在跨数据集基准上更高的几何精度（在MegaDepth上提高5.6个百分点AUC@10${}^ heta$，在IMC Phototourism上获得更多内点），以及（iv）逐位确定性输出——在不同运行中相同的关键点和描述符，检测结果在不同GPU架构间也完全一致：这是学习提取器无法匹配的保证，且在不同GPU架构间由于cuDNN的架构依赖算法选择而完全无法实现。PySIFT是开源的，无需C++编译。

View on arXiv Download PDF AI Translation

cs.CV / 212 / 2605.17875

HexagonalWarriorMamba: Superior Threshold-Dependent Multi-label Classification of 12-Lead ECG Cardiac Abnormalities

六边形战士曼巴：优越的阈值依赖多标签12导联心电图心脏异常分类

Jiang, Huawei, Mutahira, Husna, Wei, Shibo, Li, Jiahang, Shin, Vladimir, Yi, Juneho, Ryu, Dongryeol, Park, Wonyoung, Muhammad, Mannan Saeed

Abstract

The accurate automated diagnosis of cardiac abnormalities from 12-lead electrocardiograms (ECGs) is critical for managing cardiovascular disease. However, detecting concurrent conditions remains a challenge for traditional deep learning models, which often have limited ability to model the long-range dependencies inherent in ECG signals. This manuscript proposes HexagonalWarriorMamba (HWMamba), a framework built on the Mamba architecture that processes 12-lead ECGs as single-channel 2D images rather than conventional 1D time series. By integrating a hierarchical architecture with a 2D Selective Scan mechanism, HWMamba is designed to model global context and complex spatial relationships within the data. The model is evaluated on the PhysioNet/Computing in Cardiology Challenge 2021 dataset, which includes 26 diagnostic labels and comprises recordings collected from seven institutions across four countries and three continents. Results demonstrate that HWMamba outperforms current state-of-the-art (SOTA) methods across five key threshold-dependent metrics, including Challenge Score and Subset Accuracy. These improvements provide a balance between strong discriminative capability and effective threshold selection derived from the training data, while maintaining near-SOTA performance in Macro AUROC. This Hexagonal Warrior performance, reflecting consistent performance across multiple evaluation dimensions, positions HWMamba as a robust and versatile approach for multi-label ECG classification.

Chinese Translation

从12导联心电图（ECG）中准确自动诊断心脏异常对于心血管疾病的管理至关重要。然而，检测并发症对于传统深度学习模型仍然是一个挑战，这些模型通常在建模心电信号固有的长程依赖性方面能力有限。本文提出了六边形战士曼巴（HexagonalWarriorMamba，HWMamba），这是一个基于Mamba架构的框架，将12导联心电图处理为单通道二维图像，而不是传统的一维时间序列。通过整合分层架构与二维选择性扫描机制，HWMamba旨在建模数据中的全局上下文和复杂空间关系。该模型在PhysioNet/Computing in Cardiology Challenge 2021数据集上进行了评估，该数据集包括26个诊断标签，包含来自四个国家和三个大洲的七个机构收集的记录。结果表明，HWMamba在五个关键的阈值依赖指标上优于当前的最先进（SOTA）方法，包括挑战得分和子集准确率。这些改进在强大的区分能力与从训练数据中获得的有效阈值选择之间提供了平衡，同时在宏观AUROC上保持接近SOTA的表现。这种六边形战士的表现反映了在多个评估维度上持续的表现，使HWMamba成为一种强大且多功能的多标签心电图分类方法。

View on arXiv Download PDF AI Translation

cs.CV / 213 / 2605.17904

Beyond Euclidean Prototypes: Spectral Disentanglement and Geodesic Matching for Few-Shot Medical Image Segmentation

超越欧几里得原型：用于少样本医学图像分割的谱解耦和测地匹配

Jia, Penghao, Huang, Zhiyong, Hou, Mingyang, Yu, Zhi, Miao, Shuai, Wang, Jiahong, Yan, Yan

Abstract

Few-Shot Medical Image Segmentation (FSMIS) aims to delineate novel anatomical targets from one or a few annotated support images, addressing the annotation scarcity in medical imaging. Notwithstanding recent advancements, current prototype-based methods are bottlenecked by two coupled limitations: 1) cue entanglement, where a single spatial-domain prototype is forced to summarise organ silhouette, parenchymal texture and boundary appearance simultaneously, so any support-query mismatch on one cue propagates indiscriminately to the others; and 2) topology-blind matching, where cosine similarity measures distance in the ambient Euclidean space and ignores the connectivity of the underlying feature manifold, causing fragmented activations inside low-contrast organs and leakage into neighbouring tissues. To this end, we propose Spectral-Geodesic Prototype Network (SGP-Net), built around a Spectral-Geodesic Prototype Module with two coupled components. A Spectral Prototype Bank (SPB) decomposes support and query features into low-, mid- and high-frequency bands via learnable radial Fourier filters, yielding three disentangled prototypes per class that separately encode shape, texture and boundary cues. A Geodesic Matcher (GM) then replaces cosine similarity with a differentiable heat-diffusion approximation of geodesic distance, propagating matching signals along a feature affinity graph so that on-manifold pixels accumulate consistent responses while off-manifold look-alikes are suppressed. Experiments on three public FSMIS benchmarks demonstrate that SGP-Net achieves competitive performance against recent state-of-the-art methods.

Chinese Translation

少样本医学图像分割（FSMIS）旨在从一张或几张标注的支持图像中描绘出新的解剖目标，以应对医学成像中的标注稀缺问题。尽管近期取得了一些进展，当前基于原型的方法仍受到两个相互关联的限制：1）线索纠缠，其中单一的空间域原型被迫同时总结器官轮廓、实质纹理和边界外观，因此任何在某一线索上的支持-查询不匹配都会无差别地传播到其他线索；2）拓扑盲匹配，其中余弦相似度在环境欧几里得空间中测量距离，而忽略了底层特征流形的连通性，导致低对比度器官内的激活碎片化，并渗漏到邻近组织中。为此，我们提出了谱-测地原型网络（SGP-Net），其核心是一个具有两个耦合组件的谱-测地原型模块。谱原型库（SPB）通过可学习的径向傅里叶滤波器将支持和查询特征分解为低、中和高频带，从而为每个类别生成三个解耦的原型，分别编码形状、纹理和边界线索。接着，测地匹配器（GM）用可微分的热扩散近似测地距离替代余弦相似度，沿特征亲和图传播匹配信号，使得在流形上的像素积累一致的响应，而流形外的相似物则被抑制。在三个公共FSMIS基准上的实验表明，SGP-Net在与近期最先进的方法相比时表现出竞争力。

View on arXiv Download PDF AI Translation

cs.CV / 214 / 2605.17907

One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception

一个模型翻译所有：异构协作感知的通用任意到任意翻译

Li, Yang, Li, Weize, Yuan, Quan, Shao, Congzhang, Luo, Guiyang, Ba, Yunqi, Zhu, Xuanhan, Ding, Xinyuan, Fu, Xiaoyuan, Li, Jinglin

Abstract

By sharing intermediate features, collaborative perception extends each agent's sensing beyond standalone limits, but real-world feature modality heterogeneity remains a key barrier to effective fusion. Most existing methods, including direct adaption and protocol-based transformation, typically rely on training adapters for newly emerging feature modalities and often require additional retraining or fine-tuning. Such repeated training is costly and is often infeasible across manufacturers due to model and data privacy constraints, limiting real-world scalability. To address this issue, we propose UniTrans, a universal any-to-any feature modality translation model that instantiates translators on the fly for arbitrary modalities. UniTrans pretrains a bank of translator expert parameters and learns their combination coefficients as a function of source-to-target modality mapping. The mapping is measured in a modality-intrinsic latent space, where an intrinsic encoder extracts modality-specific yet scene-invariant codes from single-frame intermediate features, enabling UniTrans to instantiate translators in a zero-shot manner. Experiments on OPV2V-H and DAIR-V2X demonstrate that UniTrans consistently outperforms state-of-the-art methods in both simulated and real-world settings, enabling efficient any-to-any translation through a universal model. The code is available at https://github.com/CheeryLeeyy/UniTrans.

Chinese Translation

通过共享中间特征，协作感知将每个代理的感知能力扩展到独立限制之外，但现实世界中特征模态的异质性仍然是有效融合的关键障碍。大多数现有方法，包括直接适应和基于协议的转换，通常依赖于为新出现的特征模态训练适配器，并且通常需要额外的重新训练或微调。这种重复训练成本高昂，并且由于模型和数据隐私限制，在制造商之间往往不可行，从而限制了现实世界的可扩展性。为了解决这个问题，我们提出了UniTrans，一个通用的任意到任意特征模态翻译模型，它能够即时为任意模态实例化翻译器。UniTrans预训练了一组翻译器专家参数，并学习它们的组合系数作为源到目标模态映射的函数。该映射在模态内在潜在空间中进行测量，其中内在编码器从单帧中间特征中提取模态特定但场景不变的编码，使UniTrans能够以零样本的方式实例化翻译器。在OPV2V-H和DAIR-V2X上的实验表明，UniTrans在模拟和现实世界环境中始终优于最先进的方法，通过一个通用模型实现高效的任意到任意翻译。代码可在 https://github.com/CheeryLeeyy/UniTrans 获取。

View on arXiv Download PDF AI Translation

cs.CV / 215 / 2605.17915

SurgLQA: Scalable Long-Horizon Surgical Video Question Answering

SurgLQA：可扩展的长时间段外科视频问答

Guo, Diandian, Yang, Xikai, Li, Ruiyang, Pei, Jialun, Heng, Pheng-Ann

Abstract

Surgical Video Question Answering (VideoQA) provides a promising paradigm for dynamic intraoperative interpretation, enabling real-time decision support and context-aware retrieval in clinical environments. Nevertheless, existing approaches are predominantly restricted to images or short clips, limiting their ability to model long-range procedural dynamics and causal dependencies across extended surgical workflows. To address this challenge, we propose SurgLQA, a unified long-horizon VideoQA framework for scalable surgical reasoning. This framework incorporates Faithful Temporal Consolidation (FTC), which leverages intrinsic temporal cues to construct compact long-range representations while preserving fine-grained temporal fidelity. Further, we develop Temporally-Grounded Multi-Policy Scaling (TMS), an adaptive test-time inference paradigm that strategically adjusts policy-level reasoning capacity within temporally grounded contexts. To facilitate systematic evaluation, we restructured a long-duration colonoscopy VideoQA benchmark, Colon-LQA, and conducted extensive experiments on Colon-LQA and REAL-Colon-VQA. Experimental results demonstrate that our approach achieves consistent performance gains in long-range reasoning with temporally grounded inference. Code link: https://github.com/RascalGdd/SurgLQA.

Chinese Translation

外科视频问答（VideoQA）为动态的术中解释提供了一种有前景的范式，使得在临床环境中能够实现实时决策支持和上下文感知检索。然而，现有的方法主要局限于图像或短片段，限制了它们在建模长时间程序动态和跨越扩展外科工作流程的因果依赖方面的能力。为了解决这一挑战，我们提出了SurgLQA，一个统一的长时间段VideoQA框架，用于可扩展的外科推理。该框架结合了忠实时间整合（Faithful Temporal Consolidation, FTC），利用内在的时间线索构建紧凑的长时间段表示，同时保持细粒度的时间保真度。此外，我们开发了时间基础的多策略扩展（Temporally-Grounded Multi-Policy Scaling, TMS），这是一种自适应的测试时推理范式，能够在时间基础的上下文中战略性地调整策略级推理能力。为了便于系统评估，我们重构了一个长时间段的结肠镜检查VideoQA基准，Colon-LQA，并在Colon-LQA和REAL-Colon-VQA上进行了广泛的实验。实验结果表明，我们的方法在时间基础推理的长时间段推理中实现了一致的性能提升。代码链接：https://github.com/RascalGdd/SurgLQA。

View on arXiv Download PDF AI Translation

cs.CV / 216 / 2605.17916

PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis

PanoWorld：一种用于一致性全屋全景合成的生成空间世界模型

Jia, Jinrang, Li, Zhenjia, Hu, Yijiang, Shi, Yifeng

Abstract

Generating a consistent whole-house VR tour from a floorplan and style reference requires both photorealistic panoramas and cross-view spatial coherence. Pure 2D generators produce appealing single panoramas but re-imagine geometry and materials when the viewpoint changes, whereas monolithic 3D generation becomes expensive and loses fine texture at multi-room scale. We introduce PanoWorld, a generative spatial world model that treats whole-house synthesis as autoregressive generation of node-based 360-degree panoramas, matching the discrete navigation used by real VR tour products. PanoWorld uses a floorplan-derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory. A feed-forward panoramic LRM designed for metric-scale multi-room 360-degree inputs lifts generated panoramas into local 3DGS updates, while Room-aware Group Attention suppresses cross-room feature interference. A topology-aware progressive caching strategy fuses these local updates without repeatedly reconstructing the full history. By decoupling shell-based geometry guidance from cache-rendered visual memory, PanoWorld preserves high-frequency 2D synthesis quality while improving cross-node layout and material consistency. The project link is https://jjrcn.github.io/PanoWorld-project-home/

Chinese Translation

从平面图和风格参考生成一致的全屋虚拟现实（VR）导览需要同时具备照片级真实感的全景图和跨视角的空间一致性。纯2D生成器能够生成吸引人的单一全景图，但在视角变化时会重新构想几何形状和材料，而单一的3D生成则成本高昂，并在多房间规模上失去细致纹理。我们提出了PanoWorld，这是一种生成空间世界模型，将全屋合成视为基于节点的360度全景的自回归生成，匹配真实VR导览产品所使用的离散导航。PanoWorld使用基于平面图的3D外壳作为全局几何代理，并使用动态3D高斯点缓存作为可渲染的空间记忆。为度量尺度的多房间360度输入设计的前馈全景LRM将生成的全景图提升为局部3DGS更新，而房间感知的组注意力机制抑制了跨房间特征干扰。一种拓扑感知的渐进缓存策略融合了这些局部更新，而无需重复重建完整历史。通过将基于外壳的几何引导与缓存渲染的视觉记忆解耦，PanoWorld在提高跨节点布局和材料一致性的同时保持了高频2D合成质量。项目链接为 https://jjrcn.github.io/PanoWorld-project-home/

View on arXiv Download PDF AI Translation

cs.CV / 217 / 2605.17921

An Efficient Streaming Video Understanding Framework with Agentic Control

一种高效的流媒体视频理解框架与自主控制

Liu, Jinming, Huang, Jianguo, Jia, Zhaoyang, Li, Jiahao, Zhang, Xiaoyi, Guo, Zongyu, Li, Bin, Zeng, Wenjun, Lu, Yan, Jin, Xin

Abstract

Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial performance gains. For compute routing, we propose TB-GRPO, a target-balanced reinforcement learning objective that routes hard queries to a stronger model while preventing mode collapse. Extensive evaluations demonstrate that R3-Streaming achieves state-of-the-art results among streaming MLLMs, reaching 57.92 on OVO-Bench and 76.36 on StreamingBench, while reducing visual token usage by 95 to 96 percent.

Chinese Translation

流媒体视频需要在严格的延迟预算下处理动态信息密度。然而，现有方法通常采用静态策略，例如固定的内存压缩或依赖单一模型，这迫使我们做出权衡：快速模型在复杂查询上表现不佳，而始终在线的重型模型则违反实时约束并使简单查询过于复杂。我们提出R3-Streaming（记忆、响应、推理），将流媒体视频理解形式化为一个级联控制问题：对于每个查询，系统压缩内存，判断响应准备情况，并顺序路由计算，使得每个下游决策基于逐步精炼的信息状态。为了优化这一流程，我们引入了一种基于时间的遗忘策略用于内存压缩，因为过度压缩历史帧可以带来显著的性能提升。对于计算路由，我们提出了TB-GRPO，这是一种目标平衡的强化学习目标，能够将困难查询路由到更强的模型，同时防止模式崩溃。大量评估表明，R3-Streaming在流媒体多语言模型（MLLMs）中达到了最先进的结果，在OVO-Bench上得分57.92，在StreamingBench上得分76.36，同时将视觉标记的使用减少了95%到96%。

View on arXiv Download PDF AI Translation

cs.CV / 218 / 2605.17933

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

AtlasVA：无教师的自演化视觉技能记忆框架用于视觉语言模型代理

Wang, Pan, Hu, Yihao, Liu, Xiujin, Yang, Jingchu, Wang, Hang, Wen, Zhihao

Abstract

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb

Chinese Translation

视觉语言模型（VLM）代理越来越依赖于增强记忆的强化学习，以便在长时间跨度的任务中重用经验，然而大多数现有框架将记忆存储为文本，并依赖专有的教师模型来总结或精炼这些记忆。这种设计与空间决策制定不匹配：几何先验被压缩为有损的语言，稀疏的交互通常通过延迟的文本反馈而非密集的视觉基础信号进行监督。我们认为，VLM代理的可重用经验应保持视觉基础。基于这一见解，我们提出了 extbf{AtlasVA}，一个无教师的视觉技能记忆框架，将记忆组织为三个互补层次：空间热图、视觉示例和符号文本技能。AtlasVA进一步直接从轨迹统计和轻量级网格启发式演化出危险和亲和力图谱，并将这些自演化的图谱作为基于潜力的塑形奖励用于强化学习。这一方法在没有外部大型语言模型（LLM）监督的情况下统一了感知、记忆和优化。在 extsc{Sokoban}、 extsc{FrozenLake}、3D具身导航和3D机器人操控基准上的实验表明，AtlasVA在文本中心的记忆基线和竞争性VLM代理中始终表现优越，尤其在空间密集型任务上取得了显著的提升。主页：https://wangpan-ustc.github.io/AtlasvaWeb

View on arXiv Download PDF AI Translation

cs.CV / 219 / 2605.17942

UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D Reconstruction

UAVFF3D：一种面向几何的前馈无人机三维重建基准

Yang, Xiang, Wang, Yongli, Li, HaiFeng, Zhang, Yunsheng

Abstract

Feed-forward 3D reconstruction has recently demonstrated strong generalization across diverse scenes, yet its performance in UAV imagery remains underexplored due to distinctive acquisition geometries, large viewpoint variations, and ambiguity between horizontal field of view and flight height. We present UAVFF3D, a geometry-aware benchmark for feed-forward UAV 3D reconstruction, comprising over 170K real UAV images and more than 370K high-quality synthetic images. The benchmark also includes a challenging diagnostic test subset designed to analyze model behavior under UAV-specific geometric ambiguities.Building on UAVFF3D, we propose an evaluation protocol that jointly assesses camera-geometry estimation and reconstruction accuracy, addressing limitations of existing evaluations that rely on separate alignments. Experiments on four representative feed-forward reconstruction models show that UAV-domain adaptation substantially improves performance, reducing Ray Error by up to 84.2%, Pose ATE by up to 76.0%, and Chamfer Distance by up to 41.1%. Further analysis reveals that domain adaptation mitigates rotation-estimation degradation in oblique-view scenes and improves robustness under horizontal-field-of-view/height ambiguity. Incorporating camera priors further enhances reconstruction performance under UAV-specific acquisition geometries.

Chinese Translation

前馈三维重建最近在多样场景中展示了强大的泛化能力，但由于独特的获取几何、较大的视角变化以及水平视场与飞行高度之间的模糊性，其在无人机影像中的表现仍未得到充分探索。我们提出了UAVFF3D，这是一种面向几何的前馈无人机三维重建基准，包含超过17万张真实无人机图像和超过37万张高质量合成图像。该基准还包括一个具有挑战性的诊断测试子集，旨在分析模型在无人机特定几何模糊下的行为。在UAVFF3D的基础上，我们提出了一种评估协议，联合评估相机几何估计和重建精度，解决了现有评估依赖于单独对齐的局限性。在四个代表性的前馈重建模型上的实验表明，无人机领域适应显著提高了性能，Ray误差降低了最多84.2%，姿态绝对轨迹误差（Pose ATE）降低了最多76.0%，Chamfer距离降低了最多41.1%。进一步分析表明，领域适应减轻了斜视场景中的旋转估计退化，并在水平视场/高度模糊下提高了鲁棒性。结合相机先验进一步增强了在无人机特定获取几何下的重建性能。

View on arXiv Download PDF AI Translation

cs.CV / 220 / 2605.17949

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

SkyNative：一种用于遥感视觉证据推理的原生多模态框架

Yang, Xiao, Fu, Ronghao, Lin, Zhiwen, Duan, Zhuoran, Zhu, Jiashun, Hu, Jiasen, Sun, Lang, Zhang, Weipeng, Liu, Jiaqi, Na, Xu, Liu, Haoran, Zhang, Weijie, Yang, Bo

Abstract

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

Chinese Translation

遥感视觉语言模型通常依赖于预训练的视觉编码器，将图像转换为语义特征，然后进行语言模型推理。尽管这种方法在场景级理解上有效，但可能会过早地压缩局部视觉证据，使得细粒度空间推理容易受到语言先验的影响，尤其是在超高分辨率的遥感图像中。我们提出了SkyNative，一个用于遥感的原生多模态框架，采用无编码器架构，移除了预训练的视觉主干，直接将图像表示为语言模型令牌空间中的原始补丁令牌。为了将低级视觉补丁与文本令牌进行协调，SkyNative引入了一种模态感知解耦机制，在统一的自回归主干中使用特定于模态的参数。我们进一步引入了一种视觉依赖基准，诊断模型是否通过渐进的视觉降解和误导性的文本提示将其答案与图像证据相结合。在标准的遥感理解任务和大规模空间推理评估中，SkyNative表现出更强的图像基础感知能力，并且在提示引发的语言先验方面具有更好的鲁棒性。这些结果表明，原生补丁级多模态建模是可靠的遥感视觉语言推理的一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.CV / 221 / 2605.17952

Counting Machine Parts

计数机器零件

Arockiaraj, Benedict Florance, Dinella, Elizabeth, Billa, Ankit, Anand, Ajay

Abstract

Counting objects in an image is a task applicable across many domains. For instance, crowd counting, inventory counting, and cell counting have been the focus of recent research. The major challenges in estimating the count of objects include overlapping objects, object scale issues, occlusions, and varying lighting conditions. In this report, we explore the problem of counting machine washer parts. Our technique is an extension of FamNet with an additional loss component, trained on the given dataset. We compare to three baseline methods: a traditional image processing pipeline, instance segmentation, and density map estimation. We evaluate the performance of these algorithms by computing the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) between the true object counts and the model outputs. Our approach achieves a performance of 1.96 MAE.

Chinese Translation

在图像中计数物体是一个适用于多个领域的任务。例如，人口计数、库存计数和细胞计数一直是近期研究的重点。估计物体数量的主要挑战包括物体重叠、物体尺度问题、遮挡以及变化的光照条件。在本报告中，我们探讨了计数机器垫圈零件的问题。我们的方法是对FamNet的扩展，增加了一个额外的损失组件，并在给定数据集上进行训练。我们与三种基线方法进行了比较：传统的图像处理流程、实例分割和密度图估计。我们通过计算真实物体计数与模型输出之间的平均绝对误差（Mean Absolute Error, MAE）和均方根误差（Root Mean Squared Error, RMSE）来评估这些算法的性能。我们的方法达到了1.96的MAE性能。

View on arXiv Download PDF AI Translation

cs.CV / 222 / 2605.17954

A More Word-like Image Tokenization for MLLMs

更类似于词的图像标记化方法用于多模态大语言模型

Lee, Hyun, Jeong, Hyemin, Kim, Yejin, Choi, Hyungwook, Cho, Hyunsoo, Kim, Soo Kyung, Lee, Joonseok

Abstract

Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.

Chinese Translation

现代多模态大语言模型（MLLMs）通常保持语言模型不变，并训练一个视觉投影器，将像素映射到其嵌入空间中的一系列标记，从而使图像可以以与文本基本相同的形式呈现。然而，语言模型已被优化为在离散的、语义上有意义的标记上运行，而现有的视觉投影器则将图像转换为一长串连续且高度相关的嵌入。这导致视觉标记的行为与大语言模型（LLMs）最初训练理解的类似词单元有所不同。我们提出了一种新颖的解耦视觉标记化方法（Disentangled Visual Tokenization, DiVT），该方法将补丁嵌入聚类为一致的语义单元，使每个标记对应于一个独特的视觉概念，而不是一个固定的网格单元。DiVT 进一步根据图像复杂性调整其标记预算，提供明确的准确性与计算的权衡，而不修改视觉编码器或语言模型。在多种多模态基准测试中，DiVT 以显著更少的视觉标记匹配或超越基线，展示了在有限标记预算下的鲁棒性，显著降低了内存成本和延迟，同时使视觉输入与 LLMs 更加兼容。我们的代码可在 https://github.com/snuviplab/DiVT 获取。

View on arXiv Download PDF AI Translation

cs.CV / 223 / 2605.17969

Generation Navigator: A State-Aware Agentic Framework for Image Generation

生成导航器：一种状态感知的图像生成代理框架

Liu, Jinming, Feng, Ruoyu, Wang, Yuqi, Zeng, Wenjun, Jin, Xin

Abstract

Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.

Chinese Translation

尽管文本到图像生成技术迅速发展，但忠实实现用户意图仍然具有挑战性，通常需要手动进行多轮试错。为了自动化这一过程，现有系统依赖于简单的提示重写或由手工规则驱动的闭环代理，而不是学习根据不断变化的生成过程调整行动。在本文中，我们将图像生成重新表述为一个状态条件下的行动决策问题，并提出了生成导航器（Generation Navigator），这是一种多轮文本到图像（T2I）代理，能够学习动态引导生成轨迹并输出下一个行动。然而，通过强化学习训练该代理引入了一个关键的信用分配挑战：仅基于单一状态对轨迹进行简单奖励，会对回放中的所有行动分配相同的信用，忽视了各轮之间的质量动态，且无法区分改善轨迹的行动与那些使轨迹恶化或在没有进展的情况下浪费轮次的行动。我们通过PRE-GRPO（峰值保留效率组相对策略优化）解决了这一问题，这是一种轨迹级别的强化学习目标，明确奖励发现高质量图像（峰值）、避免随后的质量下降（保留）以及最小化不必要的轮次（效率）。实验结果显示，在基准测试中有显著改进，WISE得分达到0.90，T2I-ReasonBench上的推理准确率为79.06%。

View on arXiv Download PDF AI Translation

cs.CV / 224 / 2605.17980

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

学习平衡：解耦的西蒙斯扩散变换器用于基于参考的遥感图像超分辨率

Luo, Bin, Dong, Runmin, Luo, Zhaoyang, Zhang, Jinxiao, Zhao, Jiyao, Wei, Fan, Fu, Haohuan

Abstract

Diffusion-based methods demonstrate significant potential for remote sensing image super-resolution at large scaling factors, particularly in reference-based super-resolution (RefSR) where high-resolution reference images provide critical fine-grained texture priors. However, existing methods often suffer from a trade-off between over-reliance on reference information, which leads to texture artifacts, and underutilization, which results in insufficient detail recovery. To address these issues, we propose DS-DiT, a Decoupled Siamese Diffusion Transformer method that decouples low-resolution and reference interactions at the attention level. By enabling low-resolution structural priors and reference texture information to interact independently with the noisy latent, the framework effectively mitigates inter-source competition. Furthermore, to compensate for the limited local modeling ability of global attention, we introduce a Patch-Level Weights (PLW) module that adaptively modulates the fusion of conditional sources. In addition, this siamese architecture facilitates an autoguidance strategy during inference, which enhances reconstruction by exploiting the prediction discrepancy between strong and weak reference conditions. This approach boosts generation quality without additional training. Experimental results across multiple datasets and scaling factors demonstrate that DS-DiT outperforms existing methods in both quantitative metrics and visual fidelity.

Chinese Translation

基于扩散的方法在大尺度因子下显示出在遥感图像超分辨率方面的显著潜力，特别是在基于参考的超分辨率（RefSR）中，高分辨率参考图像提供了关键的细粒度纹理先验。然而，现有方法往往面临对参考信息的过度依赖，导致纹理伪影，以及对其的不足利用，造成细节恢复不足的权衡。为了解决这些问题，我们提出了DS-DiT，一种解耦的西蒙斯扩散变换器方法，在注意力层面上解耦低分辨率和参考交互。通过使低分辨率结构先验和参考纹理信息能够独立地与嘈杂的潜在信息进行交互，该框架有效地减轻了源间竞争。此外，为了弥补全局注意力的有限局部建模能力，我们引入了一个自适应调制条件源融合的补丁级权重（PLW）模块。此外，这种西蒙斯架构在推理过程中促进了一种自我引导策略，通过利用强弱参考条件之间的预测差异来增强重建。该方法在不增加额外训练的情况下提升了生成质量。在多个数据集和尺度因子上的实验结果表明，DS-DiT在定量指标和视觉保真度方面均优于现有方法。

View on arXiv Download PDF AI Translation

cs.CV / 225 / 2605.17990

Low Latency Gaze Tracking via Latent Optical Sensing

通过潜在光学传感实现低延迟注视跟踪

Zheng, Yidan, Souza, Matheus, Kang, Kaizhang, Fu, Qiang, Amata, Hadi, Heidrich, Wolfgang

Abstract

We present a real-time gaze tracking system that directly acquires task-relevant latent features using a fully passive optical encoder. Instead of forming and processing full-resolution images, our approach leverages a microlens array with a co-designed binary chromium mask to perform spatially multiplexed optical encoding, producing a compact set of measurements sufficient for gaze estimation. By integrating sensing and feature extraction in the optical domain, the proposed system eliminates the need for high-bandwidth image readout and substantially reduces computational overhead. The encoded measurements are captured by a 4 x 4 phototransistor array and mapped to gaze direction using a lightweight neural network. Our proof-of-concept prototype enables an end-to-end sensing-to-inference latency of 3.4 ms, outperforming published research systems. We demonstrate the effectiveness of our approach on both simulated and real-world data, achieving competitive gaze estimation accuracy while significantly improving latency and energy efficiency compared to conventional camera-based pipelines. This work highlights the potential of task-driven optical sensing for ultra-low-latency, computationally efficient human-computer interaction systems.

Chinese Translation

我们提出了一种实时注视跟踪系统，该系统通过完全被动的光学编码器直接获取与任务相关的潜在特征。我们的方案不再形成和处理全分辨率图像，而是利用微透镜阵列和共设计的二进制铬掩模进行空间复用光学编码，生成一组紧凑的测量数据，足以进行注视估计。通过在光学域中集成传感和特征提取，所提议的系统消除了对高带宽图像读取的需求，并显著降低了计算开销。编码测量由4 x 4光电晶体管阵列捕获，并通过轻量级神经网络映射到注视方向。我们的概念验证原型实现了3.4毫秒的端到端传感到推理延迟，超越了已发布的研究系统。我们在模拟和真实世界数据上展示了我们方法的有效性，取得了具有竞争力的注视估计精度，同时相比传统基于相机的管道显著提高了延迟和能效。本研究突显了任务驱动光学传感在超低延迟、计算高效的人机交互系统中的潜力。

View on arXiv Download PDF AI Translation

cs.CV / 226 / 2605.18010

Functionalization via Structure Completion and Motion Rectification

通过结构补全和运动校正实现功能化

Zhao, Mingrui, Perla, Sai Raj Kishore, Wang, Kai, Nag, Sauradip, Nguyen, Duc Anh, Peng, Jiayi, Wang, Ruiqi, Chang, Angel X., Savva, Manolis, Mahdavi-Amiri, Ali, Zhang, Hao

Abstract

Acquisition and creation of 3D assets have been largely view- or appearance-driven. As a result, existing digital 3D models often lack the requisite structural components to function as intended, such as joints, supports, interiors, or interaction elements. At the same time, even human-annotated motions are frequently error-prone, leading to physically implausible behavior. We introduce object functionalization, a novel task aimed at transforming visually plausible but non-functional 3D models into functional and physically operable ones. We formulate functionalization as a graph completion problem over a new functional graph representation, where labeled nodes represent object parts, labeled edges encode functional and contact relations, and movable nodes carry motion attributes, so that structural functional deficiencies manifest as missing nodes or incorrect edges. We develop a neural Graph Functionalizer (GraFu) to complete an incomplete graph representing a non-functional 3D object. The completed graph then drives a geometry realization stage that instantiates predicted connectors and structural elements in 3D, with the compelling side effect of rectifying erroneous human-annotated and predicted motions. To support training and evaluation, focusing on furniture as a rich and challenging target category, we introduce FurFun-233, a dataset of 233 paired non-functional and functionalized furniture models. On PartNet-Mobility ("zero-shot") and HSSD test sets, our method matches state-of-the-art methods in motion prediction accuracy while substantially improving functionality in terms of collision and connectivity.

Chinese Translation

3D资产的获取和创建在很大程度上是以视图或外观为驱动的。因此，现有的数字3D模型往往缺乏作为预期功能所需的结构组件，如关节、支撑、内部结构或交互元素。同时，即使是人工标注的运动也常常容易出错，导致物理上不合理的行为。我们提出了对象功能化（object functionalization），这是一项新任务，旨在将视觉上合理但非功能性的3D模型转变为功能性和物理可操作的模型。我们将功能化形式化为一个图补全问题，基于一种新的功能图表示，其中标记节点表示对象部件，标记边编码功能和接触关系，而可移动节点携带运动属性，从而使结构功能缺陷表现为缺失节点或不正确的边。我们开发了一种神经图功能化器（Graph Functionalizer，GraFu），用于补全表示非功能性3D对象的不完整图。补全后的图随后驱动几何实现阶段，在3D中实例化预测的连接器和结构元素，并有效地校正错误的人工标注和预测运动。为了支持训练和评估，专注于家具这一丰富且具有挑战性的目标类别，我们引入了FurFun-233，这是一个包含233对非功能性和功能化家具模型的数据集。在PartNet-Mobility（“零样本”）和HSSD测试集上，我们的方法在运动预测准确性上与最先进的方法相匹配，同时在碰撞和连通性方面显著提高了功能性。

View on arXiv Download PDF AI Translation

cs.CV / 227 / 2605.18012

SAS: Semantic-aware Sampling for Generative Dataset Distillation

SAS：面向语义的生成数据集蒸馏采样

Li, Mingzhuo, Li, Guang, Ye, Linfeng, Mao, Jiafeng, Ogawa, Takahiro, Plataniotis, Konstantinos N., Haseyama, Miki

Abstract

Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.

Chinese Translation

深度神经网络在广泛的任务中取得了令人印象深刻的性能，但这种成功往往伴随着由于大规模训练数据所带来的巨大的计算和存储成本。数据集蒸馏通过构建紧凑而又信息丰富的数据集来应对这一挑战，从而实现高效的模型训练，同时保持下游性能。然而，现有的大多数方法主要强调匹配数据分布或下游训练统计，而对保留蒸馏数据中的高层语义信息关注较少。在本研究中，我们引入了一种面向语义的数据集蒸馏视角，利用对比语言-图像预训练（Contrastive Language-Image Pretraining，CLIP）作为后采样的语义先验。我们的目标是获得不仅紧凑而且在语义上具有类别区分性和多样性的蒸馏数据集。为此，我们设计了三个语义评分函数，以量化类别相关性、类间可分性和类内多样性，这些函数在预训练的语义空间中进行评估。基于现有蒸馏方法生成的图像池，我们进一步开发了一种有效采样的两阶段策略：第一阶段筛选出语义上具有区分性的样本，以形成可靠的候选集，第二阶段则进行动态的多样性感知选择，以减少冗余，同时保留语义覆盖。针对多个数据集、图像池和下游模型的广泛实验表明，性能一致提升，突显了将语义信息纳入数据集蒸馏的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 228 / 2605.18013

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

TinySAM 2：高效的任何跟踪模型的极限内存压缩

Ding, Zhaoyuan, Yang, Yijing, Shu, Han, Chen, Xinghao

Abstract

Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi-stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.

Chinese Translation

分割任何模型 2（Segment Anything Model 2，SAM 2）作为视频分割领域的核心基础模型。在原始 SAM 模型的基础上，它引入了一种内存库机制，并在半监督视频对象分割和任何跟踪等任务中表现出色。然而，SAM 2 的多阶段图像编码器和内存模块的复杂计算特性提高了该模型在实际应用中的部署门槛。为了解决这个问题，我们提出了 TinySAM 2，这是一种轻量级视频分割模型，平衡了性能和效率。首先，引入了一种内存质量管理机制，以选择和保留高信息量的历史帧作为内存。此外，提出了一种联合时空令牌压缩方法，减少了内存存储和计算成本。具体而言，首先在空间域中采用平均池化来压缩冗余令牌。在时间域中，根据令牌级相似性测量，从内存库中选择信息量丰富的令牌。此外，我们采用 RepViT 作为轻量级图像编码器，进一步减少模型参数。在 DAVIS 和 SA-V 等挑战性数据集上进行的大量实验表明，TinySAM 2 实现了 SAM 2.1 性能的 90%，仅使用 7% 的内存令牌和 3% 的训练数据。本研究有效缓解了与 SAM 2 相关的参数数量、计算负载和部署成本的瓶颈，为视频分割模型在设备上的广泛应用提供了一种资源高效的解决方案。

View on arXiv Download PDF AI Translation

cs.CV / 229 / 2605.18018

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

看我所指：对齐视觉和语言表征以实现视频细粒度对象理解

Sun, Boyuan, Yin, Bowen, Li, Yuanming, Wei, Xihan, Hou, Qibin

Abstract

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.

Chinese Translation

我们提出了SWIM（看我所指），这是一种新颖的训练策略，旨在对齐视觉和语言表征，从而仅通过文本提示实现细粒度对象理解。与现有方法需要显式的视觉提示（如掩码或点）不同，SWIM在训练期间仅利用掩码监督来引导跨模态注意力，使模型在推理时能够自动关注用户指定的对象。我们对预训练的多模态大型语言模型（MLLMs）的跨注意力分析揭示了一个系统性差异：属性词在视觉模态中产生尖锐、局部化的激活，而对象名词则由于语义参考偏差和分布式高层表征而产生弥散和分散的模式。为了解决这种不对齐问题，我们构建了NL-Refer，这是一个丰富的数据集，其中每个对象掩码都与精确的自然语言指称表达配对。SWIM从对象名词中提取多层跨注意力图，并强制与真实掩码保持空间一致性。实验结果表明，SWIM显著改善了文本与视觉的对齐，并在细粒度对象理解基准测试中优于基于视觉提示的方法。代码和数据可在 exttt{https://github.com/HumanMLLM/SWIM} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 230 / 2605.18023

DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection

DSAA：双阶段属性激活用于细粒度开放词汇检测

Jiang, Donghong, Lin, Endian, Liu, Hanqing, Liu, Mingjie, Cui, Luoping, Yang, Zhao, Zhu, Chuang

Abstract

Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the iden- tification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine- grained detection tasks involving attributes like color, ma- terial, and texture. We attribute this performance bottle- neck in OVD models to a core issue: when category sig- nals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect bind- ing between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capa- bilities by strengthening attribute semantics at two criti- cal stages. In the text embedding stage, we employ At- tribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further am- plify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encod- ing phase, selectively enhancing the Key and Value vec- tors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with differ- ent attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary mod- els.

Chinese Translation

开放词汇物体检测（OVD）模型突破了封闭集检测的局限性，通过自然语言提示实现对未见类别的识别。然而，在涉及颜色、材料和纹理等属性的细粒度检测任务中，它们表现出显著的局限性。我们将OVD模型在性能上的瓶颈归因于一个核心问题：当类别信号占主导时，OVD模型在推理过程中往往会边缘化属性信息。这导致属性与目标物体之间的错误绑定。为了解决这一问题，我们提出了双阶段属性激活（DSAA）框架，通过在两个关键阶段增强属性语义来提升细粒度检测能力。在文本嵌入阶段，我们采用属性前缀适配器（Attribute Prefix Adapter, APA）模块生成属性前缀，以注入显式的属性先验。为了进一步放大这些属性的影响，我们的键/值调制器（Key/Value Modulator, K/V Modulator）模块在BERT编码阶段进行干预，选择性地增强相应属性标记的键和值向量。此外，我们引入了一种属性感知对比损失，以提高训练过程中不同属性的同类实例之间的区分度。在FG-OVD基准测试上的实验结果证明了我们方法在各种主流开放词汇模型中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 231 / 2605.18029

What Matters for Grocery Product Retrieval with Open Source Vision Language Models

开源视觉语言模型在杂货产品检索中的关键因素

Maminta, Emmanuel G., Atienza, Rowel O.

Abstract

Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} ($\phi$), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.} State-of-the-art models achieve 94.5\% Recall@5 but suffer a 17.5\% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \url{https://github.com/upeee/openmpr}.

Chinese Translation

多模态产品检索（MPR）是无结账零售和自动化库存系统的基础，但它需要细粒度的SKU区分，而标准的视觉-语言基准无法捕捉到这一点。我们首次对190个开源视觉语言模型（VLMs）在杂货视觉挑战（GroceryVision Challenge）中的MPR任务进行了系统的零样本评估，重点分析了预训练数据、架构和输入分辨率。我们的分析得出了三项可行的发现。 extbf{（1）数据质量胜过规模。} 从原始网络抓取数据切换到过滤后的数据集可提高多达16.6 ext{%}的准确率，超过了将模型参数翻倍所带来的好处。 extbf{（2）高效模型可以胜出。} MobileCLIP-B（150M参数）在噪声数据上训练的351M同类模型中表现优越。我们引入了 extit{语义功率密度}（$ heta$），这是一种惩罚低于阈值准确率的效率指标。 extbf{（3）精度差距依然存在。} 尖端模型在Recall@5上达到了94.5 ext{%}，但在Recall@1上下降了17.5 ext{%}，这表明对比嵌入有效地聚类了类别，但未能对视觉上相似的SKU进行有效排序。代码和评估脚本可在 exttt{https://github.com/upeee/openmpr}获取。

View on arXiv Download PDF AI Translation

cs.CV / 232 / 2605.18038

Patch Ensembles for Robust Salmon Re-Identification with Weak Trajectory Labels

基于补丁集的鲁棒性鲑鱼重新识别方法与弱轨迹标签

Høgstedt, Espen Uri, Schellewald, Christian, Stahl, Annette, Mester, Rudolf

Abstract

Salmon re-identification in commercial net-pens is challenging due to large populations, which impose strict accuracy requirements and make large-scale labeled data acquisition infeasible. Trajectory IDs can be used as proxy labels, but this introduces trajectory-ID bias. To address these challenges, we propose a patch-based re-identification framework that fuses patch-level predictions into a salmon identity decision. A key component is the prediction of the salmon's lateral line, enabling extraction of texture-anchored patches and patch slices. To enable realistic evaluation, we introduce an experimental setup using multiple cameras placed 6 m apart, allowing the same fish to be recorded in different trajectories. This enables the construction of a cross-camera test set through manual match confirmation. Our ensemble approach outperforms the full-image baseline in same-trajectory validation (0.932 to 0.965 mAP) and cross-camera testing (0.609 to 0.860 mAP). The substantial improvements in the cross-camera setting demonstrate improved generalizability and robustness. Code and data: https://github.com/espenbh/salmon-reid-patch-ensemble.

Chinese Translation

在商业网箱中，鲑鱼重新识别面临挑战，因为庞大的鱼群对准确性提出了严格要求，并且使得大规模标注数据的获取变得不可行。轨迹ID可以作为代理标签使用，但这引入了轨迹ID偏差。为了解决这些挑战，我们提出了一种基于补丁的重新识别框架，将补丁级别的预测融合为鲑鱼身份决策。一个关键组件是对鲑鱼侧线的预测，这使得能够提取基于纹理的补丁和补丁切片。为了实现现实的评估，我们引入了一种实验设置，使用多个相距6米的摄像头，允许同一条鱼在不同轨迹中被记录。这使得通过人工匹配确认构建跨摄像头测试集成为可能。我们的集成方法在同轨迹验证中优于全图基线（mAP从0.932提高到0.965），在跨摄像头测试中（mAP从0.609提高到0.860）。在跨摄像头设置中显著的改进展示了更好的泛化能力和鲁棒性。代码和数据：https://github.com/espenbh/salmon-reid-patch-ensemble。

View on arXiv Download PDF AI Translation

cs.CV / 233 / 2605.18039

SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals

SGSoft：通过模板引导的软信号学习融合语义-几何特征以实现3D形状对应

Yoon, Soyeon, Seo, Chang Wook, Shim, Hyunjung

Abstract

Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency. We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space. This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing. SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy-efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence.

Chinese Translation

在可变形3D形状之间学习稠密对应仍然是一个长期存在的挑战，原因在于结构变异、非等距变形和拓扑不一致。现有方法通常在泛化能力、几何保真度和效率之间进行权衡。我们通过提出SGSoft来解决这一问题，SGSoft是一个统一的内在管道，(i) 在标准模板上构建测地对应场，(ii) 在该测地对应场的监督下，学习由预训练语义先验引导的多模态稠密描述符，(iii) 通过在描述符空间中的最近邻搜索，在单次前馈传递中检索稠密对应。这一构造使得在大姿态变化、结构差异和重网格化的情况下，能够实现稳定且拓扑不变的监督。SGSoft在类别间泛化方面达到了最先进的水平，同时在现有方法中提供了最佳的准确性与效率的权衡。它还实现了近实时推理，无需预对齐、成对优化或后期精炼。学习到的描述符可以有效地转移到下游任务，如语义分割和变形转移，建立了一种可扩展且适合部署的稠密3D对应范式。

View on arXiv Download PDF AI Translation

cs.CV / 234 / 2605.18041

OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

OmniSelect：用于高效全模态大型语言模型的动态模态感知令牌压缩

Yang, Morunliu, Xu, Ruotao, Li, Le, Wang, Yue, Zhang, Jianxin, Li, Juntao, Lou, Yihang, Feng, Siwei, Li, Peifeng

Abstract

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $\textbf{OmniSelect}$, a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

Chinese Translation

全模态大型语言模型（OmniLLMs）近年来因其在统一音视频理解方面的潜力而受到越来越多的关注。然而，处理长的多模态令牌序列会引入大量的计算开销，因此高效的令牌压缩显得尤为重要。现有方法通常依赖于固定的、特定模态的指导，这未能考虑不同查询中模态的重要性差异。为了解决这一局限性，我们提出了$ extbf{OmniSelect}$，这是一种无训练的模态自适应令牌修剪框架，能够动态选择适合多模态输入的压缩策略。具体而言，我们利用轻量级的AudioCLIP模型来估计跨模态相关性，并将每个输入分类为三种修剪模式：以音频为中心的修剪、以视频为中心的修剪和均匀修剪。基于这些相关性得分，OmniSelect进一步在每个时间组内进行细粒度的令牌修剪，自适应地分配修剪比例，以保留跨模态的信息性令牌。通过明确建模模态偏好并实现动态策略选择，OmniSelect有效避免了一刀切的压缩陷阱。大量实验表明，我们的方法在保持强大性能的同时，实现了高效的多模态令牌减少，且无需任何额外的训练。

View on arXiv Download PDF AI Translation

cs.CV / 235 / 2605.18052

Efficient 3D Content Reconstruction and Generation

高效的三维内容重建与生成

Li, Jiahao

Abstract

Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.

Chinese Translation

自动化三维内容创作旨在用能够直接从文本或图像合成或恢复三维资产的系统替代劳动密集型建模和扫描流程。其应用涵盖视频游戏、虚拟现实、机器人技术和仿真，能够快速原型制作资产、多样化互动世界生成，以及高效收集用于训练基础模型的三维数据。当代解决方案主要遵循两种互补的范式：（i）文本或图像到三维生成，该方法通过学习三维几何和外观的先验知识，从自然语言或单视图图像中创建新颖的资产；（ii）三维重建，该方法从RGB图像中估计相机姿态和几何形状。本论文在这两个方向上都有所推进。在生成方面，我介绍了Instant3D，它结合了多视图扩散与前馈稀疏视图三维重建，在5到20秒内生成高质量资产。在重建方面，我开发了FastMap，这是一种运动结构重建流程，通过广泛使用一阶优化与融合的GPU内核，实现了比先前最先进技术快10倍的速度，同时保持了相似的姿态精度和下游新视图合成质量。

View on arXiv Download PDF AI Translation

cs.CV / 236 / 2605.18058

Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models

阿拉伯手写识别的威胁：对嵌入式卷积网络模型进行黑箱对抗攻击的研究

Khayati, Mohsine EL, Semma, Abdelillah, Courr, Abdelaziz, Elouahbi, Rachid

Abstract

Arabic handwriting recognition (AHR) has made significant progress with deep learning models. AHR research has largely focused on performance, with security receiving little attention. This study provides what appears to be a new line of inquiry by demonstrating the vulnerability of high-performing models to adversarial black-box attacks. The focus on black-box attacks reflects real-world scenarios where the attacker has no prior knowledge of the model architecture. Extensive experiments were conducted on two benchmark AHR datasets containing Arabic handwritten Characters. Results demonstrated the effectiveness of the attacks, with the Pixle attack achieving an attack success rate of 99-100\% on most models. Other, less aggressive attacks achieved success rates of 50-96\% across most experiments. Despite the higher attack success rate, the attacks maintain the structural integrity of the characters, rendering them almost imperceptible to the human eye. The findings indicate the higher vulnerability of the studied models to adversarial manipulation. This underscores the need to strengthen efforts to secure these models and ensure their reliability in AHR real-world applications.

Chinese Translation

阿拉伯手写识别（AHR）在深度学习模型的推动下取得了显著进展。AHR研究主要集中在性能上，而对安全性的关注则相对较少。本研究通过展示高性能模型对对抗性黑箱攻击的脆弱性，提供了一条新的研究思路。对黑箱攻击的关注反映了现实世界中的场景，攻击者对模型架构没有先前的知识。我们在两个包含阿拉伯手写字符的基准AHR数据集上进行了广泛的实验。结果表明，这些攻击的有效性，其中Pixle攻击在大多数模型上达到了99-100%的攻击成功率。其他一些攻击的成功率在大多数实验中达到了50-96%。尽管攻击成功率较高，但这些攻击保持了字符的结构完整性，使其对人眼几乎不可察觉。研究结果表明，所研究模型对对抗性操控的脆弱性更高。这强调了加强对这些模型的安全性保障和确保其在AHR实际应用中可靠性的必要性。

View on arXiv Download PDF AI Translation

cs.CV / 237 / 2605.18060

Embedded ConvNet Ensembles: A Lightweight Approach to Recognize Arabic Handwritten Characters

嵌入式卷积网络集成：一种轻量级的阿拉伯手写字符识别方法

Khayati, Mohsine El, Elouahbi, Rachid, Semma, Abdelillah

Abstract

Arabic Handwritten Character Recognition (AHCR) has recently advanced significantly with deep Convolutional Neural Networks (ConvNets). However, many models in the literature are deep and computationally expensive in terms of parameters and FLOPs, limiting their deployment on resource-constrained devices, which are increasingly common. This study addresses this gap by proposing a combination of lightweight embedded ConvNet models and ensemble learning techniques. Extensive experiments were conducted to identify best practices in AHCR, considering training hyperparameters, learning strategies, model choices, and ensemble methods. Results show that embedded models can achieve accuracy comparable to, or even surpassing, heavier architectures. Ensemble learning further enhances performance with only modest computational overhead, particularly under challenging training scenarios. Among the ensembling strategies, soft voting yielded the best overall results.

Chinese Translation

阿拉伯手写字符识别（AHCR）近年来在深度卷积神经网络（ConvNets）的推动下取得了显著进展。然而，文献中的许多模型都较为深且在参数和浮点运算（FLOPs）方面计算开销较大，这限制了它们在资源受限设备上的应用，而这类设备日益普遍。本研究通过提出轻量级嵌入式ConvNet模型与集成学习技术的结合，来填补这一空白。我们进行了广泛的实验，以识别AHCR中的最佳实践，考虑了训练超参数、学习策略、模型选择和集成方法。结果表明，嵌入式模型能够实现与更重架构相当甚至更高的准确率。集成学习在计算开销仅为适度的情况下进一步提升了性能，尤其是在具有挑战性的训练场景中。在所有集成策略中，软投票法获得了最佳的整体结果。

View on arXiv Download PDF AI Translation

cs.CV / 238 / 2605.18063

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

MixCount 数据集：弥补开放词汇物体计数的数据缺口

Dumery, Corentin, Amini-Naieni, Niki, Naini, Shervin, Fua, Pascal

Abstract

Object counting is a foundational vision task with over a decade of dedicated research, yet state-of-the-art models still fail systematically in the mixed-object setting that dominates real-world applications such as industrial inspection and product sorting. We show that this gap is strongly driven by limitations in existing training and evaluation data: real counting datasets are prohibitively expensive to annotate and suffer from labeling noise, while existing synthetic alternatives lack diversity and realism. We address this with MixCount, a dataset and benchmark for mixed-object counting designed to target the failure modes of current counting models. To overcome the high cost of constructing and labeling such data, we develop an automatic generation pipeline that synthesizes images, fine-grained textual descriptions, and pixel-perfect counting annotations at scale, eliminating the labeling ambiguity that plagues prior datasets. Evaluating state-of-the-art counting models on MixCount exposes severe degradation in the mixed-object setting. More importantly, training these models on our synthesized data yields substantial gains on real-world benchmarks, reducing MAE by 20.14% on FSC-147 and by 18.3% on PairTally. These results establish MixCount as both a benchmark and a training dataset for fine-grained counting, and demonstrate that our pipeline, which produces effectively unlimited labeled data, helps address a long-standing bottleneck in counting models.

Chinese Translation

物体计数是一项基础的视觉任务，已有超过十年的专门研究，但最先进的模型在主导现实应用（如工业检测和产品分类）的混合物体环境中仍然系统性地失败。我们表明，这一差距主要源于现有训练和评估数据的局限性：真实计数数据集的标注成本高昂且存在标注噪声，而现有的合成替代品缺乏多样性和真实性。为此，我们提出了 MixCount，一个针对混合物体计数的数据集和基准，旨在解决当前计数模型的失效模式。为了克服构建和标注此类数据的高成本，我们开发了一种自动生成管道，能够大规模合成图像、细粒度文本描述和像素完美的计数标注，消除了困扰先前数据集的标注模糊性。在 MixCount 上评估最先进的计数模型暴露了混合物体环境中的严重性能下降。更重要的是，在我们的合成数据上训练这些模型在真实世界基准上取得了显著提升，FSC-147 上的平均绝对误差（MAE）降低了 20.14%，PairTally 上降低了 18.3%。这些结果确立了 MixCount 作为细粒度计数的基准和训练数据集，并证明我们的管道能够有效产生无限的标注数据，有助于解决计数模型中的长期瓶颈问题。

View on arXiv Download PDF AI Translation

cs.CV / 239 / 2605.18101

SENSE: Satellite-based ENergy Synthesis for Sustainable Environment

SENSE：基于卫星的能源合成以实现可持续环境

Sun, Kailai, He, Mingyi, Huang, Heye, Rong, Can, Prakash, Alok, Guo, Baoshen, Wang, Shenhao, Zhao, Jinhua

Abstract

Urban Building Energy Modeling plays a critical role in achieving the United Nations' Sustainable Development Goals 7 and 11. Although existing studies based on satellite imagery and deep learning have achieved remarkable progress, many challenges exist: most existing studies are inherently predictive, failing to reflect the generative nature of urban planning; although generative AI and diffusion models have seen explosive growth in satellite imagery, they lack the urban functional generation (e.g., energy layer); third, aligned high-quality high-resolution building energy data with satellite imagery is limited and scarce. Here we propose SENSE (Satellite-based ENergy Synthesis for Sustainable Environment), a unified generative UBEM framework that jointly synthesizes realistic urban satellite imagery and aligned high-quality building energy consumption and height maps. By conditioning on road networks and urban density metrics, SENSE, based on a controllable diffusion model, leverages the knowledge learned by large vision models to generate urban building energy consumption and height information (annotations) in the latent space. Experiments across four cities (New York City, Boston, Lyon, Busan) demonstrate that SENSE achieves high visual fidelity and strong physical consistency, satisfying the ASHRAE standard metric. Experiments demonstrate that SENSE can generate enough annotated synthetic data using less than 20% labeled energy data, boosting downstream prediction performance by 10% IoU. Compared to SOTA urban energy prediction methods, SENSE significantly reduced prediction error (reduced 3%-11% NMBE and 1%-9% CVRMSE). This study offers an energy-efficiency urban planning and physical generation solution for urban science, energy science and building science. The dataset and code: https://huggingface.co/datasets/skl24/MUSE and https://github.com/kailaisun/GenAI4Urban-Energy/.

Chinese Translation

城市建筑能源建模在实现联合国可持续发展目标7和11中发挥着至关重要的作用。尽管基于卫星图像和深度学习的现有研究已取得显著进展，但仍面临许多挑战：大多数现有研究本质上是预测性的，未能反映城市规划的生成特性；尽管生成性人工智能和扩散模型在卫星图像中取得了爆炸性增长，但它们缺乏城市功能生成（例如，能源层）；第三，与卫星图像对齐的高质量高分辨率建筑能源数据有限且稀缺。在此，我们提出了SENSE（基于卫星的能源合成以实现可持续环境），这是一个统一的生成性城市建筑能源建模（UBEM）框架，能够共同合成真实的城市卫星图像以及对齐的高质量建筑能源消耗和高度图。通过对道路网络和城市密度指标进行条件设置，SENSE基于可控的扩散模型，利用大型视觉模型学习到的知识，在潜在空间中生成城市建筑能源消耗和高度信息（注释）。在四个城市（纽约市、波士顿、里昂、釜山）的实验表明，SENSE实现了高视觉保真度和强物理一致性，符合ASHRAE标准指标。实验表明，SENSE能够使用不到20%的标记能源数据生成足够的注释合成数据，提升下游预测性能10%的IoU。与最先进的城市能源预测方法相比，SENSE显著降低了预测误差（减少3%-11%的NMBE和1%-9%的CVRMSE）。本研究为城市科学、能源科学和建筑科学提供了一种能源效率城市规划和物理生成解决方案。数据集和代码： https://huggingface.co/datasets/skl24/MUSE 和 https://github.com/kailaisun/GenAI4Urban-Energy/.

View on arXiv Download PDF AI Translation

cs.CV / 240 / 2605.18102

DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos

DanceHMR：基于单目视频的手部感知全身人类网格恢复

Shen, Wenhao, Zhou, Ming, Zhang, Hengyuan, Bian, Siyuan, Xu, Youjiang, Lin, Xi

Abstract

Monocular video human mesh recovery is essential for digital humans, avatar animation, and embodied simulation, where both temporal stability and expressive whole-body motion are required. Existing video HMR methods produce coherent body motion but often overlook detailed hand articulation, while image-based whole-body methods recover SMPL-X meshes independently per frame, often leading to jittery and inaccurate hand motion. We present a temporally coherent whole-body HMR framework for challenging in-the-wild monocular videos. Our model unifies body context and part-specific hand observations through residual body-hand fusion, enabling stable body motion and detailed hand recovery within a single temporal architecture. We further introduce close-up-aware augmentation to improve robustness under upper-body framing. Experiments on whole-body and body-only benchmarks demonstrate improved hand reconstruction and competitive body accuracy. Our method also produces temporally stable and 2D-consistent SMPL-X motion in challenging real-world videos.

Chinese Translation

单目视频的人类网格恢复对于数字人类、虚拟形象动画和具身模拟至关重要，其中需要同时具备时间稳定性和表现力丰富的全身运动。现有的视频人类网格恢复（HMR）方法能够产生连贯的身体运动，但往往忽视了手部的细节关节运动，而基于图像的全身方法则独立地恢复每帧的SMPL-X网格，常常导致手部运动抖动和不准确。我们提出了一种针对具有挑战性的野外单目视频的时间一致性全身HMR框架。我们的模型通过残差身体-手融合统一了身体上下文和特定部位的手部观测，从而在单一时间架构内实现稳定的身体运动和详细的手部恢复。我们进一步引入了近距离感知增强，以提高在上半身框架下的鲁棒性。在全身和仅身体基准测试上的实验表明，我们的方法在手部重建和身体准确性方面均有所提升。我们的算法在具有挑战性的真实世界视频中也能产生时间稳定且2D一致的SMPL-X运动。

View on arXiv Download PDF AI Translation

cs.CV / 241 / 2605.18115

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

WinTok：通过可转移的令牌分解视觉理解与生成的双赢混合分词器

Guo, Yiwei, Zhuang, Shaobin, Huang, Zhipeng, Fu, Canmiao, Li, Chen, Lyu, Jing, Wang, Yali

Abstract

Building a unified visual tokenizer is essential for bridging the gap between visual understanding and generation. Yet existing approaches struggle with the inherent conflict between these tasks, as a single token space is forced to support both high-level semantic abstraction and low-level pixel reconstruction. We propose WinTok, a concise hybrid tokenizer that achieves a win-win performance by explicitly decoupling the two objectives. WinTok supplements pixel tokens with a set of learnable semantic tokens, effectively mitigating cross-task interference without incurring the computational overhead of dual tokenizers. To further enhance understanding capability, we introduce an asymmetric token distillation mechanism: the semantic tokens are guided by pretrained semantic embeddings from any visual foundation model, enabling them to inherit strong discriminative power while maintaining flexibility. Across 10 challenging benchmarks, WinTok delivers consistent improvements in reconstruction, understanding, and generation. Trained on only 50M open-source data, WinTok surpasses the strong baseline UniTok by 11.2% in classification accuracy and achieves a competitive reconstruction rFID of 0.41, despite using substantially less training data. Code is released at https://github.com/markywg/WinTok.

Chinese Translation

构建一个统一的视觉分词器对于弥合视觉理解与生成之间的差距至关重要。然而，现有的方法在这两项任务之间存在固有的冲突，因为单一的令牌空间被迫同时支持高层次的语义抽象和低层次的像素重建。我们提出了WinTok，这是一种简洁的混合分词器，通过明确解耦这两个目标实现双赢的性能。WinTok通过一组可学习的语义令牌来补充像素令牌，有效减轻了跨任务干扰，而无需承担双重分词器的计算开销。为了进一步增强理解能力，我们引入了一种不对称的令牌蒸馏机制：语义令牌由任何视觉基础模型的预训练语义嵌入指导，使其能够继承强大的区分能力，同时保持灵活性。在10个具有挑战性的基准测试中，WinTok在重建、理解和生成方面均实现了一致的改进。尽管使用的训练数据显著较少，WinTok在仅使用5000万开放数据的情况下，分类准确率超越了强基线UniTok 11.2%，并实现了竞争力的重建rFID为0.41。代码已发布在https://github.com/markywg/WinTok。

View on arXiv Download PDF AI Translation

cs.CV / 242 / 2605.18130

Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis

Rad-VLSM：一种基于语义辅助提示的跨模态框架用于医学分割与诊断

Zhang, Fengyi, Zeng, Xujie, Liu, Mohan, Wang, Zengyi, Jiang, Yalong

Abstract

Medical image segmentation is more clinically valuable when it supports diagnosis rather than merely producing lesion masks. However, diagnostically relevant lesion cues are often subtle and localized, while existing models may be distracted by background tissues, acoustic artifacts, and irrelevant visual correlations. To address this problem, we propose Rad-VLSM, a two-stage cross-modal framework for semantics-assisted lesion focusing, robust segmentation, and visually grounded diagnosis. In the first stage, a BLIP-2-based vision-language alignment module identifies lesion-related candidate regions under semantic guidance and converts them into box prompts. In the second stage, these prompts are fed into a SAM-based multitask network, where a multi-candidate region aggregation strategy improves prompt stability and guides lesion segmentation. The predicted masks are then used as spatial priors for diagnosis, and a visual-radiomics fusion head integrates lesion-aware visual features with selected radiomics descriptors. By using semantic information for localization rather than direct prediction, Rad-VLSM reduces text-to-diagnosis dependence and grounds diagnosis in lesion-level evidence. Experiments on a private clinical breast ultrasound dataset and public benchmarks show that Rad-VLSM achieves strong segmentation and diagnostic performance with favorable generalization.

Chinese Translation

医学图像分割在支持诊断时更具临床价值，而不仅仅是生成病灶掩膜。然而，诊断相关的病灶线索通常是微妙且局部的，而现有模型可能会受到背景组织、声学伪影和无关视觉相关性的干扰。为了解决这个问题，我们提出了Rad-VLSM，一种用于语义辅助病灶聚焦、稳健分割和视觉基础诊断的两阶段跨模态框架。在第一阶段，基于BLIP-2的视觉-语言对齐模块在语义指导下识别与病灶相关的候选区域，并将其转换为框提示。在第二阶段，这些提示被输入到基于SAM的多任务网络中，其中多候选区域聚合策略提高了提示的稳定性并指导病灶分割。然后，预测的掩膜被用作诊断的空间先验，而视觉-放射组学融合头则将关注病灶的视觉特征与选定的放射组学描述符整合在一起。通过使用语义信息进行定位而非直接预测，Rad-VLSM减少了文本到诊断的依赖，并将诊断建立在病灶级证据之上。在一个私有临床乳腺超声数据集和公共基准上的实验表明，Rad-VLSM在分割和诊断性能上表现出色，并具有良好的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 243 / 2605.18132

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

谁生成了这个3D资产？生成性3D模型的来源归属学习

Ma, Sihan, Liang, Siyuan, Tao, Dacheng

Abstract

Generative 3D models are deployed in gaming, robotics, and immersive creation, making source attribution critical: given a 3D asset, can we identify whether and which generative model created it? This problem faces two core challenges: dispersed attribution signals, where 3D fingerprints are distributed across multi-view, geometric, and frequency-domain cues; and realistic deployment constraints, where scarce labels, degraded prompts, and mixed real/synthetic assets undermine attribution reliability. To systematically study this problem, we construct, to the best of our knowledge, the first passive source attribution benchmark for modern generated assets, covering 22 representative 3D generators under standard, few-shot, and realistic deployment protocols. Based on this benchmark, we find that generative 3D models leave two types of stable fingerprints: cross-view inconsistency and structural artifacts reflected in geometric statistics and frequency-domain cues. To capture these dispersed signals, we propose a hierarchical multi-view multi-modal Transformer that fuses appearance, geometric, and frequency-domain features within each view and models global relationships across views. Extensive experiments demonstrate strong performance, achieving 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, corresponding to fewer than five samples per generator. These results show that modern 3D generators leave stable and attributable fingerprints, establishing a new benchmark and methodological foundation for trustworthy 3D content provenance.

Chinese Translation

生成性3D模型广泛应用于游戏、机器人和沉浸式创作，因此来源归属变得至关重要：给定一个3D资产，我们能否识别出是哪个生成模型创建了它？这个问题面临两个核心挑战：分散的归属信号，即3D指纹分布在多视角、几何和频域线索中；以及现实部署限制，即稀缺的标签、降级的提示和混合的真实/合成资产削弱了归属的可靠性。为了系统地研究这个问题，我们构建了迄今为止我们所知的第一个现代生成资产的被动来源归属基准，涵盖22种代表性的3D生成器，采用标准、少样本和现实部署协议。基于这个基准，我们发现生成性3D模型留下了两种类型的稳定指纹：跨视角不一致性和反映在几何统计和频域线索中的结构伪影。为了捕捉这些分散的信号，我们提出了一种层次化的多视角多模态Transformer，它在每个视角内融合外观、几何和频域特征，并建模视角之间的全局关系。大量实验表明了强大的性能，在全监督下达到97.22%的准确率，而仅使用1%的训练数据时，准确率为77.17%，对应于每个生成器少于五个样本。这些结果表明现代3D生成器留下了稳定且可归属的指纹，为可信的3D内容来源建立了新的基准和方法论基础。

View on arXiv Download PDF AI Translation

cs.CV / 244 / 2605.18137

Xiaomi EV World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving

小米电动车世界模型：一种集重建与生成于一体的自主驾驶联合世界模型

Zhou, Lijun, Luo, Hongcheng, Zhu, Zhenxin, Chi, Cheng, Tu, Mingfei, Xiong, Kaixin, Gong, Lei, Wu, Zhanqian, Zhang, Zehan, Li, Fangzhen, Li, Hao, Shen, Yingying, He, Jiale, Zhu, Haohui, Zhao, Shan, Wang, Kai, Zhan, Zhiwei, Pu, Yuechuan, Tan, Kaiyuan, Yang, Ruiling, Wang, Xianqi, Yan, Tianyi, Zhou, Jiawei, Zhang, Lei, Zhao, Jingyang, Zhou, Xi, Sun, Chitian, Wu, Chenming, Deng, Jiong, Xie, Hongwei, Lu, Ming, Ma, Kun, Chen, Long, Chen, Guang, Ye, Hangjun, Wang, Bing, Sun, Haiyang

Abstract

This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed-forward reconstruction architecture driven by sparse scene queries. WorldRec initializes structured queries in 3D space, leveraging them to aggregate cross-view, cross-temporal features, thereby naturally enforcing spatial consistency across frames and yielding compact yet high-fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two-stage training framework of bidirectional pretraining followed by causal fine-tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high-quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the JWM, which deeply integrates WorldRec and WorldGen to achieve synergistic gains in generation stability, cross-frame consistency, and visual fidelity, providing a solid foundation for closed-loop simulation, data synthesis, and end-to-end training in autonomous driving.

Chinese Translation

本报告提出了一个统一的技术系统，解决自主驾驶世界模型的两个核心能力：世界表示和世界生成。在世界表示方面，我们提出了WorldRec，一种由稀疏场景查询驱动的前馈重建架构。WorldRec在三维空间中初始化结构化查询，利用这些查询聚合跨视角和跨时间特征，从而自然地强制实现帧间的空间一致性，并生成紧凑且高保真的三维高斯场景表示。在世界生成方面，我们提出了WorldGen，一个双向预训练后接因果微调的两阶段训练框架，通过三个渐进阶段（教师强迫、常微分方程蒸馏和动态模式分解），实现高质量的在线因果视频生成，所需的去噪步骤少至4步。在这两个模块的基础上，我们进一步引入了JWM，它深度整合了WorldRec和WorldGen，以实现生成稳定性、跨帧一致性和视觉保真度的协同增益，为自主驾驶中的闭环仿真、数据合成和端到端训练提供了坚实的基础。

View on arXiv Download PDF AI Translation

cs.CV / 245 / 2605.18156

Semi-LAR: Semi-supervised Contrastive Learning with Linear Attention for Removal of Nighttime Flares

Semi-LAR：一种基于线性注意力的半监督对比学习框架，用于夜间光晕去除

Zhu, Xiyu, Wang, Wei, Jiang, Kui, Li, Zhengguo

Abstract

Lens flare removal is challenging due to the large spatial extent of flare artifacts and their entanglement with scene structures, while existing methods heavily rely on large-scale paired data. We propose a semi-supervised flare removal framework that enables stable learning from unlabeled images by jointly addressing pseudo-label reliability and representation discrimination. We propose an adaptive pseudo-label repository that progressively refines pseudo supervision through no-reference quality assessment, momentum-based updates, and invalid label filtering, effectively mitigating error accumulation. Moreover, we propose a flare-aware contrastive loss that explicitly treats flare-contaminated inputs as negatives and performs patch-level contrastive learning, encouraging representations that are discriminative against flare patterns while remaining consistent with reliable pseudo targets. Extensive experiments on multiple flare benchmarks demonstrate that the proposed framework is model-agnostic and consistently improves performance and robustness.

Chinese Translation

镜头光晕去除具有挑战性，因为光晕伪影的空间范围广泛且与场景结构纠缠在一起，而现有方法严重依赖于大规模配对数据。我们提出了一种半监督光晕去除框架，通过共同解决伪标签的可靠性和表示的区分性，使得从未标记图像中进行稳定学习成为可能。我们提出了一种自适应伪标签库，通过无参考质量评估、基于动量的更新和无效标签过滤，逐步优化伪监督，有效减轻错误累积。此外，我们提出了一种光晕感知对比损失，明确将光晕污染的输入视为负样本，并进行补丁级对比学习，鼓励对光晕模式具有区分性的表示，同时与可靠的伪目标保持一致。在多个光晕基准上的广泛实验表明，所提出的框架是模型无关的，并且始终提高了性能和鲁棒性。

View on arXiv Download PDF AI Translation

cs.CV / 246 / 2605.18160

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

视觉推理变换器：在多模态大型语言模型中维持视觉一致性

Dong, Xinpeng, Zhang, Min, Han, Kairong, Tan, Xu, Wu, Fei, Kuang, Kun

Abstract

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.

Chinese Translation

近年来，多模态大型语言模型（MLLMs）取得了显著进展，这主要归功于有效的视觉与文本信息整合范式。当前主流的基于连接器的范式将视觉特征投影到文本序列中，从而实现统一的多模态对齐和推理，适用于生成架构。然而，我们的实验揭示了两个关键限制：（1）尽管视觉信息在MLLMs中作为核心证据模态，但它与文本标记同等对待，削弱了视觉模态的独特贡献；（2）随着生成长度的增加，特别是在有限的上下文窗口内，模型对视觉信息的依赖逐渐减弱，导致视觉-语言对齐恶化以及生成内容与视觉语义之间的一致性降低。为了解决这些挑战，我们提出了视觉推理变换器（Vision Inference Former, VIF），这是一个轻量级的架构模块，建立了纯视觉表示与模型输出空间之间的直接桥梁。具体而言，VIF在推理过程的解码阶段持续注入视觉语义，确保模型在生成过程中始终与视觉内容紧密结合。我们在14个基准任务上进行了实验，涵盖一般推理、光学字符识别（OCR）、表格理解、视觉中心评估和幻觉等领域。实验结果表明，VIF在不同架构中始终提高了模型性能，同时引入的额外开销极小。该工作的代码可在 https://github.com/Dong-Xinpeng/VIF 获取。

View on arXiv Download PDF AI Translation

cs.CV / 247 / 2605.18162

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

通过几何逻辑一致性实现视觉语言模型中的自我演化空间推理

Liu, Junming, Li, Yuqi, Sun, Yifei, Wang, Maonan, Koniusz, Piotr, Chen, Yirong, Wang, Ding

Abstract

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.

Chinese Translation

视觉语言模型（VLMs）取得了显著进展，但其空间推理仍然脆弱：能够正确回答原始输入的模型在进行具有可预测答案映射的配对变换时仍可能失败，这揭示了实例级正确性与稳健空间推理之间的差距。为了解决这一问题，我们提出了通过几何演化实现空间对齐（Spatial Alignment via Geometric Evolution, SAGE），这是一个自我演化框架，通过几何和语言的双重操作在VLM中强制执行逻辑一致性。SAGE将双重一致性作为辅助奖励纳入GRPO训练中，鼓励模型在原始和变换输入之间生成逻辑上连贯的答案。动态操作池不断探测不一致性，促进具有挑战性的操作并淘汰已掌握的操作，从而使训练集中于最具信息量的信号。SAGE是模型无关的，与之前的GRPO方法相比数据效率更高，并且可以作为轻量级的后训练阶段应用于任何现有的VLM。对视频和空间推理基准的实验表明，相较于强基线，SAGE在性能上有一致的提升，并增强了对未见数据的泛化能力。

View on arXiv Download PDF AI Translation

cs.CV / 248 / 2605.18173

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

您需要文本校正吗？用于无校正场景文本检测的软注意力掩码嵌入

Colombo, Antonio, Bianchi, Giovanni

Abstract

End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary text rectification modules. Since the soft attention mechanism is fully differentiable, recognition loss gradients can be back-propagated through the SAME module to the detection branch, enabling joint optimization of detection and recognition objectives. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our approach: SAME-Net achieves 84.02\% end-to-end H-mean on the arbitrarily-shaped Total-Text dataset, surpassing the previous state-of-the-art GLASS by 1.02\% in full-lexicon accuracy without additional training data, and obtains competitive 83.4\% strong-lexicon results on the multi-oriented ICDAR 2015 dataset.

Chinese Translation

端到端场景文本检测将文本检测和识别统一在一个框架内，得益于深度学习的进步，取得了显著的进展。然而，大多数现有方法仍然受到多尺度变化、任意文本形状和复杂背景干扰导致的不完整掩码提议的影响，从而降低了识别准确性。本文提出了一种新颖的软注意力掩码嵌入模块（Soft Attention Mask Embedding，SAME），该模块利用Transformer编码器的全局感受野来编码高级特征并计算软注意力权重，然后将这些权重与预测掩码进行分层嵌入，以生成精细的文本边界感知掩码，有效抑制背景噪声。在此模块的基础上，我们提出了SAME-Net，一个强大的端到端文本检测框架，既不需要字符级注释，也不需要辅助文本校正模块。由于软注意力机制是完全可微的，识别损失梯度可以通过SAME模块反向传播到检测分支，从而实现检测和识别目标的联合优化。在具有挑战性的基准测试上进行的广泛实验证明了我们方法的有效性：SAME-Net在任意形状的Total-Text数据集上达到了84.02%的端到端H均值，在全词汇准确性上超越了之前的最先进方法GLASS 1.02%，且在多方向ICDAR 2015数据集上获得了竞争性的83.4%强词汇结果。

View on arXiv Download PDF AI Translation

cs.CV / 249 / 2605.18176

MARS: Technical Report for the CASTLE Challenge at EgoVis 2026

MARS：EgoVis 2026 CASTLE挑战的技术报告

Zhang, Haoyu, Chu, Qiaohui, Feng, Yisen, Liu, Meng, Guan, Weili, Wang, Yaowei, Nie, Liqiang

Abstract

This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into captions and DeepSeek-based summaries only because CASTLE videos are too long to fit directly into the model context for every question; this step compresses temporal evidence while keeping photos and other auxiliary media available as source-specific evidence. At inference time, a GPT-5.4 decision agent repeatedly chooses whether to continue reasoning, request a specific missing modality, produce an answer, or fall back to a random option when the evidence remains insufficient. The resulting system achieved second place on the final CASTLE Challenge leaderboard. Our codes are available at https://github.com/Hyu-Zhang/MARS.

Chinese Translation

本报告介绍了MARS（多模态代理推理与源选择），这是我们为EgoVis 2026 CASTLE挑战开发的系统。参与者必须在CASTLE 2024数据集上回答185个封闭式问题。与以往的单视频自我中心基准相比，CASTLE要求对四天的活动、15个同步视角、官方转录本以及包括个人照片、辅助视频、注视、热成像和心率测量在内的多种辅助模态进行推理。因此，MARS将该任务视为一个多模态源的代理证据选择问题，而非纯文本管道。MARS首先遵循官方CASTLE目录结构，从两个主要源（视频和转录本）以及四个辅助源（注视、心率、照片和热成像）构建证据记忆。由于CASTLE视频过长，无法直接适应模型上下文，因此长视频被转换为字幕和基于DeepSeek的摘要；此步骤在保留照片和其他辅助媒体作为源特定证据的同时压缩了时间证据。在推理时，GPT-5.4决策代理反复选择是继续推理、请求特定缺失模态、生成答案，还是在证据不足时回退到随机选项。最终，该系统在CASTLE挑战的最终排行榜上获得第二名。我们的代码可在https://github.com/Hyu-Zhang/MARS获取。

View on arXiv Download PDF AI Translation

cs.CV / 250 / 2605.18177

Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

高效视觉变换器分割的令牌空间掩码预测

Galagain, Calvin, Poreba, Martyna, Goulette, François

Abstract

Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.

Chinese Translation

基于查询的视觉变换器分割模型通常重建密集的空间特征图以预测掩码，继承了卷积架构的设计模式。我们表明，这种显式的图像空间重建并不是必需的。我们引入了TokenMask，一个令牌空间掩码头，它直接从查询令牌的亲和性计算掩码逻辑，并在逻辑空间而非特征空间中执行插值。这种重新表述保留了原始的线性评分机制，同时简化了计算结构。在不同的ViT骨干网络、数据集和分割任务中，TokenMask始终通过减少计算和内存需求来提高效率，同时保持竞争力的准确性，从而在使用TensorRT FP16推理的NVIDIA Jetson AGX Orin上实现了显著的加速。总体而言，TokenMask为嵌入式视觉系统提供了更简单且更易于部署的设计。

View on arXiv Download PDF AI Translation

cs.CV / 251 / 2605.18192

View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification

视角感知的语义对齐用于空中-地面人员重识别

Zhang, Quan, Cai, Zeqiang, Zhao, Peiming, Wu, Jingze, Wu, Cailun, Chen, Hongbo, Lai, Jianhuang

Abstract

Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts. Extensive experiments on three AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06\% mAP improvement on the challenging CARGO cross-view protocol. The code is available at \href{https://github.com/Cat-Zero/ViSA}{https://github.com/Cat-Zero/ViSA}.

Chinese Translation

空中-地面人员重识别（AGPReID）由于无人机与固定摄像头之间的视角变化而面临极大的挑战。现有方法通常遵循视角不变的范式，通过对齐不同视角下的共享特征以实现鲁棒性。然而，视角不变本质上强制执行部分级别的对齐，忽视了视角特定的线索和具有区分性的身份信息。为此，本文提出了ViSA（视角感知语义对齐），这是一个视角感知框架，包含一个专家驱动的标记生成模块（ETGM）和一个双分支局部融合模块（DLFM），以实现跨视角的语义一致性。从技术上讲，前者构建了一组视角感知的专家，以生成适应性的语义查询，感知视角特定的模式，而后者利用图推理提取和对齐响应于不同专家的局部区域。在包括AG-ReID.v2、CARGO和LAGPeR在内的三个AGPReID基准上的广泛实验表明，ViSA始终实现了卓越的性能，在具有挑战性的CARGO跨视角协议上显著提高了10.06%的mAP。代码可在 exttt{https://github.com/Cat-Zero/ViSA} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 252 / 2605.18193

Best Segmentation Buddies for Image-Shape Correspondence

最佳分割伙伴用于图像-形状对应

Lang, Itai, Lyu, Dongwei, Decatur, Dale, Hanocka, Rana

Abstract

Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored task of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape. To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. Then, we identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D image segmentation model to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our project page is at https://threedle.github.io/bsb/.

Chinese Translation

寻找对应关系是计算机视觉和图形学中的一个基本且广泛研究的问题。在本研究中，我们考察了一个尚未充分探索的任务，即在自然图像与无纹理三维形状之间估计分割到分割的对应关系。由于外观、几何形状和视角之间存在显著差异，这一任务极具挑战性。我们的方法通过将图像分割中的像素与三维形状对应语义部分的顶点联系起来，弥合了跨模态的差距。为此，我们首先从二维视觉模型中提取深度视觉特征到三维形状表面，从而计算图像像素与形状顶点之间的特征相似性。接着，我们识别最佳分割伙伴（Best Segmentation Buddies），即其最相似的图像像素位于图像分割区域内的顶点，从而可靠地发现语义上对应的形状部分的顶点。最后，我们利用从二维图像分割模型中提取的三维特征直接在三维空间中对形状进行分割，启动对应关系的过程。我们展示了我们的方法在广泛的图像-形状对中的通用性和鲁棒性，展现了准确且具有语义意义的对应关系。我们的项目页面位于 https://threedle.github.io/bsb/。

View on arXiv Download PDF AI Translation

cs.CV / 253 / 2605.18209

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

SPATIOROUTE：用于零-shot空间推理的动态提示路由

Chunhachatrachai, Pawat, Faure, Gueter Josmy, Su, Hung-Ting, Hsu, Winston H.

Abstract

Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template -- without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.

Chinese Translation

基于自我中心视频的空间问答是一项具有挑战性的任务，要求视觉-语言模型（VLMs）推理3D物体位置、场景可供性和方向关系，特别是在没有特定任务微调的零-shot设置中。我们提出了SpatioRoute，一种动态提示生成方法，它将每个输入问题路由到一个语义定制的提示模板——无需任何额外的训练、微调或3D传感器输入。SpatioRoute以两种互补模式运行：SpatioRoute-R，一种基于规则的路由器，确定性地将问题类型（例如，What、Is、How、Can、Which）映射到专门的提示模板；以及SpatioRoute-L，一种基于大型语言模型（LLM）的驱动方法，仅从问题和情境上下文生成特定任务的提示，而在路由时不需要视频输入。我们在跨越不同模型系列的VLMs上评估了SpatioRoute在SQA3D基准上的表现。SpatioRoute在固定提示基线之上实现了高达5%的整体准确率提升，确立了零-shot视频空间VQA的新最先进水平，而无需3D点云输入。作为额外发现，我们观察到通过Think it Twice架构实现的思维链（Chain-of-Thought，CoT）提示在Qwen系列模型的此设置中持续降低性能，确认了问题感知路由在空间视频理解中比统一推理指令更有效。

View on arXiv Download PDF AI Translation

cs.CV / 254 / 2605.18214

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation

EgoInteract：用于交互理解和预测的合成自我中心视频生成

Leonardi, Rosario, Ragusa, Francesco, Materia, Daniele, Passanisi, Alessandro, Fort, James, Engel, Jakob, Farinella, Giovanni Maria

Abstract

Collecting large-scale egocentric video datasets with dense spatial and temporal annotations is costly, slow, and often constrained by environmental biases, privacy constraints, and limited coverage of interaction patterns. While synthetic data has shown strong potential in several vision domains, its use for egocentric perception remains relatively underexplored, especially for tasks requiring temporally coherent human-object interactions. In this work, we introduce EgoInteract, a controllable simulator for egocentric video generation designed to model fine-grained egocentric interactions and their temporal dynamics. The simulator enables precise control over camera, human body and hand motion, object manipulation, and scene composition across diverse environments. Building on this framework, we generate a synthetic egocentric video dataset with dense spatial and temporal annotations for temporal action segmentation, next-active object detection, interaction anticipation, and hand-object interaction detection. We evaluate models trained with simulated data on multiple real-world egocentric benchmarks spanning diverse environments, object categories, and interaction patterns. Results show consistent improvements over strong baselines across tasks and datasets, demonstrating the effectiveness and transferability of our simulation-based approach.

Chinese Translation

收集大规模自我中心视频数据集并进行密集的空间和时间标注成本高昂、耗时缓慢，并且常常受到环境偏见、隐私限制和交互模式覆盖有限的制约。尽管合成数据在多个视觉领域显示出强大的潜力，但其在自我中心感知中的应用仍然相对未被充分探索，尤其是在需要时间一致的人-物交互任务中。在本研究中，我们介绍了EgoInteract，一个可控的自我中心视频生成模拟器，旨在建模细粒度的自我中心交互及其时间动态。该模拟器能够精确控制相机、人体和手部运动、物体操控以及在多样环境中的场景构成。在此框架的基础上，我们生成了一个合成自我中心视频数据集，包含密集的空间和时间标注，用于时间动作分割、下一个活跃物体检测、交互预测和手-物交互检测。我们在多个真实世界的自我中心基准上评估了使用模拟数据训练的模型，这些基准涵盖了多样的环境、物体类别和交互模式。结果显示，在各项任务和数据集上，相较于强基线模型，表现出一致的提升，证明了我们基于模拟的方法的有效性和可迁移性。

View on arXiv Download PDF AI Translation

cs.CV / 255 / 2605.18233

Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

增强无训练无限帧生成以实现一致的长视频

Feng, X., Zhu, J., Wu, M., Chen, C., Mao, F., Guo, H., Wu, J., Chu, X., Huang, K.

Abstract

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose \textbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.

Chinese Translation

无显著计算开销的无训练长视频生成旨在使基础视频生成模型能够生成更长的视频。帧级自回归框架，例如 FIFO-diffusion，具有以恒定内存消耗生成无限长视频的优势。然而，训练与推理之间的不匹配，加上保持长期一致性的挑战，限制了基础模型的有效利用。为了解决这些问题，我们提出了 extbf{MIGA}，一种新颖的无限帧长视频生成方法。首先，我们提出了一种有效的两阶段对齐机制，通过减少输入模型的过量噪声范围来减轻训练-推理差距。然后，我们引入了一种创新的双重一致性增强机制，其中自反射方法纠正早期高噪声帧，而长距离帧引导方法利用后期低噪声帧进行广泛覆盖以引导生成，共同改善时间一致性。在 VBench 和 NarrLV 上的广泛实验表明 MIGA 达到了最先进的性能。我们的项目页面可在 https://xiaokunfeng.github.io/miga_homepage/ 查阅。

View on arXiv Download PDF AI Translation

cs.CV / 256 / 2605.18238

Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning

数字实体的非冲突生物识别身份：几何、容量与百万级虚拟身份提供

Ji, Yuyang, Shen, Yixuan, Jain, Anil, Liu, Xiaoming, Liu, Feng

Abstract

Digital entities such as AI agents and humanoid robots increasingly operate alongside real humans, yet their identity infrastructure is based on credentials rather than embodied biometric identity. We introduce Biometric Identity Provisioning (BIP), a new problem and solution framework that addresses: given an enrollment gallery of real human identities, provision virtual identities that are non-colliding with every enrolled identity, maintain sufficient inter-class separability, and are realizable as high-fidelity face images. The key geometric insight is that real face identities occupy a low-dimensional subspace of the embedding hypersphere, leaving no residual subspace for virtual identities. Hence, virtual identities must instead be allocated as unclaimed gaps within the real face manifold itself. BIP is therefore a constrained packing problem: available gaps vastly exceed any foreseeable enrollment scale, and provisioned identities remain non-colliding even as new real identities are subsequently enrolled. Grounded in this geometry, our repulsion-based allocation is not bounded by any fixed provisioning count; we demonstrate 10M non-colliding virtual identity embeddings against a gallery of 360K real identities. Realizing these embeddings as face images requires a generator that operates outside the training distribution of real face images; we introduce GapGen, a gap-aware generator trained with a curriculum that progressively extends synthesis into non-colliding regions, validated at 1M photorealistic virtual face images. We further construct v-LFW, a virtual counterpart to LFW face dataset, with protocols for virtual face verification, cross-reality matching, real-vs-virtual detection, and unified recognition and detection.

Chinese Translation

数字实体如人工智能代理和类人机器人日益与真实人类并行操作，但它们的身份基础设施基于凭证而非具身的生物识别身份。我们提出了生物识别身份提供（Biometric Identity Provisioning, BIP），这是一个新的问题和解决框架，旨在解决以下问题：在给定真实人类身份的注册库的情况下，提供与每个注册身份非冲突的虚拟身份，保持足够的类间可分离性，并能够实现为高保真面部图像。关键的几何洞察是，真实面部身份占据了嵌入超球体的低维子空间，未留有任何虚拟身份的剩余子空间。因此，虚拟身份必须在真实面部流形内部分配为未被占用的间隙。因此，BIP 是一个受限的打包问题：可用的间隙远远超过任何可预见的注册规模，并且即使在后续注册新的真实身份时，提供的身份仍然保持非冲突。在这一几何基础上，我们的排斥式分配不受任何固定提供数量的限制；我们展示了与360K真实身份库相比的1000万个非冲突虚拟身份嵌入。将这些嵌入实现为面部图像需要一个在真实面部图像训练分布之外操作的生成器；我们引入了GapGen，一个关注间隙的生成器，通过逐步扩展合成到非冲突区域的课程进行训练，在100万张逼真的虚拟面部图像中进行了验证。我们进一步构建了v-LFW，这是LFW面部数据集的虚拟对应物，包含虚拟面部验证、跨现实匹配、真实与虚拟检测以及统一识别和检测的协议。

View on arXiv Download PDF AI Translation

cs.CV / 257 / 2605.18252

GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance

GaussianZoom：具有几何和语义指导的渐进放大生成3D高斯点云

Shi, Jiale, Hu, Jiarui, Yang, Zesong, Luan, Kaixuan, Bao, Hujun, Cui, Zhaopeng

Abstract

We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs. To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution. To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks\&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and robustness under extreme magnification, establishing a strong baseline for generative zoom-in 3D scene reconstruction.

Chinese Translation

我们介绍了GaussianZoom，一种生成性放大3D重建系统，采用迭代渐进框架，结合几何一致的场景建模和多尺度语义推理，从低分辨率输入实现高保真度的极限放大渲染。为此，我们开发了一种新颖的多视图一致超分辨率模块，结合基于深度的特征扭曲和VLM驱动的细节合成，确保准确的多视图对应，同时丰富超出观察分辨率的细微外观。为了支持在大放大范围内的放大，我们进一步引入了一种新的可扩展连续细节层次结构，动态调节高斯可见性，实现平滑且无别名的跨尺度渲染。在Mip-NeRF360和Tanks&Temples上的实验表明，GaussianZoom在极限放大下实现了优越的感知质量、多视图一致性和鲁棒性，为生成性放大3D场景重建建立了强有力的基线。

View on arXiv Download PDF AI Translation

cs.CV / 258 / 2605.18257

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

CodeBind：具有统一组合代码本的多模态对齐的解耦表示学习

Chen, Zeyu, Li, Jie, Han, Kai

Abstract

Multimodal representation alignment is pivotal for large language models and robotics. Traditional methods are often hindered by cross-modal information discrepancies and data scarcity, leading to suboptimal alignment spaces that overlook modality-unique features. We propose CodeBind, a framework that optimizes multimodal representation spaces through a modality-shared-specific codebook design. By incrementally aligning target and bridging modalities, CodeBind bypasses the need for fully paired data. Unlike traditional hard alignment, CodeBind decomposes features into shared components for semantic consistency and specific components for modality-unique details. This design utilizes a compositional vector quantization scheme, where a shared codebook bridges modality gaps and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.

Chinese Translation

多模态表示对齐对于大语言模型和机器人技术至关重要。传统方法常常受到跨模态信息差异和数据稀缺的限制，导致次优的对齐空间忽视了模态特有的特征。我们提出了CodeBind，一个通过模态共享特定代码本设计来优化多模态表示空间的框架。通过逐步对齐目标模态和桥接模态，CodeBind绕过了对完全配对数据的需求。与传统的硬对齐不同，CodeBind将特征分解为共享组件以保持语义一致性，以及特定组件以捕捉模态独特细节。该设计利用了一种组合向量量化方案，其中共享代码本弥合了模态间的差距，而模态特定代码本通过防止主导模态掩盖其他模态来减轻表示偏差。在九种模态（文本、图像、视频、音频、深度、热成像、触觉、3D点云、脑电图）上进行验证，CodeBind在多模态分类和检索任务中实现了最先进的性能。

View on arXiv Download PDF AI Translation

cs.CV / 259 / 2605.18263

RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting

RT-Splatting：联合反射-透射建模与高斯点云技术

Shi, Ji, Ying, Xianghua, Xing, Bowei, Guo, Ruohao, Yue, Wenzhen

Abstract

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present RT-Splatting, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into the transmission branch, effectively reducing distracting floaters. Experiments on challenging semi-transparent scenes show that RT-Splatting achieves state-of-the-art performance, delivering high-fidelity reflections and clear transmission with real-time rendering. Moreover, our factorization naturally enables flexible scene editing. The project page is available at https://sjj118.github.io/RT-Splatting.

Chinese Translation

三维高斯点云技术（3D Gaussian Splatting, 3DGS）能够实现高视觉质量的实时新视角合成。然而，现有方法在处理半透明镜面表面时面临挑战，这类表面既具有复杂的反射又有清晰的透射，常常导致反射模糊或透射过度遮挡。为了解决这个问题，我们提出了RT-Splatting，一个将每个高斯的几何占用与其光学不透明度解耦的框架。这种因子分解产生了一个统一的表面-体积场景表示，使用一组高斯原语。我们的混合渲染器将该表示解读为一个表面，以捕捉高频反射，同时也作为一个体积，以保持清晰的透射。为了减少在联合优化反射和透射时的模糊性，我们引入了镜面感知梯度门控（Specular-Aware Gradient Gating），该方法抑制来自高度镜面区域的误导性梯度进入透射分支，有效减少了干扰浮动物。在具有挑战性的半透明场景上的实验表明，RT-Splatting实现了最先进的性能，提供了高保真度的反射和清晰的透射，并支持实时渲染。此外，我们的因子分解自然地支持灵活的场景编辑。项目页面可访问 https://sjj118.github.io/RT-Splatting。

View on arXiv Download PDF AI Translation

cs.CV / 260 / 2605.18267

SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

SRC-Flow：紧凑的语义表示使得归一化流在图像生成中的应用

Jiang, Longtao, Bao, Jiangmin, Wang, Zhendong, Tao, Xin, Wan, Pengfei, Li, Zhihui, Chang, Xiaojun

Abstract

Normalizing flows (NFs) provide exact likelihoods and deterministic invertible sampling, but have historically lagged behind diffusion models for large-scale image generation. We identify a key obstacle: NFs are required to learn a single invertible transport over the full ambient space, making them highly sensitive to high-dimensional representations. This leads to a semantic-capacity mismatch in modern visual representation spaces, where semantic information is compact but encoded in overcomplete features. We propose SRC-Flow, which introduces a Semantic Representation Compressor (SRC) to compact high-dimensional RAE features into a low-dimensional semantic space before flow modeling and preserve reconstruction through the frozen RAE decoder. This compact space reduces the modeling burden of NFs and enables effective likelihood-based generation in semantic representation space. We further adopt constant noise regularization tailored to the fixed unconditional bijection learned by flows. On ImageNet $256 \times 256$ and $512 \times 512$, SRC-Flow achieves state-of-the-art generation quality among normalizing flow methods, with gFID scores of 1.65 and 2.07 under classifier-free guidance, while retaining exact likelihood computation in the compact semantic representation space and deterministic invertible sampling at the flow level. Codes and models will be available at https://github.com/longtaojiang/SRC-Flow.

Chinese Translation

归一化流（Normalizing Flows, NFs）提供了精确的似然性和确定性可逆采样，但在大规模图像生成方面历来落后于扩散模型。我们识别出一个关键障碍：NFs 需要在整个环境空间上学习一个单一的可逆传输，这使得它们对高维表示高度敏感。这导致了现代视觉表示空间中的语义容量不匹配，其中语义信息是紧凑的，但以过完备特征进行编码。我们提出了 SRC-Flow，引入了语义表示压缩器（Semantic Representation Compressor, SRC），在流建模之前将高维 RAE 特征压缩到低维语义空间，并通过冻结的 RAE 解码器保留重构。这一紧凑空间减少了 NFs 的建模负担，并使得在语义表示空间中有效的基于似然的生成成为可能。我们进一步采用了针对流学习的固定无条件双射量身定制的恒定噪声正则化。在 ImageNet $256 imes 256$ 和 $512 imes 512$ 数据集上，SRC-Flow 在归一化流方法中实现了最先进的生成质量，在无分类器引导下的 gFID 分数分别为 1.65 和 2.07，同时在紧凑的语义表示空间中保留了精确的似然计算和流级别的确定性可逆采样。代码和模型将发布在 https://github.com/longtaojiang/SRC-Flow。

View on arXiv Download PDF AI Translation

cs.CV / 261 / 2605.18287

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

StableVLA：朝着无需额外数据的稳健视觉-语言-动作模型迈进

Fu, Yiyang, Zhang, Chubin, Gong, Shukai, Deng, Yufan, Sun, Kaiwei, Min, Qiyang, Hou, Qibin, Tang, Yansong, Wang, Jianan, Zhou, Daquan

Abstract

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.

Chinese Translation

在训练数据集中涵盖所有可能的干扰是不可行的。这引发了一个关键问题，即视觉-语言-动作（VLA）模型在遇到未见过的现实世界视觉干扰时的稳健性，尤其是在不完美的视觉条件下。在本研究中，我们基于最近的最先进VLA模型进行了系统研究，并揭示了当引入训练数据中缺失的视觉干扰时，性能显著下降。为了解决这一问题，我们提出了一种基于信息理论的轻量级适配器模块，称为信息瓶颈适配器（IB-Adapter），该模块选择性地过滤视觉输入中的潜在噪声。在不需要任何额外数据或增强策略的情况下，IB-Adapter的性能平均提高了30%，同时增加的参数少于1000万，展示了显著的效率和有效性。此外，即使在使用14倍更小的主干网络（0.5B参数）且未在Open X-Embodiment数据集上进行预训练的情况下，我们的模型StableVLA在稳健性方面与7B规模的最先进VLA竞争。我们的方案在参数开销几乎可以忽略不计（<1000万）的情况下，保持了长时间任务的准确性，并在合成和物理视觉干扰下超越了OpenPi。

View on arXiv Download PDF AI Translation

cs.CV / 262 / 2605.18288

Collision-Resistant Single-Pass Method for Unsupervised Fine-Grained Image Hashing

抗碰撞的单通道无监督细粒度图像哈希方法

Duong, Anh-Kiet, Gomez-Krämer, Petra, Carozza, Jean-Michel

Abstract

Unsupervised fine-grained image hashing aims to learn compact binary codes that preserve subtle visual differences among highly similar instances without manual annotations. However, most existing methods neglect collision resistance, leading to identical hash codes for slightly semantically different samples. In this paper, we propose Collision-Resistant Single-Pass Self-Supervised Semantic Hashing (CS3H), a collision-resistant framework that directly optimizes Hamming-space similarity via a single-pass normalized Hamming distance loss to produce well-separated binary representations. We further introduce a collision-sensitive attention module to emphasize rare and discriminative local patterns, reducing hash collisions and improving fine-grained discrimination. Experiments on multiple benchmarks show that CS3H consistently outperforms state-of-the-art methods in retrieval accuracy while achieving superior collision resistance with minimal computational overhead.

Chinese Translation

无监督细粒度图像哈希旨在学习紧凑的二进制编码，以保留高度相似实例之间的微妙视觉差异，而无需人工标注。然而，现有的大多数方法忽视了抗碰撞性，导致在语义上略有不同的样本产生相同的哈希码。本文提出了一种抗碰撞的单通道自监督语义哈希框架（Collision-Resistant Single-Pass Self-Supervised Semantic Hashing，CS3H），该框架通过单通道归一化汉明距离损失直接优化汉明空间相似性，以生成良好分离的二进制表示。我们进一步引入了一种对碰撞敏感的注意力模块，以强调稀有和具有区分性的局部模式，从而减少哈希碰撞并提高细粒度区分能力。在多个基准测试上的实验表明，CS3H在检索精度上始终优于最先进的方法，同时以最小的计算开销实现了卓越的抗碰撞性。

View on arXiv Download PDF AI Translation

cs.CV / 263 / 2605.18313

Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

用于可靠医学视觉问答的瓦瑟斯坦均衡解码

Hagen, Luca, Müller, Johanna P., Zhang, Weitong, Qiao, Mengyun, Kainz, Bernhard

Abstract

Small vision-language models (2-8B) are well-suited for clin- ical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language mod- els for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, en- abling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clini- cally equivalent ranking swaps. On VQA-RAD and PathVQA, we ob- tain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain- specific fine-tuning. At accuracy parity with classic BDG, the Wasser- stein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.

Chinese Translation

由于隐私限制、连接性有限和低延迟要求，适合在临床部署的小型视觉语言模型（2-8B）更倾向于在设备上或本地进行推理。然而，它们的有限能力加剧了生成看似合理但不正确的输出。我们将游戏理论解码的扩展应用于视觉语言模型，以应对开放式医学视觉问答（Medical VQA），该方法之前仅限于文本类的封闭式自然语言处理任务。我们引入了一种语义感知的瓦瑟斯坦停止准则，替代了词汇顺序匹配，使得基于近义候选答案之间的语义共识实现收敛，避免了由于临床等效排名交换而导致的不必要迭代。在 VQA-RAD 和 PathVQA 数据集上，我们在贪婪和判别基线之上获得了一致且具有统计显著性的改进。在 VQA-RAD 上，我们使 Qwen3-VL-2B 提升了 3.5 个百分点（p < 0.01），超越了贪婪的 4B 模型，在更大规模上也呈现出类似趋势。在 PathVQA 上，Gemma-3-4B 在贪婪解码下与 MedGemma-4B 相匹配，尽管没有进行特定领域的微调。在与经典 BDG 达到相同准确率的情况下，瓦瑟斯坦准则将平均收敛迭代次数减少了约 20%，提高了推理效率，同时保持了游戏理论的均衡行为。代码可在 https://github.com/luca-hagen/Wasserstein-BDG-medical-VQA 获取。

View on arXiv Download PDF AI Translation

cs.CV / 264 / 2605.18324

Improved Baselines with Representation Autoencoders

改进的基线与表示自编码器

Singh, Jaskirat, Zheng, Boyang, Wu, Zongze, Zhang, Richard, Shechtman, Eli, Xie, Saining

Abstract

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr^k, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EP_FID@k (epochs to reach unguided gFID <= k) as a measure of training efficiency. RAEv2 attains an EP_FID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. Code is available at https://raev2.github.io.

Chinese Translation

表示自编码器（Representation Autoencoders, RAE）用预训练的视觉编码器替代了传统的变分自编码器（VAE）。在本文中，我们系统地研究了几种设计选择，并发现了三条简化和改进RAE的见解。首先，我们研究了一种广义的公式，其中表示被定义为最后k个编码器层的总和，而不仅仅是最终层。这一简单的变化在没有对编码器进行微调或使用专门数据（例如文本、面孔）的情况下，显著改善了重建效果。其次，我们研究了RAE（使用预训练表示作为编码器）替代表示对齐（Representation Alignment, REPA）的普遍假设，后者将相同的表示提炼到中间层。通过大规模的实证分析，我们发现了一个令人惊讶的发现：RAE和REPA表现出互补的工作机制，使得相同的表示可以作为编码器和中间扩散层的目标同时使用。最后，原始RAE在无分类器引导（Classifier-Free Guidance, CFG）方面存在困难，并需要训练一个第二个较弱的扩散模型用于自动引导（AutoGuidance, AG）。我们展示了REPA本身可以被视为RAE潜在空间中的x预测。通过简单地重新参数化DiT模型的输出，可以“免费”提供引导。总体而言，RAEv2在原始RAE的基础上实现了超过10倍的收敛速度，在ImageNet-256上仅用80个周期就达到了1.06的最先进的gFID。在FDr^k上，RAEv2在仅80个周期内达到了2.17的最先进水平，而之前的最佳结果为3.26（800个周期），且没有任何后训练。这促使我们将EP_FID@k（达到无引导gFID <= k所需的周期数）作为训练效率的衡量标准。RAEv2在EP_FID@2上达到了35个周期，而原始RAE为177个周期。我们还在多种文本到图像生成和导航世界模型的设置中验证了我们的方法，显示出一致的改进。代码可在https://raev2.github.io获取。

View on arXiv Download PDF AI Translation

cs.CV / 265 / 2605.18328

CineMatte: Background Matting for Virtual Production and Beyond

CineMatte：用于虚拟制作及其他领域的背景抠图

He, Yuanjian, Zhang, Chen, Chen, Fasheng, Cao, Jiangbo

Abstract

LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional "detail branch" to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, which largely mitigates the problem. We also introduce CineMatte-4K, a 4K HDR image-video dataset captured on a professional LED VP stage. To the best of our knowledge, the image subset is the first dataset for VP matting and is non-synthetic, obtained via green-screen insertion; the video subset includes camera motion with tracked trajectories so that arbitrary backgrounds can be rendered later with correct parallax. Across CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte), CineMatte not only excels in VP but also generalizes robustly to real-world footage.

Chinese Translation

LED虚拟制作（VP）利用大型LED显示屏实时渲染背景，实现了机内视觉效果，但使得后期修改变得劳动密集。我们通过CineMatte解决了这一问题，CineMatte是一个针对VP及其他应用的强大背景抠图框架。CineMatte采用了交叉注意力条件设计。与其将背景与输入拼接，CineMatte使用一个共享权重的Siamese冻结DINOv3视觉变换器分别编码输入帧和捕获的背景。交叉注意力模块比较这两个流以预测前景，保留预训练语义并提高对背景变化的鲁棒性。之前基于ViT的抠图模型使用并行卷积的“细节分支”来恢复细节，这可能由于与主干的语义不对齐而在真实样本中造成边界伪影。我们则用一个预训练的图像引导特征上采样器替代它，极大地缓解了这一问题。我们还引入了CineMatte-4K，这是一个在专业LED VP舞台上捕获的4K HDR图像-视频数据集。据我们所知，该图像子集是第一个用于VP抠图的非合成数据集，通过绿幕插入获得；视频子集包括带有跟踪轨迹的相机运动，以便后续可以正确视差渲染任意背景。在CineMatte-4K和公共基准（VideoMatte240K，YouTubeMatte）中，CineMatte不仅在VP中表现出色，还能有效地推广到真实世界的影像中。

View on arXiv Download PDF AI Translation

cs.CV / 266 / 2605.18329

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

迷失在折叠中：当交叉验证不是不确定性估计的深度集成时

Tristan, Kirscher, Markus, Bujotzek, Yannick, Kirchhoff, Maximilian, Rokuss, Fabian, Isensee, Kim-Celine, Kahl, Balint, Kovacs, Klaus, Maier-Hein

Abstract

Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.

Chinese Translation

集成不一致性被广泛用作医学图像分割中表征认知不确定性的代理。在实践中，许多研究通过K折交叉验证（CV）形成集成，但却称之为“深度集成”（DE）。由于CV成员是在不同的数据子集上训练的，它们的不一致性将种子驱动的变异性与数据暴露效应混合在一起，这可能改变对不确定性的解释方式。我们审查了近期的分割不确定性研究，发现术语与实施之间的不匹配现象普遍存在。然后，我们在三个跨越三种模态的多评估者分割数据集上，将标准的5折CV集成与5个成员的DE（固定训练集，不同随机种子）在其他配置相同的情况下进行比较。我们评估了不确定性在校准、故障检测、模糊建模和在分布变化下的鲁棒性方面的表现。DE在提高校准和故障检测的同时与分割准确性相匹配，而CV集成在研究的数据集中有时与评估者之间的变异性相关性更强。因此，集成构建应根据研究问题进行选择：DE用于可靠性导向的应用（例如，选择性转诊/故障检测），而CV集成则作为模糊性的代理。我们提供了一种轻量级的nnU-Net修改，使得在默认管道中能够进行DE训练。

View on arXiv Download PDF AI Translation

cs.CV / 267 / 2605.18334

3D Skew Gaussian Splatting with Any Camera Trajectory Visualization Engine

具有任意相机轨迹可视化引擎的3D偏斜高斯点云渲染

Zhao, Beizhen, Zhou, Yifan, Song, Gaochao, Yin, Ziran, Wang, Hao

Abstract

While 3D Gaussian Splatting (3DGS) has revolutionized real-time photorealistic view synthesis, its fundamental reliance on symmetric Gaussian distributions introduces visual artifacts that hinder accurate spatial data exploration. Specifically, symmetric kernels struggle to capture shape and color discontinuities , which cause blurriness and primitive redundancy that mislead human perception during visual analysis. To address these visualization barriers, we introduce 3D Skew Gaussian Splatting (3DSGS), a novel framework that significantly enhances the structural fidelity and compactness of explicit scene representations. Our key insight lies in extending the standard primitive to a general Skew Gaussian counterpart. This generalized primitive inherits the highly efficient rasterization properties of standard Gaussians while gaining intrinsic asymmetric modeling capabilities. We couple this with an enhanced opacity representation to better handle complex transparency, alongside a depth-aware densification strategy that intelligently manages primitive allocation. Furthermore, to make these advancements actionable for real-world visual analytics, we re-derive the CUDA rasterization pipeline to universally support both symmetric and skew Gaussians, integrating it into a decoupled, free-camera interactive visualization engine. Extensive experiments demonstrate that 3DSGS achieves superior rendering quality and structural compactness, particularly in regions with intricate details, while maintaining the real-time frame rates necessary for fluid interactive exploration. Supplementary derivations and visual results are available at \textbf{\textit{https://3d-skew-gs.github.io/}}.

Chinese Translation

尽管3D高斯点云渲染（3DGS）在实时照片级视图合成方面取得了革命性进展，但其对对称高斯分布的基本依赖引入了视觉伪影，妨碍了准确的空间数据探索。具体而言，对称核难以捕捉形状和颜色的不连续性，这导致模糊和原始冗余，从而在视觉分析中误导人类感知。为了解决这些可视化障碍，我们提出了3D偏斜高斯点云渲染（3DSGS），这是一个新颖的框架，显著增强了显式场景表示的结构保真度和紧凑性。我们的关键见解在于将标准原始体扩展为一般的偏斜高斯对应物。这种广义原始体继承了标准高斯的高效光栅化特性，同时获得了内在的非对称建模能力。我们将其与增强的不透明度表示相结合，以更好地处理复杂的透明度，并采用深度感知的密集化策略，智能管理原始体的分配。此外，为了使这些进展在现实世界的视觉分析中可操作，我们重新推导了CUDA光栅化管道，以普遍支持对称和偏斜高斯，并将其集成到一个解耦的自由相机交互可视化引擎中。大量实验表明，3DSGS在复杂细节区域实现了优越的渲染质量和结构紧凑性，同时保持了流畅交互探索所需的实时帧率。补充推导和视觉结果可在 extbf{ extit{https://3d-skew-gs.github.io/}} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 268 / 2605.18346

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

聚焦强制：基于内容的逐帧 KV 选择以实现高效自回归视频扩散

Cai, Peiliang, Zhang, Evelyn, Liu, Jiacheng, Lin, Hao, Zhang, Ruiqi, Mo, Weile, Ma, Yue, Zheng, Shikang, Huang, Jiehang, Liu, Dongrui, Zhang, Linfeng

Abstract

Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}

Chinese Translation

近期自回归视频扩散的进展使得顺序和流媒体视频生成成为可能。然而，长时间生成需要越来越大的 KV 缓存，这使得在不牺牲质量的情况下实现高效压缩变得具有挑战性。现有方法主要基于注意力分数选择历史帧，但其上下文决策仍然较为粗糙。当在同一块中生成多个帧时，这些方法通常对整个块应用共享历史选择，仅通过注意力评分历史帧，并以均匀或基于注意力模式启发式的方法分配头部预算，而不是显式的头部重要性估计。我们表明，同一生成块中的帧可以依赖不同的历史帧，相同的历史帧可以根据其与当前帧的相对时间距离而获得不同的注意力分数，并且对不同头部的掩蔽会导致不均等的生成退化。基于这些发现，我们提出了 extbf{聚焦强制}，一种无训练的 KV 选择方法，沿着生成帧和头部维度聚焦缓存历史。对于每个生成帧，聚焦强制通过结合历史帧的注意力分数和多样性分数，保留最相关和独特的历史帧，同时对估计重要性较高的头部分配更大的预算。在多种自回归生成范式中，聚焦强制实现了高达$ extbf{1.48} imes$的端到端加速且无需训练，同时 extbf{提高了视觉质量和文本对齐度}。 extit{我们的代码将在 GitHub 上发布。}

View on arXiv Download PDF AI Translation

cs.CV / 269 / 2605.18349

Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport

利用无参数注意机制优化CSRNet以实现公共交通中的人群计数

Rostamza, Aida, Del Re, Enrico, Varughese, Joshua Cherian, Olaverri-Monreal, Cristina

Abstract

Occupancy estimation and crowd counting are critical tasks in designing smart and efficient public transport vehicles. Given that public transport loading can vary from sparse to crowded, classical models for occupancy estimation must be adapted to suit this purpose. Attention mechanisms have shown remarkable capability in enhancing the representational power of deep neural networks for crowd counting in congested scenes with occlusion, complex backgrounds, and perspective distortion. However, conventional approaches, often implemented as parameterized sub-networks within convolutional layers, inevitably increase model size and computational cost, limiting deployment on resource-constrained edge devices. This paper investigates the effectiveness of state-of-the-art parameter-free attention mechanisms for crowd counting and density map estimation in highly congested scenes. We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules and compare their performance with parameterized attention modules constrained to introduce no more than 1% additional parameters. Furthermore, we present a novel combination of attention mechanisms that combines the strengths of PFCA and SA (PFCASA) customized for analyzing video streams onboard public transport systems. Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters. A detailed performance analysis further reveals that PFCASA outperforms other attention modules in scenes with fewer than 40 individuals, while PFCA shows greater effectiveness as crowd density increases, underscoring their potential applicability for integration into smart public transport modalities.

Chinese Translation

占用估计和人群计数是设计智能高效公共交通工具的关键任务。考虑到公共交通的载客量可能从稀疏到拥挤，传统的占用估计模型必须进行相应的调整。注意机制在提高深度神经网络在拥挤场景中进行人群计数的表现能力方面显示出了显著的效果，尤其是在存在遮挡、复杂背景和透视失真的情况下。然而，传统方法通常作为卷积层内的参数化子网络实现，必然增加模型的大小和计算成本，限制了在资源受限的边缘设备上的部署。本文研究了最先进的无参数注意机制在高度拥挤场景中的人群计数和密度图估计的有效性。我们评估了通道级（PFCA）、空间级（SA）和三维（SimAM）模块，并将它们的性能与引入不超过1%额外参数的参数化注意模块进行了比较。此外，我们提出了一种新颖的注意机制组合，结合了PFCA和SA的优势（PFCASA），专门用于分析公共交通系统上的视频流。以CSRNet为骨干，上海科技数据集上的实验表明，无参数注意机制在不引入额外模型参数的情况下实现了可比或更优的准确性。详细的性能分析进一步揭示，PFCASA在人数少于40的场景中优于其他注意模块，而PFCA在拥挤度增加时表现出更大的有效性，突显了它们在智能公共交通模式中集成的潜在应用价值。

View on arXiv Download PDF AI Translation

cs.CV / 270 / 2605.18359

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

RAVE：在大型多模态模型中重新分配视觉注意力

Leng, Xi, Ma, Xinhong, Dong, Ziqiang, Zhang, Feng, Tang, Xiaoying, Yang, Yang, Jiang, Guanjun

Abstract

Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evidence and intra-visual imbalance among visual tokens. We propose RAVE (Re-Allocating Visual Attention), a lightweight pair-gating mechanism that adds a learned query--key bias to pre-softmax attention scores over visual keys, derived from pre-RoPE query and key features. RAVE requires no architectural modification to the backbone and can be trained end-to-end with the rest of the model. Across a suite of multimodal benchmarks, RAVE improves over standard attention by an average of 3 points, with the largest gains on perception-intensive tasks -- including multilingual OCR, chart understanding, document VQA, and scene text VQA -- where accurate visual grounding is critical.

Chinese Translation

大型多模态模型（LMMs）继承了预训练语言骨干网络的自注意力机制，但标准注意力可能表现出次优的分配，包括文本与视觉证据之间的跨模态错误分配以及视觉标记之间的内部视觉不平衡。我们提出了RAVE（Re-Allocating Visual Attention），这是一种轻量级的配对门控机制，它在视觉键的预-softmax注意力分数上添加了一个学习的查询-键偏置，该偏置源自预-RoPE查询和键特征。RAVE不需要对骨干网络进行架构修改，并且可以与模型的其他部分进行端到端训练。在一系列多模态基准测试中，RAVE在标准注意力的基础上平均提高了3个点，在感知密集型任务上获得了最大的提升——包括多语言OCR、图表理解、文档视觉问答（VQA）和场景文本视觉问答（VQA），这些任务中准确的视觉定位至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 271 / 2605.18365

GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

GeoFlow：在视频生成中强制隐式几何一致性

Ackermann, Jan, Cai, Shengqu, Deng, Boyang, Kuang, Zhengfei, Peng, Songyou, Wetzstein, Gordon

Abstract

Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth--pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.

Chinese Translation

生成几何一致的视频仍然是一个未解决的挑战：基于网络规模数据训练的文本到视频扩散模型仅隐式地处理几何，导致物体变形、纹理漂移以及在相机运动下的非刚性背景。现有解决方案要么将一致性作为副产品进行改进，要么仅适用于静态场景，或者完全重新对齐模型的潜在空间。我们引入了一种几何一致性奖励，直接测量生成视频中的运动是否与一致的场景兼容。我们的关键见解是，在物理一致的视频中，背景运动应可通过刚性相机引起的流动来解释，而独立移动的物体在运动轨迹上应保持外观身份。我们利用光流、深度-姿态预测和基于特征的对应关系来操作化这一点，以分离刚性和动态区域并评估它们各自的一致性。将这一奖励与强化微调相结合，将几何一致性从一种突现特性转变为视频生成器的显性优化目标。该方法与模型无关，适用于包含相机和物体运动的多样动态场景。实验表明，在保持感知质量的同时，时间上的几何伪影显著减少，超越了强基线。代码和模型权重已发布。

View on arXiv Download PDF AI Translation

cs.CV / 272 / 2605.18390

Vision Foundation Models as Generalist Tokenizers for Image Generation

视觉基础模型作为图像生成的通用标记器

Zheng, Anlin, Han, Qi, Wen, Xin, Ma, Chuofan, Gong, Lanxi, Yu, Gang, Zhang, Xiangyu, Qi, Xiaojuan

Abstract

In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID of \textbf{1.36} on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf{1.25}. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbf{w/o CFG}) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.

Chinese Translation

在本研究中，我们探索了在冻结的视觉基础模型（VFM）之上直接构建通用图像标记器这一尚未充分探索的方向。为了构建这个标记器，我们利用冻结的VFM作为编码器，并引入了两个关键创新：（1）区域自适应量化框架，以消除标准二维网格特征中的空间冗余；（2）语义重构目标，使解码输出与VFM的表示对齐，以保持语义的真实性。在这些设计的基础上，我们提出了VFMTok，这是一种能够在离散和连续潜在空间中无缝操作的通用视觉标记器。VFMTok在合成质量上取得了显著提升，同时大幅提高了标记效率。对于离散自回归（AR）生成，它加速了模型收敛速度，提升了 extbf{3倍}，并在ImageNet类别条件合成中达到了 extbf{1.36}的最先进gFID。同样，对于连续空间生成，将VFMTok与去噪模型结合，获得了 extbf{1.25}的卓越gFID。此外，由于潜在空间本质上捕捉了丰富的空间语义，VFMTok在两个生成范式中都能实现高保真度的类别条件合成，而无需分类器自由引导（ extbf{w/o CFG}），显著加快了推理速度。除了这些显著的实证结果外，我们还系统地研究了我们方法的潜在机制。我们发现，在VFM预训练期间使用的特定自监督学习目标决定了其作为标记器的有效性。具体而言，与全局对比学习和潜在掩蔽图像建模共同优化的VFM提供了图像标记化的最佳表示。这些见解为未来图像标记器的设计奠定了坚实的基础，并提供了宝贵的指导。

View on arXiv Download PDF AI Translation

cs.CV / 273 / 2605.18396

NEWTON: Agentic Planning for Physically Grounded Video Generation

NEWTON：面向物理基础的视频生成的自主规划

Feng, Yuxiang, Wang, Juncheng, Xu, Chao, Qian, Yijie, Wang, Huihan, Hou, Wenlong, Liu, Yang, Sun, Baigui, Liu, Yong, Wang, Shujun

Abstract

Video generation models produce visually compelling results but systematically violate physical commonsense -- on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy -- sufficiency, dynamism, and verifiability -- and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: \href{https://Newton026.github.io/newton}{https://Newton026.github.io/newton}

Chinese Translation

视频生成模型产生了视觉上引人注目的结果，但系统性地违反了物理常识——在 VideoPhy-2 数据集上，最佳模型的联合准确率仅为 32.6%。我们识别出一个规范瓶颈：文本提示是对物理世界的有损压缩，省略了完全决定动态的参数，而无论模型规模如何扩大，都无法恢复那些从未被指定的内容。基于这一诊断，我们推导出物理条件必须满足的三个属性——充分性、动态性和可验证性——并展示了现有方法均无法同时满足这三者。我们提出了 NEWTON，其中视频生成被降级为代理工具箱中的一个动作：一个学习的规划器协调物理感知工具（关键帧生成、科学计算、提示优化）以构建丰富的条件，而一个验证器则为迭代重新规划闭合回路。规划器是唯一可训练的组件，通过 Flow-GRPO 在实时多轮循环中按策略优化。在 VideoPhy-2 上，NEWTON 将 LTX-Video 的联合准确率从 21.4% 提高到 29.7%，将 Veo-3.1 的准确率从 30.7% 提高到 37.4%，而未修改任何生成器。我们的项目页面： [https://Newton026.github.io/newton](https://Newton026.github.io/newton)

View on arXiv Download PDF AI Translation

cs.CV / 274 / 2605.18408

Historical Knowledge Graphs for Global Maritime Estimated Time of Arrival

全球海洋预计到达时间的历史知识图谱

Dimitriou, Neofytos

Abstract

Accurate vessel estimated-time-of-arrival forecasts are critical for port operations and decarbonization, yet global-scale travel-time prediction remains difficult without costly contextual data. Herein, I present a methodology for constructing a historical maritime knowledge graph using only Automatic Identification System (AIS) data. First, segmented trajectories are extracted from noisy AIS data using a Gaussian-mixture-model-based preprocessing pipeline. The graph is then constructed by iteratively processing the trajectories and storing speed distributions stratified by vessel type, time of travel, and direction of travel; the resulting global graph comprises 5,433 geohash-3 nodes and 12,334 edges. The graph can be queried to retrieve travel-time predictions between any two location via a hierarchical, priority-based system that uses historical statistics with principled fallback. On a temporally held-out test set, median RMSE is 22.75 min (segment-level) and 30.90 min (trajectory-level), with 69.1% of trajectories within 20% of actual arrival time. On a second external test set, median RMSE is 27.36 min (segment-level) and 37.46 min (trajectory-level), with 62.1% of trajectories within 20%. These results corroborate the promise of our method, enabling global travel-time prediction and providing a strong foundation for just-in-time arrival planning and emissions reduction.

Chinese Translation

准确的船舶预计到达时间预测对于港口运营和脱碳至关重要，但在缺乏昂贵的上下文数据的情况下，全球范围内的旅行时间预测仍然困难。在此，我提出了一种仅使用自动识别系统（Automatic Identification System, AIS）数据构建历史海洋知识图谱的方法。首先，通过基于高斯混合模型的预处理管道从嘈杂的AIS数据中提取分段轨迹。然后，通过迭代处理轨迹并存储按船舶类型、旅行时间和旅行方向分层的速度分布来构建图谱；最终生成的全球图谱包含5,433个geohash-3节点和12,334条边。该图谱可以通过一个分层的、基于优先级的系统进行查询，以检索任意两个位置之间的旅行时间预测，该系统使用历史统计数据并具有原则性的回退。在一个时间上保持的测试集上，中位数均方根误差（RMSE）为22.75分钟（分段级别）和30.90分钟（轨迹级别），69.1%的轨迹在实际到达时间的20%范围内。在第二个外部测试集上，中位数RMSE为27.36分钟（分段级别）和37.46分钟（轨迹级别），62.1%的轨迹在20%范围内。这些结果证实了我们方法的潜力，使全球旅行时间预测成为可能，并为及时到达规划和减排提供了坚实的基础。

View on arXiv Download PDF AI Translation

cs.CV / 275 / 2605.18413

Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models

基础中的裂缝：一个挑战视觉基础模型的土木基础设施数据集

Farronato, Nicola, Avogaro, Niccolo, Frick, Thomas, Rigotti, Mattia, Khan, Rizwan Ullah, Magno, Michele, Schindler, Konrad, Malossi, Cristiano, Scheidegger, Florian

Abstract

Automated structural health monitoring is essential to prevent catastrophic infrastructure failures. Precise, pixel-level defect segmentation is needed to accurately assess structural integrity, but progress in defect segmentation for civil infrastructures has been held back by an extreme scarcity of data, which requires costly expert annotation. The need for data is accentuated by algorithmic hurdles intrinsic to the problem, including center-bias and the need to rely more on shape when inspecting nearly textureless building materials. To remove the bottleneck, we introduce Cracks in the Foundation (CiF), the largest and most detailed civil infrastructure (instance) segmentation dataset to date, comprising $\approx$150,000 high-resolution images meticulously curated over five years in collaboration with civil engineering experts. With the help of this unprecedented data source, we expose a blind spot of current visual AI: despite the advent of promptable Foundation Models (FMs) and Vision Language Models (VLMs), and despite the impressive abilities of today's specialised segmentation models, it turns out that dense image understanding in the built environment is nowhere near solved. Our evaluations indicate that even the most recent zero-shot FMs face significant challenges when deployed on real-world infrastructure and even the performance of specialised models with domain-specific supervision plateaus at $\approx$25% mAP. CiF establishes inspection of civil infrastructure, an elementary and seemingly easy perceptual task, as an open challenge that reveals fundamental weaknesses of present-day models trained predominantly on internet images, literally and figuratively highlighting cracks in the current foundation model paradigm.

Chinese Translation

自动化结构健康监测对于防止灾难性的基础设施故障至关重要。精确的像素级缺陷分割对于准确评估结构完整性是必要的，但由于缺乏数据，尤其是需要昂贵的专家标注，土木基础设施的缺陷分割进展受到限制。数据需求因问题固有的算法障碍而加剧，包括中心偏差以及在检查几乎无纹理的建筑材料时更依赖形状的需要。为了解决这一瓶颈，我们推出了“基础中的裂缝”（Cracks in the Foundation, CiF），这是迄今为止最大、最详细的土木基础设施（实例）分割数据集，包含约150,000张高分辨率图像，这些图像经过五年的精心策划，并与土木工程专家合作完成。借助这一前所未有的数据源，我们揭示了当前视觉人工智能的盲点：尽管可提示的基础模型（Foundation Models, FMs）和视觉语言模型（Vision Language Models, VLMs）的出现，以及当今专业分割模型的令人印象深刻的能力，但事实证明，在建筑环境中进行密集图像理解仍远未解决。我们的评估表明，即使是最新的零-shot FMs在实际基础设施中部署时也面临重大挑战，甚至在领域特定监督下的专业模型的性能也停滞在约25%的mAP。CiF将土木基础设施的检查这一基本且看似简单的感知任务确立为一个开放挑战，揭示了当今主要基于互联网图像训练的模型的根本弱点，字面和比喻上突显了当前基础模型范式中的裂缝。

View on arXiv Download PDF AI Translation

cs.CV / 276 / 2605.18419

Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology

面向几何的鲁棒视觉上下文学习的不确定性核心集在组织病理学中的应用

Erick, Franciskus Xaverius, Müller, Johanna Paula, Kainz, Bernhard

Abstract

Vision-language models (VLMs) can couple visual perception with open-ended clinical reasoning, making them attractive for computational histopathology. However, fine-tuning billions of parameters on scarce, expert-annotated pathology data is prohibitive, while in-context learning (ICL), which conditions the VLM on demonstrative image-text pairs without parameter updates, suffers from high sensitivity to which examples are selected and how the query is phrased, producing unreliable diagnostics. Existing selection strategies rely on query-dependent nearest-neighbour retrieval that ignores global data structure, require costly parameter updates, or disregard the joint vision-text embedding geometry of VLMs. We propose GAUC, a training-free coreset selection method operating directly in the pre-trained multimodal embedding space. GAUC jointly optimises three objectives: (1) a Maximum Mean Discrepancy term enforcing distributional fidelity between coreset and full dataset, (2) an Effective Mutual Information Difference regulariser bounding performance degradation under prompt paraphrases by exploiting the VLM's joint vision-text alignment, and (3) a predictive-variance penalty suppressing overconfident, unstable outputs. On CRC-100K and MHIST across multiple open-source VLM architectures, GAUC consistently improves accuracy, calibration, and prompt robustness over recent ICL selection methods and dataset-distillation baselines, all without a single gradient update.

Chinese Translation

视觉-语言模型（VLMs）能够将视觉感知与开放式临床推理相结合，使其在计算组织病理学中具有吸引力。然而，在稀缺的专家注释病理数据上微调数十亿个参数是不可行的，而上下文学习（ICL）则是在不更新参数的情况下，将VLM与示例图像-文本对进行条件化，这种方法对所选择的示例和查询的措辞高度敏感，导致诊断结果不可靠。现有的选择策略依赖于查询依赖的最近邻检索，忽略了全局数据结构，要求昂贵的参数更新，或忽视了VLM的联合视觉-文本嵌入几何。我们提出了GAUC，一种在预训练多模态嵌入空间中直接操作的无训练核心集选择方法。GAUC联合优化三个目标：（1）最大均值差异（Maximum Mean Discrepancy）项，强制核心集与完整数据集之间的分布一致性；（2）有效互信息差（Effective Mutual Information Difference）正则项，通过利用VLM的联合视觉-文本对齐，限制在提示改写下的性能下降；（3）预测方差惩罚，抑制过于自信和不稳定的输出。在CRC-100K和MHIST数据集上，GAUC在多个开源VLM架构中始终提高了准确性、校准性和提示鲁棒性，优于近期的ICL选择方法和数据集蒸馏基线，且无需进行任何梯度更新。

View on arXiv Download PDF AI Translation

cs.CV / 277 / 2605.18431

Seeing Together:Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

共同观察：基于多模态大型语言模型的多机器人协作自我中心空间推理

Peng, Kunyu, Zhou, Zhikun, Yang, Kailun, Wen, Di, Liu, Ruiping, Chen, Yufan, Zheng, Junwei, Shi, Hao, Zhou, Yi, Sarfraz, M. Saquib, Paudel, Danda Pani, Van Gool, Luc

Abstract

Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an MLLM framework for fine-grained cooperative spatial reasoning. SP-CoR combines dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation, enabling the model to benefit from privileged robot-pose supervision during training while requiring only egocentric videos at test time. Across 22 MLLM baselines, SP-CoR consistently improves cooperative reasoning, outperforming the strongest fine-tuned baseline by +3.87% on Habitat and +7.12% on iGibson. It also shows stronger generalization to unseen team sizes and real-world robot tests. Code can be found at https://github.com/KPeng9510/seeing-together.git.

Chinese Translation

多模态大型语言模型（MLLMs）在自我中心视频理解方面取得了显著进展，但它们从多个具身视角进行协作推理的能力仍然很大程度上未被探索。我们通过多机器人协作动态空间推理研究这一问题，其中模型必须通过整合来自移动机器人团队的同步自我中心视频来回答空间、时间、可见性和协调问题。为支持这一设置，我们引入了CoopSR，这是该任务的第一个基准，以及EgoTeam，一个多机器人自我中心问答数据集。EgoTeam包含114,227个问答对，涵盖19种问题类型、四个难度等级和三种团队规模，数据来源于Habitat和iGibson，并附有约2,326个使用两只四足机器人收集的真实世界测试集。我们进一步提出了SP-CoR（谱和物理信息协作推理器），这是一个用于细粒度协作空间推理的MLLM框架。SP-CoR结合了动态感知的多机器人帧采样、谱和物理引导的视图融合，以及物理对齐的提示蒸馏，使模型在训练期间能够受益于特权的机器人姿态监督，而在测试时仅需自我中心视频。在22个MLLM基准中，SP-CoR始终提高了协作推理性能，在Habitat上比最强的微调基准提高了3.87%，在iGibson上提高了7.12%。它还在未见的团队规模和真实世界机器人测试中表现出更强的泛化能力。代码可在https://github.com/KPeng9510/seeing-together.git找到。

View on arXiv Download PDF AI Translation

cs.CV / 278 / 2605.18436

A Dataset for the Recognition of Historical and Handwritten Music Scores in Western Notation

用于识别西方乐谱中的历史和手写音乐的数据库

Torras, Pau, Mayer, Jiří, Badal, Carles, Dvořáková, Martina, Vlková, Markéta Herzanová, Asbert, Gerard, Dvořák, Vojtěch, Šomorjai, Samuel, Hajič jr., Jan, Fornés, Alicia

Abstract

A large amount of musical heritage has been digitised by memory institutions: libraries, museums, and archives. Nevertheless, the field of Optical Music Recognition (OMR) has struggled with making this music machine-readable, despite advances in deep learning, mostly because no datasets for training systems in realistic conditions were available. The MusiCorpus dataset aims to remedy this situation by providing 1,309 pages of historical sheet music, primarily handwritten, with MusicXML transcriptions and symbol annotations. It is the largest dataset of handwritten music to date and the first dataset containing a realistic and representative sample of musical document collections from memory institutions, suitable for training and evaluating both end-to-end and object detection-based OMR systems and comparing their performance.

Chinese Translation

大量的音乐遗产已经被记忆机构（如图书馆、博物馆和档案馆）数字化。然而，尽管深度学习取得了进展，光学音乐识别（OMR）领域在将这些音乐转化为机器可读格式方面仍然面临挑战，主要原因是缺乏在现实条件下训练系统的数据集。MusiCorpus 数据集旨在通过提供 1,309 页历史乐谱（主要为手写形式），以及 MusicXML 转录和符号注释，来改善这一状况。该数据集是迄今为止最大的手写音乐数据集，也是第一个包含来自记忆机构的音乐文档集合的真实且具有代表性的样本的数据集，适用于训练和评估基于端到端和基于目标检测的 OMR 系统，并比较其性能。

View on arXiv Download PDF AI Translation

cs.CV / 279 / 2605.18445

What is Holding Back Latent Visual Reasoning?

是什么阻碍了潜在视觉推理？

Viveiros, André G., Gonçalves, Nuno, Martins, André F. T., Lindemann, Matthias

Abstract

Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative ``dummy'' tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.

Chinese Translation

人类可以通过心理模拟中间视觉步骤来处理复杂的视觉问题，而不仅仅依赖语言推理。受到此启发，最近一些关于视觉-语言模型的研究探索了使用连续潜在标记作为中间视觉想象步骤的链式推理。在本研究中，我们调查了最近的模型如何利用这些潜在标记。令人惊讶的是，我们发现当潜在标记被无信息的“虚拟”标记替代时，模型的准确性并未受到影响。这表明潜在标记在模型最终预测中起到的因果作用极小。为了更好地理解这一现象，我们分析了由oracle潜在表示提供的训练信号以及推理时生成的潜在标记的质量。我们的实验揭示了阻碍潜在视觉推理的两个关键问题：首先，在大多数现有数据集中，oracle潜在标记提供的额外信息有限，无法显著简化任务，导致模型在训练过程中忽视它们，并在推理时有效绕过它们。当在一个诊断数据集上进行微调时，该数据集中的潜在标记为最终预测提供了足够的支持，我们展示了模型可以因果地依赖它们。其次，推理时生成的潜在标记与其对应的oracle表示偏离，收敛到一个狭窄的区域，即使模型依赖它们也无法获得好处。总体而言，我们的发现表明，潜在视觉推理的未来进展依赖于两个关键支柱：具有信息丰富的中间步骤的高质量数据集和更精确的潜在标记预测。

View on arXiv Download PDF AI Translation

cs.CV / 280 / 2605.18447

NeRF-based Spacecraft Reconstruction from Close-Range Monocular Imagery Under Illumination Variability and Pose Uncertainty

基于NeRF的近距离单目图像下航天器重建：应对照明变化和姿态不确定性

Legrand, Antoine, Detry, Renaud, De Vleeschouwer, Christophe

Abstract

Autonomous rendezvous and proximity operations around uncooperative, unknown spacecraft are critical for active debris removal and on-orbit servicing missions. A key component of such operations is the offline reconstruction of a 3D model of the target from a set of 2D images. This task is challenging due to two main factors. First, in-orbit illumination conditions exhibit considerable variability, and change rapidly over time. Second, the inaccuracy of pose information in the images, results in 3D reconstruction uncertainty. To overcome these challenges, we propose to extend Neural Radiance Fields with per-image degrees of freedom: a learnable appearance embedding that captures the illumination conditions specific to each image, and an image-specific pose correction term that refines its noisy pose label to increase 3D consistency across images. These parameters add minimal complexity, as they are learned jointly with the NeRF, yet they substantially improve robustness to illumination variability and pose inaccuracies. We validate our approach on three image sets representative of in-orbit operations, demonstrating its effectiveness for offline reconstruction and highlighting its suitability for online reconstruction, an open problem in the field.

Chinese Translation

自主对接和在不合作、未知航天器周围的近距离操作对于主动去除太空垃圾和在轨服务任务至关重要。这类操作的一个关键组成部分是从一组二维图像中离线重建目标的三维模型。由于两个主要因素，这项任务具有挑战性。首先，在轨道上的照明条件变化显著，并且随时间迅速变化。其次，图像中姿态信息的不准确性导致三维重建的不确定性。为了克服这些挑战，我们提出扩展神经辐射场（Neural Radiance Fields, NeRF），引入每幅图像的自由度：一个可学习的外观嵌入，用于捕捉特定于每幅图像的照明条件，以及一个图像特定的姿态修正项，用于精炼其噪声姿态标签，以提高图像间的三维一致性。这些参数增加的复杂性很小，因为它们与NeRF共同学习，但它们显著提高了对照明变化和姿态不准确性的鲁棒性。我们在三组代表在轨操作的图像集上验证了我们的方法，展示了其在离线重建中的有效性，并强调了其在在线重建中的适用性，这是该领域的一个开放问题。

View on arXiv Download PDF AI Translation

cs.CV / 281 / 2605.18451

Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

代码即房间：通过代理代码合成从俯视图图像生成3D房间

Yang, Yixuan, Luo, Zhen, Gan, Wanshui, Hao, Jinkun, Lu, Junru, Yan, Jinghao, Lyu, Zhaoyang, Xu, Xudong

Abstract

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.

Chinese Translation

设计逼真且功能齐全的3D室内房间对于室内设计、虚拟现实、游戏和具身人工智能等广泛应用至关重要。尽管最近基于多模态大语言模型（MLLM）的方法在从文本描述或参考图像合成3D房间方面显示出巨大潜力，但基于文本的方法在捕捉精确空间信息方面存在困难，而现有的图像条件代理在从俯视图整体生成房间时则面临不稳定性和无限循环的问题。为了解决这些局限性，我们提出了代码即房间（Code-as-Room），这是一个基于MLLM的代理框架，配备了结构化执行工具，使用Blender代码表示3D房间。给定一幅俯视房间图像，该框架解析参考图像以提取场景元素及其空间关系，并在一个原则性、多阶段的流程中合成可执行的Blender代码，用于几何体、材料和照明。整个过程中维护一个跨阶段记忆模块，以减轻现有基于代理框架固有的上下文遗忘问题。我们进一步引入了一个专门的基准，用于基于代码的3D房间合成，涵盖各种评估协议。基于我们的基准，进行了与现有基于代理的方法的全面比较，以验证我们提出的执行工具的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 282 / 2605.18464

PERL: Parameter Efficient Reasoning in CLIP Latent Space

PERL：CLIP潜在空间中的参数高效推理

Carnemolla, Simone, Calcagno, Salvatore, Giordano, Daniela, Spampinato, Concetto, Pennisi, Matteo

Abstract

Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step, PERL generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving CLIP's pretrained multimodal structure. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and out-of-distribution ImageNet variants, PERL achieves the best parameter-performance trade-off among the compared methods under a fast-adaptation few-shot setting, combining strong novel-class accuracy and competitive transfer performance with only about 6K trainable parameters, up to 817x fewer than the largest compared approach. Overall, our results suggest that iterative latent reasoning provides a complementary adaptation mechanism to parameter scaling in discriminative vision-language models.

Chinese Translation

对比训练的视觉-语言模型，如CLIP，通过在共享嵌入空间中对齐图像和文本，提供了强大的零样本迁移能力。然而，在不降低其开放词汇泛化能力的情况下，将这些模型适应于下游任务仍然具有挑战性。现有的参数高效适应方法通常通过学习的提示、适配器或多模态变换来提高任务专业化，其中适应能力主要通过额外的可训练参数来表达。受到最近语言模型中潜在推理方法的启发，我们探讨了一种互补的视角：适应是否可以通过对潜在表示的迭代推理而不是单纯增加参数数量来实现？我们提出了PERL（CLIP潜在空间中的参数高效推理），这是一个轻量级适应框架，通过在每次细化步骤中反复应用的紧凑共享推理模块来增强一个冻结的CLIP模型。在每一步中，PERL生成一个条件于当前表示的潜在推理令牌，并将其注入到中间编码器层中，逐步细化更高层次的语义表示，同时保留CLIP的预训练多模态结构。在涵盖基础到新颖泛化、跨数据集迁移和超出分布的ImageNet变体的15个基准测试中，PERL在快速适应的少样本设置下，在比较的方法中实现了最佳的参数-性能权衡，结合了强大的新类别准确性和具有竞争力的迁移性能，仅使用约6K的可训练参数，比最大的比较方法少多达817倍。总体而言，我们的结果表明，迭代潜在推理为判别性视觉-语言模型提供了一种互补的适应机制，而不是单纯依赖参数扩展。

View on arXiv Download PDF AI Translation

cs.CV / 283 / 2605.18466

Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI

基于语音引导的多模态学习用于实时MRI中的声道分割

Liu, Daiqi, Mulzer, Lukas, Hasan, Md, de Castro, Nyvenn, Xing, Fangxu, Kang, Xingjian, Ye, Chengze, Mei, Siyuan, Sun, Yipeng, Arias-Vergara, Tomás, Hutter, Jana, Woo, Jonghye, Maier, Andreas, Pérez-Toro, Paula Andrea

Abstract

Segmenting vocal tract articulators in real-time MRI (rtMRI) is a challenging dynamic image segmentation problem characterized by low contrast, rapid motion, and limited spatial resolution. However, while rtMRI acquisitions may provide synchronized acoustic signals, existing methods discard this information, and the few multimodal approaches that incorporate audio cannot be deployed when audio is unavailable. We propose a three-stage framework that leverages acoustic and phonological supervision during training while requiring only the rtMRI image at inference: phonological representations are converted into spatial bounding-box priors for articulator localization, visual and acoustic encoders are aligned via dual-level cross-modal contrastive pretraining, and the learned representations are fused through a cross-attention decoder, effectively transferring multimodal knowledge into a single-modality inference pipeline. Evaluated on 75-Speaker~Annot-16 and USC-TIMIT datasets, our method outperforms existing unimodal and multimodal methods, demonstrating that multimodal supervision provides transferable benefits for precise and clinically deployable vocal tract segmentation.

Chinese Translation

在实时MRI（rtMRI）中分割声道发音器是一项具有挑战性的动态图像分割问题，其特点是低对比度、快速运动和有限的空间分辨率。然而，尽管rtMRI采集可能提供同步的声学信号，现有方法却忽略了这一信息，而少数结合音频的多模态方法在音频不可用时无法部署。我们提出了一个三阶段框架，在训练过程中利用声学和音韵监督，同时在推理时仅需rtMRI图像：音韵表示被转换为用于发音器定位的空间边界框先验，通过双层跨模态对比预训练对视觉和声学编码器进行对齐，学习到的表示通过跨注意力解码器融合，有效地将多模态知识转移到单模态推理管道中。在75-Speaker~Annot-16和USC-TIMIT数据集上的评估表明，我们的方法优于现有的单模态和多模态方法，证明了多模态监督为精确且可临床部署的声道分割提供了可转移的优势。

View on arXiv Download PDF AI Translation

cs.CV / 284 / 2605.18467

InstructAV2AV: Instruction-Guided Audio-Video Joint Editing

InstructAV2AV：基于指令的音视频联合编辑

Zheng, Haojie, Yang, Yixin, Yang, Siqi, Weng, Shuchen, Shi, Boxin

Abstract

Recent diffusion-based methods have achieved impressive progress in video content manipulation. However, they typically ignore the accompanying audio, leaving the audio disjointed from the edited results. In this paper, we propose InstructAV2AV, the first end-to-end framework for instruction-guided audio-video joint editing. We first develop a scalable data synthesis pipeline and construct InsAVE-80K, the first large-scale audio-video editing dataset with high-quality source-to-target pairs. With this data foundation, we adapt an audio-video generation backbone to leverage its robust priors. We concatenate the audio-video input with noisy latent codes to anchor the source context, propose the source-instruction gated attention to improve instruction following and content preservation, and introduce a two-stage training strategy to effectively transfer these pre-trained priors. Extensive experiments demonstrate that InstructAV2AV outperforms state-of-the-art methods across 11 metrics spanning three aspects on two evaluation sets, highlighting its potential for controllable content creation. Project page: https://hjzheng.net/projects/InstructAV2AV/.

Chinese Translation

近年来，基于扩散的方法在视频内容操作方面取得了显著进展。然而，这些方法通常忽视了伴随的音频，导致音频与编辑结果脱节。本文提出了InstructAV2AV，这是第一个用于基于指令的音视频联合编辑的端到端框架。我们首先开发了一个可扩展的数据合成管道，并构建了InsAVE-80K，这是第一个具有高质量源到目标对的大规模音视频编辑数据集。在此数据基础上，我们调整了音视频生成主干网络，以利用其强大的先验知识。我们将音视频输入与噪声潜在编码连接，以锚定源上下文，提出了源-指令门控注意力机制，以改善指令遵循和内容保留，并引入了两阶段训练策略，以有效转移这些预训练的先验知识。大量实验表明，InstructAV2AV在两个评估集的三个方面的11个指标上均超越了最先进的方法，突显了其在可控内容创作中的潜力。项目页面：https://hjzheng.net/projects/InstructAV2AV/

View on arXiv Download PDF AI Translation

cs.CV / 285 / 2605.18491

Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks

评估自监督学习（SSL）预训练在同一和不同模态分割任务中的可转移性

Jiang, Jue, Veeraraghavan, Harini

Abstract

Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10{,}412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment. Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieved the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot setting and narrowed as the size of the labeled fine-tuning dataset increased, indicating that the choice of SSL pretraining matters most under limited annotation budgets.

Chinese Translation

方法：使用相同的10,412个3D CT扫描（1.89M 2D轴切片），涵盖不同疾病部位，从零开始预训练了九种SSL方法，涉及四个预文本任务家族。每种方法的预训练Swin Transformer编码器被集成到SwinUNETR风格的分割网络中（Swin编码器与3D CNN解码器及跳跃连接），并在九个复杂性各异的公共分割任务上进行了微调，这些任务包括大腹部器官、头颈结构以及来自CT和MRI的肿瘤。使用Dice相似系数（DSC）评估性能。进一步分析了微调收敛速度、跨模态转移能力（CT到MRI）以及在少量样本和大量样本微调之间的特征重用模式，采用中心核对齐方法。结果：自蒸馏掩蔽图像变换器（SMIT）结合了掩蔽图像建模（MIM）与局部和全局自蒸馏，在九个任务中实现了最高的整体分割准确率，最快的微调收敛速度，以及最小的少量样本与大量样本性能差距，表明其数据效率最强。SMIT在少量样本与大量样本微调之间也展现了最一致的特征重用模式。基于MIM的SimMIM和自蒸馏方法（DINO, iBOT）优于依赖于图像级全局表示的对比学习和旋转预测。在少量样本设置中，SSL方法之间的差异最大，随着标注微调数据集规模的增加而缩小，表明在有限的标注预算下，SSL预训练的选择尤为重要。

View on arXiv Download PDF AI Translation

cs.CV / 286 / 2605.18507

Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation

弱监督跨模态学习用于4D雷达场景流估计

Fu, Jingyun, Xiang, Zhiyu, Zhao, Na

Abstract

Due to the difficulty of obtaining ground-truth data for 4D radar scene flow estimation, previous methods typically rely on either self-supervised losses or cross-modal supervision using 3D LiDAR data, 2D images, and odometry. However, self-supervised approaches often yield suboptimal results due to radar's inherently low-fidelity measurements, while existing cross-modal supervised methods introduce complex multi-task architecture and require costly LiDAR sensors to generate pseudo radar scene flow labels from pretrained 3D tracking models. To overcome these limitations, we propose a task-specific iterative framework for weakly supervised radar scene flow learning, using only images and odometry for auxiliary supervision during training. Specially, we establish two novel instance-aware self-supervised losses by exploiting off-the-shelf 2D tracking and segmentation algorithms to obtain tracked instance masks, which are back-projected into 3D space to provide instance-level semantic guidance; for static regions, we integrate vehicle odometry with radar's intrinsic motion cues to construct a rigid static loss. Extensive experiments on the real-world View-of-Delft (VoD) dataset demonstrate that our method not only surpasses state-of-the-art cross-modal supervised approaches that rely on 3D multi-object tracking on dense LiDAR point clouds but also outperforms existing fully supervised scene flow estimation methods. The code is open-sourced at \href{https://github.com/FuJingyun/IterFlow}{https://github.com/FuJingyun/IterFlow}.

Chinese Translation

由于获取4D雷达场景流估计的真实数据的困难，之前的方法通常依赖于自监督损失或使用3D LiDAR数据、2D图像和里程计的跨模态监督。然而，自监督方法由于雷达固有的低保真度测量，往往产生次优结果，而现有的跨模态监督方法则引入复杂的多任务架构，并需要昂贵的LiDAR传感器从预训练的3D跟踪模型生成伪雷达场景流标签。为了克服这些限制，我们提出了一种特定任务的迭代框架，用于弱监督雷达场景流学习，在训练过程中仅使用图像和里程计作为辅助监督。具体而言，我们通过利用现成的2D跟踪和分割算法获得跟踪实例掩膜，建立了两个新颖的实例感知自监督损失，这些掩膜被反投影到3D空间中以提供实例级语义指导；对于静态区域，我们将车辆里程计与雷达的内在运动线索结合，构建了一个刚性静态损失。在真实世界的Delft视图（View-of-Delft, VoD）数据集上的大量实验表明，我们的方法不仅超越了依赖于密集LiDAR点云的3D多目标跟踪的最先进的跨模态监督方法，而且在现有的全监督场景流估计方法中也表现优异。代码已开源，地址为 exttt{https://github.com/FuJingyun/IterFlow}.

View on arXiv Download PDF AI Translation

cs.CV / 287 / 2605.18522

Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification

超越形态学：量化颜色特征在癌症分类中的诊断能力

Kheiri, Farnaz, Rahnamayan, Shahryar, Makrehchi, Masoud

Abstract

In histopathology, human experts primarily rely on color as a means of enhancing contrast to interpret tissue morphology, whereas machine vision models process color as raw statistical information. This distinction raises a fundamental question: to what extent can pixel intensity alone, independent of structural and morphological cues, support cancer classification? To address this question, we systematically evaluated the standalone discriminative power of global color features while deliberately excluding all morphological information. Specifically, we extracted statistical color moments and discretized RGB and HSV color histograms, and assessed their performance across ten diverse experimental settings using classical machine learning classifiers. Our results demonstrate that color features alone can achieve strong performance in binary diagnostic tasks (e.g., benign versus malignant), with classification accuracies reaching up to 89%. This performance is likely attributable to global chromatic shifts associated with malignancy. Importantly, these simple color-based representations consistently outperformed random baselines by a substantial margin, indicating that raw color distributions encode a non-random and diagnostically relevant signal for cancer detection. Consequently, this study suggests that simple, computationally efficient color features can serve as an effective pre-screening tool. By identifying samples with strong chromatic indicators of malignancy, these lightweight models could function as a first-pass triage system, reducing the computational burden on complex deep learning architectures.

Chinese Translation

在组织病理学中，人类专家主要依赖颜色作为增强对比度的手段来解释组织形态，而机器视觉模型则将颜色视为原始统计信息。这一区别引发了一个基本问题：仅凭像素强度，独立于结构和形态线索，能够在多大程度上支持癌症分类？为了解决这个问题，我们系统地评估了全球颜色特征的独立区分能力，同时故意排除了所有形态信息。具体而言，我们提取了统计颜色矩，并对RGB和HSV颜色直方图进行了离散化，使用经典机器学习分类器评估了它们在十个不同实验设置中的表现。我们的结果表明，仅凭颜色特征在二元诊断任务中（例如良性与恶性）可以取得强劲的表现，分类准确率高达89%。这种表现可能归因于与恶性相关的全局色彩变化。重要的是，这些简单的基于颜色的表示在性能上始终显著优于随机基线，表明原始颜色分布编码了一个非随机且具有诊断相关性的信号，用于癌症检测。因此，本研究建议，简单且计算效率高的颜色特征可以作为有效的预筛选工具。通过识别具有强烈恶性色彩指示的样本，这些轻量级模型可以作为第一轮分诊系统，减轻复杂深度学习架构的计算负担。

View on arXiv Download PDF AI Translation

cs.CV / 288 / 2605.18541

LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift

LESSViT：在光谱配置变化下的鲁棒性高光谱表示学习

Si, Haozhe, Wan, Yuxuan, Wang, Yuqing, Do, Minh, Zhao, Han

Abstract

Modeling hyperspectral imagery (HSI) across different sensors presents a fundamental challenge due to variations in wavelength coverage, band sampling, and channel dimensionality. As a result, models trained under a fixed spectral configuration often fail to generalize to other sensors. Existing Vision Transformer (ViT) approaches either rely on implicit spectral modeling with fixed channel assumptions or adopt explicit spatial-spectral attention with prohibitive computational cost, leading to a fundamental trade-off between efficiency and expressiveness. In this work, we introduce Low-rank Efficient Spatial-Spectral ViT (LESSViT), a sensor-flexible architecture for cross-spectral generalization. LESSViT is built on LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions through separable spatial and spectral components, reducing the complexity of full spatial-spectral attention from $O(N^2 C^2)$ to $O(rNC)$, where $N$ is the number of spatial tokens, $C$ is the number of spectral channels, and $r$ is the rank of the low-rank approximation. We further incorporate channel-agnostic patch embedding and wavelength-aware positional encoding to support flexible spectral inputs. To enable efficient and robust pretraining, we introduce a hyperspectral masked autoencoder (HyperMAE) with decoupled spatial-spectral masking and hierarchical channel sampling. We evaluate LESSViT under a cross-spectral generalization setting that simulates cross-sensor variability. Experiments on the SpectralEarth benchmark demonstrate that LESSViT improves robustness under spectral shifts while remaining competitive in-distribution, and explicit and efficient spatial-spectral modeling is essential for scalable and generalizable hyperspectral representation learning.

Chinese Translation

在不同传感器下建模高光谱图像（HSI）面临着基本挑战，因为波长覆盖、波段采样和通道维度存在差异。因此，在固定光谱配置下训练的模型往往无法推广到其他传感器。现有的视觉变换器（ViT）方法要么依赖于固定通道假设的隐式光谱建模，要么采用显式的空间-光谱注意机制，导致计算成本高昂，从而在效率和表现力之间存在根本的权衡。在本研究中，我们提出了低秩高效空间-光谱ViT（LESSViT），这是一种适用于跨光谱泛化的传感器灵活架构。LESSViT基于LESS注意机制构建，这是一种结构化的低秩分解，通过可分离的空间和光谱组件建模联合空间-光谱交互，将全空间-光谱注意的复杂度从$O(N^2 C^2)$降低到$O(rNC)$，其中$N$是空间标记的数量，$C$是光谱通道的数量，$r$是低秩近似的秩。我们进一步结合了与通道无关的补丁嵌入和波长感知的位置编码，以支持灵活的光谱输入。为了实现高效且鲁棒的预训练，我们引入了一种高光谱掩蔽自编码器（HyperMAE），采用解耦的空间-光谱掩蔽和分层通道采样。我们在模拟跨传感器变异的跨光谱泛化设置下评估LESSViT。在SpectralEarth基准测试上的实验表明，LESSViT在光谱变化下提高了鲁棒性，同时在分布内保持竞争力，显式且高效的空间-光谱建模对于可扩展和可泛化的高光谱表示学习至关重要。

View on arXiv Download PDF AI Translation

cs.CV / 289 / 2605.18553

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

StableHand：基于质量的流匹配用于从自我中心视频中估计世界空间双手运动

Zeng, Huajian, Yao, Chaohua, Zhang, Yuantai, Yang, Jiaqi, Potamias, Rolandos Alexandros, Zuo, Xingxing

Abstract

Recovering world space 4D motion of two interacting hands from egocentric video is a fundamental capability for supervising robot policy learning, where wrist trajectories track the end-effector and finger articulations specify the grasp pose. Two major challenges arise in this setting: hands frequently leave the camera view for extended periods due to head motion, and persistent hand-object interactions cause severe occlusions of one or both hands. Existing methods uniformly condition on noisy hand motion observations without accounting for their per-frame reliability, leading to substantial performance degradation. Our key insight is that accurate world space hand motion estimation is tightly coupled with the quality of per-frame hand observations. To this end, we decompose the quality of hand motion observations extracted from an off-the-shelf hand pose estimator into four channels: wrist global translation and finger articulations for both hands. We propose StableHand, a quality-aware flow-matching framework conditioned on these four-channel quality signals, which are predicted by a learned quality network. We naturally incorporate the quality signals into the flow-matching process through a per-channel forward schedule, a quality-adjusted velocity target, AdaLN modulation of the DiT denoiser, and a quality-aware ODE initialization. This unified generative process preserves high-quality observations while reconstructing unreliable ones using a learned bimanual motion prior. Experiments on HOT3D and ARCTIC, two egocentric benchmarks featuring long missing-hand spans and persistent hand-object occlusions, show that StableHand achieves state-of-the-art performance across all reported metrics, reducing W-MPJPE by 20-25% compared to the strongest baseline, with the largest gains on heavily occluded ARCTIC sequences.

Chinese Translation

从自我中心视频中恢复两只交互手的世界空间四维运动是监督机器人策略学习的基本能力，其中手腕轨迹跟踪末端执行器，手指关节指定抓取姿态。在这种情况下出现了两个主要挑战：由于头部运动，手部经常长时间离开摄像机视野，且持续的手物体交互导致一只或两只手的严重遮挡。现有方法在处理噪声手部运动观测时未考虑每帧的可靠性，导致性能显著下降。我们的关键见解是，准确的世界空间手部运动估计与每帧手部观测的质量紧密相关。为此，我们将从现成手势估计器提取的手部运动观测质量分解为四个通道：两只手的手腕全局位移和手指关节。我们提出了StableHand，一个基于这四个通道质量信号的质量感知流匹配框架，这些信号由学习的质量网络预测。我们通过每个通道的前向调度、质量调整的速度目标、DiT去噪器的AdaLN调制以及质量感知的常微分方程初始化，将质量信号自然地融入流匹配过程中。这个统一的生成过程在重建不可靠观测的同时，保留了高质量观测，使用学习的双手运动先验。在HOT3D和ARCTIC这两个自我中心基准上进行的实验显示，StableHand在所有报告的指标上都达到了最先进的性能，与最强基线相比，W-MPJPE降低了20-25%，在严重遮挡的ARCTIC序列中获得了最大的提升。

View on arXiv Download PDF AI Translation

cs.CV / 290 / 2605.18577

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

OmniPro：一个全面的基准用于全主动流媒体视频理解

Zhao, Ruixiang, Yang, Jie, Xin, Zijie, Wang, Tianyi, Rao, Fengyun, LYU, Jing, Li, Xirong

Abstract

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

Chinese Translation

全主动流媒体视频理解，即从连续的音视频流中自主决定何时发言以及说什么，是全模态大型语言模型的一项新兴能力。现有基准在三个关键方面存在不足：它们主要依赖视觉信号，采用轮询或固定时间戳协议，而非真正的主动评估，并且仅涵盖有限范围的任务，阻碍了对全主动流媒体模型的可靠评估和区分。我们提出了OmniPro，这是第一个联合评估全模态感知、主动响应和多样化视频理解任务的基准。它包含2700个经过人工验证的样本，涵盖9个子任务和3个认知水平，涉及6种基本视频理解能力。值得注意的是，84%的样本需要音频信号（语言或非语言），每个样本都带有模态隔离标签，以便进行细粒度的多模态分析。我们进一步引入了一种双模式评估协议：探测模式通过在每个真实触发前后查询模型来评估内容理解，而在线模式则通过要求模型自主决定在流输入中何时响应来评估全面的主动能力。对11个代表性模型的评估揭示了三个关键发现：（1）音频提供了一致的增益，但在模型间的利用率高度可变；（2）性能随时间显著下降，表明长时间鲁棒性有限；（3）非语言音频感知仍然是最薄弱的维度。

View on arXiv Download PDF AI Translation

cs.CV / 291 / 2605.18599

Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

通过语义-空间解耦解决前馈新视图合成变换器中的表示歧义

Wu, Yihang, Sun, Yihang, Zhang, Shaofeng, Wu, Zuxuan, Yan, Junchi, Jia, Xiaosong, Jiang, Yu-gang

Abstract

Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Pl\"ucker rays) into a shared feature space. Since Pl\"ucker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.

Chinese Translation

基于变换器的模型推动了前馈新视图合成（NVS）的发展。目前的架构如 GS-LRM 和 LVSM 将语义信息（例如 RGB）和空间信息（例如 Pl"ucker 射线）混合到一个共享特征空间中。由于 Pl"ucker 射线自然携带类似晶格的空间结构，这些设计可能导致空间偏差干扰外观表示，从而降低渲染保真度。为此，我们提出将前馈 NVS 变换器的表示解耦为独立的语义和空间标记。解耦设计在各自的分支中保持语义和空间信息的明确性，同时通过共享注意力路由保留跨分支交互。在此设计基础上，我们引入了可选的分类监督和双向调制：前者提供分支特定的训练信号，而后者改善了两个分支之间的交互。值得注意的是，基础解耦设计几乎不会引入额外的推理延迟，这得益于其架构设计。所提出的设计实现了一致的改进，证明了在仅解码器和编码-解码器前馈 NVS 模型中的有效性。

View on arXiv Download PDF AI Translation

cs.CV / 292 / 2605.18601

Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models

咒语：自然语言作为多实体视频世界模型的动作接口

Zhu, Shangwen, Peng, Qianyu, Pu, Zhao, Shu, Zhilei, Ke, Xiangrui, Xing, Zhaohu, Tong, Zizhao, Wang, Zeqing, Cui, Xinyu, Wang, Huangji, Zhao, Jian, Jin, Yeying, Cheng, Fan, Feng, Ruili

Abstract

Modern interactive video world models have achieved impressive visual fidelity, yet lack fine-grained multi-entity control and cross-entity, cross-world generalization. We trace this gap to the action interface: standard control protocols (e.g. animation IDs, device inputs, scene-level captions) bind action semantics to specific entities or engines at design time. We propose natural language as the interface to unlock expressiveness that no prior interface can achieve, and we present Incantation, the first interactive video world model with per-latent-frame (0.25 s) natural-language conditioning that supports simultaneous multi-entity control and concept-level cross-entity transfer beyond any fixed rendering pipeline. We pair a pretrained bidirectional video backbone with frame-local text cross-attention, and enable real-time long-horizon streaming through ODE-initialized Self-Forcing distillation with a RoPE-decoupled sliding KV-cache. We surpass the Action-Index baseline on cross-entity transfer (89% vs. 43%) and out-of-vocabulary prompts (90% vs. 0%), and our 2-step student sustains 19.7 FPS at 480p with stable FVD over 2-hour rollouts. We further apply the same architecture and training recipe to The King of Fighters, changing only the per-entity action vocabulary slots. We have released a preview subset of the Incantation dataset at https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes, containing manually collected Elden Ring player-boss combat clips with structured action-oriented metadata. Larger-scale Elden Ring and KOF data will be released with the full project.

Chinese Translation

现代交互式视频世界模型在视觉保真度方面取得了令人瞩目的成就，但在细粒度的多实体控制和跨实体、跨世界的泛化能力上仍显不足。我们将这一差距归因于动作接口：标准控制协议（例如动画ID、设备输入、场景级字幕）在设计时将动作语义绑定到特定的实体或引擎。我们提出自然语言作为接口，以解锁前所未有的表达能力，并呈现了咒语（Incantation），这是第一个具有每个潜在帧（0.25秒）自然语言条件的交互式视频世界模型，支持同时的多实体控制和超越任何固定渲染管道的概念级跨实体转移。我们将预训练的双向视频主干与帧局部文本交叉注意力相结合，并通过ODE初始化的自我强制蒸馏与去耦的滑动KV缓存实现实时长时域流媒体。我们在跨实体转移（89%对比43%）和超出词汇表的提示（90%对比0%）上超越了动作索引基线，并且我们的两步学生模型在480p下以稳定的FVD在2小时的回放中维持19.7 FPS。我们进一步将相同的架构和训练方案应用于《拳皇》（The King of Fighters），仅改变每个实体的动作词汇槽。我们已在https://huggingface.co/datasets/zhush/incantation-elden-ring-scenes发布了咒语数据集的预览子集，其中包含手动收集的《艾尔登法环》（Elden Ring）玩家与Boss战斗片段及结构化的动作导向元数据。更大规模的《艾尔登法环》和《拳皇》数据将在完整项目中发布。

View on arXiv Download PDF AI Translation

cs.CV / 293 / 2605.18603

Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth

饥饿以感知：通过限制视觉带宽驯化视觉语言模型中的懒惰感知

Wu, Yuhuan, Wei, Cong, Lin, Fangzhen, Chen, Wenhu, Wang, Haozhe

Abstract

Vision-Language Models (VLMs) deployed as situated agents in high-resolution visual environments require active perception -- the ability to dynamically decide where to look through operations like zooming, cropping, and panning. However, current training paradigms produce models that mimic the surface form of such operations without functionally depending on their outputs, a phenomenon we term lazy perception. We trace this to a fundamental learning asymmetry: when coarse global views combined with language priors suffice for moderate accuracy, the model has no incentive to learn harder multi-step visual search. If a model can succeed without actively looking, it will never learn to look. This motivates Starve to Perceive, a training paradigm that constrains visual bandwidth -- restricting each observation to a tight token budget so that no single view suffices for task completion, making active perception the only viable strategy. Despite requiring no auxiliary losses, reward shaping, or architectural changes -- serving as a minimal, plug-in modification to standard post-training pipelines -- models trained under perceptual starvation achieve substantial gains of 5% average relative improvement across diverse benchmarks.

Chinese Translation

作为高分辨率视觉环境中的情境代理部署的视觉语言模型（VLMs）需要主动感知——通过缩放、裁剪和移动等操作动态决定观察位置的能力。然而，目前的训练范式产生的模型模仿了这些操作的表面形式，却在功能上并不依赖于其输出，这一现象我们称之为懒惰感知。我们将其追溯到一种基本的学习不对称性：当粗略的全局视图结合语言先验足以达到中等准确率时，模型没有动力去学习更复杂的多步骤视觉搜索。如果一个模型可以在不主动观察的情况下成功，它将永远不会学会观察。这促使我们提出“饥饿以感知”这一训练范式，限制视觉带宽——将每次观察限制在一个紧凑的标记预算内，以至于没有单一视图足以完成任务，使得主动感知成为唯一可行的策略。尽管不需要辅助损失、奖励塑造或架构更改——作为对标准后训练流程的最小插件修改——在感知饥饿下训练的模型在各种基准测试中实现了5%的平均相对提升。

View on arXiv Download PDF AI Translation

cs.CV / 294 / 2605.18608

Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation through Dynamic Style Bridging

跨越变化的舞蹈：通过动态风格桥接实现前向促进的持续测试时间适应

Zhu, Zhilin, Wang, Yabin, Ma, Zhiheng, Song, Yaguang, Wang, Yaowei, Hong, Xiaopeng

Abstract

Continual Test-Time Adaptation (CTTA) aims to empower perception systems to handle dynamic distribution shifts encountered after deployment. Existing methods predominantly follow a backward-alignment paradigm, which rigidly aligns incoming data with supervisory surrogates derived from the source domain. Consequently, they struggle with unreliable supervision and evolving distribution shifts. To overcome these limitations, we introduce a novel forward-facilitation paradigm through a method termed Dynamic Style Bridging. Prior to deployment, we construct a compact knowledge base of generated class exemplars. During test time, to mitigate inherent generative bias and adapt these proxies to incoming data, we propose a multi-level bridging mechanism. This mechanism dynamically injects the proxies with incoming data styles at the input, statistical, and representation levels, while preserving the original semantics of the proxies. These high-fidelity proxies are then used to provide reliable, on-demand supervisory signals, enabling stable adaptation under continual shifts. Extensive experiments across standard CTTA benchmarks demonstrate that our method achieves consistent and substantial improvements over recent state-of-the-art approaches. Code is available at \href{https://github.com/z1358/DAS}.

Chinese Translation

持续测试时间适应（CTTA）旨在增强感知系统以应对部署后遇到的动态分布变化。现有方法主要遵循向后对齐范式，严格将输入数据与源领域派生的监督代理进行对齐。因此，它们在不可靠的监督和不断变化的分布变化面前表现不佳。为克服这些局限性，我们提出了一种新的前向促进范式，称为动态风格桥接。在部署之前，我们构建了一个紧凑的生成类示例知识库。在测试期间，为了减轻固有的生成偏差并将这些代理适应于输入数据，我们提出了一种多层次桥接机制。该机制在输入、统计和表示层面动态注入输入数据的风格，同时保持代理的原始语义。这些高保真代理随后被用于提供可靠的按需监督信号，从而实现稳定的适应以应对持续变化。针对标准CTTA基准的广泛实验表明，我们的方法在最近的最先进方法上实现了一致且显著的改进。代码可在 exttt{https://github.com/z1358/DAS} 获取。

View on arXiv Download PDF AI Translation

cs.CV / 295 / 2605.18610

CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

CATA：通过冲突规避任务算术实现持续机器遗忘

Lin, Shen, Dong, Junhao, Chen, Rongjie, Zhang, Xiaoyu, Xu, Li, Chen, Xiaofeng

Abstract

Vision-language models (VLMs) have shown remarkable ability in aligning visual and textual representations, enabling a wide range of multimodal applications. However, their large-scale training data inevitably raises concerns about privacy, copyright, and undesirable content, creating a strong need for machine unlearning. While existing studies mainly focus on single-shot unlearning, practical VLM deployment often involves sequential removal requests over time, giving rise to continual machine unlearning. In this work, we make the first attempt to study continual unlearning for VLMs and identify three key challenges in this setting: effectiveness in removing target knowledge, fidelity in preserving retained model utility, and persistence in preventing knowledge re-emergence under sequential updates. To address these challenges, we propose CATA, a conflict-averse task arithmetic method that represents each forget request as an unlearning task vector. By maintaining historical task vectors and performing sign-aware conflict-averse aggregation, CATA suppresses conflicting update components that may weaken previous forgetting effects. Extensive experiments under both single-shot and continual settings show that CATA outperforms baselines in terms of forgetting effectiveness, model fidelity, and forgetting persistence.

Chinese Translation

视觉-语言模型（VLMs）在对齐视觉和文本表示方面展现了卓越的能力，使得多模态应用得以广泛实现。然而，它们的大规模训练数据不可避免地引发了关于隐私、版权和不良内容的担忧，迫切需要机器遗忘。现有研究主要集中于单次遗忘，而实际的VLM部署通常涉及随时间推移的连续删除请求，从而引发持续机器遗忘。在本研究中，我们首次尝试研究VLM的持续遗忘，并识别出这一设置中的三个关键挑战：有效去除目标知识、保留模型效用的保真度，以及在连续更新中防止知识重新出现的持久性。为了解决这些挑战，我们提出了CATA，一种冲突规避任务算术方法，将每个遗忘请求表示为一个遗忘任务向量。通过维护历史任务向量并执行符号感知的冲突规避聚合，CATA抑制可能削弱先前遗忘效果的冲突更新组件。在单次和持续设置下的广泛实验表明，CATA在遗忘有效性、模型保真度和遗忘持久性方面优于基线方法。

View on arXiv Download PDF AI Translation

cs.CV / 296 / 2605.18621

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

CrossView Suite：利用多模态大语言模型的跨视角空间智能，结合数据集、模型和基准

Wang, Wei, Yuan, Yuqian, Lin, Tianwei, Zhang, Wenqiao, Tang, Siliang, Xiao, Jun, Zhuang, Yueting

Abstract

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.

Chinese Translation

空间智能要求多模态大语言模型（MLLMs）超越单一视角感知，并在多个视角中对物体、可见性、几何形状和交互进行一致的推理。然而，跨视角推理的进展受到三个主要缺口的限制：大规模高质量标注训练数据的稀缺、缺乏系统评估的全面基准，以及缺乏建立视角间物体一致性的显式对齐机制。为了解决这些问题，我们全面开发了CrossView Suite，涵盖三个协调组件：CrossViewSet、CrossViewBench和CrossViewer。首先，我们引入了一个多智能体数据引擎，精心策划了一个大规模高质量的跨视角指令数据集，称为CrossViewSet，涵盖17种细粒度任务类型，共1.6百万个样本。其次，我们精心创建了一个场景不重叠的CrossViewBench，以全面评估MLLM的跨视角空间理解能力，从多个方面进行评估。最后，我们提出了CrossViewer，一个渐进式的三阶段框架，用于MLLM中的跨视角空间推理，遵循感知 -> 对齐 -> 推理的范式。我们的方法配备了自适应空间区域标记器，以捕捉细粒度的物体表示，然后显式对齐多视角物体，从而融合对齐特征，以提升MLLM的跨视角推理能力。大量实验和分析表明，大规模训练数据、系统评估和显式跨视角对齐对于推动MLLM从单一视角感知向现实世界空间智能的进步至关重要。项目页面可访问 https://github.com/Thinkirin/Crossview-Suite。

View on arXiv Download PDF AI Translation

cs.CV / 297 / 2605.18636

SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents

SPIKE：一种适应性双控制器框架，用于成本高效的长时间游戏代理

Jiang, Wencan, Zhang, Jiangning, Mei, Jianbiao, Liu, Jinzhuo, Yang, Yu, Hu, Xiaobin, Xue, Zhucun, Liu, Yong, Tao, Dacheng

Abstract

Long-horizon multimodal agents in open-world games must stay goal-directed across many low-level interactions under tight token and latency budgets. Existing approaches often trade off costly per-step reasoning against reactive execution that can drift, repeat failures, and recover poorly. Our key idea is to reuse strategic reasoning across locally stable segments and reinvoke it at event boundaries. We present SPIKE, an adaptive dual controller framework for cost-efficient long-horizon game control. Its Strategic Controller performs low-frequency global planning, failure analysis, and recovery, while its Reactive Controller handles fast local execution under a strict token budget. An Event Trigger monitors visual change, task progress, repeated actions, and failure signals to decide when control should stay reactive or escalate to strategic reasoning. Hierarchical Memory separates short-term experience reuse in the State-Action Memory Bank (SA-MB) from structured evidence in the State Action Knowledge Graph (SA-KG), allowing each controller to retrieve the context it needs. This design reuses strategic proposals over multiple reactive steps, supports local override when plans become stale, and reserves expensive reasoning for moments where extra deliberation is useful. On the Lite-100 split of StarDojo, SPIKE improves Lite-100 success rate (SR) by 5.0 percentage points (38.5% relative) over the strongest Lite-100 baseline and Budgeted SR by 9.3 points (75.6% relative) over the strongest budgeted baseline. It also reduces token consumption by 54.9% and latency by 40.8%. Ablations show that event triggering, reactive override, and heterogeneous memory each contribute to success and recovery, supporting selective reasoning rather than reasoning at every step.

Chinese Translation

开放世界游戏中的长时间多模态代理必须在严格的令牌和延迟预算下，保持目标导向并进行多次低级交互。现有方法通常在每步推理的高成本与可能漂移、重复失败和恢复不良的反应执行之间进行权衡。我们的关键思想是重用战略推理，在局部稳定的片段中并在事件边界重新调用它。我们提出了SPIKE，一种适应性双控制器框架，用于成本高效的长时间游戏控制。其战略控制器执行低频率的全局规划、失败分析和恢复，而反应控制器则在严格的令牌预算下处理快速的本地执行。事件触发器监控视觉变化、任务进展、重复动作和失败信号，以决定控制何时应保持反应性或升级到战略推理。分层记忆将短期经验的重用（在状态-动作记忆库（SA-MB）中）与结构化证据（在状态动作知识图（SA-KG）中）分开，使每个控制器能够检索所需的上下文。这种设计在多个反应步骤中重用战略提案，支持在计划变得过时时进行本地覆盖，并将高成本的推理保留给额外深思熟虑有用的时刻。在StarDojo的Lite-100拆分中，SPIKE将Lite-100成功率（SR）提高了5.0个百分点（相对提高38.5%），将预算成功率（Budgeted SR）提高了9.3个百分点（相对提高75.6%），同时减少了54.9%的令牌消耗和40.8%的延迟。消融实验表明，事件触发、反应覆盖和异构记忆各自对成功和恢复有所贡献，支持选择性推理而非每一步都进行推理。

View on arXiv Download PDF AI Translation

cs.CV / 298 / 2605.18641

Leveraging Latent Visual Reasoning in Silence

在沉默中利用潜在视觉推理

Zhu, Dongyao, Wang, Zhen, Xiao, Xi, Jiang, Han, Vahidian, Saeed, Chao, Wei-Lun, Berger-Wolf, Tanya, Su, Yu, Vatsavai, Raju, Gu, Jianyang

Abstract

Latent visual reasoning involves visual evidence more directly in multimodal reasoning by inserting continuous latent tokens before textual generation. However, the necessity of these latent tokens at inference remains ambiguous. We show that replacing latent tokens with random noise or removing them completely causes little performance degradation across spatial reasoning benchmarks. Reinforcement learning further diminishes the latent generation behavior after post-training. These observations raise a central question: Is latent visual reasoning still meaningful? We argue that its value should be measured by how effectively latent tokens guide learning, rather than whether they persist as an inference-time format. Our analysis shows that latent reasoning is unevenly favorable across question types, yet hard task-level routing for applying latent generation is brittle. Motivated by these findings, we propose an attention-based reward that encourages generated latent tokens to interact with later text tokens during RL. This reward promotes latent utilization when the latent mode is activated while preserving the flexibility to use pure-text reasoning. Experiments show that our method improves performance across perception and visual reasoning benchmarks, even when latent tokens are rarely generated after post-training. Our results highlight that, without explicit expression at inference, latent visual reasoning can shape better visual grounding and more accurate textual reasoning in silence. Our code and trained models are publicly available at \href{https://github.com/ddydyd32/silent-lvr/tree/master}{GitHub} and \href{https://huggingface.co/collections/cornuHGF/silent-lvr}{Hugging Face}.

Chinese Translation

潜在视觉推理通过在文本生成之前插入连续的潜在标记，更直接地将视觉证据纳入多模态推理。然而，这些潜在标记在推理时的必要性仍然模糊不清。我们展示了用随机噪声替换潜在标记或完全去除它们在空间推理基准测试中几乎不会导致性能下降。强化学习在后训练后进一步减少了潜在生成行为。这些观察提出了一个核心问题：潜在视觉推理仍然有意义吗？我们认为其价值应通过潜在标记在指导学习方面的有效性来衡量，而不是它们是否在推理时仍然存在。我们的分析表明，潜在推理在不同问题类型中表现不均衡，但在应用潜在生成的困难任务级路由方面则显得脆弱。基于这些发现，我们提出了一种基于注意力的奖励，鼓励生成的潜在标记在强化学习过程中与后续文本标记进行交互。当潜在模式被激活时，该奖励促进潜在利用，同时保持使用纯文本推理的灵活性。实验表明，我们的方法在感知和视觉推理基准测试中提高了性能，即使在后训练后潜在标记很少生成。我们的结果强调，在推理时没有明确表达的情况下，潜在视觉推理可以在沉默中塑造更好的视觉基础和更准确的文本推理。我们的代码和训练模型已公开发布在 [GitHub](https://github.com/ddydyd32/silent-lvr/tree/master) 和 [Hugging Face](https://huggingface.co/collections/cornuHGF/silent-lvr) 上。

View on arXiv Download PDF AI Translation

cs.CV / 299 / 2605.18645

Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video

原始基础的关节物体理解：来自单一随意视频的关节化分析

Artykov, Arslan, Ravaud, Tom, Violante-Grezzi, Nicolás, Lepetit, Vincent

Abstract

Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods. Project page: https://aartykov.github.io/Articulation-in-Prime/

Chinese Translation

从单目视频中提取关节物体的三维运动学是计算机视觉中的一个基本挑战。现有方法依赖于复杂的视频设置或线索，例如长期点跟踪或宽基线匹配，但在严重遮挡、快速相机自运动或弱局部特征下往往表现脆弱。基于学习的方法在训练类别之外的泛化能力也面临困难。我们提出了一种类别无关的优化框架，将关节物体理解视为一个原始拟合问题。几何原始作为一种代理表示，避免了不稳定点轨迹的陷阱；一种新颖的机制将其组织成受转动关节和滑动关节约束的连贯部分。我们的公式联合优化部分分割和关节参数，从单个随意捕获的视频中恢复复杂的运动学。一个考虑可见性的程序处理现实数据中固有的部分观测和遮挡。我们还提出了AiP-synth和AiP-real基准，具有显著的相机运动和严重的遮挡，超越了现有方法。项目页面：https://aartykov.github.io/Articulation-in-Prime/

View on arXiv Download PDF AI Translation

cs.CV / 300 / 2605.18652

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

MementoGUI：学习自主多模态记忆控制以支持长期GUI代理

Zeng, Ziyun, Hua, Hang, Zou, Bocheng, Cai, Mu, Feris, Rogerio, Luo, Jiebo

Abstract

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce \textbf{MementoGUI-Bench} for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.

Chinese Translation

近期的GUI代理在视觉定位和动作预测方面取得了显著进展，但在需要跨多个界面转换保持任务状态的长期任务中仍然表现脆弱。现有的代理通常依赖于原始历史重放或仅文本记忆，这要么使模型被冗余的截图淹没，要么丢弃了未来决策所需的局部视觉证据。为了解决这些局限性，我们引入了 extbf{MementoGUI}，一个插件式自主记忆框架，赋予基于MLLM的GUI代理以 extbf{MementoCore}，一个用于在线记忆选择、压缩和检索的学习控制器。MementoGUI并不将交互历史视为固定上下文，而是将长期GUI控制形式化为一个在线记忆控制问题：工作记忆选择性地保留与任务相关的界面事件，并提供文本摘要和ROI级别的视觉证据，而情节记忆则通过学习的相关性选择检索可重用的过去轨迹。MementoCore将记忆控制模块化为专门的操作符，用于步骤处理、记忆压缩、情节写入和情节选择，从而实现插件式记忆增强，而无需微调GUI代理的主干。我们进一步开发了一个可扩展的数据策划管道，将计算机使用轨迹转换为记忆控制器训练数据，推出了 extbf{MementoGUI-Bench}以评估GUI代理的长期决策能力，并设计了基于MLLM的度量标准，用于语义动作匹配、任务进展和记忆一致性。在GUI-Odyssey、MM-Mind2Web和MementoGUI-Bench上的实验表明，MementoGUI在无历史、历史重放和仅文本记忆基线的情况下，始终改善了GUI代理的表现，且更大的MementoCore主干进一步增强了记忆增强的GUI控制能力。

View on arXiv Download PDF AI Translation

人工智能 (Artificial Intelligence)

142

cs.AI / 1 / 2605.16265

AgentWall: A Runtime Safety Layer for Local AI Agents

AgentWall：本地AI代理的运行时安全层

Aravind, Ashwin

Abstract

The safety of autonomous AI agents is increasingly recognized as a critical open problem. As agents transition from passive text generators to active actors capable of executing shell commands, modifying files, calling APIs, and browsing the web, the consequences of unsafe or adversarially manipulated behavior become immediate and tangible. Existing AI safety work has focused primarily on model alignment and input filtering, but these approaches do not address what happens at the moment an agent's intent becomes a real action on a real machine. This gap is especially acute in local environments, where developers run agents against their own filesystems, credentials, and infrastructure with little runtime control. This paper introduces AgentWall, a runtime safety and observability layer for local AI agents. AgentWall intercepts every proposed agent action before it reaches the host environment, evaluates it against an explicit declarative policy, requires human approval for sensitive operations, and records a complete execution trail for audit and replay. It is implemented as a policy-enforcing MCP proxy and native OpenClaw plugin, working across Claude Desktop, Cursor, Windsurf, Claude Code, and OpenClaw with a single install command. We present the design, architecture, threat model, and policy model of AgentWall, and demonstrate 92.9% policy enforcement accuracy with sub-millisecond overhead across 14 benchmark tests. AgentWall is open-source at https://github.com/agentwall/Agentwall.

Chinese Translation

自主AI代理的安全性日益被认为是一个关键的开放问题。随着代理从被动的文本生成器转变为能够执行shell命令、修改文件、调用API和浏览网页的主动参与者，不安全或被对抗性操控行为的后果变得直接而明显。现有的AI安全研究主要集中在模型对齐和输入过滤上，但这些方法并未解决代理的意图在真实机器上转化为实际行动时会发生什么。这个缺口在本地环境中尤为明显，开发者在自己的文件系统、凭证和基础设施上运行代理时几乎没有运行时控制。本文介绍了AgentWall，一个用于本地AI代理的运行时安全和可观察性层。AgentWall在每个提议的代理操作到达主机环境之前进行拦截，依据明确的声明性政策进行评估，对敏感操作要求人工批准，并记录完整的执行轨迹以供审计和重放。它作为一个政策执行的MCP代理和原生的OpenClaw插件实现，能够通过单一安装命令在Claude Desktop、Cursor、Windsurf、Claude Code和OpenClaw上工作。我们展示了AgentWall的设计、架构、威胁模型和政策模型，并在14个基准测试中实现了92.9%的政策执行准确率，且延迟低于毫秒级。AgentWall的源代码可在https://github.com/agentwall/Agentwall获取。

View on arXiv Download PDF AI Translation

cs.AI / 2 / 2605.16309

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

ANNEAL：通过受控符号补丁学习适应大型语言模型代理

Hakim, Safayat Bin, Guo, Keyan, Tan, Wenkai, Velasquez, Alvaro, Xu, Shouhuai, Song, Houbing Herbert

Abstract

LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72-100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.

Chinese Translation

基于大型语言模型（LLM）的代理能够从个体执行错误中恢复，但当基础过程知识——操作符模式、前提条件和约束——未得到修复时，它们在相同故障上反复失败。现有的自我演化方法通过更新提示、记忆或模型权重来解决这一问题，但没有一种方法直接修复编码任务执行方式的符号结构，且很少提供安全部署所需的治理保证。我们提出了ANNEAL，一种神经符号代理，它将重复出现的故障转化为对过程知识图的受控符号编辑，而不修改基础模型的权重。其核心机制，故障驱动知识获取（Failure-Driven Knowledge Acquisition, FDKA），定位负责的操作符，通过受限的LLM生成合成一个类型补丁，并通过多维评分、符号护栏和金丝雀测试在提交前验证提案。每个被接受的编辑都携带完整的来源信息和确定性的回滚能力。在四个领域和27次多种种子运行中，ANNEAL是唯一一个提交持久结构修复的评估系统——强基线如ReAct和Reflexion在高情节恢复方面表现良好，但在重复故障上仍保持72-100%的保留失败率，而ANNEAL在测试的重复故障设置中将这些比率降低至0%。消融实验确认，移除FDKA会消除所有结构修复，并使成功率下降多达26.7个百分点。这些结果表明，受控符号修复为持久故障消除提供了一种补充的范式，与权重级和提示级适应相辅相成。

View on arXiv Download PDF AI Translation

cs.AI / 3 / 2605.16552

From Prompts to Protocols: An AI Agent for Laboratory Automation

从提示到协议：一种用于实验室自动化的人工智能代理

Angelopoulos, Angelos, Cahoon, James F., Alterovitz, Ron

Abstract

Automating science laboratories enables faster, safer, more accurate, and more reproducible execution of protocols, accelerating the discovery and testing of new materials, drugs, and more. However, setting up and running autonomous labs requires coordinating numerous instruments and robots, forcing scientists to write code, manage configuration files, and navigate complex software infrastructure. We present an AI agent architecture that integrates large language models with laboratory orchestration, enabling scientists to interactively create and monitor automated lab protocols using natural language. Integrated into the Experiment Orchestration System (EOS), the AI agent operates under an agentic loop with automated validation and error correction, and supports the complete experimental lifecycle: creating protocols, running and monitoring both protocols and closed-loop optimization campaigns, and analyzing results. A visual graph editor renders protocols as interactive node-based diagrams synchronized with the AI agent's protocol representation, enabling seamless alternation between AI-assisted and manual protocol construction. Evaluated on three simulated automated labs spanning chemistry, biology, and materials science, the AI agent achieves a 97% first-attempt protocol generation success rate and an order of magnitude reduction in required interface actions.

Chinese Translation

自动化科学实验室能够更快速、更安全、更准确且更可重复地执行协议，从而加速新材料、药物等的发现和测试。然而，建立和运行自主实验室需要协调众多仪器和机器人，迫使科学家编写代码、管理配置文件，并应对复杂的软件基础设施。我们提出了一种人工智能代理架构，将大型语言模型与实验室编排相结合，使科学家能够使用自然语言交互式地创建和监控自动化实验室协议。该AI代理集成于实验编排系统（Experiment Orchestration System, EOS）中，在具有自动验证和错误修正的代理循环下运行，并支持完整的实验生命周期：创建协议、运行和监控协议及闭环优化活动，并分析结果。一个可视化图形编辑器将协议呈现为与AI代理的协议表示同步的交互式节点图，能够无缝切换AI辅助和手动协议构建。在涵盖化学、生物学和材料科学的三个模拟自动化实验室中进行评估，该AI代理实现了97%的首次尝试协议生成成功率，并在所需接口操作上减少了一个数量级。

View on arXiv Download PDF AI Translation

cs.AI / 4 / 2605.16565

Skim: Speculative Execution for Fast and Efficient Web Agents

Skim：快速高效的网络代理的推测执行

Wong, Mike, Hsieh, Kevin, Nath, Suman, Netravali, Ravi

Abstract

Skim is a speculative execution framework for web agents that exploits the predictable structure of purpose-built websites. Today's web-agent expense is not intrinsic to the tasks but a property of how agents are composed: frontier-model inference, browser rendering, and ReAct-style planning are applied to every step of every task regardless of complexity. Skim's key observation is that websites enforce stable URL patterns, answer formats, and task-to-trajectory mappings across queries of the same type, so most queries can bypass these heavyweight components entirely. An offline profiler captures these patterns once per site. At runtime, Skim matches each query to a template, synthesizes the destination URL, and extracts the answer with a small model. A lightweight verifier gates each fast-path output against the query and schema; rare misspeculations cascade to the full agent, warm-started by the fast path's final URL to preserve upstream trajectory progress. Across standard web-agent benchmarks paired with three backboneagents (WebVoyager, AgentOccam, BrowserUse), Skim reduces median per-task cost by 1.9x and latency by 33.4% with no accuracy loss.

Chinese Translation

Skim 是一个针对网络代理的推测执行框架，利用了专门构建的网站的可预测结构。如今，网络代理的开销并非源于任务本身，而是代理组合方式的特性：边界模型推理、浏览器渲染和 ReAct 风格的规划在每个任务的每个步骤中都被应用，无论其复杂性如何。Skim 的关键观察是，网站强制执行稳定的 URL 模式、答案格式以及相同类型查询的任务与轨迹映射，因此大多数查询可以完全绕过这些重量级组件。离线分析器每个网站仅捕获一次这些模式。在运行时，Skim 将每个查询匹配到一个模板，合成目标 URL，并使用一个小模型提取答案。一个轻量级验证器将每个快速路径输出与查询和模式进行比对；罕见的错误推测会级联到完整代理，并通过快速路径的最终 URL 进行热启动，以保留上游轨迹的进展。在与三种基础代理（WebVoyager、AgentOccam、BrowserUse）配对的标准网络代理基准测试中，Skim 将每个任务的中位数成本降低了 1.9 倍，延迟降低了 33.4%，且没有准确性损失。

View on arXiv Download PDF AI Translation

cs.AI / 5 / 2605.16568

Scalable Uncertainty Reasoning in Knowledge Graphs

知识图谱中的可扩展不确定性推理

Wu, Jingcheng

Abstract

Knowledge Graphs are pivotal for semantic data integration. The real-world data they model is often inherently uncertain. Within knowledge graphs, uncertainty manifests in three distinct levels: imprecise attribute values, probabilistic triple existence, and incomplete schema knowledge. However, current Semantic Web standards lack native support for reasoning over such uncertainty, and na\"ive extensions often incur computational intractability. In this thesis, I aim to develop a modular framework that addresses each level through tailored techniques: (1) defining probabilistic literals and a corresponding query algebra for continuous attributes; (2) a compilation-based framework transforming SPARQL provenance into tractable probabilistic circuits for uncertain triples; and (3) topology-aware geometric embeddings for statistical schema reasoning. The central hypothesis is that specialized reasoning mechanisms, namely algebraic, logical, and geometric approaches, can reconcile semantic precision with computational tractability.

Chinese Translation

知识图谱在语义数据集成中至关重要。它们所建模的现实世界数据通常具有内在的不确定性。在知识图谱中，不确定性表现为三个不同的层次：不精确的属性值、概率三元组的存在性以及不完整的模式知识。然而，当前的语义网标准缺乏对这种不确定性进行推理的原生支持，而简单的扩展往往会导致计算上的不可处理性。在本论文中，我旨在开发一个模块化框架，通过量身定制的技术解决每个层次的问题：（1）为连续属性定义概率文字及相应的查询代数；（2）一个基于编译的框架，将SPARQL的来源转化为可处理的不确定三元组的概率电路；（3）用于统计模式推理的拓扑感知几何嵌入。核心假设是，专门的推理机制，即代数、逻辑和几何方法，可以调和语义精确性与计算可处理性。

View on arXiv Download PDF AI Translation

cs.AI / 6 / 2605.16575

Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

对手建模并非策略：大型语言模型谈判者的局限性

Cosentino, Romain, Shekkizhar, Sarath, Earle, Adam, Savarese, Silvio

Abstract

Negotiation requires more than inferring what the other side wants: it requires using that information to make advantageous offers and counteroffers over multiple turns. We study whether large language model (LLM) agents do this in a controlled multi-attribute bargaining environment. We find that current LLM agents can model a counterparty's preferences, but do not reliably turn that knowledge into strategic bargaining. When given negotiating partner preference information, agents model it accurately and early in their reasoning traces, yet this does not reliably improve outcomes for the informed side. Turn-level analyses show why: agents often respond to what they believe the counterparty values, but do not consistently pair those moves with gains on their own high-value attributes. Sellers are more accommodating overall, and in asymmetric-information conditions, the informed side often makes the more weakly compensated concessions. Because agents fail to leverage this underlying utility structure for strategic advantage, their final agreements are heavily dictated by surface-level opening anchors rather than actual utility weights. Finally, requiring agents to explicitly state concession-for-reciprocity trades before making an offer makes individual turns look more strategic, but ultimately fails to improve the efficiency of the final agreements.

Chinese Translation

谈判不仅需要推断对方的需求，还需要利用这些信息在多个回合中提出有利的报价和反报价。我们研究了大型语言模型（LLM）代理在一个受控的多属性讨价还价环境中是否能够做到这一点。我们的研究发现，目前的LLM代理能够建模对手的偏好，但并不可靠地将这种知识转化为战略性谈判。当获得谈判伙伴的偏好信息时，代理能够准确地建模这些信息，并在推理过程中较早地使用它，然而这并未可靠地改善知情方的结果。逐回合分析揭示了原因：代理通常会根据他们认为对手重视的内容作出反应，但并未始终将这些举动与自身高价值属性的收益相结合。整体而言，卖方更为宽容，而在信息不对称的情况下，知情方往往做出补偿较弱的让步。由于代理未能利用这种潜在的效用结构来获得战略优势，他们的最终协议往往受到表面开盘锚点的强烈影响，而非实际的效用权重。最后，要求代理在提出报价之前明确说明让步与回报的交易，使得个别回合看起来更具战略性，但最终未能提高最终协议的效率。

View on arXiv Download PDF AI Translation

cs.AI / 7 / 2605.16612

PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

PRISMat：基于政策驱动的置换不变自回归材料生成

Schlesinger, Claire, Hsu, Circe, Schindler, Peter, Walters, Robin

Abstract

Rapid identification of candidate materials with target properties has become a key task in materials science. Machine learning has emerged as an alternative to physics-based simulation, offering a faster and cheaper way to filter materials based on their stability and other target properties, reducing the number of candidates that reach the costly synthesis stage. Recently, Large Language Models (LLMs) have been applied to this role, but these models are parameter-heavy and computationally expensive both during training and at inference time, making them unsuitable for high-throughput tasks. This inefficiency stems from both the large over-parameterization of language models and the difficulty of framing material generation as a sequence learning problem. In this paper, we present PRISMat, a cost-effective, permutation-invariant model, which addresses these limitations. We show that PRISMat, despite taking less time for inference, is able to outperform LLMs in generating crystal slabs conditioned on critical materials' surface properties. In targeted material discovery, we achieve mean absolute errors of 0.188 eV/A$^2$ and 2.79 eV for cleavage energy and work function tasks, respectively, reducing the error of the next best model by 4$\times$.

Chinese Translation

快速识别具有目标属性的候选材料已成为材料科学中的一项关键任务。机器学习作为物理基础模拟的替代方案，提供了一种更快速和更便宜的方式来根据材料的稳定性及其他目标属性筛选材料，从而减少进入昂贵合成阶段的候选材料数量。最近，大型语言模型（Large Language Models, LLMs）被应用于这一角色，但这些模型在训练和推理时都需要大量参数，计算成本高，使其不适合高通量任务。这种低效源于语言模型的过度参数化以及将材料生成框架化为序列学习问题的困难。在本文中，我们提出了PRISMat，这是一种具有成本效益的置换不变模型，旨在解决这些限制。我们展示了PRISMat在推理时间上较短的情况下，能够在生成基于关键材料表面属性的晶体薄片方面超越LLMs。在针对材料发现的任务中，我们在裂解能和功函数任务中分别实现了0.188 eV/A$^2$和2.79 eV的平均绝对误差，将下一个最佳模型的误差减少了4倍。

View on arXiv Download PDF AI Translation

cs.AI / 8 / 2605.16638

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

TTE-Flash：通过思考-再嵌入令牌加速基于推理的多模态表示

Cheng, Jianpeng, Wu, Xian, Zhang, Jiangfan, Bao, Wentao, Ahuja, Chaitanya, Mishra, Shlok Kumar, Yu, Hanchao, Gao, Yang, Xia, Fan, Guo, Qi, Zhai, Shaodan, Fan, Xiangjun, Xiao, Jun

Abstract

Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.

Chinese Translation

最近的研究表明，通用多模态嵌入（UME）在思维链（CoT）推理中受益显著。在这一范式中，生成模型为多模态查询生成明确的推理轨迹，最终表示通过一个嵌入令牌提取，该令牌同时关注查询和推理。尽管其有效性显著，但生成明确的CoT轨迹的计算开销往往是不可承受的。在本研究中，我们提出用潜在思考令牌替代明确的CoT，这些令牌被解释为可以生成明确CoT轨迹的潜在变量。通过使用CoT生成损失优化思考令牌，并使用对比损失优化后续的嵌入令牌，我们在恒定的推理成本下生成高性能、关注推理的表示。我们的研究探讨了两个关键的架构设计：1）思考令牌和嵌入令牌应如何从同一个大语言模型（LLM）主干中提取；2）这些令牌应如何作为两个依赖任务进行训练。我们引入了TTE-Flash-2B，这是一种关注推理的多模态表示模型，在MMEB-v2基准测试中超越了其明确CoT的对应模型，同时生成的潜在思考令牌在文本和视觉上均可解释。此外，在15个视频数据集上的零-shot评估揭示了随着思考令牌数量增加的扩展行为，并激励了基于任务需求的自适应思考预算分配的初步研究。

View on arXiv Download PDF AI Translation

cs.AI / 9 / 2605.16671

Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents

野生环境中的可持续智能：通过知识自适应边缘专家代理实现生态监测的民主化

Li, Jiaxing, Fang, Hao, Xu, Chi, Zhang, Miao, Liu, Jiangchuan, Atlas, William I., Connors, Katrina M., Spoljaric, Mark A.

Abstract

Rapid biodiversity loss underscore the urgency of effective monitoring, yet manual surveys remain resource-intensive. While on-device AI offers a scalable alternative, its performance in the wild is often challenged by environmental variability. Current methods rely heavily on cloud resource, which requires continuous uploading of field data for model retraining. This approach is unsuitable for remote deployments because it consumes limited power and network connectivity. To address these constraints, this research proposes a shift from model adaptation to knowledge adaptation. We introduce an architecture that separates visual perception from reasoning, combining a visual encoder with a dynamic knowledge base. We uses an explicit knowledge base to replace implicitly encoding expert knowledge into model parameters. This method also supports knowledge sustainability by preserving expert insights in a structured form. Through cross-disciplinary collaboration with biologists and Indigenous communities, this work advances ethical AI co-development, fostering responsible and culturally informed ecosystem management.

Chinese Translation

快速的生物多样性丧失凸显了有效监测的紧迫性，但手动调查仍然资源密集。尽管设备上的人工智能提供了一种可扩展的替代方案，但其在野外的表现常常受到环境变化的挑战。目前的方法在很大程度上依赖于云资源，这需要持续上传现场数据以进行模型再训练。这种方法不适合远程部署，因为它消耗有限的电力和网络连接。为了解决这些限制，本研究提出了从模型适应转向知识适应的转变。我们引入了一种将视觉感知与推理分离的架构，将视觉编码器与动态知识库相结合。我们使用显式知识库来替代将专家知识隐式编码到模型参数中的方法。这种方法还通过以结构化形式保留专家见解来支持知识的可持续性。通过与生物学家和土著社区的跨学科合作，这项工作推动了伦理人工智能的共同开发，促进了负责任和文化知情的生态系统管理。

View on arXiv Download PDF AI Translation

cs.AI / 10 / 2605.16675

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

LinAlg-Bench：揭示大型语言模型数学推理中结构性失败模式的法医基准测试

Agarwal, Shradha, Rajbhar, Deepak, J, Tariq

Abstract

We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors -- sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool roleplay, constraint-consistent confabulation, and structured hallucination rather than attempting computation. This fabrication-to-abandonment transition is near-universal across all model tiers and architectures, suggesting a working memory limit rather than a knowledge gap, supported by three scale-emergent error types absent at 3x3 but present at 4x4 and 5x5. We further show that solution strategy rigidity is a near-perfect predictor of 5x5 determinant accuracy, document constraint-aware confabulation as a novel structured hallucination failure mode, and release all data, model outputs, error labels, and judge pipeline publicly.

Chinese Translation

我们介绍了LinAlg-Bench，这是一个诊断基准，评估10个前沿大型语言模型在严格维度梯度下的结构化线性代数计算，涵盖3x3、4x4和5x5矩阵。该基准涵盖9种任务类型和660个经过SymPy认证的问题，全面评估了6,600个模型输出。除了二元准确性外，LinAlg-Bench还引入了一个三阶段自动法医流程，将1,156个失败分类为十个主要错误标签及其细分类型，揭示了大型语言模型的数学失败并非随机，而是受算法类型和矩阵维度的结构性限制。我们的核心发现是在4x4规模处存在明显的行为阈值：在此阈值以下，模型通过执行错误失败——例如符号跟踪失败、算术漂移和奇偶性错误；而在此阈值以上，失败转变为计算放弃，模型通过工具角色扮演、约束一致的虚构和结构性幻觉而非尝试计算来编造响应。这种从编造到放弃的转变在所有模型层级和架构中几乎是普遍存在的，表明存在工作记忆限制而非知识缺口，这一点得到了在3x3中缺失但在4x4和5x5中存在的三种规模涌现错误类型的支持。我们进一步表明，解决策略的刚性是5x5行列式准确性的近乎完美预测因子，记录了约束感知的虚构作为一种新型结构性幻觉失败模式，并公开发布所有数据、模型输出、错误标签和评审流程。

View on arXiv Download PDF AI Translation

cs.AI / 11 / 2605.16676

Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment

增强元认知人工智能：基于图论的知识图谱构建与大型语言模型的丰富

Askin, Deniz, Hadar, Gal, Conway-Smith, Brendan

Abstract

Metacognition-the ability to monitor one's own knowledge state, spot gaps, and autonomously fill them--remains largely absent from modern AI. Here, we present MetaKGEnrich, a fully automated pipeline that endows large language model (LLM) applications with self-directed knowledge repair. The system (i) builds knowledge graphs from a seed query, (ii) detects sparse regions via seven graph metrics, (iii) has GPT-4o generate targeted questions, (iv) retrieves web evidence with Tavily and ingests it into Neo4j, and (v) re-answers the query with GraphRAG for GPT-4 to evaluate improvement. Tested on 30 queries from each of three widely-used datasets: Google Research Natural Questions, MS MARCO, and Hot-potQA. MetaKGEnrich improved answer quality in 80% of HotpotQA questions, 87% of Google Research Natural Questions and 83% of MS MARCO questions, while preserving well-supported regions. This proof of concept demonstrates how topological self-diagnosis plus targeted retrieval can advance AI toward humanlike metacognitive learning.

Chinese Translation

元认知——监测自身知识状态、发现知识空白并自主填补的能力——在现代人工智能中仍然缺乏。在此，我们提出了MetaKGEnrich，一个完全自动化的流程，使大型语言模型（LLM）应用具备自我导向的知识修复能力。该系统（i）从种子查询构建知识图谱，（ii）通过七种图度量检测稀疏区域，（iii）利用GPT-4o生成针对性问题，（iv）使用Tavily检索网络证据并将其导入Neo4j，以及（v）通过GraphRAG重新回答查询，以便GPT-4评估改进效果。在三个广泛使用的数据集（Google Research Natural Questions、MS MARCO和HotpotQA）中测试了30个查询。MetaKGEnrich在80%的HotpotQA问题、87%的Google Research Natural Questions和83%的MS MARCO问题中提高了答案质量，同时保持了良好支持区域。这个概念验证展示了拓扑自我诊断与针对性检索如何推动人工智能向类人元认知学习发展。

View on arXiv Download PDF AI Translation

cs.AI / 12 / 2605.16712

Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

回忆不足：个性化语言系统中的承诺界限

Tang, Rui, Zhang, Yichi, Chen, Xi, Dong, Chen, Yang, Youwei, Shen, Yumeng

Abstract

Long-context and memory systems usually treat personalization as a recall problem. In practice, many failures occur later, when a system commits: it turns noisy hints into hard constraints, drops rare witnesses, forgets downstream obligations, or answers despite infeasibility. We introduce Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV). CBEA activates a bounded evidence set using typed coverage, tail witnesses, and consequence debt; LCV validates structured commitments before prose and routes infeasible states to repair, abstention, or recontract. Across 360 fixtures and three generation backends, CBEA+LCV reaches zero failures within validator scope at 0.49-0.60 availability over attempted runs. Raw and long-context baselines with the same LCV gate reach zero only at 0.003-0.092. A shadow oracle diagnostic marks the limit: CBEA+LCV recalls 0.012 of uncompiled visible facts, while raw recalls 0.53. The result is a bounded operating point: explicit commitment control and 74-75% lower median input payload, not universal memory dominance.

Chinese Translation

长上下文和记忆系统通常将个性化视为回忆问题。在实践中，许多失败发生在系统做出承诺时：它将嘈杂的提示转化为严格的约束，丢弃稀有证据，忘记下游义务，或在不可行的情况下作出回答。我们引入了合同界限证据激活（Contract-Bounded Evidence Activation, CBEA）与字典承诺验证（Lexicographic Commitment Validation, LCV）。CBEA使用类型覆盖、尾部证据和后果债务激活一个有限的证据集；LCV在生成文本之前验证结构化的承诺，并将不可行状态引导至修复、弃权或重新签约。在360个固定装置和三个生成后端中，CBEA+LCV在尝试运行中以0.49-0.60的可用性在验证器范围内实现了零失败。具有相同LCV门控的原始和长上下文基线仅在0.003-0.092时达到零。一个影子神谕诊断标记了极限：CBEA+LCV回忆了0.012个未编译的可见事实，而原始回忆为0.53。结果是一个有限的操作点：明确的承诺控制和74-75%的中位输入负载降低，而非普遍的记忆主导。

View on arXiv Download PDF AI Translation

cs.AI / 13 / 2605.16714

GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

GRID：用于安全文本知识图谱构建的智能数据图表示

Huang, Liangyi, Liu, Zichen, Shao, Fei, Ma, Shang, Zhang, Mengshi, Chen, Zihao, Ye, Yanfang, Xiao, Xusheng

Abstract

Security knowledge graphs can provide computable external memory for security agents, but constructing them from long-form cyber threat intelligence (CTI) remains difficult: LLMs often lack grounded security-domain knowledge, and end-to-end document-to-graph training is hard to supervise with cheap, stable rewards. We present GRID (Graph Representation of Intelligence Data), an end-to-end framework for security text knowledge graph construction. GRID first builds security-domain supervision from CTI articles by creating traceable article-graph alignments through graph extraction and knowledge-graph-conditioned text revision. It then turns document-to-graph learning into a scripted task bank combining four-option multi-select questions with triple-level regex matching targets, yielding more stable task-specific rewards than repeatedly scoring full graph outputs with an LLM judge. Using this supervision pipeline, we train two Qwen3-4B-Instruct-2507-based 4B extractors: a primary Task-bank Reward model and a secondary End2End Reward model with LLM-as-judge precision/recall rewards. On 249 CTI articles from GRID, CASIE, CTINexus, MalKG, and SecureNLP, the Task-bank Reward model with the ontology-guided GRID extraction pipeline reaches 84.62% source-averaged precision, 64.91% source-averaged recall, and 68.53% Avg F1, achieving the best source-averaged recall and near-top Avg F1 with lower token usage and deployment cost. The End2End Reward model reaches 76.91% precision, 53.85% recall, and 58.06% Avg F1. Further analyses show that task-bank rewards can be built once offline and reused across later post-training runs, outperforming online End2End LLM-as-judge reward and weaker alternatives such as Choice-only Reward and End2End SFT without RL.

Chinese Translation

安全知识图谱可以为安全代理提供可计算的外部记忆，但从长篇网络威胁情报（CTI）中构建它们仍然困难：大型语言模型（LLMs）往往缺乏扎实的安全领域知识，并且端到端的文档到图谱训练难以通过廉价、稳定的奖励进行监督。我们提出了GRID（智能数据图表示），这是一个用于安全文本知识图谱构建的端到端框架。GRID首先通过图提取和知识图谱条件文本修订，从CTI文章中构建安全领域的监督，通过创建可追溯的文章-图谱对齐。然后，它将文档到图谱的学习转变为一个脚本化的任务库，结合四选一的多项选择题和三级正则表达式匹配目标，产生比使用LLM评判者重复评分完整图输出更稳定的任务特定奖励。利用这一监督管道，我们训练了两个基于Qwen3-4B-Instruct-2507的4B提取器：一个主要的任务库奖励模型和一个次要的端到端奖励模型，后者具有LLM作为评判者的精确度/召回率奖励。在来自GRID、CASIE、CTINexus、MalKG和SecureNLP的249篇CTI文章上，使用本体引导的GRID提取管道的任务库奖励模型达到了84.62%的源平均精确度、64.91%的源平均召回率和68.53%的平均F1，取得了最佳的源平均召回率和接近最高的平均F1，同时降低了令牌使用和部署成本。端到端奖励模型达到了76.91%的精确度、53.85%的召回率和58.06%的平均F1。进一步分析表明，任务库奖励可以在离线时构建并在后续训练运行中重复使用，优于在线端到端LLM作为评判者的奖励和较弱的替代方案，如仅选择奖励和没有强化学习的端到端SFT。

View on arXiv Download PDF AI Translation

cs.AI / 14 / 2605.16725

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

梦游仙境中的巴巴：可执行世界模型的在线自监督动态发现

Seo, SeungWon, Han, DongHeun, Noh, SeongRae, Kang, HyeongYeop

Abstract

Executable world models can be read, edited, executed, and reused for planning, but only if the program captures the environment's transition law rather than semantic shortcuts in its surface vocabulary. We study online executable world-model learning under prior misalignment, where an agent must induce state-dependent dynamics from interaction evidence alone, without rule descriptions, reward signals, or trustworthy lexical priors. We introduce Alice, a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program. We evaluate Alice on Baba in Wonderland, a prior-misaligned variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule-property labels with unrelated words. Experiments show that Alice substantially improves executable world-model learning under prior misalignment, and ablations show that both class refinement and class-aware exploration contribute.

Chinese Translation

可执行世界模型可以被读取、编辑、执行和重用用于规划，但前提是程序捕捉环境的转移法则，而不是其表面词汇中的语义捷径。我们研究了在先前不一致的情况下进行在线可执行世界模型学习，其中代理必须仅通过交互证据诱导状态依赖的动态，而不依赖于规则描述、奖励信号或可信的词汇先验。我们引入了Alice，一个闭环系统，将失败的候选更新视为结构信号：当一个候选解释了一个新的转移但失去了之前解释的转移时，保留冲突揭示了当前程序混淆的动态。Alice将这些冲突细化为假设类，这些假设类不仅为更新提供紧凑的、按类分层的保留反例，还引导前沿探索朝向与当前程序相比新颖且代表性不足的转移。我们在“梦游仙境中的巴巴”上评估Alice，这是一个先前不一致的“巴巴就是你”的变体，保留了模拟器动态，同时用不相关的词替换了语义上有意义的规则属性标签。实验表明，Alice在先前不一致的情况下显著改善了可执行世界模型学习，消融实验显示，类的细化和类感知探索均有助于这一改进。

View on arXiv Download PDF AI Translation

cs.AI / 15 / 2605.16726

A Global-Local Graph Attention Network for Traffic Forecasting

用于交通预测的全局-局部图注意力网络

Zhang, Tianchi

Abstract

Traffic forecasting is a significant part of intelligent transportation systems. One of the critical challenges of traffic forecasting is to find spatio-temporal correlations. In recent years, graph convolutional networks and graph attention networks have replaced traditional statistical models to predict future traffic. However, it is complicated for both of them to allow vertices to have far different characters. To address this, we propose the Global-Local Graph Attention Network (GLGAT) with pairwise encoding and the event-based adjacency matrix. The GLGAT allows vertices to have a global attention matrix set for the whole graph and assigns local attention matrix sets to each vertex. Experiments on two real-world traffic datasets show that GLGAT can effectively capture spatio-temporal correlations and has competitive performance against other state-of-the-art baselines.

Chinese Translation

交通预测是智能交通系统的重要组成部分。交通预测的一个关键挑战是寻找时空相关性。近年来，图卷积网络（Graph Convolutional Networks）和图注意力网络（Graph Attention Networks）已取代传统统计模型来预测未来交通。然而，这两者都很难处理具有显著不同特征的节点。为了解决这个问题，我们提出了全局-局部图注意力网络（Global-Local Graph Attention Network，GLGAT），该网络采用成对编码和基于事件的邻接矩阵。GLGAT允许节点为整个图设置一个全局注意力矩阵，并为每个节点分配局部注意力矩阵集。在两个真实世界交通数据集上的实验表明，GLGAT能够有效捕捉时空相关性，并在与其他最先进基线的比较中表现出竞争力。

View on arXiv Download PDF AI Translation

cs.AI / 16 / 2605.16727

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

PopuLoRA：共同进化的LLM群体用于推理自我对弈

Castanyer, Roger Creus, Bradway, Geoffrey, Wolf, Lorenz, Lin, Maxwill, Mavor-Parker, Augustine N., Sargent, Matthew James

Abstract

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.

Chinese Translation

我们介绍了PopuLoRA，这是一种基于群体的非对称自我对弈框架，旨在进行具有可验证奖励的强化学习（RLVR），用于LLM的后训练。教师和学生是共享的冻结基础上的专业化LoRA适配器：教师提出问题，匹配的学生在程序验证器下解决这些问题，子群体之间的交叉评估取代了限制单一代理自我对弈的自我校准。一系列LoRA权重空间演化算子（突变和交叉，在几秒钟内生成同等级的群体成员）作为7B规模的基于群体的训练循环的替代步骤。我们在Absolute Zero Reasoner的基础上实例化了PopuLoRA，并将其与每个适配器计算匹配的单一代理基线进行了比较。在单一代理自我校准生成其可以可靠解决的简单问题的情况下，群体进入了共同进化的军备竞赛：教师产生越来越复杂的问题，学生的解决率波动，问题空间的覆盖在整个训练过程中不断扩大。尽管训练时间奖励较低，但该群体的平均表现超越了三个代码基准（HumanEval+、MBPP+、LiveCodeBench）和七个数学基准（AIME 24/25、AMC 23、MATH-500、Minerva、GSM8K、OlympiadBench），甚至群体中最弱的成员在整体上也超过了基线。

View on arXiv Download PDF AI Translation

cs.AI / 17 / 2605.16728

Body-Grounded Perspective Formation and Conative Attunement in Artificial Agents

基于身体的视角形成与人工智能体的意向调节

Pae, Hongju

Abstract

This paper proposes a minimal architecture for body-grounded perspective formation in artificial agents. Extending prior work, the model introduces an interoceptive viability signal, a Fisher-style metric over fused exteroceptive-interoceptive states, and a conative alignment mechanism linking bodily tendency to action readiness. In a reward-free gridworld, conation converts learned bodily tendency into stable body-directed behavior, while body-to-perspective routing allows bodily perturbations to leave a recoverable geometric residue in the perspective latent. This study shows how minimal structural conditions for artificial subjectivity can be operationalized in the phenomenological sense, through the embodied organization of how a world is given to an agent.

Chinese Translation

本文提出了一种用于人工智能体的基于身体的视角形成的最小架构。在扩展先前工作的基础上，该模型引入了一种内感受的可行性信号、一种基于融合的外感受-内感受状态的Fisher风格度量，以及一种将身体倾向与行动准备联系起来的意向对齐机制。在一个无奖励的网格世界中，意向将学习到的身体倾向转化为稳定的身体导向行为，而身体到视角的路由允许身体扰动在视角潜变量中留下可恢复的几何残余。本研究展示了如何通过身体化的组织方式在现象学意义上将人工主体性的最小结构条件转化为可操作的形式，从而使世界以某种方式呈现给智能体。

View on arXiv Download PDF AI Translation

cs.AI / 18 / 2605.16746

State Contamination in Memory-Augmented LLM Agents

记忆增强型大语言模型代理中的状态污染

Wang, Yian, Goyal, Agam, Chen, Yuen, Sundaram, Hari

Abstract

LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.

Chinese Translation

大语言模型（LLM）代理越来越依赖持久状态，包括记录、摘要、检索的上下文和记忆缓冲区，以支持长期交互。这使得安全性不仅依赖于单个模型的输出，还依赖于代理存储和后续重用的内容。我们研究了一种称为记忆洗涤的失效模式：有毒或对抗性的上下文可以被压缩成在标准检测器下不再显得有毒的记忆摘要，同时仍然保留影响未来生成的敌对框架或冲突结构。通过配对的反事实多代理回滚，我们展示了有毒来源的记忆摘要可以保持在常见的有毒阈值以下，同时相对于匹配的中性基线增加下游的有毒性。为了测量这种隐藏影响，我们引入了阈下传播差距（SPG），它量化了基于被部署监控器分类为安全的记忆状态的下游行为差异。我们的实验表明，有毒性通过不同的状态通道传播：原始记录的重用驱动明显的下游有毒性，而压缩的记忆则携带隐藏的阈下影响。我们进一步发现，缓解措施在干预位置上至关重要。在摘要化之前对有毒状态进行清理显著减少了隐藏的传播差距，而仅清理已完成的摘要可能会使洗涤的影响保持不变。这些结果表明，在记忆增强型代理中，安全性应被视为对不断演变的上下文的状态控制问题，并在不安全信息被压缩到持久记忆之前进行清理。

View on arXiv Download PDF AI Translation

cs.AI / 19 / 2605.16757

NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

NeuroMAS：将多智能体系统视为具有联合强化学习的神经网络

Lu, Haoran, Fang, Luyang, Zhong, Wenxuan, Ma, Ping

Abstract

Multi-agent language systems are often built as hand-designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose NeuroMAS, a method that first treats a multi-agent language system as a trainable and scalable neural-network-like architecture with LLM agents as nodes and intermediate textual signals as edges. In NeuroMAS, agent nodes are role-free but structure-aware: the topology only determines how information can flow in general, while reinforcement learning training determines how nodes communicate, specialize, and coordinate. This formulation shifts multi-agent design from workflow engineering toward architecture design, where depth, width, connectivity, and growth protocol become scalable sources of capability. Further, we provide a theoretical perspective showing why such modular textual computation is more parameter-efficient when tasks admit hierarchical decompositions. Experiments show that NeuroMAS improves significantly over both inference-time and trained multi-agent baselines. We further find that organizational scaling is path-dependent: larger systems can be challenging to train from scratch, but become feasible when grown progressively from smaller trained systems. These results suggest that learned neural multi-agent systems are a promising scaling axis for LLMs.

Chinese Translation

多智能体语言系统通常是作为手工设计的工作流程构建的，其中智能体被分配语义角色，通信协议在事先指定。我们提出了NeuroMAS，一种方法，首先将多智能体语言系统视为一种可训练和可扩展的类似神经网络的架构，其中大型语言模型（LLM）智能体作为节点，中间文本信号作为边。在NeuroMAS中，智能体节点是无角色但具有结构意识的：拓扑结构仅决定信息流动的一般方式，而强化学习训练决定节点如何进行通信、专业化和协调。这种表述将多智能体设计从工作流程工程转向架构设计，其中深度、宽度、连通性和增长协议成为可扩展的能力来源。此外，我们提供了一个理论视角，说明为什么当任务允许分层分解时，这种模块化文本计算在参数效率上更具优势。实验表明，NeuroMAS在推理时间和训练的多智能体基线方面显著提高。我们进一步发现，组织扩展是路径依赖的：较大的系统从零开始训练可能具有挑战性，但当从较小的训练系统逐步增长时变得可行。这些结果表明，学习的神经多智能体系统是大型语言模型（LLMs）的一个有前景的扩展方向。

View on arXiv Download PDF AI Translation

cs.AI / 20 / 2605.16821

Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

实践中的多范式智能体交互：对 buddyMe 框架中生成器-评估器、ReAct 循环和对抗性评估的系统分析

Wang, Xiaohua, Han, Chao, Yu, Kai, Xu, XiaoLiang, Wang, Liang

Abstract

The rapid evolution of Large Language Model (LLM) agents has produced diverse interaction paradigms, yet few production systems integrate multiple paradigms within a unified architecture. This paper presents a systematic analysis of three principal agent interaction paradigms, including Multi-Agent Orchestration (Generator-Evaluator), ReAct Tool-Use Loops, and Memory-Augmented Interaction, as implemented in buddyMe, an open-source multi-model agent programming framework. We formalize a five-stage processing pipeline: Requirement Pre-Review -> Task Decomposition -> ReAct Execution -> Real-Execution Verification -> Adversarial Evaluation Discussion, and establish a six-dimensional evaluation schema with weighted scoring. Through four empirical case studies drawn from real-world deployment logs covering museum guide generation, scheduled weather tasks, and comprehensive tour planning, we draw three key conclusions. First, Generator-Evaluator pre-review detects requirement omissions in 20 percent of complex tasks, with 80 percent tasks passing initial inspection. Second, the ReAct loop ensures stable subtask execution but leads to around 30 percent redundant tool invocations. Third, adversarial Evaluator-Defender discussions reach consensus within 2-3 rounds for nearly 70 percent of scenarios, functioning mainly for content refinement rather than logical reversal. We additionally provide three Mermaid-based architectural diagrams and conduct cross-paradigm comparisons with CrewAI, AutoGen, LangGraph, MemGPT and A-Mem across six system dimensions. The research outcomes offer practical design guidelines for constructing stable and reliable multi-paradigm agent systems.

Chinese Translation

大型语言模型（LLM）智能体的快速发展产生了多样的交互范式，但很少有生产系统在统一架构中整合多种范式。本文对三种主要的智能体交互范式进行了系统分析，包括多智能体编排（Generator-Evaluator）、ReAct 工具使用循环和增强记忆交互，这些都在开源的多模型智能体编程框架 buddyMe 中得以实现。我们形式化了一个五阶段处理管道：需求预审 -> 任务分解 -> ReAct 执行 -> 实际执行验证 -> 对抗性评估讨论，并建立了一个六维评估框架，采用加权评分。通过四个来自实际部署日志的实证案例研究，涵盖博物馆导览生成、定时天气任务和全面旅游规划，我们得出三个关键结论。首先，生成器-评估器预审在20%的复杂任务中发现了需求遗漏，80%的任务通过了初步检查。其次，ReAct 循环确保了子任务的稳定执行，但导致约30%的工具调用冗余。第三，对抗性评估者-防御者讨论在近70%的场景中在2-3轮内达成共识，主要用于内容细化而非逻辑推翻。此外，我们还提供了三个基于 Mermaid 的架构图，并在六个系统维度上与 CrewAI、AutoGen、LangGraph、MemGPT 和 A-Mem 进行了跨范式比较。研究结果为构建稳定可靠的多范式智能体系统提供了实用的设计指南。

View on arXiv Download PDF AI Translation

cs.AI / 21 / 2605.16827

Voices in the Loop: Mapping Participatory AI

循环中的声音：参与式人工智能的映射

Mushkani, Rashid

Abstract

Participatory approaches to artificial intelligence are increasingly documented across public, civic, and humanitarian settings, but evidence about how participation is organized remains fragmented. This paper reports on the construction of an open repository and interactive atlas of participatory AI initiatives, using records harmonized from Maga~na and Shilton's Trustworthy AI corpus, and additional audited cases from research and practice. We contribute three elements. First, we specify a reproducible protocol for discovery, vetting, harmonization, geocoding, provenance tracking, and release-based publication of participatory AI records. Second, we report corpus-level patterns in geography, participation tiers, lifecycle loci, organizational form, verification status, and remaining documentation gaps. Documented initiatives remain concentrated in a small number of countries, while participation is most often coded at problem formulation, evaluation, and governance rather than model development or training. Third, we show how the atlas operationalizes a design and governance framework for participatory-by-default AI infrastructures through versioned releases, record-linked issue and annotation channels, schema feedback workflows, and redaction or restricted-disclosure requests. The atlas is intended to support comparative research, policy learning, and community scrutiny through a living inventory that can be updated, contested, and reused.

Chinese Translation

参与式人工智能的方法在公共、民间和人道主义环境中越来越多地被记录，但关于参与如何组织的证据仍然零散。本文报告了一个开放的参与式人工智能倡议的存储库和互动地图的构建，使用了来自Maga~na和Shilton的可信赖人工智能语料库的记录，并结合了来自研究和实践的额外审核案例。我们贡献了三个要素。首先，我们指定了一个可重复的协议，用于发现、审查、协调、地理编码、来源追踪和基于发布的参与式人工智能记录的发布。其次，我们报告了在地理分布、参与层级、生命周期位置、组织形式、验证状态和剩余文档缺口等方面的语料库级模式。记录的倡议仍然集中在少数几个国家，而参与通常是在问题制定、评估和治理阶段进行编码，而非模型开发或训练阶段。第三，我们展示了如何通过版本发布、记录关联的问题和注释渠道、模式反馈工作流以及编辑或限制披露请求，来使地图实现一个参与式默认人工智能基础设施的设计和治理框架。该地图旨在通过一个可以更新、争议和重用的动态清单，支持比较研究、政策学习和社区审查。

View on arXiv Download PDF AI Translation

cs.AI / 22 / 2605.16842

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

先草图再上色：用于扩散多模态大型语言模型的层次强化学习

Luo, Siqi, Shen, Jianghan, Xin, Yi, Zheng, Huayu, Chen, Haoxing, Tai, Yan, Li, Yue, He, Junjun, Liu, Yihao, Zhai, Guangtao, Cao, Yuewen, Liu, Xiaohong

Abstract

Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.

Chinese Translation

扩散多模态大型语言模型（dMLLMs）在图像生成方面具有强大的能力，但通过强化学习（RL）对其进行优化仍然是一个主要挑战。一个主要困难在于，单个图像可以通过许多不同的去掩蔽序列生成，这使得计算重要性比率往往变得不可行。此外，现有方法往往忽视了dMLLMs的层次生成过程，其中早期的标记定义了全局布局，而后期的标记则关注局部细节。通过对所有标记分配统一的奖励，这些当前方法未能反映每个标记对最终图像的实际贡献。为了解决这些问题，我们提出了层次标记GRPO（HT-GRPO），该方法将这一层次结构直接整合到策略优化过程中。我们的方法采用了“先草图再上色”的训练方案，将更新组织为三个不同的阶段：全局、结构和精细化。我们还使用了基于提示的估计器，从完全掩蔽状态开始计算重要性比率。此外，我们引入了一种层次信用分配机制，优先考虑关键结构标记，以确保奖励传播的准确性。使用两个流行的dMLLM骨干网络MMaDA和Lumina-DiMOO的实验表明，HT-GRPO在GenEval和DPG基准测试中取得了显著的提升。对六个额外指标的评估确认了图像质量、美学和人类偏好的显著改善。

View on arXiv Download PDF AI Translation

cs.AI / 23 / 2605.16844

Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence

人工自适应智能：狭义智能与广义智能之间的缺失阶段

Kriuk, Boris

Abstract

Between the narrow systems we deploy and the general intelligence we speculate about lies an entire regime of machine behavior that has never received its own name. This monograph argues that this regime is not empty: it is where meta-learning, neural architecture search, AutoML, continual learning, evolutionary computation, and physics-informed modeling have quietly converged on a common principle, namely the steady removal of the human from the loop of parameter specification. We name this regime Artificial Adaptive Intelligence (AAI) and define it operationally: a system exhibits AAI to the extent that it requires no human-specified tunable hyperparameters while maintaining competitive performance across a diverse distribution of tasks. To make the definition quantitative, we introduce an adaptivity index that measures progress along an axis orthogonal to scale, combining the fraction of hyperparameters absorbed by the system with the performance ratio against a task-specialized baseline. We develop the principle of parametric minimality and ground it in the minimum description length framework, showing that the appropriate hyperparameter count is data-determined rather than designer-determined. We then organize the field around three pathways to minimality: data- and task-aware configuration, structural and evolutionary morphing, and in-training self-adaptation. We analyze their stability, convergence, and governance implications, and illustrate them through case studies spanning aerospace design, financial regime detection, turbulence modeling, ecological dynamics, and vision-language systems. The thesis is that the path from ANI to AGI passes through AAI, and that naming this stage changes what we measure, what we build, and what we call a success.

Chinese Translation

在我们部署的狭义系统与我们推测的广义智能之间，存在一个从未被命名的机器行为体系。本专著认为，这一体系并非空无一物：它是元学习（meta-learning）、神经架构搜索（neural architecture search）、自动机器学习（AutoML）、持续学习（continual learning）、进化计算（evolutionary computation）和物理信息建模（physics-informed modeling）在一个共同原则上悄然汇聚的地方，即逐步消除人类在参数规范过程中的参与。我们将这一体系命名为人工自适应智能（Artificial Adaptive Intelligence，AAI），并从操作上进行定义：一个系统在不需要人类指定的可调超参数的情况下，能够在多样化任务的分布中保持竞争性表现，则可以被认为展现了AAI。为了使定义具有量化性，我们引入了一个适应性指数（adaptivity index），该指数测量沿着与规模正交的轴线的进展，将系统吸收的超参数比例与针对特定任务基线的性能比率相结合。我们发展了参数最小性原则，并将其基于最小描述长度框架（minimum description length framework）进行理论支撑，表明适当的超参数数量是由数据决定的，而非设计者决定的。随后，我们围绕通往最小性的三条路径组织该领域：数据和任务感知配置、结构和进化变形，以及训练中的自适应。我们分析了这些路径的稳定性、收敛性和治理影响，并通过涵盖航空航天设计、金融体制检测、湍流建模、生态动态和视觉-语言系统的案例研究进行说明。我们的论点是，从狭义人工智能（ANI）到广义人工智能（AGI）的路径经过AAI，而命名这一阶段改变了我们所测量的内容、所构建的内容以及我们所称之为成功的标准。

View on arXiv Download PDF AI Translation

cs.AI / 24 / 2605.16857

Learning to Learn from Multimodal Experience

从多模态经验中学习的学习

Sui, Xingyu, Zhao, Weixiang, Tang, Yongxin, Zhao, Yanyan, Wu, Yang, Tu, Dandan, Qin, Bing

Abstract

Experience-driven learning has emerged as a promising paradigm for enabling agents to improve from interaction trajectories by accumulating and reusing past experience. However, existing approaches are predominantly developed in textual settings and rely on manually designed memory schemas, limiting their applicability to multimodal environments. In real-world scenarios, experience is inherently multimodal, involving heterogeneous signals across perception, reasoning, and action, which makes effective memory design significantly more challenging. In particular, the optimal way to structure and utilize multimodal experience is highly task-dependent and evolves over time, rendering fixed memory designs insufficient. In this work, we propose a new paradigm, learning to learn from multimodal experience, which shifts memory design from a predefined component to an adaptive and learnable process. Our framework enables agents to dynamically construct, organize, and utilize memory based on task requirements and interaction history, effectively learning how to structure experience for improved performance. Experiments demonstrate that adaptive memory design substantially enhances agent performance and generalization across multimodal tasks, highlighting the critical role of learning memory mechanisms in experience-driven learning.

Chinese Translation

以经验为驱动的学习已成为一种有前景的范式，使智能体能够通过积累和重用过去的经验，从交互轨迹中不断改进。然而，现有的方法主要是在文本环境中开发的，并依赖于手动设计的记忆架构，这限制了它们在多模态环境中的适用性。在现实世界场景中，经验本质上是多模态的，涉及感知、推理和行动等异构信号，这使得有效的记忆设计变得更加具有挑战性。特别是，结构化和利用多模态经验的最佳方式高度依赖于任务，并随着时间的推移而演变，从而使固定的记忆设计显得不足。在本研究中，我们提出了一种新的范式——从多模态经验中学习的学习，旨在将记忆设计从预定义的组件转变为自适应和可学习的过程。我们的框架使智能体能够根据任务需求和交互历史动态构建、组织和利用记忆，有效地学习如何结构化经验以提高性能。实验表明，自适应记忆设计显著增强了智能体在多模态任务中的表现和泛化能力，突显了学习记忆机制在以经验为驱动的学习中的关键作用。

View on arXiv Download PDF AI Translation

cs.AI / 25 / 2605.16874

Reasoning Can Be Restored by Correcting a Few Decision Tokens

通过纠正少量决策标记可以恢复推理能力

Shen, Changshuo, Sheng, Leheng, Chen, Yuxin, Zhang, An, Wang, Xiang

Abstract

Large reasoning models (LRMs) substantially outperform their base LLM counterparts on challenging reasoning benchmarks, yet it remains poorly understood where base models go wrong during token-by-token generation and how to narrow this gap efficiently. We study the base-reasoning gap through quantifying token-level distributional disagreement between a base model and a stronger reasoning model using likelihood-based divergences. Across benchmarks, we find that the reasoning advantage is highly sparse and concentrates on a small set of early, planning-related decision tokens. For instance, on Qwen3-0.6B, only ~8% of generated tokens account for the salient disagreement, and these tokens concentrate early in the response, are strongly enriched in planning-related decisions (17x), and coincide with high base-model uncertainty -- suggesting that base models fail mainly at early planning points that steer the subsequent reasoning trajectory. Building on these findings, we propose disagreement-guided token intervention, a simple inference-time delegation scheme that performs a one-token takeover by the reasoning model only at high-disagreement positions and immediately switches back to the base model. With a small intervention budget, this sparse delegation substantially recovers and can even surpass the performance of a same-size reasoning model on challenging reasoning tasks. Code is available at https://github.com/AlphaLab-USTC/RRTokenIntervention.

Chinese Translation

大型推理模型（LRMs）在具有挑战性的推理基准测试中显著优于其基础大型语言模型（LLM）对应物，但目前尚不清楚基础模型在逐个标记生成过程中出现问题的具体原因，以及如何有效缩小这一差距。我们通过量化基础模型与更强推理模型之间的标记级分布不一致性，使用基于似然的差异度来研究基础推理差距。在各个基准测试中，我们发现推理优势高度稀疏，集中在一小部分早期与规划相关的决策标记上。例如，在Qwen3-0.6B上，只有约8%的生成标记占据了显著的不一致性，这些标记集中在响应的早期，且在与规划相关的决策中显著丰富（17倍），同时与基础模型的不确定性高峰重合——这表明基础模型主要在引导后续推理轨迹的早期规划点上失败。基于这些发现，我们提出了不一致性引导的标记干预，这是一种简单的推理时委托方案，仅在高不一致性位置由推理模型进行一次标记接管，然后立即切换回基础模型。通过小规模的干预预算，这种稀疏的委托显著恢复了推理能力，甚至可以超越同规模推理模型在具有挑战性的推理任务上的表现。代码可在 https://github.com/AlphaLab-USTC/RRTokenIntervention 获取。

View on arXiv Download PDF AI Translation

cs.AI / 26 / 2605.16880

Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities

基于虚拟节点引导的动态图神经网络用于缺失模态的脑肿瘤分割

Tao, Sha, Pan, Jiao, Guo, Yu, Yao, Chao

Abstract

Multimodal magnetic resonance imaging (MRI) is crucial for brain tumor segmentation, with many methods leveraging its four key modalities to capture complementary information for effective sub-region analysis. However, the absence of several modalities is very common in practice, leading to severe performance degradation in existing full-modality segmentation methods. Limited by the structured data model, recent works often adopt a multi-stage training strategy for full-modality and missing-modality scenarios, which increases training costs and inadequately addresses the interference of miss. In this work, we propose a graph-based one-stage framework for robust brain tumor segmentation with missing modalities. Specifically, we introduce modality-specific virtual nodes that serve as supplementary information sources to compensate for missing modalities. To enhance model robustness against arbitrary modality combinations, we leverage the inherent flexibility of graph networks to devise a dynamic connection strategy. This mechanism dynamically adjusts the adjacency matrix based on modality availability, preserving beneficial information flow while mitigating interference effects caused by missing modalities. Furthermore, we enhance the graph network through heterogeneous weight matrices, enhancing its adaptability to multimodal scenarios. Extensive experiments on the BRATS-2018 and BRATS-2020 datasets demonstrate that our method outperforms the state-of-the-art methods on almost all subsets of incomplete modalities.

Chinese Translation

多模态磁共振成像（MRI）对于脑肿瘤分割至关重要，许多方法利用其四个关键模态捕捉互补信息，以实现有效的子区域分析。然而，在实际应用中，缺失几种模态的情况非常普遍，这导致现有的全模态分割方法性能严重下降。受限于结构化数据模型，近期的研究通常采用多阶段训练策略来应对全模态和缺失模态场景，这增加了训练成本，并且未能充分解决缺失模态带来的干扰。在本研究中，我们提出了一种基于图的单阶段框架，用于在缺失模态情况下进行稳健的脑肿瘤分割。具体而言，我们引入了模态特定的虚拟节点，作为补充信息源，以弥补缺失模态。为了增强模型对任意模态组合的鲁棒性，我们利用图网络的固有灵活性设计了一种动态连接策略。该机制根据模态可用性动态调整邻接矩阵，保持有益的信息流，同时减轻缺失模态造成的干扰效应。此外，我们通过异构权重矩阵增强图网络，提高其对多模态场景的适应性。在BRATS-2018和BRATS-2020数据集上的大量实验表明，我们的方法在几乎所有不完整模态的子集上均优于现有的最先进方法。

View on arXiv Download PDF AI Translation

cs.AI / 27 / 2605.16893

NGM: A Plug-and-Play Training-Free Memory Module for LLMs

NGM：一种即插即用的无训练记忆模块用于大型语言模型

Qu, Yuwen, Dong, Wenhui, Si, Chenyang, Shan, Caifeng

Abstract

Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).

Chinese Translation

近期研究引入了条件记忆模块，将知识存储与神经计算解耦，从而实现更直接的知识访问。与依赖动态计算路径的混合专家（MoE）相比，显式查找提供了更高效的知识检索机制。然而，这些方法仍然依赖于学习的记忆嵌入，需额外训练并限制了灵活性。为了解决这一问题，我们提出了N-gram Memory（NGM），这是一种无训练的即插即用模块，由因果N-gram编码器和余弦门控记忆注入器组成。因果N-gram编码器直接平均基础模型的预训练标记嵌入，以构建N-gram表示，从而消除了从头训练单独N-gram嵌入的需要。该设计既不需要额外的记忆表，也不需要检索管道。随后，余弦门控记忆注入器使用非参数余弦门与ReLU调制检索到的嵌入为上下文表示。我们在Qwen3系列（从0.6B到14B）上对NGM进行了评估，结果显示NGM的平均性能提高了0.5到1.2分，尤其在代码生成和知识密集型任务上表现出明显提升（例如，Qwen3-14B在LiveCodeBench上提高了3.0分，在GPQA上提高了3.03分）。此外，NGM在多模态基准测试中的表现也有所提升（例如，Qwen3-VL-2B在MMStar上提高了1.53分）。

View on arXiv Download PDF AI Translation

cs.AI / 28 / 2605.16909

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

TOBench：面向任务的全模态基准测试，用于现实世界工具使用代理

Liu, Zhiqiang, Dong, Wenhui, Tan, Yilang, Qu, Yuwen, Yin, Haochen, Si, Chenyang

Abstract

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.

Chinese Translation

工具使用代理越来越被期望能够在现实专业工作流程中操作，在这些流程中，它们必须解释多模态输入、协调外部工具、检查中间产物，并在生成最终结果之前修正其行为。然而，现有的基准测试往往孤立地评估工具使用、计算机使用和多模态推理，导致基准设置与现实世界中的端到端全模态工具使用之间存在差距。为了解决这一差距，我们引入了MM-ToolBench，这是一个面向任务的全模态工具使用的基准和评估框架。MM-ToolBench包含来自两个宏任务家族（客户服务和智能创作）的100个可执行任务，涵盖20个子类别切片，并由27个MCP服务器和324个工具支持。MM-ToolBench的核心设计是闭环多模态验证：代理必须执行工具、检查呈现或转换的产物，并在输出未满足特定任务要求时自我修正。为了使这种评估具有可扩展性和可验证性，MM-ToolBench将基于MCP的执行与任务特定的基础评估者相结合，并提供一个半自动化的构建管道，用于场景发现、任务实例化、评估者合成和人工审计。对15个当代代理模型的实验表明，MM-ToolBench依然具有很高的挑战性：Claude Opus 4.6，通常被认为是最强大的编码代理模型之一，仅实现了32.0%的任务成功率，远低于94.0%的人工基准。我们设想MM-ToolBench作为评估和推动下一代全模态工具使用代理的实际基础，通过闭环多模态验证实现进步。

View on arXiv Download PDF AI Translation

cs.AI / 29 / 2605.16927

From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

从静态风险到动态轨迹：朝向世界模型启发的临床预测

Feng, Pujun, Guo, Xiaoyu, Saffari, Seyed Ehsan, Lee, Min Hun, Lam, Siew-Kei, Cambria, Erik, Sun, Xibin, Zhou, Yangtao, Yang, Tong, Zhang, Xiaoyu, Tan, Tao, Sun, Yue, Cui, Bin

Abstract

Clinical decision-making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians' measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention-aware disease trajectory modeling in clinical AI--methods estimating patient-specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time-varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point-process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off-policy robustness, and target-trial validation. This synthesis advances benchmark prediction to decision-grade clinical evidence, enabling treatment-sensitive individualized futures, pre-deployment policy stress-testing, and safer closed-loop learning health systems that adapt/abstain when evidence is insufficient.

Chinese Translation

临床决策制定是一个反馈系统，其中风险估计影响治疗，而治疗又改变疾病轨迹，二者共同塑造临床医生的测量实践。静态预测在临床上往往失败：基于观察性护理记录训练的模型将疾病生物学与临床医生行为混淆，特别是在治疗混杂反馈和不规则或信息性观察的情况下。本综述聚焦于临床人工智能中的干预意识疾病轨迹建模——评估患者特定的纵向疾病演变并在不同治疗下评估轨迹变化的方法。我们围绕六个相互关联的组成部分组织该领域：三个决策任务（事实预测、反事实估计、政策评估）和三个数据生成机制（疾病演变、治疗分配、观察过程），这些机制决定了可识别性。我们提出了第一个统一框架，连接离散/连续时间的预测、反事实轨迹和政策评估，明确处理治疗分配、时间变化的混杂和观察偏差。我们综合了关键方法家族（多状态/联合模型、时间点过程、深度序列架构、纵向因果推断），将其映射到相关组成部分，并通过重叠诊断、不确定性量化、离政策稳健性和目标试验验证对评估与主张强度进行对齐。这一综合推进了基准预测到决策级临床证据，使得治疗敏感的个性化未来、预部署政策压力测试以及在证据不足时能够适应/避免的更安全的闭环学习健康系统成为可能。

View on arXiv Download PDF AI Translation

cs.AI / 30 / 2605.16953

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

人类如何处理AI生成的幻觉内容：一项神经成像研究

Zhu, Shuqi, Zhong, Yi, Ye, Ziyi, Du, Bangde, Zhou, Yujia, Ai, Qingyao, Liu, Yiqun

Abstract

While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize or be misled by these hallucinations remain unclear. To address this problem, this paper explores humans' neural dynamics to characterize how the brain processes hallucinated content. We record EEG signals from 27 participants while they are performing a verification task to judge the correctness of image descriptions generated by a multi-modal large language model (MLLM). Based on an averaged event-related potential (ERP) study, we reveal that multiple cognitive processes, e.g., semantic integration, inferential processing, memory retrieval, and cognitive load, exhibit distinct patterns when humans process hallucinated versus non-hallucinated content. Notably, neural responses to hallucinations that were misjudged versus correctly judged by human participants showed significant differences. This indicates that misjudged AI-generated hallucinations failed to trigger the standard neurocognitive fact verification pathway.

Chinese Translation

尽管AI生成的幻觉带来了相当大的风险，但人类成功识别或被这些幻觉误导的潜在认知机制仍不清楚。为了解决这个问题，本文探讨了人类的神经动态，以表征大脑如何处理幻觉内容。我们在27名参与者执行验证任务时记录了脑电图（EEG）信号，以判断由多模态大型语言模型（MLLM）生成的图像描述的正确性。基于平均事件相关电位（ERP）研究，我们揭示了多种认知过程，例如语义整合、推理处理、记忆检索和认知负荷，在人类处理幻觉内容与非幻觉内容时表现出不同的模式。值得注意的是，参与者对被误判与正确判断的幻觉的神经反应显示出显著差异。这表明，被误判的AI生成幻觉未能触发标准的神经认知事实验证路径。

View on arXiv Download PDF AI Translation

cs.AI / 31 / 2605.16966

Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

利用人工智能解决逆偏微分方程问题：过去、现在与展望

Tan, Zhentao, Hao, Yuze, Zou, Boyi, Long, Mingsheng, Yang, Yi, Bao, Gang

Abstract

Solving inverse partial differential equation (PDE) problems is a fundamental topic in scientific research due to its broad significance across a wide range of real-world applications. Inverse PDE problems arise across medical imaging, geophysics, materials science, and aerodynamics, where the goal is to infer hidden causes, design structures, or control physical states. In this paper, we provide a comprehensive review of recent advances in solving inverse PDE problems using artificial intelligence (AI). We first introduce the basic formulation, key challenges, and traditional numerical foundations of inverse PDE problems, and then organize it into three major categories: inverse problems, inverse design, and control problems. For each category, we further present a methodological paradigms, and review representative state-of-the-art approaches from recent years. We then summarize representative applications across scientific and industrial domains, including mechanical systems, aerodynamic problems, thermal systems, full-waveform inversion, system identification, and medical imaging. Finally, we discuss open challenges and future prospects, such as physics-informed architectures, limited real-world data, uncertainty quantification, and inverse foundation models. This survey aims to provide the first unified and systematic perspective on AI for inverse PDE problems, demonstrating how modern learning-based methods are reshaping inverse problems, inverse design, and control problems in PDE-governed systems.

Chinese Translation

解决逆偏微分方程（PDE）问题是科学研究中的一个基础主题，因为它在广泛的现实应用中具有重要意义。逆PDE问题出现在医学成像、地球物理学、材料科学和空气动力学等领域，其目标是推断隐藏的原因、设计结构或控制物理状态。本文对利用人工智能（AI）解决逆PDE问题的最新进展进行了全面回顾。我们首先介绍了逆PDE问题的基本公式、关键挑战和传统数值基础，然后将其组织为三个主要类别：逆问题、逆设计和控制问题。对于每个类别，我们进一步呈现了方法论范式，并回顾了近年来的代表性最先进方法。接着，我们总结了在科学和工业领域的代表性应用，包括机械系统、空气动力学问题、热系统、全波形反演、系统识别和医学成像。最后，我们讨论了开放挑战和未来展望，如物理信息架构、有限的现实数据、不确定性量化和逆基础模型。本次调查旨在提供关于AI在逆PDE问题中的首次统一和系统的视角，展示现代基于学习的方法如何重塑PDE控制系统中的逆问题、逆设计和控制问题。

View on arXiv Download PDF AI Translation

cs.AI / 32 / 2605.16969

Brain Vascular Age Prediction Using Cerebral Blood Flow Velocity and Machine Learning Algorithms

基于脑血流速度和机器学习算法的脑血管年龄预测

Zhao, Anni, Bateh, Alex, Baldridge, Tyler, Billinger, Sandra, Hu, Xiao

Abstract

Defining vascular age in terms of physiological function has become one focal point of the extensive studies to categorize and track chronological age. Transcranial Doppler (TCD) is a method by which cerebral blood flow velocity is measured along the major arteries feeding the human brain. This study aims to use features extracted from TCD to estimate chronological age and assess accelerated aging in subjects with various brain diseases. We predict subjects with various brain diseases to present with accelerated cerebrovascular aging when tested on various regression models trained by healthy subjects. 168 healthy subjects and 277 diseased subjects with bilateral TCD recordings of the middle cerebral artery were analyzed using the Morphological Analysis and Clustering of Intracranial Pressure (MOCAIP) algorithm. MOCAIP-generated features and heart rate variability features were used as input features for regression models to predict the brain vascular age. 66 subjects with acute stroke, 27 subjects with post stroke, 26 subjects with Alzheimer's disease, 23 subjects with mild cognitive impairment, and 135 established subjects were tested against the machine learning model to assess for accelerated cerebrovascular age. The trained model, on average, predicted healthy subjects' cerebrovascular age to be 3.69 years above their chronological age. Subjects with different disease conditions exhibited varying levels of age acceleration. The differences in healthy and diseased subjects' performances suggest that features generated using TCD may be relevant when evaluating accelerated cerebrovascular aging. Moreover, imbalanced datasets have been observed to affect the performance of machine-learning-based brain age prediction models.

Chinese Translation

将血管年龄定义为生理功能已成为广泛研究的一个重点，以便对生物年龄进行分类和追踪。经颅多普勒（Transcranial Doppler, TCD）是一种测量供给人脑的主要动脉中脑血流速度的方法。本研究旨在利用从TCD提取的特征来估计生物年龄，并评估各种脑疾病患者的加速衰老情况。我们预测，接受各种回归模型测试的不同脑疾病患者在脑血管衰老方面会表现出加速现象，这些模型是基于健康个体训练的。我们分析了168名健康个体和277名患有双侧中脑动脉TCD记录的患者，采用了颅内压的形态分析与聚类（Morphological Analysis and Clustering of Intracranial Pressure, MOCAIP）算法。MOCAIP生成的特征和心率变异性特征被用作回归模型的输入特征，以预测脑血管年龄。我们测试了66名急性中风患者、27名中风后患者、26名阿尔茨海默病患者、23名轻度认知障碍患者和135名确诊患者，以评估其脑血管年龄的加速情况。经过训练的模型平均预测健康个体的脑血管年龄比其生物年龄高出3.69岁。不同疾病状态的患者表现出不同程度的年龄加速。健康与疾病患者表现的差异表明，使用TCD生成的特征在评估加速脑血管衰老时可能具有相关性。此外，观察到不平衡的数据集会影响基于机器学习的脑年龄预测模型的性能。

View on arXiv Download PDF AI Translation

cs.AI / 33 / 2605.17021

A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification

一种冲突感知的证据框架用于可靠的睡眠阶段分类

Tian, Yunzhi, Wang, Dekui, Bu, Qirong, Zhou, Wei, Hao, Xingxing, Feng, Jun

Abstract

Multi-view learning has been widely applied for sleep stage classification using multi-modal data. However, existing methods typically assume that different modalities are well-aligned, which is often unattainable in real-world scenarios, thereby compromising the reliability of the staging results. In this paper, we propose ConfSleepNet, a conflict-aware evidential framework that dynamically resolves inter-view conflicts. The framework consists of multi-view evidence extraction and conflict-aware aggregation. In the first phase, it learns category-related evidence from different modalities, which represents the degree of support for individual sleep stages. Considering the inherent characteristics of varying modalities, we propose hybrid category structures for different modalities to promote more reasonable evidence learning. In the second phase, view-specific opinions, including prediction results and uncertainty, are constructed from the learned evidence. Notably, we propose a novel conflict-aware aggregation method that integrates these view-specific opinions into a reliable joint decision. This mechanism can effectively resolve conflicts among opinions and synthesize them into a reliable joint decision. Both theoretical analysis and experimental results demonstrate the effectiveness of ConfSleepNet in sleep staging tasks. The code is available at https://github.com/By4te/ConfSleepNet_ICML2026/.

Chinese Translation

多视角学习已广泛应用于利用多模态数据进行睡眠阶段分类。然而，现有方法通常假设不同模态之间是良好对齐的，这在现实场景中往往难以实现，从而影响了分级结果的可靠性。本文提出了ConfSleepNet，一种动态解决视图间冲突的冲突感知证据框架。该框架由多视图证据提取和冲突感知聚合组成。在第一阶段，它从不同模态中学习与类别相关的证据，代表对各个睡眠阶段的支持程度。考虑到不同模态的固有特性，我们为不同模态提出了混合类别结构，以促进更合理的证据学习。在第二阶段，从学习到的证据中构建特定视图的意见，包括预测结果和不确定性。值得注意的是，我们提出了一种新颖的冲突感知聚合方法，将这些特定视图的意见整合为可靠的联合决策。该机制能够有效解决意见之间的冲突，并将其综合为可靠的联合决策。理论分析和实验结果均表明ConfSleepNet在睡眠分级任务中的有效性。代码可在 https://github.com/By4te/ConfSleepNet_ICML2026/ 获取。

View on arXiv Download PDF AI Translation

cs.AI / 34 / 2605.17036

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

自主人工智能代理在供应链管理中的可靠性与有效性

Long, Carol Xuan, Simchi-Levi, David, Zhu, Feng, Su, Huangyuan, Calmon, Andre P., Calmon, Flavio P.

Abstract

This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering. Model capability is the dominant factor: an out-of-the-box reasoning model exceeds human-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams. However, strong average performance masks substantial reliability risks. We introduce the agent bullwhip effect, the amplification of decision unreliability across echelons, manifesting along two dimensions: decision variance increases both across facilities at the same point in time and within the same facility across time. We develop a mathematical framework showing that this phenomenon is inherent to multi-agent systems that involve coordination and information delays, and we demonstrate that repeated sampling fails to meaningfully reduce it. To address this limitation, we propose a Group Relative Policy Optimization (GRPO)-based reinforcement-learning post-training framework that trains a shared base LLM using system-level supply-chain rewards. GRPO post-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply-chain agents.

Chinese Translation

本文研究了在多层级供应链中使用MIT啤酒游戏的自主生成AI代理。我们识别出四个影响性能的推理时间杠杆：模型选择、政策与保护措施、集中数据共享以及提示工程。模型能力是主导因素：一个开箱即用的推理模型超越了人类水平的表现，而优化后的推理模型相较于人类团队可将成本降低多达67%。然而，强大的平均表现掩盖了显著的可靠性风险。我们引入了代理鞭子效应，即决策不可靠性在各层级间的放大，表现为两个维度：在同一时间点上，不同设施之间的决策方差增加，以及同一设施在不同时间点上的决策方差增加。我们开发了一个数学框架，表明这一现象是多代理系统中固有的，涉及协调和信息延迟，并且我们证明重复采样未能有效减少这一现象。为了解决这一局限性，我们提出了一种基于群体相对政策优化（Group Relative Policy Optimization, GRPO）的强化学习后训练框架，该框架利用系统级供应链奖励训练共享基础大型语言模型（LLM）。GRPO后训练显著减少了尾部事件，抑制了代理鞭子效应，并提高了自主供应链代理的可靠性。

View on arXiv Download PDF AI Translation

cs.AI / 35 / 2605.17038

Evidential Information Fusion on Possibilistic Structure

基于可能性结构的证据信息融合

Zhou, Qianli, Cui, Ye, Li, Zhen, Pedrycz, Witold, Deng, Yong

Abstract

Dempster's rule is a fundamental tool for combining belief functions from distinct and reliable sources. However, its intersection-based semantics imposes strong structural restrictions, which limits its flexibility in handling complex source states and diverse information fusion scenarios. To overcome this limitation, we propose a reversible transformation, derived from the isopignistic principle, between belief functions and a possibilistic structure defined on the power set. In this transformation, the relationships among subsets are explicitly characterized by a belief evolution network, which provides a more flexible representation of evidential information beyond the conventional mass function structure. On this basis, we further introduce the triangular norm family to develop a general and adaptive evidential information fusion framework. Unlike fusion methods rooted in Dempster semantics, the proposed framework supports more flexible combination behaviors and exhibits advantages in non-distinct source fusion, conflict management, parametric combination design, and heterogeneous information fusion.

Chinese Translation

德姆斯特规则是结合来自不同可靠来源的信念函数的基本工具。然而，其基于交集的语义施加了强烈的结构限制，这限制了其在处理复杂源状态和多样化信息融合场景中的灵活性。为克服这一限制，我们提出了一种可逆变换，该变换源于等可能性原则，连接信念函数与定义在幂集上的可能性结构。在这一变换中，子集之间的关系通过信念演化网络被明确表征，这提供了一种超越传统质量函数结构的证据信息的更灵活表示。在此基础上，我们进一步引入三角范数族，以开发一个通用且自适应的证据信息融合框架。与根植于德姆斯特语义的融合方法不同，所提出的框架支持更灵活的组合行为，并在非独特源融合、冲突管理、参数组合设计和异构信息融合方面表现出优势。

View on arXiv Download PDF AI Translation

cs.AI / 36 / 2605.17044

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

PersonaArena：用于评估和增强大型语言模型中个性化角色扮演的动态模拟

Shi, Wenlong, Lian, Jianxun, Wu, Mingqi, Qin, Haiming, Zhou, Mingyang, Xie, Xing, Chao, Naipeng, Liao, Hao

Abstract

Large language models (LLMs) increasingly serve as interactive social agents, yet their ability to maintain coherent and authentic persona-level role-playing remains limited, particularly in realistic social scenarios. Existing research predominantly focuses on character-level settings and relies on static evaluation formats, failing to capture the complexity of everyday social interactions. In this work, we present PersonaArena, a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. PersonaArena leverages a large, filtered corpus of user-generated social content to construct a nuanced persona bank, and elicits multi-turn, context-rich interactions within simulated social environments. Our framework features a multi-agent debating judge for holistic and unbiased assessment. Through extensive experiments, we demonstrate that PersonaArena enables rigorous evaluation and enhancement of LLMs' role-playing capabilities, advancing the development of more authentic and socially adept AI agents.

Chinese Translation

大型语言模型（LLMs）越来越多地作为互动社交代理，但它们在保持连贯和真实的个性化角色扮演方面的能力仍然有限，尤其是在现实社交场景中。现有研究主要集中在角色级别的设置上，并依赖静态评估格式，未能捕捉日常社交互动的复杂性。在本研究中，我们提出了PersonaArena，一个用于评估和改善LLMs个性化角色扮演的动态模拟框架。PersonaArena利用大量经过筛选的用户生成社交内容构建细致的个性库，并在模拟社交环境中引导多轮、丰富上下文的互动。我们的框架具有多代理辩论评审功能，以实现全面和无偏见的评估。通过广泛的实验，我们证明了PersonaArena能够对LLMs的角色扮演能力进行严格的评估和增强，推动更真实和社交能力更强的人工智能代理的发展。

View on arXiv Download PDF AI Translation

cs.AI / 37 / 2605.17064

Towards Human-Level Book-Writing Capability

迈向人类水平的书写能力

Zierstek, Jan, Batelic, Matteo, Medjad, Maya, Schönenberger, Tim

Abstract

Large language models optimized for instruction following and agentic tasks remain poorly aligned with the requirements of high-quality creative writing. Fiction frequently depends on behaviors that assistant-tuned models are explicitly trained to avoid, particularly deception, moral ambiguity, and unreliable narration. As a result, generated stories often appear structurally correct while remaining stylistically generic, overly explanatory, or weakly grounded in human literary behavior. We present a dataset construction and training framework for book-scale creative writing that reframes supervised fine-tuning as a prompt-to-book generation task grounded in human-authored fiction. Starting from public-domain novels, we derive a multi-resolution planning scaffold by summarizing each book at progressively finer levels, from a high-level premise to chapter- and scene-level structure. We then invert this hierarchy during training: the model learns to expand a prompt into increasingly detailed plans and finally into the original human-authored book text. This formulation preserves human prose as the final supervised target while using intermediate summaries to make book-scale generation learnable. We train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose and toward human literary writing.

Chinese Translation

针对指令遵循和代理任务优化的大型语言模型与高质量创意写作的要求仍然存在较大差距。虚构作品常常依赖于助手调优模型明确训练以避免的行为，特别是欺骗、道德模糊和不可靠叙述。因此，生成的故事通常在结构上看似正确，但在风格上却显得过于普通、解释性过强或与人类文学行为的联系较弱。我们提出了一种书籍规模创意写作的数据集构建和训练框架，将监督微调重新构建为基于人类创作的虚构作品的提示到书籍生成任务。从公共领域的小说出发，我们通过逐步细化的方式总结每本书，从高层次的前提到章节和场景级别的结构，推导出一个多分辨率的规划框架。然后，在训练过程中，我们反转这一层次结构：模型学习将提示扩展为越来越详细的计划，最终生成原始的人类创作书籍文本。这种表述保留了人类散文作为最终的监督目标，同时利用中间摘要使书籍规模的生成可学习。我们在这些提示到书籍的轨迹上训练了一个长上下文语言模型，并研究这一目标是否将生成从助手风格的散文转向人类文学写作。

View on arXiv Download PDF AI Translation

cs.AI / 38 / 2605.17071

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

AnchorDiff：具有基于置信度重写的拓扑感知掩蔽扩散用于放射学报告生成

Yu, Shiying, Wang, Jielei, Lu, Guoming

Abstract

Radiology report generation (RRG) aims to automatically produce clinically accurate textual reports from medical images. Existing methods predominantly rely on autoregressive (AR) language models, whose causal dependency structure restricts generation to a unidirectional left-to-right process. This paradigm can induce sequence bias, where models tend to follow stereotypical token orders and high-frequency report templates rather than fully grounding generation in image-specific evidence. In this paper, we propose AnchorDiff, the first masked-diffusion framework for RRG that integrates knowledge-graph-derived clinical anchors into diffusion language modeling. By leveraging bidirectional context and iterative refinement, AnchorDiff mitigates the limitations of fixed-order autoregressive decoding. Specifically, we introduce a topology-aware training strategy that uses RadGraph-derived entity hierarchies to assign clinically important tokens differentiated masking protection and loss weights. We further design an inference-time rewriting strategy that detects unstable committed tokens through perturbation-based testing and selectively revises them during denoising. Extensive experiments on the MIMIC-CXR and MIMIC-RG4 benchmarks demonstrate that AnchorDiff achieves state-of-the-art (SOTA) performance, showing the effectiveness of clinically anchored masked diffusion for radiology report generation.

Chinese Translation

放射学报告生成（RRG）旨在从医学图像中自动生成临床准确的文本报告。现有方法主要依赖自回归（AR）语言模型，其因果依赖结构限制了生成过程为单向的从左到右。这种范式可能导致序列偏差，使得模型倾向于遵循刻板的标记顺序和高频报告模板，而不是完全基于图像特定证据进行生成。在本文中，我们提出了AnchorDiff，这是第一个将知识图谱衍生的临床锚点整合到扩散语言建模中的掩蔽扩散框架。通过利用双向上下文和迭代优化，AnchorDiff缓解了固定顺序自回归解码的局限性。具体而言，我们引入了一种拓扑感知的训练策略，使用RadGraph衍生的实体层次结构为临床重要标记分配差异化的掩蔽保护和损失权重。我们进一步设计了一种推理时重写策略，通过基于扰动的测试检测不稳定的已承诺标记，并在去噪过程中选择性地对其进行修正。在MIMIC-CXR和MIMIC-RG4基准上的广泛实验表明，AnchorDiff实现了最先进的（SOTA）性能，展示了临床锚定掩蔽扩散在放射学报告生成中的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 39 / 2605.17072

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

RAGA：用于自主知识图谱构建和检索增强生成的阅读与图谱构建代理

Han, Chengrui, Cheng, Zesheng

Abstract

Existing LLM-driven knowledge graph (KG) construction methods predominantly employ stateless batch processing pipelines, exhibiting structural deficiencies in cross-chunk semantic relation capture, entity disambiguation, and construction process interpretability. These limitations undermine KG quality, retrieval precision, and deployment trust in high-stakes domains. We propose RAGA (Reading And Graph-building Agent), an LLM-based autonomous KG construction and retrieval fusion framework. RAGA provides an atomic toolset supporting full KG lifecycle CRUD operations and embeds a Read-Search-Verify-Construct cognitive constraint into a ReAct tool loop. A KG-vector synchronization mechanism enables hybrid symbolic-vector retrieval, while evidence-anchored verification links every knowledge entry to its source text for auditable provenance. Preliminary experiments on a subset of the QASPER scientific QA dataset indicate that RAGA's fusion retrieval outperforms zero-shot baselines, with KG integration providing measurable gains in both answer and evidence quality. The framework design and experimental baseline serve as a reference for agent-driven autonomous KG construction.

Chinese Translation

现有的基于大型语言模型（LLM）的知识图谱（KG）构建方法主要采用无状态的批处理管道，表现出在跨块语义关系捕获、实体消歧和构建过程可解释性方面的结构缺陷。这些局限性削弱了知识图谱的质量、检索精度以及在高风险领域的部署信任。我们提出了RAGA（阅读与图谱构建代理），这是一个基于LLM的自主知识图谱构建与检索融合框架。RAGA提供了一套原子工具集，支持完整的知识图谱生命周期的CRUD操作，并将阅读-搜索-验证-构建的认知约束嵌入到ReAct工具循环中。知识图谱向量同步机制实现了混合符号-向量检索，而基于证据的验证将每个知识条目链接到其源文本，以便于可审计的来源追溯。在QASPER科学问答数据集的一个子集上的初步实验表明，RAGA的融合检索优于零-shot基线，知识图谱的集成在答案和证据质量上均提供了可测量的提升。该框架设计和实验基线为代理驱动的自主知识图谱构建提供了参考。

View on arXiv Download PDF AI Translation

cs.AI / 40 / 2605.17104

Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

丰富科学逻辑性的LLM推理方法论：物理学中的实践

Yu, Zhaoxin, Xu, Nan, Chen, Kun, Zhao, Jiahao, Wang, Lei, Mao, Wenji

Abstract

With the continuous advancement of reasoning abilities in Large Language Models (LLMs), their application to scientific reasoning tasks has gained significant research attention. Current research primarily emphasizes boosting LLMs' performance on scientific QA benchmarks by training on larger, more comprehensive datasets with extended reasoning chains. However, these approaches neglect the essence of the scientific reasoning process -- logicality, which is the rational foundation to ensure the validity of reasoning steps leading to reliable conclusions. In this work, we make the first systematic investigation into the internal logicality underlying LLM scientific reasoning, and develop a scientific logicality-enriched methodology, including a set of assessment criteria and data sampling methods for logicality-guided training, to improve the logical faithfulness as well as task performance. Further, we take physics, characterized by its diverse logical structures and formalisms, as an exemplar discipline to practise the above methodology. For data construction, we extract scientific problems from academic literature and sample a high-quality dataset exhibiting strong logicality. Experiments based on three different backbone LLMs reveal that: 1) the training data we constructed can effectively improve the scientific logicality in LLM reasoning; and 2) the enriched scientific logicality plays a critical role in solving scientific problems. Code is available at \href{https://github.com/ScienceOne-AI/PhysLogic}{https://github.com/ScienceOne-AI/PhysLogic}.

Chinese Translation

随着大型语言模型（LLMs）推理能力的持续提升，它们在科学推理任务中的应用引起了显著的研究关注。目前的研究主要强调通过在更大、更全面的数据集上进行训练，以延长推理链来提升LLMs在科学问答基准上的表现。然而，这些方法忽视了科学推理过程的本质——逻辑性，这是确保推理步骤有效性以得出可靠结论的理性基础。在本研究中，我们首次系统性地探讨了LLM科学推理中潜在的内部逻辑性，并开发了一种丰富科学逻辑性的研究方法论，包括一套评估标准和逻辑性引导训练的数据采样方法，以提高逻辑的可信度和任务表现。此外，我们以物理学为例，物理学以其多样的逻辑结构和形式主义为特征，来实践上述方法论。在数据构建方面，我们从学术文献中提取科学问题，并采样出一个展现强逻辑性的高质量数据集。基于三种不同主干LLM的实验表明：1）我们构建的训练数据可以有效提高LLM推理中的科学逻辑性；2）丰富的科学逻辑性在解决科学问题中发挥了关键作用。代码可在 exttt{https://github.com/ScienceOne-AI/PhysLogic} 获取。

View on arXiv Download PDF AI Translation

cs.AI / 41 / 2605.17110

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

通过证据校准的查询聚类捕捉大型语言模型能力

Wu, Fangzhou, Silwal, Sandeep, Zhang, Qiuyi

Abstract

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.

Chinese Translation

查询聚类将查询组织成反映共享潜在能力需求的组，从而实现能力感知的大型语言模型（LLM）评估。现有的聚类方法主要依赖于语义分类法或嵌入，通常无法捕捉这种潜在能力需求，因为表层语义与实际模型性能之间存在不一致。我们提出了ECC（Evidence-Calibrated Clustering），一种通过有限的后验模型比较来校准先前的语义嵌入的算法，以弥合表层语义与潜在能力需求之间的差距。ECC通过一个由Bradley-Terry模型参数化的能力特征来表征每个聚类，并使用可训练的混合权重来适应具有混合能力需求的查询，共同学习一个灵活的、能力感知的聚类结构，以支持查询特定的LLM能力推断。大量的定量和定性评估表明，ECC显著提高了LLM能力排名的质量，分别比人工标注和基于嵌入的基准提高了平均17.64和18.02个百分点，并在查询路由等下游任务中证明了其有效性。

View on arXiv Download PDF AI Translation

cs.AI / 42 / 2605.17115

F2IND-IT! -- Multimodal Fuzzy Fake Indian News Detection using Images and Text

F2IND-IT！-- 基于图像和文本的多模态模糊假印度新闻检测

Trivedi, Kushal, Shaikh, Murtuza, Singh, Khushi, S., Jeevaraj

Abstract

Biased manipulation of facts across regional and national media outlets complicates misinformation detection in diverse landscapes like India. This paper introduces a novel multimodal framework combining visual and textual modalities for enhanced fake news detection on Indian media. The architecture utilizes a ResNet-50 Convolutional Neural Network to extract visual features from news images, a DistilBERT encoder to obtain textual semantic embeddings, and an Adaptive Neuro-Fuzzy Inference System (ANFIS) to generate a fuzzy reliability score. A lightweight attention-based fusion module assigns learnable weights to each modality prior to classification. Evaluated on the IFND dataset, the proposed model is validated through an in-depth comparative analysis against previous research. Experimental results demonstrate superior performance across accuracy, precision, recall, and $F_1$-scores, confirming the efficacy of the architecture.

Chinese Translation

区域和国家媒体对事实的偏见操控使得在印度等多样化环境中检测虚假信息变得复杂。本文提出了一种新颖的多模态框架，结合视觉和文本模态，以增强对印度媒体上假新闻的检测。该架构利用ResNet-50卷积神经网络从新闻图像中提取视觉特征，使用DistilBERT编码器获取文本语义嵌入，并通过自适应神经模糊推理系统（Adaptive Neuro-Fuzzy Inference System, ANFIS）生成模糊可靠性评分。轻量级的基于注意力的融合模块在分类之前为每种模态分配可学习的权重。在IFND数据集上进行评估，所提出的模型通过与先前研究的深入比较分析得到了验证。实验结果表明，在准确率、精确率、召回率和$F_1$-分数等方面表现优越，确认了该架构的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 43 / 2605.17137

Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design

潜在启发式搜索：用于自动化算法设计的连续优化

Ahmed, Cheikh, Mostajabdaveh, Mahdi, Zhou, Zirui

Abstract

The integration of Large Language Models (LLMs) into evolutionary frameworks has established a new paradigm for automated heuristic discovery. Despite their promise, these methods typically search in the discrete space of program syntax, relying on stochastic sampling to navigate a highly non-convex optimization landscape. This work proposes a continuous heuristic discovery framework that shifts optimization to a learned latent manifold. We employ an encoder to map discrete programs into continuous embeddings and train a differentiable surrogate model to predict performance, enabling gradient-based search. To regularize the optimization trajectory, an invertible normalizing flow maps these embeddings to a structured Gaussian prior, where we perform gradient ascent. The resulting optimized latent vectors are projected through a learned mapper into soft prompts, which condition a frozen LLM to synthesize novel executable heuristics. We evaluate the proposed method on the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), the Knapsack Problem (KSP), and Online Bin Packing (OBP). Empirical results demonstrate that continuous latent-space optimization achieves performance competitive with state-of-the-art discrete evolutionary baselines while offering a complementary methodological alternative for automated algorithm design. The implementation code is available at \url{https://github.com/cheikh025/LHS}.

Chinese Translation

将大型语言模型（LLMs）融入进化框架中，建立了一种自动化启发式发现的新范式。尽管这些方法前景广阔，但它们通常在程序语法的离散空间中进行搜索，依赖随机采样来导航高度非凸的优化景观。本研究提出了一种连续启发式发现框架，将优化转移到学习的潜在流形上。我们采用编码器将离散程序映射到连续嵌入，并训练一个可微分的代理模型来预测性能，从而实现基于梯度的搜索。为了规范优化轨迹，一个可逆的归一化流将这些嵌入映射到结构化的高斯先验中，在此进行梯度上升。最终优化得到的潜在向量通过学习的映射器投影到软提示中，以此条件化一个冻结的LLM，从而合成新的可执行启发式算法。我们在旅行商问题（TSP）、容量限制车辆路径问题（CVRP）、背包问题（KSP）和在线装箱问题（OBP）上评估了所提方法。实证结果表明，连续潜在空间优化在性能上与最先进的离散进化基线相竞争，同时为自动化算法设计提供了一种互补的方法论替代方案。实现代码可在 https://github.com/cheikh025/LHS 获取。

View on arXiv Download PDF AI Translation

cs.AI / 44 / 2605.17141

Dynamics of collective creativity in AI art competitions

人工智能艺术竞赛中的集体创造力动态

Youngblood, Mason, Nusz, Jeff, Simon, Joel

Abstract

Creativity is a fundamental aspect of how culture evolves, yet the mechanisms by which groups produce novelty are notoriously difficult to infer from the historical record. Iterated learning experiments have shown that cultural transmission reliably distorts artifacts toward the inductive biases of learners, but most of this work uses linear chains between human participants, leaving open how these dynamics play out in the networked, human-AI systems that increasingly shape cultural production. In this study, we leverage one such system, Artbreeder, which hosts daily "remix parties" where users iteratively build on each other's work from a single seed image, producing branching lineages of human-AI co-created images. We analyze a dataset of 130,882 images from 368 remix parties over 13 months and find that images become simpler and converge toward common thematic "attractors" (e.g., steampunk scenes, alien architecture). We also find that while more novel "parent" images produce more novel and complex "children" that attract more likes, users paradoxically prefer to remix images that are less novel and complex. Finally, larger remix parties produce more novelty at the cost of lower complexity.

Chinese Translation

创造力是文化演变的一个基本方面，但群体产生新颖性的机制在历史记录中往往难以推断。迭代学习实验表明，文化传播可靠地扭曲了工件，以适应学习者的归纳偏见，但大多数研究使用的是人类参与者之间的线性链条，尚不清楚这些动态在日益影响文化生产的人机网络系统中如何展开。在本研究中，我们利用了一个这样的系统——Artbreeder，该平台每天举办“混音派对”，用户从单一的种子图像出发，迭代地基于彼此的作品进行创作，产生人机共同创作的图像分支谱系。我们分析了来自368个混音派对的130,882幅图像的数据集，发现图像变得更简单，并趋向于共同的主题“吸引子”（例如，蒸汽朋克场景、外星建筑）。我们还发现，虽然更具新颖性的“父”图像会产生更新颖和复杂的“子”图像，并吸引更多的点赞，但用户却矛盾地更倾向于混音那些不那么新颖和复杂的图像。最后，较大的混音派对在降低复杂性的代价下产生了更多的新颖性。

View on arXiv Download PDF AI Translation

cs.AI / 45 / 2605.17159

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

MADP：一种可持续文档处理的多智能体管道，结合人机协作

Gosmar, Diego, Zenezini, Giovanni

Abstract

Document processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor-intensive and error-prone. We present MADP, a multi-agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning-based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents--Classificator, Splitter, Parser, Extraction, and Validator--with a Human-in-the-Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate, with only 3% requiring non-AI fallback. Ablation evaluation on a stratified 100-document subset (5 documents per each of 20 supplier/document-type categories) demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) provide practical insights for deployment in production environments.

Chinese Translation

文档处理自动化在企业环境中仍然是一个关键挑战，传统的手动方法劳动密集且容易出错。我们提出了MADP，一种多智能体架构，通过结合基于深度学习的分类和解析与大型语言模型提取，解决了在企业环境中自动化文档处理的挑战，同时通过选择性的人类验证保持准确性。我们的系统集成了五个专业代理——分类器（Classificator）、分割器（Splitter）、解析器（Parser）、提取器（Extraction）和验证器（Validator），并采用了人机协作（Human-in-the-Loop, HITL）机制和一种新颖的反馈继承提示微调（Prompt Fine Tuning with Feedback Inheritance, PFTFI）方法。对每年处理100,000份发票的生产用例场景的操作分析表明，潜在的全职当量（Full-Time Equivalent, FTE）需求可减少约70%。在2026年1月之前处理的955份真实文档的生产部署实现了97.0%的全管道自动化率，仅有3%需要非AI回退。在一个分层的100文档子集（每种20个供应商/文档类型类别各5个文档）上的消融评估表明，完整的MADP配置在有HITL监督的情况下达到了98.5%的文档级准确率。此外，我们还进行了全面的可持续性分析，显示我们的混合AI+HITL方法相比传统手动处理减少了69%的二氧化碳排放、69%的能源消耗和63%的水使用。对多个大型语言模型（LLM）后端（Granite-Docling、Mistral-Small、DeepSeek-OCR）的基准比较为生产环境中的部署提供了实用的见解。

View on arXiv Download PDF AI Translation

cs.AI / 46 / 2605.17162

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

从模仿到互动：利用浅层强化学习掌握Schnapsen游戏

Klačan, Ján, Zhang, Sizhong

Abstract

This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo sampling and lookahead search. Guided by a progressively more complex experimental design, we first evaluate a supervised learning agent (MLPBot) trained on replay data and then a reinforcement learning agent (RLBot) with the same shallow architecture trained through asynchronous Monte Carlo updates and experience replay. The results show that supervised imitation does not generalize well enough to defeat strong RdeepBot opponents, whereas reinforcement learning produces substantially stronger agents. In the setting that focuses on the depth parameter of RdeepBot, the best performance is achieved when the learned value function is combined with deeper lookahead during gameplay, allowing RLBot to achieve statistically significant higher winning rates against the strongest evaluated RdeepBot baseline. In the sample-based setting, the gains are more conditional: the strongest performance appears at a relatively lower training num_samples parameter rather than increasing uniformly with stronger sampling.

Chinese Translation

本文研究了浅层神经网络代理是否能够掌握纸牌游戏Schnapsen，并挑战一个强大的基于搜索的基线RdeepBot，该基线使用蒙特卡洛采样和前瞻搜索。在逐步复杂化的实验设计指导下，我们首先评估了一个基于回放数据训练的监督学习代理（MLPBot），然后评估了一个通过异步蒙特卡洛更新和经验回放训练的相同浅层架构的强化学习代理（RLBot）。结果表明，监督模仿的泛化能力不足以击败强大的RdeepBot对手，而强化学习则产生了显著更强的代理。在关注RdeepBot深度参数的设置中，当学习到的价值函数与游戏过程中更深的前瞻相结合时，表现最佳，使得RLBot在与最强评估的RdeepBot基线对抗时获得了统计显著更高的胜率。在基于样本的设置中，收益则更具条件性：最强的表现出现在相对较低的训练样本数参数下，而不是随着更强采样的增加而均匀提高。

View on arXiv Download PDF AI Translation

cs.AI / 47 / 2605.17169

Responsible Agentic AI Requires Explicit Provenance

负责任的自主智能需要明确的来源追溯

Hu, Jinwei, Huang, Xinmiao, He, Qisong, Sun, Youcheng, Dong, Yi, Huang, Xiaowei

Abstract

Agentic AI is rapidly proliferating across diverse real-world domains such as software engineering, yet public trust has not kept pace. The central reason is that responsibility, despite being widely discussed, remains a subjective and unenforced concept, as no current agentic framework produces the quantifiable, traceable, and interventionable provenance needed to assign it when harm emerges from compositions no single party designed. We position that what is missing is not better benchmark-level evaluation but $\textbf{explicit provenance}$ across the full agentic lifecycle, which is the only viable basis for making responsibility computable and actionable. We advance this agenda along four axes: establishing $\textit{why}$ such provenance is a structural necessity by identifying responsibility gaps across sociotechnical dimensions, formalizing $\textit{what}$ it must encode through a causal attribution function and responsibility tensor, discussing $\textit{how}$ it can be made computable across four lifecycle layers with preliminary experiments showing that provenance is estimable and interveneable online before irreversible harm accumulates, and examining $\textit{who}$ bears responsibility through a concrete agentic incident. Explicit provenance is not a discretionary refinement but the necessary condition for responsible agentic AI, and no stakeholder across its ecosystem can afford to treat it as optional.

Chinese Translation

自主智能（Agentic AI）正在迅速扩展到软件工程等多种现实领域，但公众信任并未同步增长。其主要原因在于，尽管责任被广泛讨论，但仍然是一个主观且未被强制执行的概念，因为目前没有任何自主框架能够提供量化、可追溯和可干预的来源追溯，以便在由多个未设计的单一方组合而成的情况下出现危害时进行责任分配。我们认为，缺失的不是更好的基准评估，而是贯穿整个自主生命周期的$ extbf{明确来源追溯}$，这是使责任可计算和可操作的唯一可行基础。我们从四个方面推进这一议程：首先，通过识别社会技术维度中的责任缺口，建立$ extit{为什么}$这种来源追溯是结构性必要的；其次，通过因果归因函数和责任张量形式化$ extit{它必须编码什么}$；然后，讨论$ extit{如何}$在四个生命周期层面上使其可计算，初步实验表明，来源追溯在不可逆危害累积之前是可以在线估计和干预的；最后，通过一个具体的自主事件审视$ extit{谁}$承担责任。明确的来源追溯不是一种可选的细化，而是负责任的自主智能的必要条件，生态系统中的任何利益相关者都无法将其视为可选。

View on arXiv Download PDF AI Translation

cs.AI / 48 / 2605.17176

CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

CAREBench：通过评估认知评估推理来评估大型语言模型的情感理解

Sun, Zhaoyue, Xu, Hainiu, Uusberg, Andero, Gross, James J., Slovak, Petr, He, Yulan

Abstract

Emotion understanding is a core capability for LLMs to interact effectively with humans, yet existing evaluation paradigms rely on discrete emotion label prediction and fail to capture the cognitive processes underlying emotion generation. Grounded in appraisal theory, we introduce CAREBench, the first benchmark with complete inferential chain annotations from both first- and third-person perspectives on real-world narratives, spanning appraisal reasoning, appraisal ratings, and multi-label emotion annotation. We propose a process-level evaluation framework and conduct systematic experiments across six LLMs organized around four research questions. We find that stronger models match or surpass human observers on certain tasks, yet fall short on appraisal reasoning and positive emotion recognition; performance across chain steps and sensitivity to appraisal interventions exhibit dissociations across models; and current models have not internalized the mechanisms needed to capture human subjective heterogeneity. These findings suggest that downstream emotion prediction metrics may overestimate LLMs' true emotion understanding, and CAREBench provides a foundation for more diagnostically informative evaluation of LLMs' affective cognitive capabilities.

Chinese Translation

情感理解是大型语言模型（LLMs）与人类有效互动的核心能力，然而现有的评估范式依赖于离散情感标签的预测，未能捕捉情感生成背后的认知过程。基于评估理论，我们引入了CAREBench，这是第一个基准，提供了来自第一人称和第三人称视角的真实叙事的完整推理链注释，涵盖了评估推理、评估评分和多标签情感注释。我们提出了一个过程级评估框架，并围绕四个研究问题对六个LLMs进行了系统实验。我们发现，较强的模型在某些任务上与人类观察者的表现相匹配或超越，但在评估推理和积极情感识别方面仍显不足；不同模型在链步之间的表现和对评估干预的敏感性存在差异；当前模型尚未内化捕捉人类主观异质性所需的机制。这些发现表明，下游情感预测指标可能高估了LLMs的真实情感理解能力，而CAREBench为更具诊断性的信息化评估LLMs的情感认知能力提供了基础。

View on arXiv Download PDF AI Translation

cs.AI / 49 / 2605.17214

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

ChemVA：推进大型语言模型对化学反应图的理解

Rao, Mingyang, Feng, Kehua, Zhu, Zhihui, Fu, Jiangzhen, Yu, Hao, Ding, Keyan, Chen, Huajun

Abstract

While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model's latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.

Chinese Translation

尽管大型语言模型（LLMs）在科学文本处理方面取得了革命性进展，但在解释化学反应图时仍存在显著的能力差距。我们识别出限制当前系统的两个基本瓶颈：视觉缺陷（Visual Deficit），即通用视觉编码器难以解析密集分子图的严格拓扑连通性；以及语义脱节（Semantic Disconnect），即标准线性字符串（如SMILES）未能有效激活模型的潜在化学推理能力。为了解决这些问题，我们提出了化学视觉激活（Chemical Visual Activation, ChemVA）框架，该框架采用视觉锚机制，通过混合粒度检测来定位功能团，随后采用语义对齐方法将视觉特征转换为实体名称，以最大化LLMs中的知识激活。我们在OCRD-Bench上评估了我们的方法，该数据集新构建，具有密集的视觉-语义上下文和全面的反应覆盖，以评估从识别到推理的完整范围。在OCRD-Bench上的大量实验表明，ChemVA实现了92.0%的结构识别准确率。通过弥合视觉和语义瓶颈，我们的框架在9种不同的LLMs中实现了约20个百分点的一致性能提升，使得开放权重模型在复杂化学推理任务中能够与专有的SOTA系统相媲美。

View on arXiv Download PDF AI Translation

cs.AI / 50 / 2605.17247

Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

通过 TIDE 实现稳健的论证性论文理解：一个包含试验与辩论的互动框架

Yin, Zheqin, Ren, Yupei, Zhang, Yadong, Lu, Yujiang, Lan, Man

Abstract

Argumentative essays serve as a vital medium for assessing critical thinking and reasoning skills, yet there is limited works on accurately understanding and evaluating such texts via prompt. In this work, we propose TIDE, a novel framework designed to improve criteria-based prompt optimization for argument-related tasks by integrating TrIal and DEbate mechanism. Our method addresses key limitations of criteria-based prompt optimizing by mitigating the influence of noisy training data and enhancing optimization stability. We evaluate TIDE on three core tasks: Automated Essay Scoring, Argument Component Detection, and Argument Relation Identification. Results demonstrate that our framework improves performance across tasks. These findings underscore the potential of combining prompt-based methods for advanced argument understanding.

Chinese Translation

论证性论文作为评估批判性思维和推理能力的重要媒介，然而目前针对通过提示准确理解和评估此类文本的研究仍然有限。在本研究中，我们提出了 TIDE，一个新颖的框架，旨在通过整合试验（Trial）和辩论（Debate）机制来改善基于标准的提示优化，以应对与论证相关的任务。我们的方法通过减轻噪声训练数据的影响和增强优化稳定性，解决了基于标准的提示优化的关键局限性。我们在三个核心任务上评估了 TIDE：自动论文评分、论证成分检测和论证关系识别。结果表明，我们的框架在各个任务上均提高了性能。这些发现强调了结合基于提示的方法以实现高级论证理解的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 51 / 2605.17254

CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials

CatalyticMLLM：用于催化材料的图-文本多模态大语言模型

Li, Yanjie

Abstract

Property prediction and inverse structural design of catalytic materials are typically modeled as two independent tasks: the former predicts target properties from given structures, whereas the latter generates candidate structures according to desired properties. Although the decoupled paradigm facilitates the implementation of a ``generation--evaluation--screening'' workflow, the inconsistency between the generative model and the property prediction model in terms of representation spaces and training objectives can readily introduce data distribution shifts and evaluator bias, thereby limiting the stability of closed-loop optimization. In this work, we propose QE-Catalytic-V2, a unified graph--text multimodal large language model for catalytic materials, which integrates property prediction and inverse design within the same model and shared representation space. Under this unified framework, QE-Catalytic-V2 can not only perform reliable property prediction by leveraging three-dimensional structures and textual information, but also generate and screen physically feasible CIF candidates conditioned on target properties, thereby forming a closed-loop optimization workflow of ``inverse design--prediction--screening--redesign.'' Experimental results demonstrate that this unified paradigm outperforms decoupled baselines on both catalytic relaxed-energy prediction and inverse design tasks, validating the effectiveness of jointly modeling property prediction and structure generation within a single multimodal model.

Chinese Translation

催化材料的性质预测和逆向结构设计通常被建模为两个独立的任务：前者从给定的结构中预测目标性质，而后者根据期望性质生成候选结构。尽管解耦的范式有助于实施“生成-评估-筛选”的工作流程，但生成模型和性质预测模型在表示空间和训练目标上的不一致性容易引入数据分布的变化和评估者偏差，从而限制闭环优化的稳定性。在本研究中，我们提出了QE-Catalytic-V2，一种用于催化材料的统一图-文本多模态大语言模型，它在同一模型和共享表示空间内整合了性质预测和逆向设计。在这一统一框架下，QE-Catalytic-V2不仅能够利用三维结构和文本信息进行可靠的性质预测，还能生成并筛选基于目标性质的物理可行CIF候选结构，从而形成“逆向设计-预测-筛选-重新设计”的闭环优化工作流程。实验结果表明，这一统一范式在催化放松能量预测和逆向设计任务上均优于解耦基线，验证了在单一多模态模型中联合建模性质预测和结构生成的有效性。

View on arXiv Download PDF AI Translation

cs.AI / 52 / 2605.17255

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

CAM-Bench：一个用于Lean中的计算与应用数学的基准

Long, Wentao, Zhang, Yunfei, Li, Chenyi, Zhou, Li, Sun, Chumin, Wen, Zaiwen

Abstract

Formal theorem-proving benchmarks enable mechanically verifiable evaluation of mathematical reasoning in large language models. However, existing benchmarks mainly focus on Olympiad-style problems and algebraic domains, leaving computational and applied mathematics underrepresented. We introduce CAM-Bench, a Lean 4 theorem-proving benchmark of 1,000 Lean proof targets in computational and applied mathematics, with coverage spanning optimization, numerical linear algebra, and numerical analysis. These problems are adapted from textbook exercises and often depend on locally introduced definitions, notation, algorithms, and elementary results. To construct CAM-Bench, we develop a dependency-recovery pipeline that reconstructs the local textbook context needed to state each problem faithfully. It then normalizes each problem into a standalone informal theorem and translates it into a Lean target. We validate the resulting formal problems through Lean compilation and semantic review, checking both formal correctness and semantic alignment with the original exercises. For each problem, we release the raw exercise, recovered context, normalized informal theorem, and final Lean target. CAM-Bench complements existing formal mathematics benchmarks by targeting applied mathematics problems that rely on textbook concepts and elementary theorems, many of which are not directly available as standard Mathlib4 lemmas. We evaluate widely used large language models and formalization agents on CAM-Bench, and analyze common failure modes in tracking local assumptions, applying elementary results, decomposing proofs, and maintaining long-horizon control in Lean.

Chinese Translation

形式化定理证明基准能够对大型语言模型中的数学推理进行机械可验证的评估。然而，现有基准主要集中在奥林匹克风格的问题和代数领域，导致计算与应用数学的表现不足。我们介绍了CAM-Bench，这是一个Lean 4定理证明基准，包含1,000个计算与应用数学的Lean证明目标，涵盖优化、数值线性代数和数值分析等领域。这些问题改编自教科书习题，通常依赖于局部引入的定义、符号、算法和基本结果。为了构建CAM-Bench，我们开发了一个依赖恢复管道，重建每个问题所需的局部教科书背景，以忠实地陈述每个问题。然后，它将每个问题规范化为一个独立的非正式定理，并将其翻译为Lean目标。我们通过Lean编译和语义审查验证生成的形式化问题，检查形式正确性和与原始习题的语义一致性。对于每个问题，我们发布原始习题、恢复的背景、规范化的非正式定理和最终的Lean目标。CAM-Bench通过针对依赖于教科书概念和基本定理的应用数学问题，补充了现有的形式数学基准，其中许多问题并未作为标准Mathlib4引理直接提供。我们在CAM-Bench上评估广泛使用的大型语言模型和形式化代理，并分析在跟踪局部假设、应用基本结果、分解证明和在Lean中保持长期控制方面的常见失败模式。

View on arXiv Download PDF AI Translation

cs.AI / 53 / 2605.17268

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation

VLA推理是否可信？探讨因果链的安全性

Mayumu, Nicanor, Deng, Xiaoheng, Mukala, Patrick

Abstract

We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferences across 100 diverse PhysicalAI-AV scenarios. Our main finding is that output natural-language rationales with trajectories may be significantly unfaithful: (i) overall reasoning fidelity is only 42.5%, with Chain-of-Causation matching scene reality less than half the time; (ii) 94 missed pedestrians in one-third of pedestrian-relevant scenes; (iii) 97.7% trajectory fragility under mild visual perturbations; and (iv) only 48.3% mean reasoning-action consistency, with 53.3% of inferences exhibiting low consistency, including 37.9% of stop-claimed cases where the model continues instead. We formalize faithfulness information-theoretically, define entity and action fidelity with verification criteria, and outline a four-component safety architecture aligned with these results.

Chinese Translation

我们首次系统性地研究了视觉-语言-行动（VLA）驱动模型中的可信性，分析了100个多样化的PhysicalAI-AV场景中的300个Alpamayo-R1-10B推理。我们的主要发现是，输出的自然语言推理与轨迹可能显著不可信：（i）整体推理的保真度仅为42.5%，因果链与场景现实的匹配率不足一半；（ii）在三分之一与行人相关的场景中遗漏了94名行人；（iii）在轻微视觉扰动下，97.7%的轨迹表现出脆弱性；（iv）平均推理-行动一致性仅为48.3%，其中53.3%的推理表现出低一致性，包括37.9%声称停止的案例中模型却继续行动。我们从信息论的角度形式化可信性，定义了实体和行动的保真度及其验证标准，并概述了与这些结果相一致的四个组成部分的安全架构。

View on arXiv Download PDF AI Translation

cs.AI / 54 / 2605.17278

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

A2RBench：一种自动化的形式可验证抽象推理基准生成范式

Ma, Qingchuan, Ma, Yuexiao, Xie, Yongkang, Xie, Tianyu, Zheng, Xiawu, Ji, Rongrong

Abstract

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)--guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.

Chinese Translation

抽象推理能力反映了大型语言模型（LLMs）提取和应用抽象规则的智能和概括能力。然而，准确测量这一能力仍然具有挑战性：现有基准要么依赖昂贵的人工标注，限制了其规模，要么存在测量记忆而非真正推理的风险。为了解决这个问题，我们引入了一个名为A2RBench的自动化流程，涵盖生成、扩展、评估和分析。具体而言，在生成阶段，LLMs创建需要真正推理的多样化任务；在扩展阶段，LLMs重用经过验证的规则并扩展新的输入空间以生成任务变体，从而实现规模化。然而，这一过程可能导致幻觉现象。为消除这种现象，我们进一步建立了一个理论框架，并证明程序验证——测试逆操作是否完美地逆转前向操作（循环一致性）——可以保证唯一解。通过对主流LLMs进行广泛评估，我们发现：（1）当前LLMs在抽象推理方面存在根本缺陷，顶尖模型在一个代表性子集上的表现显著低于人类（39.8%对68.5%）。（2）当前LLMs在生成的3D任务复杂度上远远落后于2D和1D，揭示了它们对高维任务理解的缺乏。（3）反直觉的是，信息复杂度较高的输入可以简化推理过程。

View on arXiv Download PDF AI Translation

cs.AI / 55 / 2605.17292

MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation

MetaCogAgent：一个具有自我意识任务委派的元认知多智能体大语言模型框架

Wang, Chenyu, Shu, Yang

Abstract

Multi-agent large language model (LLM) systems have shown promise for solving complex tasks through agent collaboration. However, existing frameworks assign tasks based on predefined roles without considering whether an agent can accurately assess its own competence boundaries, leading to overconfident execution on tasks beyond its expertise. Inspired by metacognition theory from cognitive science, we propose MetaCogAgent, a multi-agent LLM framework where each agent is equipped with a Metacognitive Self-Assessment Unit that evaluates task-capability alignment before execution. The framework introduces three contributions: (1) a self-assessment mechanism that estimates per-task confidence by combining verbalized uncertainty with historical capability profiles; (2) an adaptive delegation protocol that routes low-confidence tasks to better-suited agents through cross-agent evaluation; and (3) a capability boundary learning module that iteratively refines each agent's competence model via cybernetic feedback. Experiments on our constructed MetaCog-Eval benchmark (700 tasks across 5 cognitive dimensions) demonstrate that MetaCogAgent achieves 82.4% task accuracy -- 8.7% above the best routing baseline -- while using 5% fewer API calls than AutoGen and 34% fewer than ensemble voting. Ablation studies confirm that each metacognitive component contributes to overall system performance.

Chinese Translation

多智能体大语言模型（LLM）系统在通过智能体协作解决复杂任务方面展现了良好的前景。然而，现有框架基于预定义角色分配任务，而未考虑智能体是否能够准确评估自身能力边界，导致在超出其专业领域的任务上执行过于自信。受到认知科学中元认知理论的启发，我们提出了MetaCogAgent，一个多智能体LLM框架，其中每个智能体配备了一个元认知自我评估单元，在执行前评估任务能力的匹配程度。该框架提出了三项贡献：（1）一种自我评估机制，通过结合口头表达的不确定性与历史能力档案来估计每个任务的信心；（2）一种自适应委派协议，通过跨智能体评估将低信心任务路由到更合适的智能体；（3）一个能力边界学习模块，通过控制论反馈迭代细化每个智能体的能力模型。在我们构建的MetaCog-Eval基准（涵盖5个认知维度的700个任务）上的实验表明，MetaCogAgent实现了82.4%的任务准确率，比最佳路由基线高出8.7%，同时API调用次数比AutoGen少5%，比集成投票少34%。消融研究确认每个元认知组件对整体系统性能都有贡献。

View on arXiv Download PDF AI Translation

cs.AI / 56 / 2605.17305

CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models

CyberCorrect：一个用于大语言模型闭环自我修正的控制论框架

Wu, Yuning, Liu, Yingmin, Shu, Yang

Abstract

Large language model (LLM) self-correction -- the ability to detect and fix errors in generated outputs -- remains largely ad hoc, relying on generic prompts such as "please reconsider your answer" without systematic error analysis or convergence guarantees. We propose CyberCorrect, a framework that formalizes LLM self-correction as a closed-loop control system grounded in cybernetic theory. The framework models the LLM generator as the plant and introduces a tri-modal Error Detector (combining self-consistency, verbalized confidence, and logic-chain verification) as the sensor. A type-directed Correction Controller generates targeted repair instructions based on diagnosed error categories, while a Convergence Judge determines iteration termination using stability criteria adapted from control theory. We further introduce three control-theoretic evaluation metrics -- convergence rate, overshoot rate, and oscillation rate -- that capture correction dynamics beyond final accuracy. Experiments on our constructed CyberCorrect-Bench (440 reasoning tasks with annotated error types and correction paths) show that CyberCorrect achieves 79.8% final accuracy, improving upon the best existing self-correction method by 6.2 percentage points, while reducing overshoot (erroneous over-correction) by 41% through its convergence control mechanism.

Chinese Translation

大语言模型（LLM）自我修正——检测和修复生成输出中的错误的能力——在很大程度上仍然是临时性的，依赖于诸如“请重新考虑你的答案”等通用提示，而没有系统的错误分析或收敛保证。我们提出了CyberCorrect，一个将LLM自我修正形式化为基于控制论理论的闭环控制系统的框架。该框架将LLM生成器建模为植物，并引入三模态错误检测器（结合自一致性、口头信心和逻辑链验证）作为传感器。类型导向的修正控制器根据诊断的错误类别生成有针对性的修复指令，而收敛判断器使用从控制理论中调整的稳定性标准来确定迭代终止。我们进一步引入了三种控制理论评估指标——收敛率、超调率和振荡率——以捕捉超出最终准确性的修正动态。在我们构建的CyberCorrect-Bench（包含440个带注释错误类型和修正路径的推理任务）上的实验表明，CyberCorrect实现了79.8%的最终准确率，比现有最佳自我修正方法提高了6.2个百分点，同时通过其收敛控制机制将超调（错误的过度修正）降低了41%。

View on arXiv Download PDF AI Translation

cs.AI / 57 / 2605.17308

Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification

诊断前推理：基于医生启发的心电图分类结构化思维

Wu, Yang, Yuan, Xiaoyan, Wong, Hau-San, Hu, Xiping

Abstract

Electrocardiogram (ECG) diagnosis in clinical practice relies on structured reasoning over multiple hierarchical aspects, including cardiac rhythm, conduction properties, waveform morphology, and overall diagnostic impression. However, most existing approaches predict labels directly from ECG signals without explicit clinical reasoning, resulting in opaque decisions that lack clinical alignment. To bridge this gap, we propose CardioThink, a physician-inspired multimodal large language model (MLLM) framework that explicitly models the diagnostic reasoning process through human-interpretable intermediate stages (rhythm, conduction, morphology, and impression) to derive final classification results. Furthermore, we introduce Structured Set Policy Optimization (SSPO) to jointly optimize adherence to this structured reasoning format and the accuracy of variable-size diagnostic sets, without requiring manually annotated reasoning traces. Extensive experiments on diverse ECG benchmarks demonstrate the significant superiority of our approach in diagnostic accuracy, while simultaneously providing interpretable clinical reasoning. Notably, reasoning quality evaluations confirm that SSPO substantially enhances the clinical validity of the generated rationales. These findings reveal that moving beyond direct label prediction toward structured reasoning offers a more clinically aligned direction for future ECG modeling.

Chinese Translation

心电图（ECG）诊断在临床实践中依赖于对多个层次方面的结构化推理，包括心脏节律、传导特性、波形形态和整体诊断印象。然而，大多数现有方法直接从心电图信号预测标签，而没有明确的临床推理，导致决策不透明且缺乏临床一致性。为了解决这一问题，我们提出了CardioThink，一个基于医生启发的多模态大型语言模型（MLLM）框架，该框架通过人类可解释的中间阶段（节律、传导、形态和印象）明确建模诊断推理过程，以得出最终分类结果。此外，我们引入了结构化集合策略优化（SSPO），以联合优化对这种结构化推理格式的遵循和可变大小诊断集合的准确性，而无需手动标注推理轨迹。在多样化的心电图基准测试中进行的大量实验表明，我们的方法在诊断准确性上具有显著优势，同时提供了可解释的临床推理。值得注意的是，推理质量评估确认SSPO显著增强了生成推理的临床有效性。这些发现表明，从直接标签预测转向结构化推理为未来心电图建模提供了更符合临床的方向。

View on arXiv Download PDF AI Translation

cs.AI / 58 / 2605.17355

HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

HyperPersona：一种基于多层超图的文本自动人格预测框架

Heydari, Sina, Ramezani, Majid

Abstract

As a modern commodity, language has become a vast repository of socially and psychologically significant traits and concepts, reflecting the ways people encode pattern of thoughts, behaviors, and emotions into words. Text-based Automatic Personality Prediction (APP), seeks to infer personality from linguistic behavior, offering a scalable alternative to traditional psychometric assessments. Although text is inherently hierarchical, with the document-level capturing global features, the sentence-level encoding local semantics, and the word-level providing fine-grained lexical information, most existing approaches rely on shallow, sequential, or single-level representations that ignore the multi-level structure of written language. To address this, we propose HyperPersona, a framework that explicitly models the hierarchical organization of text (document, sentence, and word) through hypergraph structure, where a document and its sentences are represented as hyperedges, and the words are represented as nodes, enabling joint modeling of global, local, and lexical dependencies of text. Followed by a transformer-based graph encoder that learns interactions within and across these linguistic layers, yielding context-sensitive and structurally grounded feature representations for personality prediction. Experiments on the Big Five personality dimensions show that, while relying solely on text, HyperPersona effectively integrates multi-level linguistic cues, achieving superior performance compared to state-of-the-art baselines. These findings underscore the critical role of textual hierarchy in advancing human-like personality inference from natural language.

Chinese Translation

作为一种现代商品，语言已成为一个庞大的社会和心理特征及概念的库，反映了人们如何将思维、行为和情感模式编码为文字。基于文本的自动人格预测（APP）旨在从语言行为中推断人格，为传统心理测量评估提供了一种可扩展的替代方案。尽管文本本质上是层次化的，文档级别捕捉全局特征，句子级别编码局部语义，而词汇级别提供细粒度的词汇信息，但大多数现有方法依赖于浅层、顺序或单层表示，忽视了书面语言的多层结构。为了解决这个问题，我们提出了HyperPersona，一个通过超图结构明确建模文本（文档、句子和单词）层次组织的框架，其中文档及其句子被表示为超边，单词被表示为节点，从而实现文本的全局、局部和词汇依赖的联合建模。随后，采用基于变换器的图编码器学习这些语言层次内外的交互，生成上下文敏感且结构化的特征表示用于人格预测。在对五大人格维度的实验中，结果表明，HyperPersona在仅依赖文本的情况下，能够有效整合多层语言线索，表现优于最先进的基线。这些发现强调了文本层次结构在从自然语言推进类人类人格推断中的关键作用。

View on arXiv Download PDF AI Translation

cs.AI / 59 / 2605.17370

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

CBT-Audio：评估音频语言模型在认知行为疗法会话录音中对患者痛苦强度的估计

Hu, Qixuan, Ye, Shuchang, Zhang, Xumou, Serafimovska, Anastasia, Suraev, Anastasia, Saha, Amit, Lin, Ping-hsiu, Su, Sydney, Naseem, Usman, Dunn, Adam G., Kim, Jinman

Abstract

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.

Chinese Translation

认知行为疗法广泛用于帮助患者理解和管理心理痛苦。它通常通过口头交流进行，治疗师不仅关注患者所说的内容，还关注他们的表达方式，因为这些线索可以帮助治疗师决定如何回应并调整治疗方案。构建用于认知行为疗法的人工智能系统的进展在很大程度上仍然局限于文本，部分原因是大多数可用的数据集是基于文本的，而在伦理和隐私限制下，可共享的口语认知行为疗法数据非常稀缺。这造成了一个盲点，因为基于文本的模型和评估无法捕捉到转录文本与患者声音之间的不匹配，尽管治疗师通常依赖这种不匹配来理解患者的痛苦。我们引入了CBT-Audio，一个用于评估从口语认知行为疗法会话中估计患者痛苦的音频语言模型的数据集。CBT-Audio包含来自96个公开可用的认知行为疗法录音的1,802个患者发言回合，回合级痛苦标签经过专家注释的子集验证。我们在三种输入条件下评估了10个开源音频语言模型，其中模型仅接收患者音频、仅接收转录文本或同时接收音频和转录文本。我们的结果表明，音频可以提供超越文本的有用信息，尤其是在与转录文本结合时。将音频添加到转录文本输入中，在10个模型家族中的8个中改善了痛苦估计，并在4个模型中取得了显著的提升，案例研究显示当口头内容与声音表达不一致时，效果最为明显。CBT-Audio使得口语患者行为在认知行为疗法相关任务中可测量，为人工智能评估提供支持，并促进未来在心理健康互动中音频语言模型的研究。

View on arXiv Download PDF AI Translation

cs.AI / 60 / 2605.17380

ADR: An Agentic Detection System for Enterprise Agentic AI Security

ADR：企业代理人工智能安全的代理检测系统

Li, Chenning, Hu, Pan, Xu, Justin, Ozbas, Baris, Liu, Olivia, Van, Caroline, Li, Manxue, Zhou, Wei, Alizadeh, Mohammad, Zhang, Pengyu, Sriramadhesikan, KK, Zhang, Ming

Abstract

We present the Agentic AI Detection and Response (ADR) system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability -- existing Endpoint Detection and Response (EDR) tools see file writes but not the agent reasoning, prompts, or causal chains linking intent to execution; (2) insufficient robustness -- static defenses constrained by pre-defined rules fail to generalize across diverse attack techniques and enterprise contexts; and (3) high detection costs -- LLM-based inference is prohibitively expensive at scale. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for systematic pre-deployment red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. Deployed at Uber for over ten months, ADR has sustained reliable detection in production with growing adoption reaching over 7,200 unique hosts and processing over 10,000 agent sessions daily, uncovering hundreds of credential exposures across 26 categories and enabling a shift-left prevention layer (97.2% precision, 206 detected credentials). To validate the approach and enable community adoption, we introduce ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), where ADR achieves zero false positives while detecting 67% of attacks -- outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2--4x in F1-score. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks.

Chinese Translation

我们提出了代理人工智能检测与响应（ADR）系统，这是第一个经过大规模生产验证的企业框架，用于保护通过模型上下文协议（Model Context Protocol, MCP）运行的人工智能代理。我们在这一领域识别出三个持续存在的挑战：（1）有限的可观察性——现有的端点检测与响应（Endpoint Detection and Response, EDR）工具能够看到文件写入，但无法观察代理的推理、提示或将意图与执行联系起来的因果链；（2）不足的鲁棒性——受预定义规则限制的静态防御无法在多样化的攻击技术和企业环境中进行泛化；（3）高检测成本——基于大型语言模型（LLM）的推理在规模上代价过高。ADR通过三个组件解决这些挑战：ADR传感器用于高保真代理遥测，ADR探索器用于系统的预部署红队测试和困难示例生成，以及ADR检测器用于可扩展的两级在线检测，结合快速分类与上下文感知推理。在Uber部署超过十个月后，ADR在生产环境中持续可靠地进行检测，采用率不断增长，已覆盖超过7200个独特主机，每日处理超过10000个代理会话，揭示了26个类别中的数百个凭证暴露，并实现了向左预防层的转变（97.2%的精确率，206个检测到的凭证）。为了验证该方法并促进社区的采用，我们引入了ADR-Bench（302个任务，17种技术，133个MCP服务器），在该基准上，ADR实现了零假阳性，同时检测到67%的攻击——在F1分数上超越了三种最先进的基线（ALRPHFS、GuardAgent、LlamaFirewall）2至4倍。在AgentDojo（公共提示注入基准）上，ADR检测到所有攻击，仅在93个任务中产生三次误报。

View on arXiv Download PDF AI Translation

cs.AI / 61 / 2605.17382

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

QQJ：量化定性判断以实现可扩展和人类对齐的生成性人工智能评估

Veysi, Marjan, Shamsinejadbabaki, Pirooz, Zare, Mohammad, Sabouri, Mohammad

Abstract

The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.

Chinese Translation

生成性人工智能的快速进展暴露了现有评估方法的基本局限性，尤其是在开放式、创造性和面向人类的任务中。传统的自动化指标依赖于表层统计相似性，往往无法反映人类对质量的感知，而纯粹的人类评估虽然可靠，却成本高、主观且难以扩展。最近使用大型语言模型作为评估者的方法提供了更好的可扩展性，但常常缺乏明确的基于人类定义的评估原则，导致偏见和不一致。在本文中，我们介绍了量化定性判断（QQJ），一个可扩展且以人为中心的评估框架，明确弥合了人类判断与自动化评估之间的差距。QQJ通过将评估锚定在专家设计的多维评分标准上，并利用小规模高质量的标注集校准大型语言模型评估者，以使其与专家推理对齐，从而将质量的定义与其执行分离。这一设计使得在多样的生成任务和模式中实现一致、可解释和可扩展的评估成为可能。针对文本和图像生成的广泛实验表明，QQJ在与人类判断的一致性方面显著优于传统的自动化指标和不受限制的基于LLM的评估者。此外，QQJ在重复评估中的稳定性有所提升，并在识别关键失败模式（如幻觉和意图不匹配）方面表现出更强的诊断能力。这些结果表明，结构化的定性判断可以在不牺牲可解释性或人类对齐的情况下实现规模化操作，使QQJ成为现代生成性人工智能系统可靠评估的实用基础。

View on arXiv Download PDF AI Translation

cs.AI / 62 / 2605.17393

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

用于多智能体强化学习的异构信息瓶颈协调图

Duan, Wei, Xuan, Junyu, Yu, En, Yang, Xiaoyu, Lu, Jie

Abstract

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention -- determining which edges should exist and at what density per group block -- and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.

Chinese Translation

协调图是合作多智能体强化学习（MARL）中的一个核心抽象，然而现有的稀疏图学习者缺乏理论基础机制来决定哪些边应当存在以及每条边应承载多少信息。目前的方法依赖于启发式标准，这些标准对学习到的拓扑结构没有正式保证，也没有原则性的方法来为结构上不同的智能体关系分配不同的通信能力。为了解决这一问题，我们提出了异构信息瓶颈协调图（HIBCG），该方法学习一个群体感知的稀疏图，其中边的存在和消息容量均有理论依据。以图信息瓶颈（GIB）作为基础工具，HIBCG首先构建一个群体对齐的块对角先验，为边的保留提供了闭式形式的标准——确定哪些边应当存在以及每个群体块的密度，然后在结果拓扑上控制每个智能体的特征带宽，压缩消息以保留仅与任务相关的内容。我们证明了群体对齐的先验严格收紧了拓扑学习的变分界限，目标在每个群体块上分解，从而实现差异化的边控制，并且容量分配遵循水填充原则。

View on arXiv Download PDF AI Translation

cs.AI / 63 / 2605.17410

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

代币经济学中的计算挑战：连接经济理论与人工智能系统设计

Wu, Ou, Deng, Yingjun

Abstract

Token economics has emerged as a useful lens for understanding resource allocation, value creation, and pricing in large language model systems. While recent work has increasingly treated tokens as economic primitives, there remains a substantial gap between high-level economic theory and the computational realities of modern AI infrastructure. This paper identifies and analyzes the key computational challenges that arise when token-economic principles are implemented in real-time inference systems. We argue that computational feasibility is not merely one dimension of token economics, but its governing constraint: these challenges are driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. To structure this problem space, we introduce the notion of \textbf{Computational Token Economics} and propose the \textbf{Token Economics Trilemma} -- a conditional no-free-lunch principle that captures the inherent trade-offs among granularity, real-time performance, and optimality. We further categorize the main technical challenges into three areas: real-time value accounting, constrained resource allocation, and economic-aware system architecture. Rather than presenting a complete solution, this paper aims to define a research agenda for bridging token economics and AI system design, highlighting open problems at the intersection of computational economics, machine learning systems, and AI infrastructure.

Chinese Translation

代币经济学已成为理解大语言模型系统中资源分配、价值创造和定价的重要视角。尽管近期的研究越来越多地将代币视为经济原语，但高层经济理论与现代人工智能基础设施的计算现实之间仍存在显著差距。本文识别并分析了在实时推理系统中实施代币经济原则时所面临的关键计算挑战。我们认为，计算可行性不仅是代币经济学的一个维度，而是其主导约束：这些挑战源于细粒度评估、低延迟执行和不确定性下的最优分配之间的基本张力。为了构建这一问题空间，我们引入了 extbf{计算代币经济学}的概念，并提出了 extbf{代币经济学三难困境}——一种条件性无免费午餐原则，捕捉了粒度、实时性能和最优性之间的固有权衡。我们进一步将主要技术挑战分类为三个领域：实时价值核算、受限资源分配和经济意识系统架构。本文旨在定义一个研究议程，以连接代币经济学与人工智能系统设计，突出计算经济学、机器学习系统与人工智能基础设施交叉领域中的开放问题，而非提供完整解决方案。

View on arXiv Download PDF AI Translation

cs.AI / 64 / 2605.17454

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

作为共识搜索的多方多目标优化：跨党派重组的运行时间分析

Fang, Xiaolei, Xu, Peilan, Luo, Wenjian

Abstract

Multi-party multi-objective optimization problems (MPMOPs) require consensus among autonomous decision makers and therefore differ from flattened many-objective formulations. Existing runtime theory for multi-objective evolutionary algorithms is largely tailored to single-party Pareto-front approximation and does not directly explain common-solution search in MPMOPs. We investigate cross-party recombination in two representative settings. On MP-JCG, a pseudo-Boolean benchmark with an explicit gap region, we prove that a payoff-guided mutation baseline faces a gap-crossing bottleneck requiring $\Theta(n^2)$ expected fitness evaluations. In contrast, an analytical CPR-NSGA-II variant discovers both common Pareto-optimal solutions in $O(n\log n)$ expected evaluations by directly assembling complementary prefix and suffix templates distributed across party populations. Comparing this with the flattened four-objective formulation F-JCG, our full-front coverage analysis illustrates the additional coverage burden introduced by flattening. For BPBOMST, the bi-party, two-objective-per-party specialization of the multi-party multi-objective minimum spanning tree problem, we develop a layered support-cover analysis. For each common Pareto objective vector, the symmetric average projection induces an auxiliary bi-objective MST instance, and suitable support representatives yield a $2\lambda$-common approximation cover with $\lambda\in[1,2]$. We further derive an instance-parameterized expected runtime bound for a representative-pool CPR-NSGA-II variant using edge-union recombination and uniform repair. This bound separates the effects of local auxiliary-front filling, cross-party recombination shortcuts, and edge-union repair ambiguity.

Chinese Translation

多方多目标优化问题（MPMOPs）需要自主决策者之间达成共识，因此与扁平化的多目标形式有所不同。现有的多目标进化算法运行时间理论主要针对单方的帕累托前沿近似，并未直接解释在MPMOPs中的共通解搜索。我们在两个代表性设置中研究跨党派重组。在具有显式间隙区域的伪布尔基准MP-JCG上，我们证明了一个基于收益引导的变异基线面临一个需要B8(n^2)期望适应度评估的跨越瓶颈。相比之下，一个分析性的CPR-NSGA-II变体通过直接组装分布在各党派种群中的互补前缀和后缀模板，在O(n ext{log} n)的期望评估中发现了共同的帕累托最优解。与扁平化的四目标形式F-JCG相比，我们的全前沿覆盖分析展示了扁平化所引入的额外覆盖负担。对于BPBOMST，即多方多目标最小生成树问题的双方、每方两个目标的专业化，我们开发了分层支持覆盖分析。对于每个共同的帕累托目标向量，对称平均投影引入了一个辅助的双目标最小生成树实例，合适的支持代表生成了一个 ext{2}BB-共同近似覆盖，其中 ext{λ} ext{∈} [1,2]。我们进一步推导了一个实例参数化的期望运行时间界限，适用于使用边联合重组和均匀修复的代表池CPR-NSGA-II变体。该界限区分了局部辅助前沿填充、跨党派重组捷径和边联合修复模糊性的影响。

View on arXiv Download PDF AI Translation

cs.AI / 65 / 2605.17480

The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

能力悖论：更智能的审计员如何使多智能体系统的安全性降低

Liu, Qiqi, Holz, Thorsten, Ye, Shilin, Song, Runhan

Abstract

Multi-agent systems extend large language models (LLMs) by decomposing tasks among specialized agents, but their distributed decision process creates new attack surfaces. We identify \emph{semantic hijacking}, an attack in which harmful requests are concealed within domain-specific narratives and propagated to a Manager through Worker reports, without any syntactic injection primitives. Across 42,000 adversarial trials over 12 Manager models and 7 Worker configurations, we uncover a \emph{capability paradox}: as Worker capability increases, the mean system-level Attack Success Rate (ASR) increases from 18.4% to 63.9%, peaking at 94.4%. To explain this effect, we conduct multi-level mediation analysis on two independent datasets (47,807 interactions). This analysis shows that this paradox is driven by \emph{linguistic certainty}: stronger Workers are more likely to interpret adversarial narratives as legitimate, convey their conclusions assertively, and thereby lead Managers to treat such confident endorsements as justification to execute. In our larger Worker-Only setting ($n_W$=14), certainty mediates 74% of the effect, with 95% confidence intervals (CI) excluding zero under both Monte Carlo and cluster bootstrap; the smaller Full-MAS setting ($n_W$ =6) shows a directionally consistent indirect effect. Worker-side safety prompting does not reliably mitigate this failure. Building on the mediation finding, we propose \emph{heterogeneous ensemble verification}, which pairs Workers of asymmetric domain competence so their complementary vulnerabilities break the certainty-to-execution chain, reducing ASR from 52.8% to 2.0% with negligible benign-task impact. Our results show that upgrading components to stronger models can actively degrade system security, and that effective defenses require exploiting--rather than eliminating--capability asymmetries between agents.

Chinese Translation

多智能体系统通过将任务分解给专业代理，扩展了大型语言模型（LLMs），但其分布式决策过程创造了新的攻击面。我们识别出 extit{语义劫持}，这是一种攻击方式，其中有害请求隐藏在特定领域的叙述中，并通过工作者报告传播给管理者，而没有任何语法注入原语。在对12个管理者模型和7个工作者配置进行的42,000次对抗性试验中，我们发现了 extit{能力悖论}：随着工作者能力的提高，系统级攻击成功率（ASR）的平均值从18.4%增加到63.9%，并在94.4%时达到峰值。为了解释这一现象，我们对两个独立数据集（47,807次交互）进行了多层次中介分析。该分析表明，这一悖论是由 extit{语言确定性}驱动的：更强的工作者更可能将对抗性叙述解读为合法，果断地传达他们的结论，从而导致管理者将这种自信的支持视为执行的理由。在我们更大的仅工作者设置中（$n_W$=14），确定性中介了74%的效果，95%的置信区间（CI）在蒙特卡洛和聚类自助法下均不包括零；较小的全MAS设置（$n_W$=6）显示出方向一致的间接效应。工作者端的安全提示并不能可靠地缓解这一失败。基于中介发现，我们提出了 extit{异质集成验证}，该方法将具有不对称领域能力的工作者配对，以便其互补的脆弱性打破确定性到执行的链条，将ASR从52.8%降低到2.0%，对良性任务的影响微乎其微。我们的结果表明，升级组件到更强的模型可能会主动降低系统安全性，而有效的防御需要利用而非消除代理之间的能力不对称。

View on arXiv Download PDF AI Translation

cs.AI / 66 / 2605.17503

RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

基于RAG的脑电图（EEG）到文本翻译的深度学习与大型语言模型（LLMs）应用

Collautti, Enrico, Mao, Xiaopeng, Tonin, Luca, Tortora, Stefano, Puthusserypady, Sadasivan

Abstract

The decoding of linguistic information from electroencephalography (EEG) signals remains an extremely challenging problem in brain-computer interface (BCI) research. In particular, sentence-level decoding from EEG is difficult due to the low signal-to-noise ratio of these recordings. Previous studies tackling this problem have typically failed to surpass random baseline performance unless teacher forcing is used during the inference phase. In this work, we propose a retrieval-augmented generation (RAG)-based sentence-level EEG-to-text decoding pipeline that combines an EEG encoder aligned with semantic sentence embeddings, a vector retrieval stage, and a large language model (LLM) to refine retrieved sentences into coherent output. Experiments are conducted on the Zurich Cognitive Language Processing Corpus (ZuCo) dataset, which contains single-trial EEG recordings collected during silent reading. To evaluate whether the system extracts meaningful information from these EEG signals, the results are compared with a random baseline. In nine subjects, the proposed pipeline outperforms the random baseline, achieving a mean cosine similarity of 0.181 +- 0.022 compared to 0.139 +- 0.029 for the baseline, corresponding to a relative improvement of 30.45%. Statistical analysis further confirms that this improvement is significant, following a strict evaluation workflow where inference is performed without access to ground-truth labels.

Chinese Translation

从脑电图（EEG）信号中解码语言信息仍然是脑机接口（BCI）研究中的一个极具挑战性的问题。特别是，由于这些记录的低信噪比，EEG的句子级解码变得困难。以往研究通常未能超越随机基线性能，除非在推理阶段使用教师强制（teacher forcing）。在本研究中，我们提出了一种基于检索增强生成（RAG）的句子级EEG到文本解码管道，该管道结合了与语义句子嵌入对齐的EEG编码器、向量检索阶段和大型语言模型（LLM），以将检索到的句子精炼为连贯的输出。实验在苏黎世认知语言处理语料库（ZuCo）数据集上进行，该数据集包含在默读过程中收集的单次试验EEG记录。为了评估系统是否从这些EEG信号中提取了有意义的信息，结果与随机基线进行了比较。在九名受试者中，所提出的管道超越了随机基线，平均余弦相似度达到0.181 ± 0.022，而基线为0.139 ± 0.029，相关改善达30.45%。统计分析进一步确认了这一改善是显著的，遵循严格的评估工作流程，在此过程中推理不接触真实标签。

View on arXiv Download PDF AI Translation

cs.AI / 67 / 2605.17537

Self-supervised Hierarchical Visual Reasoning with World Model

自监督层次视觉推理与世界模型

Xu, Yuanfei, Liu, Lin, Zhou, Wengang, Feng, Mingxiao, Li, Houqiang

Abstract

3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Effective reasoning representations are essential in such settings. While existing self-supervised visual foresight reasoning approaches often suffer from multi-step error accumulation, many recent studies resort to injecting domain-specific knowledge for more stable guidance. Our key insight is that the photorealistic fidelity of visual reasoning representations is secondary; what truly matters is providing informative, task-relevant signals. To this end, we propose ResDreamer, a hierarchical world model in which each higher-level layer is trained to reconstruct the residuals of the layer below. This design enables progressive abstraction of increasingly sophisticated world dynamics and fosters the emergence of richer latent representations. Drawing inspiration from the "Bitter Lesson", ResDreamer trains its reasoning representations in a purely self-supervised manner. The higher-level residual representations are used to modulate lower-level predictions, allowing the world model to scale effectively with only linearly increasing cross-layer communication costs. Experiments show that ResDreamer achieves state-of-the-art sample efficiency and parameter efficiency. This scalable hierarchical visual foresight reasoning architecture paves the way for more capable online RL agents in open-ended, dynamic environments. The code is accessible at \url{https://github.com/XuYuanFei01/ResDreamer}.

Chinese Translation

由于其庞大的状态空间，具有对抗性对手的3D开放世界环境仍然是强化学习的核心挑战。在这种环境中，有效的推理表示至关重要。尽管现有的自监督视觉前瞻推理方法往往遭受多步错误累积的困扰，但许多近期研究通过注入领域特定知识以获得更稳定的指导。我们的关键见解是，视觉推理表示的照片级真实感是次要的；真正重要的是提供信息丰富、与任务相关的信号。为此，我们提出了ResDreamer，这是一种层次化的世界模型，其中每个高层次的层被训练以重构下层的残差。这一设计使得对日益复杂的世界动态进行渐进抽象成为可能，并促进了更丰富的潜在表示的出现。受到“苦涩教训”的启发，ResDreamer以完全自监督的方式训练其推理表示。高层次的残差表示用于调节低层次的预测，使得世界模型能够有效扩展，同时仅需线性增加跨层通信成本。实验表明，ResDreamer在样本效率和参数效率方面达到了最先进的水平。这种可扩展的层次视觉前瞻推理架构为在开放式动态环境中更强大的在线强化学习代理铺平了道路。代码可在 {https://github.com/XuYuanFei01/ResDreamer} 获取。

View on arXiv Download PDF AI Translation

cs.AI / 68 / 2605.17539

Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

基于记忆引导的树搜索与跨分支知识转移用于大语言模型求解器合成

Haji, Fatemeh, Quiros, Javier Delarosa, Najafirad, Peyman

Abstract

Combinatorial optimization (CO) underlies decision-making from logistics to chip design, where infeasible solutions are operationally unusable and small quality gains translate into substantial economic value. Recent work uses large language models (LLMs) to automate solver synthesis: generating executable solver programs from natural-language specifications. However, existing tree-search and evolutionary agents refine candidate trajectories in parallel without explicit knowledge transfer, reintroducing the same constraint violations and converging on similar algorithm families. We introduce MEMOIR, a memory-guided tree-search framework with a two-level memory hierarchy: branch-local memory preserves execution-grounded refinement details within a branch as it iterates on a single algorithmic design, while global memory stores compressed algorithmic and failure-mode summaries across branches. A reflection step at branch termination distills these summaries, enabling cross-branch transfer without polluting future contexts with low-level debugging traces. Across seven CO problems spanning scheduling, routing, packing, and geometric design, MEMOIR achieves 96.7% solution validity (a 9.2 point gap over the strongest baseline) and improves the average normalized score by 7.3 points at matched per-method execution budget. Over three independent runs on four problems, MEMOIR's run-to-run validity standard deviation is more than an order of magnitude below that of every baseline we evaluated in this setting, suggesting that memory-guided exploration yields consistent improvements rather than reflecting sampling variance.

Chinese Translation

组合优化（CO）是从物流到芯片设计的决策基础，其中不可行的解决方案在操作上是不可用的，而小的质量提升则会转化为可观的经济价值。最近的研究利用大语言模型（LLMs）来自动化求解器合成：从自然语言规范生成可执行的求解器程序。然而，现有的树搜索和进化代理在没有明确知识转移的情况下并行优化候选轨迹，重新引入相同的约束违规，并趋向于相似的算法家族。我们提出了MEMOIR，一个基于记忆引导的树搜索框架，具有两级记忆层次：分支局部记忆在单一算法设计上迭代时保留执行基础的细化细节，而全局记忆则存储跨分支的压缩算法和失败模式摘要。在分支终止时的反思步骤提炼这些摘要，使得跨分支转移成为可能，而不会用低级调试痕迹污染未来的上下文。在涵盖调度、路由、打包和几何设计的七个CO问题中，MEMOIR实现了96.7%的解决方案有效性（比最强基线高出9.2个百分点），并在匹配的每种方法执行预算下提高了平均标准化得分7.3点。在四个问题的三次独立运行中，MEMOIR的运行间有效性标准差比我们在此环境中评估的每个基线低一个数量级以上，这表明基于记忆引导的探索带来了持续的改进，而不是反映采样方差。

View on arXiv Download PDF AI Translation

cs.AI / 69 / 2605.17554

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

评估深度研究代理在专家咨询工作中的表现：一个包含验证者、评分标准和认知陷阱的基准测试

Asthana, Tanmay, Saksena, Aman, Sahu, Divyansh

Abstract

Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.

Chinese Translation

前沿深度研究代理（DRA）能够规划研究任务、跨文档综合信息，并按需返回结构化的交付成果。它们在企业工作流程中的部署速度快于评估的速度。现有基准主要测量事实回忆、单跳问答或通用代理技能，未能涵盖DRA被部署以产生的多文档、决策级工作。我们引入了一个基准，针对管理顾问典型工作周中所需的结构化分析交付成果。我们对三种前沿代理进行评分，分别是带有网络搜索的Claude Opus 4.6、OpenAI o3-deep-research和Google Gemini 3.1 Pro深度研究，基于42个专家小组（SME）撰写的提示。每个126个响应在两个层面上进行评分：确定性真相验证者（每个任务平均13.8个）和一个包含五个标准的0-3专家评分标准，合成一个0-100的验证者-评分标准分数（VRS）。大多数提示嵌入了惩罚表面模式匹配的认知陷阱。在我们的联合阈值下（评分标准平均 >= 2.5 且验证者通过率 >= 80%）的接受率普遍较低：Gemini 21.4%，o3 9.5%，Claude 9.5%。平均VRS分数与已发布的基于评分标准的基准一致（我们的最高分62.6与APEX-v1的64.2、ProfBench的65.9、ResearchRubrics的< 68%相比），验证了评分标准的构建。接受率低于APEX-Agents的MC-segment Pass@1区间（12.3-22.7%）针对专用DRA的表现；尽管由于更严格的联结评分和陷阱设计，我们的底线低了三分。每个代理的失败特征各不相同。Claude在交付成果的可靠性上表现最佳（在需要文件的任务中是其他代理的4.5倍），但其虚构签名最高。o3的推理平均最干净，但遗漏了必要部分并传播算术错误。Gemini则呈双峰特征，具有最高的接受率，同时也有最多的零分评分标准单元。

View on arXiv Download PDF AI Translation

cs.AI / 70 / 2605.17565

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

泛化还是记忆？针对国际象棋训练语言模型的脆弱性测试

Tang, Ethan

Abstract

Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT's fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.

Chinese Translation

近期的研究对语言模型进行了国际象棋数据的微调，并报告了高基准分数，作为证据表明所得到的模型能够理解国际象棋的规则，以专业水平进行完整的国际象棋对局，或生成基于专家知识的人类可读解释。我们训练了KinGPT，一个仅基于（位置，最佳走法）对的25M参数字符级语言模型，其在600个难题的将死（mate-in-N）测试中超越了3B参数的ChessGPT，并在20主题难题基准测试中超越了4B参数的C1-4B。我们考察了现有文献中关于国际象棋训练语言模型的若干主张，并断言它们令人印象深刻的基准表现主要可以通过模式匹配来解释。我们还展示了如何通过LLM-Modulo，一个验证器在环框架，将RedPajama 3B的最佳走法准确率从1.2%提升至21.2%，将走法生成有效性从19.3%提升至95.3%，在将死国际象棋难题上，这些提升与ChessGPT在特定于国际象棋的网络语料库上的微调所取得的提升相当，但成本却低得多。我们的结果表明，将通用语言模型与外部验证器配对，为在定义明确的领域中直接在合成数据上训练提供了更灵活的替代方案。我们开源了所有训练/评估代码、数据集、难题样本和KinGPT模型检查点，以便于复现。

View on arXiv Download PDF AI Translation

cs.AI / 71 / 2605.17580

ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

ECG-WM：一种基于生理学的心电图世界模型用于临床干预模拟

Chen, Zhikang, Wang, Yue, Cui, Sen, Zhang, Yu, Zhang, Changshui, Ren, Tianling, Zhu, Tingting

Abstract

Electrocardiogram (ECG)-based models have achieved strong performance in diagnostic tasks, yet they remain limited in modeling how cardiac dynamics evolve under external interventions. In particular, existing approaches focus primarily on static prediction and lack mechanisms to capture ECG variations under different pharmacological conditions. In this work, we propose an ECG World Model for action-conditioned predictive simulation of cardiac electrophysiology. Moving beyond disjoint pipelines, our framework features a principled integration of physiological ordinary differential equation (ODE) priors into latent diffusion dynamics via energy regularization. This structural constraint enables the synthesis of physiologically plausible post-intervention ECG trajectories while effectively mitigating generative hallucinations. Building on this simulation process, we introduce an uncertainty-aware evaluation strategy that leverages the stochasticity of diffusion sampling to characterize both the expected clinical risk and its variability, allowing a more reliable comparative assessment of candidate interventions. We evaluate our method across diverse settings, including controlled drug-response scenarios and real-world clinical records. Beyond standard waveform metrics, experimental results demonstrate improved risk calibration and strong alignment with expert-informed treatment preferences. These results establish our approach as a robust foundation for safe and intervention-aware clinical decision support.

Chinese Translation

基于心电图（ECG）的模型在诊断任务中表现出色，但在模拟心脏动态如何在外部干预下演变方面仍然有限。现有方法主要集中于静态预测，缺乏捕捉不同药理条件下ECG变化的机制。在本研究中，我们提出了一种用于基于动作条件的心脏电生理预测模拟的ECG世界模型。我们的框架超越了离散的处理流程，原则性地将生理学常微分方程（ODE）先验整合到潜在扩散动态中，通过能量正则化实现。这一结构约束使得能够合成生理上合理的干预后ECG轨迹，同时有效减轻生成幻觉。基于这一模拟过程，我们引入了一种关注不确定性的评估策略，利用扩散采样的随机性来表征预期的临床风险及其变异性，从而允许对候选干预措施进行更可靠的比较评估。我们在多种设置中评估了我们的方法，包括受控药物反应场景和真实世界临床记录。除了标准波形指标外，实验结果显示出风险校准的改善，并与专家指导的治疗偏好高度一致。这些结果确立了我们的方法作为安全且关注干预的临床决策支持的坚实基础。

View on arXiv Download PDF AI Translation

cs.AI / 72 / 2605.17596

NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents

NeuSymMS：一种用于持久自我策展大型语言模型代理的混合神经符号记忆系统

Sultan, Mujahid, Thuraisamy, Sri, Rajaratnam, Daya

Abstract

We present NeuSymMS, an adaptive memory system that enables large language model (LLM) agents to learn, remember, and reason about users across sessions via a hybrid neuro-symbolic architecture. NeuSymMS couples neural fact extraction from unstructured dialogue with a CLIPS-based expert system that classifies, deduplicates, and reconciles facts under explicit lifecycle rules. The system represents knowledge as subject-relation-value triples stored in relational database management system, supports user/agents/agent-to-agents scoping, and implements a dual-horizon short-term/long-term memory model with access-based promotion and time-based pruning. NeuSymMS maintains continuity of memory while avoiding context-window bloat and cross-entity contamination. We argue that this architecture offers a practical path to trustworthy, auditable memory for production agentic systems and discuss its novelty relative to log retrieval, summarization, and key-value approaches.

Chinese Translation

我们提出了NeuSymMS，这是一种自适应记忆系统，使大型语言模型（LLM）代理能够通过混合神经符号架构在会话之间学习、记忆和推理用户。NeuSymMS将来自非结构化对话的神经事实提取与基于CLIPS的专家系统相结合，该系统在明确的生命周期规则下对事实进行分类、去重和调和。该系统将知识表示为存储在关系数据库管理系统中的主题-关系-值三元组，支持用户/代理/代理之间的范围界定，并实现基于访问的提升和基于时间的修剪的双视野短期/长期记忆模型。NeuSymMS在避免上下文窗口膨胀和跨实体污染的同时，保持了记忆的连续性。我们认为，这种架构为生产代理系统提供了一条通向可信、可审计记忆的实用路径，并讨论了其相对于日志检索、摘要和键值方法的新颖性。

View on arXiv Download PDF AI Translation

cs.AI / 73 / 2605.17602

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

AutoRubric-T2I：用于文本到图像对齐的稳健规则基础奖励模型

Kao, Kuei-Chun, Huo, Daixuan, Ban, Yuanhao, Hsieh, Cho-Jui

Abstract

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a $\ell_1$-Regularized Logistic Regression Refiner, which selects the Top-$N$ most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.

Chinese Translation

文本到图像（T2I）生成模型与人类偏好的对齐日益依赖于图像奖励模型，这些模型根据提示对齐和感知质量对生成的图像进行评分或排名。现有的奖励模型通常作为Bradley-Terry（BT）偏好模型在大规模人类偏好语料库上进行训练，这使得它们的训练成本高、适应性差，并且评估标准不透明。同时，视觉-语言模型（VLM）评判者可以通过文本评分标准提供更细致的评估，但其手动设计或启发式生成的评分规则可能无法可靠地反映人类偏好。在本文中，我们提出了AutoRubric-T2I，这是T2I领域中第一个自动合成和选择显式评分标准以指导VLM评判者的评分学习框架。AutoRubric-T2I首先将偏好对的推理轨迹合成候选评分标准，然后使用VLM评判者在每个评分标准下对成对图像进行评分，生成用于偏好学习的成对评分差异。为了去除噪声和冗余规则，我们进一步采用了$ ext{l}_1$正则化逻辑回归精炼器，选择出Top-$N$最具辨别力的评分标准。广泛的评估表明，AutoRubric-T2I使用不到0.01%的标注偏好数据生成高质量、可解释的奖励信号，显著减少了对大规模奖励模型训练的需求。在MMRB2等图像奖励基准上，AutoRubric-T2I超越了强大的奖励模型基线。我们进一步验证了AutoRubric-T2I作为下游T2I任务的强化学习奖励，包括TIIF和UniGenBench++，在这些任务中，它通过Flow-GRPO管道在扩散模型上改善了生成质量，相较于标量奖励模型表现更佳。

View on arXiv Download PDF AI Translation

cs.AI / 74 / 2605.17617

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

GraphMind：从操作轨迹到自我演化的工作流自动化

Zhu, Yiwen, Cahoon, Joyce, Pavlenko, Anna, Bai, Qiushi, Shahbazi, Nima, Vermareddy, Divya, Wang, Meina, Demarne, Mathieu, Bararia, Swati, Wang, Wenjing, Kumar, Hemkesh Vijaya, Lerner, Hannah, Lin, Katherine, Toscano, Steve, Cilimdzic, Miso, Krishnan, Subru

Abstract

Complex operational workflows coordinating personnel, tools, and information are central to enterprise operations, yet end-to-end automation remains challenging due to extensive requirements for human inputs and the inability to adapt over time. We present GraphMind, an end-to-end system that constructs, executes, and evolves action-centric workflow graphs without human effort. The system operates in three phases. First, a scalable offline pipeline extracts structured workflow graphs from large volumes of human resolution traces, capturing problems, actions, and their causal relationships. Second, an online multi-agent traversal engine navigates the graph to dynamically construct and execute workflows, combining graph-guided retrieval with LLM-driven reasoning at each step. Third, Adaptive Traversal Reinforcement (ATR) reinforces successful traversal paths and decays stale elements. This closed-loop mechanism enables the graph to self-optimize and adapt to shifting operational conditions. GraphMind has been deployed across four production cloud database services for incident investigation. Evaluated on production data, the system substantially outperforms a Trace-RAG baseline in mitigation reach, groundedness, and diagnostic throughput, scoring 4.95/5 in blind expert review. The ATR layer provides further gains across all metrics, demonstrating that workflow graphs can learn and improve from execution-derived feedback.

Chinese Translation

复杂的操作工作流协调人员、工具和信息，是企业运营的核心。然而，由于对人力输入的广泛需求以及无法随时间适应的限制，端到端的自动化仍然面临挑战。我们提出了GraphMind，一个无需人工干预即可构建、执行和演化以行动为中心的工作流图的端到端系统。该系统分为三个阶段。首先，一个可扩展的离线管道从大量人类决策轨迹中提取结构化的工作流图，捕捉问题、行动及其因果关系。其次，一个在线多智能体遍历引擎在图中导航，动态构建和执行工作流，在每一步结合图引导的检索与基于大型语言模型（LLM）的推理。第三，适应性遍历强化（Adaptive Traversal Reinforcement, ATR）强化成功的遍历路径并衰减过时的元素。这个闭环机制使得图能够自我优化并适应不断变化的操作条件。GraphMind已在四个生产云数据库服务中部署，用于事件调查。在生产数据上的评估表明，该系统在缓解覆盖率、基础性和诊断吞吐量方面显著优于Trace-RAG基线，在盲评专家审查中得分为4.95/5。ATR层在所有指标上提供了进一步的提升，证明工作流图可以从执行反馈中学习和改进。

View on arXiv Download PDF AI Translation

cs.AI / 75 / 2605.17618

Prediction of Challenging Behaviors Associated with Profound Autism in a Classroom Setting Using Wearable Sensors

使用可穿戴传感器预测课堂环境中与重度自闭症相关的挑战性行为

Kartha, Yadhu, Anderson, Conor, Foster, Jenny, Hamlin, Theresa, Lantz, Johanna, Lay, Ryan, Hahn, Juergen, Clifford, Gari D., Kwon, Hyeokhyen

Abstract

Autism Spectrum Disorder (ASD) is characterized by challenges with social interaction and communication and by restricted or repetitive patterns of thought and behavior, with significant variability in presentation. Approximately a quarter of children with ASD are classified as having profound autism, who often exhibit challenging behaviors, such as self-injurious behavior, aggression, elopement, or pica, that pose serious safety risks and disrupt learning in educational settings. Prior work has applied wearable sensors and machine learning to detect challenging behaviors, but has been largely confined to controlled laboratory environments. This work demonstrates that predicting challenging behavior episodes is feasible in a real-world special education classroom. We collected approximately 110.7 hours of labeled multimodal wearable data comprising accelerometry, electrodermal activity (EDA), and skin temperature from 9 children and young adults aged 10 to 21 years across standard classroom sessions. We fine-tuned state-of-the-art foundation models for multimodal wearable time-series analysis and show that challenging behavior episodes can be predicted up to 10 minutes in advance with an AUC-ROC of 0.78. These results establish a concrete foundation for developing proactive in-class intervention systems that enable teachers to minimize the safety risks of challenging behaviors in special education classrooms

Chinese Translation

自闭症谱系障碍（ASD）的特征是社交互动和沟通方面的挑战，以及思想和行为的限制或重复模式，表现出显著的多样性。大约四分之一的自闭症儿童被归类为重度自闭症，他们常常表现出挑战性行为，如自伤行为、攻击性行为、逃跑或异食癖，这些行为对安全构成严重风险，并干扰教育环境中的学习。之前的研究应用可穿戴传感器和机器学习来检测挑战性行为，但主要局限于受控实验室环境。本研究表明，在真实的特殊教育课堂中预测挑战性行为事件是可行的。我们收集了来自9名年龄在10至21岁之间的儿童和年轻人的约110.7小时标记的多模态可穿戴数据，包括加速度计、皮肤电反应（EDA）和皮肤温度，数据来自标准课堂课程。我们对多模态可穿戴时间序列分析的最先进基础模型进行了微调，并展示了挑战性行为事件可以提前最多10分钟预测，AUC-ROC达到0.78。这些结果为开发主动课堂干预系统奠定了坚实的基础，使教师能够最大限度地减少特殊教育课堂中挑战性行为的安全风险。

View on arXiv Download PDF AI Translation

cs.AI / 76 / 2605.17625

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

用于长时间科学代理的情节-语义记忆架构

Milosevic, Nikola

Abstract

As Large Language Models (LLMs) evolve into persistent scientific collaborators, context window saturation has emerged as a critical bottleneck. Scientific workflows involving iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content, while monolithic approaches suffer from quadratic cost scaling and cognitive degradation. We evaluate a Dual Process Memory Architecture that decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Unlike prior social agent memory systems, our domain-specific consolidation addresses contradictory parameter evolution, multi-hop reasoning across experimental phases, and precise technical fact retention. Through large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries, we establish three key findings. First, while full-context models fail at 10,000 messages due to context overflow, our system maintains 70-85% accuracy with 1-2 second latency using 62% fewer tokens (45,434 vs 120,000+ limit). Second, cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85%), suggesting complementary deployment strategies. Third, we identify a "Sim-to-Real" gap where synthetic tests maintain constant memory but realistic workflows exhibit linear growth (about 3 tokens/message), with consolidation quality emerging as the primary scalability bottleneck. The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), demonstrating that domain-specific memory consolidation enables sustained operation beyond full-context limits.

Chinese Translation

随着大型语言模型（LLMs）逐渐演变为持久的科学合作伙伴，上下文窗口饱和已成为一个关键瓶颈。涉及迭代数据分析和假设精炼的科学工作流程迅速使得即使是扩展的上下文也饱和于密集的技术内容，而单一的方法则面临二次成本扩展和认知退化的问题。我们评估了一种双过程记忆架构，该架构将即时情节需求（恒定的10条消息窗口）与长期整合知识（以每条消息约3个标记的速度增长）解耦。与先前的社交代理记忆系统不同，我们的领域特定整合解决了矛盾参数演变、跨实验阶段的多跳推理以及精确的技术事实保留。通过对15,000条消息的大规模评估，以及对来自三个家族（OpenAI、Anthropic、Google）的六个LLMs的跨模型验证，总计1,440个查询，我们建立了三个关键发现。首先，尽管全上下文模型在10,000条消息时因上下文溢出而失败，但我们的系统在使用62%更少的标记（45,434 vs 120,000+限制）时，仍能以1-2秒的延迟保持70-85%的准确率。其次，跨模型验证揭示了独立于特定LLMs的架构级权衡：双过程在数值/时间查询上表现优异（65-90%的准确率），而RAG在历史检索上表现突出（60-85%），这表明互补的部署策略。第三，我们识别出“模拟到现实”的差距，其中合成测试保持恒定的记忆，但现实工作流程表现出线性增长（约3个标记/消息），而整合质量成为主要的可扩展性瓶颈。该架构成功管理了包含14,000多个科学事实（125k标记）的配置，证明领域特定的记忆整合使得超越全上下文限制的持续操作成为可能。

View on arXiv Download PDF AI Translation

cs.AI / 77 / 2605.17637

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

WebGameBench：通过浏览器原生游戏进行编码代理的需求到应用评估

Zhang, Wenyu, You, Guoliang, Tianlun, Zhao, Haotian, Zhu, Tianshu, Wang, Haoran, Tang, Xiaoxuan, Dai, Mingyang, Gu, Jingnan, Dong, Daxiang, Wu, Jianmin

Abstract

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.

Chinese Translation

编码代理越来越多地被用作应用构建者，但许多评估仍然侧重于源代码、仓库级测试或中间跟踪，而不是交付的应用程序。我们介绍了WebGameBench，这是一个需求到应用的基准，评估编码代理是否能够将冻结的结构化Web游戏规范转化为可通过浏览器访问的游戏。浏览器原生游戏提供了一个紧凑但行为密集的测试平台：即使是简单的游戏也需要协调的输入处理、空间映射、规则执行、状态转换、终止条件、重启行为和可见反馈。在WebGameBench中，每个生成的工件都在统一的部署协议下构建、提供并作为可通过浏览器访问的应用程序暴露。然后，运行时评估器在真实浏览器中与交付的游戏进行交互，并分配一个三元标签：优秀（EXCELLENT）、可用（USABLE）或不可用（UNUSABLE）。在一个经过人工审核的子集上，运行时标签与基于可用率标准的人类游戏评审大致一致。在111个任务、12个编码代理和14个评估配置中，WebGameBench区分了当前系统：最佳配置达到了76.9%的可用率，但仅有20.2%的优秀率。这一差距表明，跨越最低可玩交付阈值仍远未满足完整需求。据我们所知，WebGameBench是第一个针对浏览器原生游戏交付的需求到应用基准，它验证了交付应用的运行时标签与基于可用率标准的独立人类游戏评审之间的一致性。

View on arXiv Download PDF AI Translation

cs.AI / 78 / 2605.17641

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

基于因果干预的长时间跨度大语言模型代理的记忆选择

Srivastava, Saksham Sahai

Abstract

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory-selection technique that estimates how candidate memories affect the model's answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long-term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at https://github.com/Saksham4796/causal-memory-intervention.

Chinese Translation

长时间跨度的大语言模型（LLM）代理依赖持久记忆来支持跨会话的交互，然而现有的记忆系统通常通过语义相似性或广泛的历史包含来检索上下文，将检索到的记忆视为均匀有用。这一假设是脆弱的，因为记忆可能在主题上相关，但仍然可能无关、过时或误导。我们提出了因果记忆干预（Causal Memory Intervention, CMI），这是一种因果记忆选择技术，能够估计候选记忆在控制干预下对模型答案的影响，选择那些能够提高任务表现的记忆，同时抑制不稳定、无关或有害的记忆。为了评估这一设置，我们引入了因果-局部记忆（Causal-LoCoMo），这是一个基于长对话数据的因果注释基准，其中每个示例包含用户请求、结构化记忆库、有用记忆、无关干扰项和合成有害记忆。我们将CMI与向量、图形、反射、摘要、完整历史和无记忆基线进行了比较。结果表明，CMI在答案质量和对误导性记忆的鲁棒性之间实现了更强的平衡，表明可靠的长期记忆需要基于因果有用性而非仅仅相关性来选择上下文。完整的框架、基准构建代码和实验流程可在https://github.com/Saksham4796/causal-memory-intervention获取。

View on arXiv Download PDF AI Translation

cs.AI / 79 / 2605.17648

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO：基于推理的生成推荐的步骤对齐策略优化

Zheng, Zaiyi, Min, Guanghui, Zhu, Yaochen, Wu, Liang, Hong, Liangjie, Chen, Chen, Li, Jundong

Abstract

Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token). We instantiate this idea in SAPO (Step-Aligned Policy Optimization): rather than broadcasting one advantage to the whole response, SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token. Across three real-world recommendation datasets, SAPO stabilizes reinforcement-learning training and consistently improves over existing generative recommendation baselines, with the largest gains where sparse exact-match feedback makes reasoning-step credit assignment important. Our results suggest that reinforcement-learning objectives for structured generation should mirror the decoder's own decomposition of the output.

Chinese Translation

生成推荐将下一个项目的预测视为自回归项目标识符的生成。具体而言，项目被编码为语义标识符（SIDs），这些标识符是短的粗到细的标记序列，其早期标记捕捉广泛的语义，而后期标记则对其进行细化。最近的研究通过推理轨迹增强了这一范式，并通过强化学习进行优化，使用可验证的奖励，通常是基于结果的奖励算法，对生成的SID进行精确匹配反馈。然而，在大目录推荐中，对生成的SID的精确匹配反馈仅报告最终项目是否正确；当生成的SID不匹配时，结果奖励无法识别导致不匹配的SID标记预测，并可能对匹配的SID标记位置与不匹配的位置一起进行惩罚。我们识别到，在这种情况下，信用分配的自然单位是单个推理步骤（一个思维块配对一个SID标记）。我们在SAPO（步骤对齐策略优化）中实例化这一思想：SAPO并不是将一个优势广播到整个响应，而是为每个推理步骤计算一个单独的组相对优势，并仅将其应用于相应的思维块和SID标记。在三个真实世界的推荐数据集上，SAPO稳定了强化学习训练，并在现有的生成推荐基准上持续改进，在稀疏的精确匹配反馈使推理步骤信用分配变得重要的情况下，取得了最大的收益。我们的结果表明，结构化生成的强化学习目标应当反映解码器对输出的自身分解。

View on arXiv Download PDF AI Translation

cs.AI / 80 / 2605.17669

Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models

基于语言和视觉模型的多模态文化遗产知识图谱扩展

Zhang, Yang, Mimouni, Nada, Moissinac, Jean-Claude, Hamdi, Fayçal

Abstract

The preservation and interpretation of cultural heritage increasingly rely on digital technologies, among which Knowledge Graphs (KGs) stand out for their ability to structure vast amounts of data. However, the construction and expansion of these KGs often face challenges due to the diverse and complex nature of cultural heritage information. In this paper, we propose a novel approach for extending KG resources in the domain of cultural heritage, which we applied to French data. First, we introduce a new knowledge graph in the domain of French cultural heritage, WJoconde, which is distinguished by its multimodality as it integrates both textual and image information of the entities. We further introduce three variants of WJoconde to facilitate downstream research, such as Knowledge Graph Completion (KGC). We also built a comprehensive benchmark for KGC methods on our dataset. Second, we propose a new framework for extending cultural heritage KGs using multi-modal approaches leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), which includes automated data extraction from unstructured resources combined with a special validation pipeline for grounding the output of both models, to further extend WJoconde. Our results show that by integrating the rich text and image information in cultural heritage data, we can efficiently enhance KGs with high reliability. We open-source all code and benchmark datasets with text and images, as well as the original data with an interactive access point

Chinese Translation

文化遗产的保护与解读日益依赖于数字技术，其中知识图谱（Knowledge Graphs, KGs）因其结构化大量数据的能力而脱颖而出。然而，由于文化遗产信息的多样性和复杂性，这些知识图谱的构建和扩展常常面临挑战。本文提出了一种在文化遗产领域扩展知识图谱资源的新方法，并将其应用于法国数据。首先，我们介绍了一个新的法国文化遗产知识图谱WJoconde，该图谱以其多模态性而著称，整合了实体的文本和图像信息。我们进一步推出了WJoconde的三个变体，以促进下游研究，例如知识图谱补全（Knowledge Graph Completion, KGC）。我们还为我们的数据集构建了一个全面的KGC方法基准。其次，我们提出了一个新的框架，通过利用大型语言模型（Large Language Models, LLMs）和视觉-语言模型（Vision-Language Models, VLMs）采用多模态方法扩展文化遗产知识图谱，该框架包括从非结构化资源中自动提取数据，并结合一个特殊的验证流程，以确保两个模型输出的有效性，从而进一步扩展WJoconde。我们的结果表明，通过整合文化遗产数据中的丰富文本和图像信息，我们可以高效地增强知识图谱的可靠性。我们将所有代码和带有文本和图像的基准数据集开源，并提供原始数据的交互访问点。

View on arXiv Download PDF AI Translation

cs.AI / 81 / 2605.17684

EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness

EGI：一个多模态情感人工智能框架，用于增强Scrum Master的实时自我意识

Huang, Jingni, Bloodsworth, Peter

Abstract

While increasing research focuses on the emotional well-being of agile team members, a significant gap remains in emotion monitoring studies for Scrum Masters and meeting organizers, whose impact on team dynamics is crucial. This paper proposes a novel application integrating four carefully selected and recommended AI models to monitor the unconsciously expressed emotions of these key roles. This is achieved through: real- time transcription using a speech-to-text model; thresholding for intonation analysis to detect emotional cues in prosody; applying emotion-based vocabulary matching to identify sentiment in spoken content; and providing context-aware suggestions containing emotion keywords using an open-source, multi-module AI API. The system achieved an ASR word error rate WER of 10% in simulated meeting environments. Our evaluation shows that real- time feedback significantly improves emotion awareness during simulated agile meetings, providing Scrum Masters and meeting organizers with real-time and practical suggestions to help them quickly identify and minimize the expression of negative emotions, fostering more positive and effective team interactions.

Chinese Translation

尽管越来越多的研究关注敏捷团队成员的情感健康，但在Scrum Master和会议组织者的情感监测研究中仍存在显著的空白，这些角色对团队动态的影响至关重要。本文提出了一种新颖的应用，整合了四个经过精心选择和推荐的人工智能模型，以监测这些关键角色无意识表达的情感。具体实现方式包括：使用语音转文本模型进行实时转录；通过音调分析进行阈值处理以检测韵律中的情感线索；应用基于情感的词汇匹配以识别口语内容中的情感；以及使用开源的多模块人工智能API提供包含情感关键词的上下文感知建议。该系统在模拟会议环境中实现了10%的自动语音识别（ASR）字错误率（WER）。我们的评估表明，实时反馈显著提高了模拟敏捷会议中的情感意识，为Scrum Master和会议组织者提供了实时且实用的建议，帮助他们快速识别并减少负面情感的表达，从而促进更积极和有效的团队互动。

View on arXiv Download PDF AI Translation

cs.AI / 82 / 2605.17721

EXG: Self-Evolving Agents with Experience Graphs

EXG：具有经验图的自我进化代理

Jin, Yuxin, Zhang, Siyuan, Wang, Hanchen, Qin, Lu, Zhang, Ying, Zhang, Wenjie

Abstract

Large language model (LLM)-based agents have demonstrated strong capabilities in complex reasoning and problem solving through multi-step interactions, yet most deployed agents remain behaviorally static, with knowledge acquired during execution rarely translating into systematic improvement over time. In response, a growing line of work on self-evolving agents explores how agents can improve through experience during deployment, but most existing approaches either rely on ad hoc reflection limited to single-task correction or adopt unstructured memory that accumulates fragmented experience with delayed usability. To address this limitation, we introduce EXG, an experience graph framework for self-evolving agents that explicitly organizes accumulated successes and failures into a structured, relational representation. EXG is the first experience graph designed for self-evolving agents, supporting both online, real-time graph growth during execution for immediate cross-task experience reuse, and offline reuse of a consolidated experience graph as an external memory module. This design also enables EXG to serve as a plug-and-play component for existing self-evolving agents, organizing prior experience into a unified experience graph and improving both solution quality and resource efficiency as deployment progresses. Extensive experiments across code generation and reasoning benchmarks show that EXG attains more favorable performance-efficiency trade-offs than reflection- and memory-based baselines in both online and offline evaluations. Our results suggest that structuring experience as a graph provides a principled foundation for scalable and transferable self-evolving agent behavior.

Chinese Translation

基于大型语言模型（LLM）的代理在复杂推理和问题解决方面通过多步骤交互展示了强大的能力，但大多数已部署的代理在行为上仍然是静态的，执行过程中获得的知识很少能转化为随时间推移的系统性改进。为此，越来越多的关于自我进化代理的研究探索了代理如何在部署过程中通过经验进行改进，但大多数现有方法要么依赖于仅限于单任务纠正的临时反思，要么采用非结构化的记忆，积累了碎片化的经验但可用性延迟。为了解决这一局限性，我们提出了EXG，一种用于自我进化代理的经验图框架，明确地将积累的成功与失败组织成结构化的关系表示。EXG是第一个为自我进化代理设计的经验图，支持在执行过程中进行在线实时图增长，以便立即重用跨任务经验，并将整合的经验图作为外部记忆模块进行离线重用。这一设计还使EXG能够作为现有自我进化代理的即插即用组件，将先前的经验组织成统一的经验图，并随着部署的推进提高解决方案质量和资源效率。在代码生成和推理基准测试中的广泛实验表明，EXG在在线和离线评估中都比基于反思和记忆的基线获得了更有利的性能效率权衡。我们的结果表明，将经验结构化为图为可扩展和可转移的自我进化代理行为提供了一个原则性的基础。

View on arXiv Download PDF AI Translation

cs.AI / 83 / 2605.17733

Divergence-Suppressing Couplings for Rectified Flow

抑制散度的耦合用于整流流动

Min, Yimeng, Gomes, Carla P.

Abstract

The promise of Rectified Flow rests on producing self-generated couplings whose trajectories are straight, or nearly so. In practice, trajectories generated by the base flow model can bend and intertwine, and the resulting coupling inherits this distortion. In this paper, we identify that such trajectory entanglement is often associated with regions of nonzero divergence in the learned velocity field, where local expansion or contraction distorts trajectories and steers particles away from their ideal endpoints. We then propose divergence-suppressing couplings for Rectified Flow, an offline correction that attenuate the divergent component of the learned velocity during coupling generation. The correction is paid only once per coupling pair and amortized over training, so deployment runs plain Euler at identical wall-clock cost to standard Rectified Flow. Empirically, this offline modification yields consistent improvements on 2D synthetic benchmarks and on image generation.

Chinese Translation

整流流动的潜力在于产生自生成的耦合，其轨迹应为直线或近似直线。然而，在实际应用中，基流模型生成的轨迹可能会弯曲和交错，导致生成的耦合继承这种扭曲。本文指出，这种轨迹缠结通常与学习到的速度场中非零散度区域相关，在这些区域，局部的扩展或收缩扭曲了轨迹，并将粒子引导远离其理想的终点。因此，我们提出了用于整流流动的抑制散度耦合，这是一种离线校正方法，在耦合生成过程中减弱学习到的速度中的散度成分。该校正每对耦合仅需支付一次，并在训练过程中摊销，因此在部署时以与标准整流流动相同的墙钟时间运行简单的欧拉法。经验表明，这种离线修改在二维合成基准测试和图像生成上均带来了持续的改进。

View on arXiv Download PDF AI Translation

cs.AI / 84 / 2605.17734

Harnessing LLM Agents with Skill Programs

利用技能程序赋能大型语言模型代理

Liu, Hongjun, Ming, Yifei, Joty, Shafiq, Zhao, Chen

Abstract

Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.

Chinese Translation

为大型语言模型（LLM）代理配备源自过去经验的可重用技能已成为应对复杂和长期任务的一种流行且成功的方法。然而，这些经验往往以文本指导的形式编码，主要是建议性的，缺乏明确的机制来指明何时以及如何介入代理循环。为了解决这一问题，我们提出了HASP（Harnessing LLM Agents with Skill Programs），一个将技能升级为可执行程序功能（Program Functions, PFs）的新框架。PFs不仅提供被动建议，而是作为可执行的保护措施，在易出错的状态下激活，修改下一步行动或注入纠正上下文。HASP具有高度模块化的特点：它可以在推理时直接介入代理循环，在后训练阶段提供结构化监督，或通过演化经过验证的、教师审查的PFs实现自我提升。从实证上看，HASP在网络搜索、数学推理和编码任务上相比于无训练和基于训练的方法均取得了显著提升。例如，在网络搜索推理中，仅推理时的PFs就使平均性能提高了25%，相比于（多循环）ReAct代理，而后训练和控制演化则在Search-R1上实现了30.4%的提升。为了深入了解HASP，我们的机制分析揭示了PFs如何触发和介入，技能如何内化，以及稳定技能库演化的必要性。

View on arXiv Download PDF AI Translation

cs.AI / 85 / 2605.17746

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

实验的代理，代理的实验：一种面向人工智能的实验科学设计语法

Zhang, Yingjie, Feng, Chun, Zhu, Weizhang, Sun, Tianshu

Abstract

AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.

Chinese Translation

人工智能系统正成为组织和知识工作的积极参与者。它们越来越多地与人类互动，协调工作流程，并在多代理安排中运作。因此，理解它们的影响不仅需要测量输出的准确性，还需要关于机制、委托、反馈和控制的证据。实验在这一任务中仍然至关重要，但它们也面临着一个递归挑战：我们需要为代理设计实验以研究这些安排，同时我们可能需要代理来帮助实验，以探索不断扩展的可能设计空间。然而，关于人类-人工智能和代理工作流程的实验条件仍然主要以散文形式规定，这使得它们难以比较、重用或审计。我们将此视为一个关于工作流程表示、可追溯性和治理的问题，涉及人工智能驱动的知识生产。我们引入了SEED（结构编码实验发现），这是一个将实验条件表示为类型化参与者-流程图的框架。SEED支持三种设计功能：将条件描述为互动结构，相对于编码的先前设计评估结构新颖性，以及在可行性和治理约束下生成候选设计。我们报告了一项轻量级的实证可行性测试，比较了图盲生成与SEED引导生成在医疗分诊设计任务中的表现。在这一诊断对比中，SEED引导的候选设计显示出更清晰的参与者-流程变化、假设和治理检查，支持该语法作为设计辅助工具的可行性。最后的评论指出了围绕新颖性、复制性、有效性、探究多样性和问责制的治理紧张关系。

View on arXiv Download PDF AI Translation

cs.AI / 86 / 2605.17762

Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

表面形式神经稀疏检索：工业音乐搜索的鲁棒模糊匹配

Greyson, Paul, Geng, Zhichao, Zhang, Wei, Yang, Yang

Abstract

Music search at the scale of Amazon Music presents a unique challenge: queries frequently deviate from indexed metadata due to misspellings, transpositions, and phonetic variations, yet the retrieval system must operate under strict millisecond-level latency constraints. Our existing learning-to-retrieve system, the High Confidence Index (HCI), learns query-entity associations from customer behavior, relying on continual ``exploration'' to choose candidates. Traditional n-gram matching enables this exploration but suffers from poor semantic robustness and high noise, limiting the system's ability to learn from long-tail queries. In this work, we present a \textbf{robust neural sparse retrieval system} designed to maximize exploration efficiency. We adapt a state-of-the-art \textbf{inference-free} sparse retrieval architecture to the music domain, combining it with an effective \textbf{domain-specific granular subword tokenization strategy}. Our approach utilizes short-length token constraints (max 3 chars) to enforce the learning of surface-form robustness over lexical memorization. By pre-computing the neural embeddings and term expansions during the offline indexing phase, online processing is reduced to minimal tokenization and IDF weighting, achieving effectively zero latency overhead for query encoding. Evaluations on a 6M-document production corpus show an aggregate \textbf{91.4\%} recall@10 (vs. \textbf{57.7\%} for trigrams) at comparable throughput. Simulation of the HCI feedback loop demonstrates improved exploration efficiency, with \textbf{+0.8\%} higher stabilized recall than production trigrams. Ablation studies indicate that our sparse training methodology drives the performance gains, while domain-specific pretraining provides a cost-effective alternative to large-scale general-purpose pretraining.

Chinese Translation

在亚马逊音乐规模下的音乐搜索面临独特挑战：查询由于拼写错误、调换和语音变体等原因，常常偏离索引元数据，但检索系统必须在严格的毫秒级延迟约束下运行。我们现有的学习检索系统，高置信度索引（High Confidence Index, HCI），通过从客户行为中学习查询-实体关联，依赖于持续的“探索”来选择候选项。传统的n-gram匹配支持这种探索，但在语义鲁棒性和高噪声方面表现不佳，限制了系统从长尾查询中学习的能力。在本研究中，我们提出了一种 extbf{鲁棒的神经稀疏检索系统}，旨在最大化探索效率。我们将一种最先进的 extbf{无推理}稀疏检索架构适配到音乐领域，并结合有效的 extbf{领域特定细粒度子词标记化策略}。我们的方法利用短长度标记约束（最多3个字符）来强化表面形式鲁棒性而非词汇记忆。通过在离线索引阶段预计算神经嵌入和术语扩展，在线处理简化为最小的标记化和IDF加权，实现了查询编码的有效零延迟开销。在一个600万文档的生产语料库上的评估显示，整体 extbf{91.4\%}的召回率@10（相比之下，三元组为 extbf{57.7 extbf{ extbf{}}}）在可比的吞吐量下。HCI反馈循环的模拟展示了更高的探索效率，稳定召回率比生产三元组提高了 extbf{+0.8 extbf{ extbf{}}}。消融研究表明，我们的稀疏训练方法推动了性能提升，而领域特定的预训练为大规模通用预训练提供了成本效益高的替代方案。

View on arXiv Download PDF AI Translation

cs.AI / 87 / 2605.17770

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

熵梯度反演：迈向大型推理模型的内部机制

Yang, Junyao, Qian, Chen, Wang, Kun, Zhang, Linfeng, Zhang, Quanshi, Liu, Yong, Liu, Dongrui

Abstract

The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.

Chinese Translation

大型推理模型（LRMs）的进展催生了一场范式转变，从反应式的“快速思维”文本生成转向系统化、逐步的“慢思维”推理，在复杂的数学和逻辑任务中解锁了最先进的性能。然而，该领域面临着 extit{令牌级行为分析与内部推理机制之间的基本差距，以及依赖昂贵外部验证者的推理优化中强化学习（RL）的不稳定性}。我们识别并正式定义了 extbf{熵梯度反演}，即令牌熵与对数梯度之间的强负相关关系，作为LRM推理能力的明确几何指纹。在此基础上，我们提出了 extbf{相关性正则化组策略优化（CorR-PO）}，将这一反演特征嵌入到RL奖励正则化中。在多个模型规模的各种推理基准上进行的广泛实验表明，CorR-PO始终优于最先进的基线，确认了更强的反演与更优越的推理性能之间的直接相关性。

View on arXiv Download PDF AI Translation

cs.AI / 88 / 2605.17790

STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

STRIDE：一种用于可靠自动方程发现的自反性智能体框架

Su, Jiarui, Tu, Songjun, Sun, Bei, Liang, Xiaojun

Abstract

LLM-based equation discovery offers a promising route to recovering symbolic laws from data, but many systems still rely on generation-centered loops that propose candidates, fit parameters, score results, and reuse selected examples. Such loops can misjudge useful skeletons under unreliable fitting, discard near-correct equations that require repair, and accumulate redundant memories that provide limited guidance. We propose STRIDE, a self-reflective agent framework that improves reliability by coordinating data-aware generation, mixed-fitting evaluation, critic--executor repair, and diversity-preserving semantic memory. By turning fitted scores and candidate behavior into shared feedback, STRIDE enables equations to be proposed, assessed, refined, and reused within a closed-loop discovery process. Experiments on representative symbolic-regression benchmarks and LSR-Synth suites show that STRIDE improves accuracy, OOD robustness, and structural recovery across multiple LLM backbones, with ablations and analyses confirming the contribution of its core components.

Chinese Translation

基于大型语言模型（LLM）的方程发现为从数据中恢复符号法则提供了一条有前景的途径，但许多系统仍然依赖于以生成为中心的循环，这些循环提出候选方程、拟合参数、评分结果并重用选定示例。这些循环可能在不可靠的拟合下错误判断有用的框架，丢弃需要修复的近似正确方程，并积累提供有限指导的冗余记忆。我们提出了STRIDE，一种自反性智能体框架，通过协调数据感知生成、混合拟合评估、批评者-执行者修复和保持多样性的语义记忆来提高可靠性。通过将拟合得分和候选行为转化为共享反馈，STRIDE使得方程能够在闭环发现过程中被提出、评估、精炼和重用。在具有代表性的符号回归基准和LSR-Synth套件上的实验表明，STRIDE在多个LLM基础上提高了准确性、OOD（超出分布）鲁棒性和结构恢复，消融实验和分析确认了其核心组件的贡献。

View on arXiv Download PDF AI Translation

cs.AI / 89 / 2605.17809

Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

加速人工智能驱动的研究：PuppyChatter框架的可用性与灵活性工具

Tseng, Chun-Hsiung, Lin, Hao-Chiang Koong, Huang, Andrew Chih-Wei, Chen, Yung-Hui, Lin, Jia-Rou

Abstract

This research addresses the challenges inherent in developing Artificial Intelligence (AI) applications, particularly those leveraging Large Language Models (LLMs). While AI vendors provide Application Programming Interfaces (APIs) and Software Development Kits (SDKs) to facilitate developer interaction, the former often requires intricate manual request construction, and the latter can lead to significant vendor lock-in. Furthermore, existing model abstraction frameworks, though mitigating vendor dependency, introduce an additional layer of complexity and potential security concerns. To reconcile these conflicting factors, the study introduces PuppyChatter, a novel software framework designed to preserve the intuitive simplicity of vendor-specific SDKs while simultaneously adhering to the vendor-neutrality principles characteristic of model abstraction, thereby offering a more streamlined and flexible development paradigm.

Chinese Translation

本研究解决了开发人工智能（AI）应用程序中固有的挑战，特别是那些利用大型语言模型（LLMs）的应用。尽管AI供应商提供应用程序编程接口（APIs）和软件开发工具包（SDKs）以促进开发者的交互，但前者通常需要复杂的手动请求构建，而后者可能导致显著的供应商锁定。此外，现有的模型抽象框架虽然减轻了对供应商的依赖，但却引入了额外的复杂性和潜在的安全隐患。为了解决这些相互矛盾的因素，本研究提出了PuppyChatter，一个新颖的软件框架，旨在保留供应商特定SDK的直观简洁性，同时遵循模型抽象特有的供应商中立性原则，从而提供更简化和灵活的开发范式。

View on arXiv Download PDF AI Translation

cs.AI / 90 / 2605.17812

Going Headless? On the Boundaries of Vertical AI Firms

走向无头？关于垂直人工智能公司的边界

Hydari, Muhammad Zia, Muzaffar, Farooq

Abstract

Vertical AI firms in accounting, law, healthcare, procurement, and similar domains historically bundled workflow, domain logic, and accountability into a single application. General-purpose AI agents are now unbundling that package, prompting founders and investors to advocate "going headless": cede the workflow and interface to agents and expose domain expertise as callable services. This article argues that going headless is correct for some firms and destructive for others, and that the latter often cede their value capture inadvertently through architectural choices that look like interface decisions. This is a boundary question, and the answer turns on distinguishing the interface boundary, which can often move, from the accountability boundary, which often must not. Drawing on Coase's theory of the firm, Eisenmann, Parker, and Van Alstyne's platform envelopment framework, and Teece's analysis of complementary assets and appropriability, the article shows that orchestrators operating through open protocols acquire envelopment power even as technical interoperability improves, and that durable value capture concentrates in cospecialized accountability assets: professional signoff, regulated workflows, evidence trails, and trusted systems of record. The article proposes a three-position taxonomy (component, integrated software platform, dual-track) determined not by sector but by task-accountability regime, and formalizes the construct of rule debt: the future governance, maintenance, and accountability burden that accrues to customer organizations when business rules and professional standards migrate from governed systems into prompts and agent instructions. Four principles follow: decompose by accountability not interface, invert the edges while retaining the core, position rule debt as the customer cost the integrated platform prevents, and avoid single-orchestrator dependence.

Chinese Translation

垂直人工智能公司在会计、法律、医疗保健、采购等领域历史上将工作流程、领域逻辑和责任整合为单一应用程序。通用人工智能代理现在正在解构这一包裹，促使创始人和投资者倡导“走向无头”：将工作流程和界面交给代理，并将领域专业知识作为可调用服务暴露出来。本文认为，走向无头对某些公司是正确的，而对其他公司则是破坏性的，而后者往往通过看似界面决策的架构选择无意中放弃其价值捕获。这是一个边界问题，答案在于区分接口边界（通常可以移动）与责任边界（通常必须保持不变）。本文借鉴了科斯的公司理论、艾森曼、帕克和范阿尔斯廷的平台包络框架，以及蒂斯对互补资产和可获得性的分析，表明通过开放协议运作的协调者即使在技术互操作性改善的情况下也能获得包络权力，并且持久的价值捕获集中在共同专业化的责任资产上：专业签署、受监管的工作流程、证据链和可信的记录系统。本文提出了一种基于任务-责任制度的三种定位分类法（组件、集成软件平台、双轨），而非按行业划分，并形式化了规则债务的构念：当商业规则和专业标准从受管控系统迁移到提示和代理指令时，客户组织所承担的未来治理、维护和责任负担。接下来提出四项原则：按责任而非接口进行分解，保留核心的同时反转边缘，将规则债务视为集成平台所防止的客户成本，避免单一协调者依赖。

View on arXiv Download PDF AI Translation

cs.AI / 91 / 2605.17829

Interactive Evaluation Requires a Design Science

互动评估需要设计科学

Xuan, Keyang, Song, Peiyang, Lu, Pan, Han, Pengrui, Li, Wenkai, Zhang, Zhenyu, He, Zexue, Hua, Wenyue, Li, Manling, You, Jiaxuan, Weller, Adrian, Wang, Yizhong, Pei, Jiaxin

Abstract

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.

Chinese Translation

人工智能评估正经历结构性变革。大型语言模型（LLMs）越来越多地作为在时间上通过工具、环境、用户和其他代理进行操作的系统被部署，而许多评估实践仍然继承了以响应为中心的基准假设（例如，固定输入、孤立输出，以及可以从单一响应中得出的结果判断）。该领域已经开始构建互动基准，但所产生的格局是支离破碎的：基准在允许的互动工件、轨迹评分以及其结果支持的主张方面存在差异。本文立场论文主张，互动评估应被视为一种原则性的评估范式，而不仅仅是一类新的代理基准。简单地采用以前的评估范式是不够的。我们将评估定义为从证据到判断的自主映射，并展示互动评估如何改变这一映射的两侧：证据变为互动生成的轨迹，而评估过程必须评估过程、可恢复性、协调性、鲁棒性和系统级性能。在此定义的基础上，我们提出了一个双轴分类法，推导出设计原则和报告标准，考察代表性场景，并分析长期存在的评估挑战如何在轨迹层面重新出现。

View on arXiv Download PDF AI Translation

cs.AI / 92 / 2605.17830

Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

记忆更多，风险更多：具备记忆功能的LLM代理的纵向安全风险

Al-Tawaha, Ahmad, Gu, Shangding, Niu, Peizhi, Jia, Ruoxi, Jin, Ming

Abstract

Safety evaluations of memory-equipped LLM agents typically measure within-task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independent tasks over a long horizon, and memory accumulated during earlier tasks can affect behavior on later, unrelated ones. Studying this regime requires evaluation along the temporal dimension across tasks: not whether an agent is safe at any single memory state, but how its safety profile changes as memory accumulates across many independent interactions. We call this failure mode temporal memory contamination. To isolate memory exposure from stream non-stationarity, we introduce a trigger-probe protocol that evaluates a fixed probe set against read-only memory snapshots at varying prefix lengths, together with a NullMemory counterfactual baseline for identifying memory-induced violations. We apply this protocol across three deployment scenarios spanning records, memos, forms, and email correspondence and eight memory architectures, and additionally on Claw-like AI agents, such as OpenClaw, using the platform's native memory mechanism. Memory-enabled agents consistently exceed the NullMemory baseline, and memory-induced violation rates show a robust upward trend with exposure length on both agent classes. Order-randomization experiments indicate that the effect is driven primarily by accumulated content rather than encounter order. Finally, a structural consequence of the event decomposition is that memory-induced risk is detectable from retrieval state before generation, which we confirm with a high-recall diagnostic monitor. Our results argue for treating memory safety as a longitudinal property that requires temporal evaluation, not a single-state property that can be captured by a snapshot.

Chinese Translation

具备记忆功能的LLM代理的安全评估通常测量任务内安全性：即代理是否在单一场景中安全完成任务，通常是在诸如提示注入或记忆中毒等对抗性条件下。然而，在实际部署中，单个代理在较长时间内执行许多独立任务，早期任务中积累的记忆可能会影响后续无关任务的行为。研究这一情况需要沿时间维度对任务进行评估：不是评估代理在任何单一记忆状态下的安全性，而是评估其安全特征如何随着在许多独立交互中记忆的积累而变化。我们将这种失效模式称为时间记忆污染。为了将记忆暴露与流非平稳性隔离，我们引入了一种触发探测协议，该协议评估固定探测集在不同前缀长度下的只读记忆快照，以及一个NullMemory反事实基线用于识别记忆引起的违规行为。我们在涵盖记录、备忘录、表单和电子邮件通信的三个部署场景以及八种记忆架构中应用了该协议，并且还在类似Claw的AI代理（如OpenClaw）上使用该平台的本地记忆机制。具备记忆功能的代理在各类任务中始终超过NullMemory基线，且记忆引起的违规率在两类代理中均显示出随着暴露时间的延长而稳步上升的趋势。顺序随机化实验表明，该效应主要由积累的内容驱动，而非遭遇顺序。最后，事件分解的一个结构性结果是，记忆引起的风险可以在生成之前从检索状态中检测到，我们通过高召回率的诊断监测器对此进行了确认。我们的结果表明，应将记忆安全视为一种纵向特性，需要进行时间评估，而不是可以通过快照捕捉的单一状态特性。

View on arXiv Download PDF AI Translation

cs.AI / 93 / 2605.17856

KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

KISS - 科学模拟知识基础设施：代理地球科学的支架

Li, Ziwei, Zhu, Liujun, Liu, Yuchen, Zhao, Yichen, Li, Birk, Wu, Ruiqi, Jin, Junliang, Zhang, Jianyun

Abstract

Process-based simulation models encode decades of scientific understanding across the Earth sciences, yet the communities most exposed to climate risk and resource scarcity are the least able to use them. Here, we introduce knowledge infrastructure (KI), an agent-actionable scaffold that externalizes expertise into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms. Across a 3,000-trial coupled-hydrology benchmark, agents equipped with KI produced physically plausible, verifiable end-to-end simulations in up to 84% of trials, while agents without KI plateaued below 40%. KI generalizes across disciplines. We packaged its construction into a Knowledge Dissection Toolkit (KDT) that autonomously produced KI enabling end-to-end agent execution of 117 additional process-based models across 14 Earth-science domains. Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc. Demonstrations show KI-equipped agents lowering both the access barrier between non-specialist users and process-based simulation, and the integration barrier between modelling communities. Through this scaffold, process-based science can then evolve as a living scientific commons, answerable to whoever needs to know and extendable by whoever can contribute.

Chinese Translation

基于过程的模拟模型编码了数十年来在地球科学领域的科学理解，然而，最容易受到气候风险和资源匮乏影响的社区却最难以使用这些模型。在此，我们介绍了知识基础设施（Knowledge Infrastructure, KI），这是一种可供代理行动的支架，将专业知识外化为经过验证的建模操作符、分阶段的领域协议和诊断恢复机制。在3000次耦合水文基准试验中，配备KI的代理在高达84%的试验中生成了物理上合理、可验证的端到端模拟，而未配备KI的代理则停滞在40%以下。KI在各学科中具有普适性。我们将其构建过程打包成知识解剖工具包（Knowledge Dissection Toolkit, KDT），该工具包能够自主生成KI，使得在14个地球科学领域中，117个额外的基于过程的模型能够实现端到端的代理执行。在所有119个KI中，尽管基础物理不同，建模决策和失败补救措施却趋于一致，显示出操作专业知识是结构化的和可提取的，而非临时的。演示表明，配备KI的代理降低了非专业用户与基于过程的模拟之间的接入障碍，以及建模社区之间的整合障碍。通过这一支架，基于过程的科学可以作为一个活的科学共享资源不断发展，能够回应任何需要了解的人，并由任何能够贡献的人进行扩展。

View on arXiv Download PDF AI Translation

cs.AI / 94 / 2605.17877

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

PAIR：面向前缀的内部奖励模型用于多轮智能体优化

Kim, Wonjoong, In, Yeonjun, Park, Sangwu, Lee, Dongha, Park, Chanyoung

Abstract

A significant hurdle for current LLMs is the execution of complex, multi-stage tasks. Group Relative Policy Optimization (GRPO) has been emerging as a leading choice, but its reliance on sparse outcome rewards severely limits credit assignment across intermediate steps. Existing remedies such as running full rollouts to assign step-level advantages, calling external LLM judges at each step, or computing intrinsic rewards that require ground-truth answers at every evaluation introduce significant costs or practical constraints. We hypothesize that internal correctness probing over LLM hidden states can be repurposed as a step-level reward signal, potentially addressing all of these limitations at once. However, existing probing research assumes clean inputs, and we first show that this assumption breaks down in multi-step settings: hidden-state probes degrade severely under prefix contamination tracking coherence with the (possibly corrupted) prefix rather than grounded correctness, while attention-based features remain robust to contamination but underperform on clean prefixes. Building on this complementary relationship, we propose the Prefix-Aware Internal Reward (PAIR), a two-stage model with a frozen hidden-state probe estimating belief-consistency and a lightweight attention-based head correcting it toward grounded correctness. Experimental results show that PAIR achieves the highest AUROC on contaminated trajectories while operating at negligible inference cost, enabling dense step-level reward signals for GRPO training without external model calls, ground-truth dependencies, or full-trajectory rollouts.

Chinese Translation

当前大规模语言模型（LLMs）面临的一个重大障碍是执行复杂的多阶段任务。群体相对策略优化（Group Relative Policy Optimization, GRPO）作为一种领先选择逐渐受到关注，但其对稀疏结果奖励的依赖严重限制了在中间步骤中的信用分配。现有的解决方案，例如进行完整的回滚以分配步骤级优势、在每个步骤调用外部LLM评审者，或计算需要每次评估都提供真实答案的内在奖励，均引入了显著的成本或实际限制。我们假设对LLM隐藏状态进行内部正确性探测可以重新用作步骤级奖励信号，可能同时解决所有这些限制。然而，现有的探测研究假设输入是干净的，我们首先展示了这一假设在多步骤设置中是失效的：隐藏状态探测在前缀污染下严重退化，跟踪与（可能被污染的）前缀的一致性，而非基于真实的正确性，而基于注意力的特征在污染下保持稳健，但在干净前缀上表现不佳。在这一互补关系的基础上，我们提出了面向前缀的内部奖励（Prefix-Aware Internal Reward, PAIR），这是一个两阶段模型，具有一个冻结的隐藏状态探测器用于估计信念一致性，以及一个轻量级的基于注意力的头部将其修正为基于真实的正确性。实验结果表明，PAIR在污染轨迹上实现了最高的AUROC，同时以微不足道的推理成本运行，使得GRPO训练能够在没有外部模型调用、真实答案依赖或完整轨迹回滚的情况下，提供密集的步骤级奖励信号。

View on arXiv Download PDF AI Translation

cs.AI / 95 / 2605.17894

Evaluating Cognitive Age Alignment in Interactive AI Agents

评估交互式人工智能代理的认知年龄一致性

Shen, Yifan, Zhang, Jiawen, Xu, Jian, Kim, Junho, Lourentzou, Ismini, Cao, Xu, Huang, Meihuan

Abstract

While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM-based interactive agents against age-specific human developmental stages, exposing where current agentic AI systems can and cannot simulate age-specific cognitive behavior.

Chinese Translation

尽管代理型人工智能及其核心的多模态大型语言模型（MLLMs）在日常生活到高级科学研究等多个领域展示了显著的潜力，但人工智能与人类智能之间仍存在深刻的差距。尽管集成了强大的工具和先进的MLLMs，最先进的人工智能代理在一些基础且看似简单的任务上常常表现不佳，而这些任务儿童却能轻松解决。受韦氏儿童智力量表（WISC）的启发，我们推出了ChildAgentEval，这是第一个基于心理测量学的交互式基准，用于评估基于MLLM的代理的认知年龄一致性。ChildAgentEval系统地比较了各种基于MLLM的交互式代理的推理性能与特定年龄的人类发展阶段，揭示了当前代理型人工智能系统在模拟特定年龄的认知行为方面的能力与不足。

View on arXiv Download PDF AI Translation

cs.AI / 96 / 2605.17900

DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

DuIVRS-2：一种基于大型语言模型的交互式语音响应系统，用于大规模兴趣点属性获取

Zhang, Le, Zhang, Shengming, Zha, Rui, Wu, Yunpeng, Zhou, Jingbo, Huang, Jizhou

Abstract

Accurate Point of Interest (POI) attribute acquisition is essential for location-based services, yet traditional modular Interactive Voice Response (IVR) systems suffer from error accumulation and high maintenance overhead. We present DuIVRS-2, a large language model (LLM)-based end-to-end framework designed for large-scale POI attribute acquisition at Baidu Maps. To address the long-tail distribution of real-world interactions, our methodology first employs a finite state machine (FSM)-guided data augmentation strategy to synthesize a balanced and diverse training dataset. We then streamline dialogue management via a selective generation scheme combined with a Chain-of-Thought (CoT) mechanism, which ensures output stability and effectively eliminates hallucinations in industrial settings. To facilitate continuous policy refinement with minimal manual effort, we design a cooperative iterative learning framework that leverages a dual-evaluator voting system. Deployed in production for two months, DuIVRS-2 processed 0.4 million calls daily and achieved a 83.9\% Task Success Rate (TSR), outperforming its predecessor by 4 percentage points while maintaining a low reaction time of 130ms. This work provides a production-proven reference for developing robust, cost-effective LLM agents for large-scale industrial dialogue applications.

Chinese Translation

准确的兴趣点（POI）属性获取对于基于位置的服务至关重要，然而传统的模块化交互式语音响应（IVR）系统存在错误累积和高维护成本的问题。我们提出了DuIVRS-2，这是一种基于大型语言模型（LLM）的端到端框架，旨在为百度地图提供大规模的POI属性获取。为了应对现实世界交互的长尾分布，我们的方法首先采用有限状态机（FSM）引导的数据增强策略，以合成一个平衡且多样化的训练数据集。然后，我们通过选择性生成方案结合思维链（Chain-of-Thought, CoT）机制来简化对话管理，这确保了输出的稳定性并有效消除了工业环境中的幻觉。为了便于在最小人工努力下进行持续的策略优化，我们设计了一个合作迭代学习框架，利用双评估者投票系统。在生产环境中部署两个月后，DuIVRS-2每天处理40万通电话，达到了83.9%的任务成功率（Task Success Rate, TSR），比其前身提高了4个百分点，同时保持了130毫秒的低反应时间。这项工作为开发强大且具有成本效益的大型语言模型代理提供了经过生产验证的参考，适用于大规模工业对话应用。

View on arXiv Download PDF AI Translation

cs.AI / 97 / 2605.17902

LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

LAST-RAG：基于文献的随机轨迹检索增强生成用于知识条件下的退化模型选择

Park, Hanbyeol, Bae, Hyerim

Abstract

Stochastic-process-based degradation modeling is a core approach for estimating the distribution of remaining useful life (RUL); however, the selection of an appropriate stochastic process has not been sufficiently addressed. Existing model selection methods mainly rely on the statistical fit of the observed health indicator (HI) trajectory, but this approach may select a model that is inconsistent with the underlying degradation mechanism when the observation window is short or the signal is highly noisy. To address this issue, this paper proposes Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation (LAST-RAG). The proposed method uses both the observed HI trajectory and domain-specific context, and hierarchically conditions the candidate degradation model space based on theoretical and mechanical evidence retrieved from a local evidence bank. In addition, Rule-based Confidence Reasoning with Uncertain State (RCRUS) is introduced to prevent candidate models from being prematurely eliminated when hierarchical decisions are uncertain. Simulation-based experiments demonstrate that the proposed method outperforms statistical, prognostic, and uncertainty-aware baselines in both Wiener/gamma family classification and detailed degradation model classification. Ultimately, this study reframes degradation model selection from a purely statistical goodness-of-fit problem into a knowledge-conditioned decision-making problem that integrates observed data with domain knowledge.

Chinese Translation

基于随机过程的退化建模是估计剩余使用寿命（RUL）分布的核心方法；然而，选择合适的随机过程尚未得到充分解决。现有的模型选择方法主要依赖于观察到的健康指标（HI）轨迹的统计拟合，但当观察窗口较短或信号噪声较大时，这种方法可能会选择与潜在退化机制不一致的模型。为了解决这一问题，本文提出了基于文献的随机轨迹检索增强生成（LAST-RAG）。该方法同时利用观察到的HI轨迹和领域特定的上下文，并基于从本地证据库检索的理论和机械证据对候选退化模型空间进行分层条件化。此外，引入了基于规则的不确定状态置信推理（RCRUS），以防止在分层决策不确定时候选模型被过早排除。基于仿真的实验表明，所提方法在Wiener/伽马族分类和详细退化模型分类中均优于统计、预测和不确定性感知的基线方法。最终，本研究将退化模型选择从纯粹的统计拟合优度问题重新框架为一个知识条件下的决策问题，整合了观察数据与领域知识。

View on arXiv Download PDF AI Translation

cs.AI / 98 / 2605.17903

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

代理式分块与贝叶斯去分块的AI生成模糊认知图：修昔底德陷阱模型

Panda, Akash Kumar, Adigun, Olaoluwa, Kosko, Bart

Abstract

We automatically generate feedback causal fuzzy cognitive maps (FCMs) from text by teaching large-language-model agents to break the text into overlapping chunks of text. Convex mixing of these chunk FCMs gives a representative cyclic FCM knowledge graph. The text chunks can have different levels of overlap. The chunk FCMs still mix to form a new FCM causal knowledge graph. The mixing technique scales because it uses light computation with sparse causal chunk matrices. The mixing structure allows an operator-level type of Bayesian inference that produces "de-chunked" or posterior-like FCMs from the mixed FCM. These de-chunked FCMs are useful in their own right and allow further iterations of Bayesian updating. We demonstrate these mixing techniques on the essay text of Allison's "Thucydides Trap" model of conflict between a dominant power such as the United States and a rising power such as China. The FCM dynamical systems predict outcomes as they equilibrate to fixed-point or limit-cycle attractors. Seven out of 8 FCM knowledge graphs predicted a type of war when we stimulated them by turning on and keeping on the concept node that stands for the rising power's ambition and entitlement. Gemini 3.1 LLMs served as the chunking AI agents.

Chinese Translation

我们通过教大型语言模型代理将文本分解为重叠的文本块，自动生成反馈因果模糊认知图（FCMs）。这些块状FCMs的凸混合生成一个代表性的循环FCM知识图。文本块可以具有不同程度的重叠。这些块状FCMs仍然可以混合形成新的FCM因果知识图。该混合技术具有可扩展性，因为它使用稀疏因果块矩阵进行轻量计算。混合结构允许一种操作级别的贝叶斯推理，从混合的FCM中产生“去分块”或后验式的FCMs。这些去分块的FCMs本身具有实用性，并允许进一步的贝叶斯更新迭代。我们在艾莉森的“修昔底德陷阱”模型的论文文本上演示了这些混合技术，该模型描述了像美国这样的主导力量与像中国这样的崛起力量之间的冲突。FCM动态系统预测结果，因为它们趋向于固定点或极限周期吸引子。当我们通过打开并保持代表崛起力量的雄心和权利概念节点来刺激它们时，8个FCM知识图中有7个预测了一种战争类型。Gemini 3.1 LLMs作为分块AI代理发挥了作用。

View on arXiv Download PDF AI Translation

cs.AI / 99 / 2605.17909

Ethical Hyper-Velocity (EHV): A Provably Deterministic Governance-Aware JIT Compiler Architecture for Agentic Systems

伦理超高速（EHV）：一种可证明的确定性治理感知即时编译器架构用于自主系统

Sharma, Riddhi Mohan

Abstract

As autonomous agentic systems scale across regulated critical infrastructures, the lack of mechanistic, hardware-rooted enforcement for high-frequency policy updates presents a fundamental safety gap. We introduce Ethical Hyper-Velocity (EHV), a novel architectural framework for the formal verification of AI governance policies at runtime. Unlike retrospective auditing frameworks (ISO/IEC 42001, NIST AI RMF) which introduce 14-30 day latencies, EHV relocates the Policy Enforcement Point (PEP) into the inference pipeline via a Governance-Aware Just-In-Time (JIT) Compiler. By integrating Conflict-free Replicated Data Types (CRDTs) for policy synchronization and Epoch-based Attestation Caching within Trusted Execution Environments (TEEs), EHV achieves Sub-millisecond Formal Determinism (SMFD). We demonstrate via TLA+ formal verification that non-compliant agentic actions are computationally unreachable within the system's bounded operating state space. We prove that O(1) runtime enforcement can eliminate the traditional trade-off between deployment velocity and governance integrity, reducing Governance Latency from O(days) to O(1).

Chinese Translation

随着自主代理系统在受监管的关键基础设施中规模化，高频政策更新缺乏机械化、硬件根植的强制执行，导致了根本的安全缺口。我们提出了伦理超高速（EHV），这是一个用于运行时AI治理政策形式验证的新型架构框架。与引入14-30天延迟的回顾性审计框架（ISO/IEC 42001，NIST AI RMF）不同，EHV通过治理感知的即时编译器（JIT Compiler）将政策强制执行点（PEP）重新定位到推理管道中。通过在受信执行环境（TEE）中集成无冲突复制数据类型（CRDTs）以实现政策同步和基于纪元的证明缓存，EHV实现了亚毫秒形式确定性（SMFD）。我们通过TLA+形式验证证明，在系统的有界操作状态空间内，不合规的代理行为在计算上是不可达的。我们证明O(1)的运行时强制执行可以消除部署速度与治理完整性之间的传统权衡，将治理延迟从O（天）减少到O（1）。

View on arXiv Download PDF AI Translation

cs.AI / 100 / 2605.17946

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

SVFSearch：一个针对游戏垂直领域短视频帧搜索的多模态知识密集基准

Mao, Lingtao, Dai, Huangyu, Sun, Xinyu, Liang, Zihan, Chen, Ben, Lei, Chenyi, Ou, Wenwu

Abstract

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

Chinese Translation

多模态大型语言模型越来越多地被用作理解多模态输入、规划检索动作、调用外部工具和对检索信息进行推理的代理骨干。然而，现有基准很少评估这种能力在短视频应用中的表现，在这些应用中，暂停的帧通常在视觉上模糊不清，回答问题需要垂直领域、长尾和快速发展的领域知识。我们推出了SVFSearch，这是第一个针对中国游戏领域短视频帧搜索的开放基准。SVFSearch包含5000个四选一的测试示例和4198个辅助训练示例，每个示例都围绕来自真实短视频剪辑的暂停游戏场景展开。为了支持公平和可重复的评估，SVFSearch提供了一个冻结的离线检索环境，包含游戏领域文本语料库、主题关联的图像库，以及文本、图像和多模态检索接口，避免依赖不受控制的网络搜索API。我们评估了从直接问答（QA）和检索增强生成（RAG）工作流到计划-行动-重新计划（Plan-Act-Replan）代理和学习搜索模型的代表性范式。结果显示，模型仅回答、实际代理搜索和oracle知识之间存在较大差距：最佳开源直接问答模型达到66.4%，最佳实际代理达到79.1%，而oracle知识达到95.4%。进一步分析揭示了视觉定位、检索质量、基于证据的推理和工具使用行为中的瓶颈，包括过度检索、仅回答快捷方式和检索引发的误导。

View on arXiv Download PDF AI Translation

cs.AI / 101 / 2605.17967

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

调和对大语言模型中监督微调(SFT)有效性的矛盾观点：一种交互视角

Zhang, Junpeng, Cheng, Lei, Zhang, Guoxi, Cai, Hua, Xu, Qing, Zhang, Quanshi

Abstract

This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.

Chinese Translation

本文探讨了一个关于监督微调(SFT)的科学问题：为什么SFT在小规模深度神经网络中广泛有效，而在应用于大语言模型(LLMs)时却可能产生不一致甚至有害的效果。基于交互的解释的最新进展表明，单词/标记之间的交互提供了一种可靠的度量，用于量化LLMs编码的推理模式。我们发现，在SFT过程中交互的演变可以有效解释SFT在LLMs中的不一致有效性。具体而言，我们发现：(1) SFT主要去除噪声般的交互，而很少获取可靠的新交互。(2) 这个去噪阶段极其短暂，之后持续的微调往往会引入过拟合的交互。我们在多个LLMs和数据集上验证了这些发现。我们的研究为早期停止提供了新的见解，并为LLM训练提供了实用指导。

View on arXiv Download PDF AI Translation

cs.AI / 102 / 2605.17976

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

在贝叶斯优化中释放大型语言模型：面向科学发现的偏好引导框架

Yuan, Xinzhe, Chen, Zhuo, Zhang, Jianshu, Xiong, Huan, Ye, Nanyang, Li, Yuqiang, Gu, Qinying

Abstract

Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.

Chinese Translation

科学发现日益受到高成本实验和有限资源的制约，这突显了在科学领域中高效优化的必要性。尽管贝叶斯优化（Bayesian Optimization, BO）被广泛采用以平衡探索与开发，但其在冷启动性能和高维设置中的可扩展性往往较差，限制了其在现实科学问题中的应用。为了解决这些挑战，我们提出了LLM引导的贝叶斯优化（LLM-Guided Bayesian Optimization, LGBO），这是第一个将大型语言模型（Large Language Models, LLMs）的语义推理持续集成到优化循环中的偏好引导BO框架。与之前仅将LLMs用于热启动初始化或候选生成的研究不同，LGBO引入了一种区域提升的偏好机制，将LLM驱动的偏好嵌入到每一次迭代中，以稳定和可控的方式调整代理均值。从理论上讲，我们证明了LGBO在最坏情况下的表现并不显著劣于标准BO，同时在偏好与目标一致时实现了显著更快的收敛。在实证上，LGBO在物理、化学、生物学和材料科学等多样的干基准测试中始终优于现有方法。最值得注意的是，在对Fe-Cr电池电解质的新湿实验室优化中，LGBO在6次迭代内达到了 extbf{90 ext{%的最佳观察值}}，而标准BO和现有的LLM增强基线则需要超过10次迭代。总体而言，这些结果表明，LGBO为将LLMs集成到科学优化工作流程中提供了一个有前景的方向。

View on arXiv Download PDF AI Translation

cs.AI / 103 / 2605.17999

Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

共享骨干 PPO 在多无人机通信覆盖中的连接保持

Jiang, Z.

Abstract

This paper proposes a Shared Backbone Proximal Policy Optimization (Shared Backbone PPO) algorithm. By sharing the base module between the Actor and Critic networks, the algorithm achieves efficient training and improved performance. The algorithm is implemented in a connectivity-preserving multi-UAV swarm communication coverage task and compared with the standard PPO algorithm. Experimental results demonstrate that the proposed method achieves superior performance. Furthermore, a graph information aggregation module is incorporated into the model architecture to accommodate the communication conditions among agents. With the integration of this module, the algorithm remains effective, and the trained agent swarm exhibits a higher level of cooperation.

Chinese Translation

本文提出了一种共享骨干近端策略优化（Shared Backbone Proximal Policy Optimization，Shared Backbone PPO）算法。通过在演员（Actor）和评论家（Critic）网络之间共享基础模块，该算法实现了高效的训练和性能提升。该算法应用于一个保持连接的多无人机群体通信覆盖任务，并与标准 PPO 算法进行了比较。实验结果表明，所提出的方法具有优越的性能。此外，模型架构中还融入了图信息聚合模块，以适应代理之间的通信条件。通过集成该模块，算法保持有效性，训练后的代理群体表现出更高水平的合作。

View on arXiv Download PDF AI Translation

cs.AI / 104 / 2605.18025

TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

TeleCom-Bench：大型语言模型距离工业电信应用还有多远？

Xiao, Jieting, Lin, Yun, Qiu, Huizhen, Ma, Rui, Zhong, Chen, Xu, Dongyang, Long, Xiao, Zhang, Chaoyu, Hao, Qiaobo, Zou, Ding, Yang, Zhiguo, Gao, Yanqin, Tan, Fang

Abstract

While Large Language Models have achieved remarkable integration in various vertical scenarios, their deployment in the telecommunications domain remains exploratory due to the lack of a standardized evaluation framework. Current telecom benchmarks primarily focus on static, foundational knowledge and isolated atomic skills, neglecting the equipment-specific documentation and end-to-end industrial workflows essential for real-world production systems. To bridge this gap, we present TeleCom-Bench, a comprehensive benchmark comprising 12 evaluation sets with 22,678 curated samples, which evaluates LLMs across a synergistic hierarchy: (1) Multi-dimensional Knowledge Comprehension, which integrates telecommunication fundamentals, 3GPP protocols, and 5G network architecture with proprietary product knowledge across wired, core, and wireless networks via knowledge graph-driven synthesis; and (2)End-to-End Knowledge Application, which formalizes six core tasks on authentic trajectories from live network agent workflows, including intent recognition, entity extraction, event verification, tool invocation, root cause analysis, and solution generation-across network optimization and fault maintenance scenarios. Evaluations of eight state-of-the-art LLMs reveal a universal Execution Wall: while models achieve 90% accuracy in linguistic interface tasks such as intent recognition and entity extraction, performance collapses to approximately 30% in procedural execution tasks like solution generation. This capability gap demonstrates that current LLMs function competently as diagnosticians but fail as field engineers. TeleCom-Bench provides standardized diagnostics to precisely pinpoint this deficit, offering actionable guidance for domain-specific alignment toward production-ready telecom agents. The dataset and evaluation code have been released at https://github.com/ZTE-AICloud/TeleCom-Bench.

Chinese Translation

尽管大型语言模型在多个垂直场景中实现了显著的整合，但由于缺乏标准化的评估框架，其在电信领域的应用仍然处于探索阶段。目前的电信基准主要集中于静态的基础知识和孤立的原子技能，忽视了对于真实生产系统至关重要的设备特定文档和端到端工业工作流程。为填补这一空白，我们提出了TeleCom-Bench，这是一个综合性基准，包含12个评估集和22,678个经过精心策划的样本，旨在通过协同层次评估大型语言模型（LLMs）：（1）多维知识理解，整合了电信基础知识、3GPP协议和5G网络架构，以及通过知识图谱驱动的综合，涵盖有线、核心和无线网络的专有产品知识；（2）端到端知识应用，形式化了来自实时网络代理工作流程的六个核心任务，包括意图识别、实体提取、事件验证、工具调用、根本原因分析和解决方案生成，涵盖网络优化和故障维护场景。对八种最先进的LLMs的评估揭示了一个普遍的执行壁垒：尽管模型在意图识别和实体提取等语言接口任务中达到90%的准确率，但在解决方案生成等程序执行任务中的表现却降至约30%。这一能力差距表明，当前的LLMs在诊断方面表现良好，但在现场工程师的角色上则表现不佳。TeleCom-Bench提供了标准化的诊断工具，以精确定位这一缺陷，为领域特定的生产就绪电信代理提供可操作的指导。数据集和评估代码已发布在 https://github.com/ZTE-AICloud/TeleCom-Bench。

View on arXiv Download PDF AI Translation

cs.AI / 105 / 2605.18035

New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

零阶硬阈值中的方差减少新见解：缓解梯度误差与扩展性矛盾

Yuan, Xinzhe, de Vazelhes, William, Gu, Bin, Xiong, Huan

Abstract

Hard-thresholding is an important type of algorithm in machine learning that is used to solve $\ell_0$ constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. The SZOHT algorithm is the only algorithm tackling $\ell_0$ sparsity constraints with ZO gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions % in ZO gradients due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a ridge regression problem as well as black-box adversarial attacks.

Chinese Translation

硬阈值是一种重要的机器学习算法，用于解决 $ ext{l}_0$ 约束优化问题。然而，在某些情况下，目标函数的真实梯度可能难以获取，通常可以通过零阶（ZO）方法进行近似。SZOHT 算法是迄今为止唯一一个使用 ZO 梯度处理 $ ext{l}_0$ 稀疏约束的算法。不幸的是，由于 ZO 梯度的偏差与硬阈值算子的扩展性之间的固有冲突，SZOHT 在随机方向的数量上存在显著限制。本文通过考虑方差的作用来解决这一问题，并提供了方差减少的新见解：缓解 ZO 梯度与硬阈值之间的独特冲突。在这一视角下，我们提出了一种广义的方差减少 ZO 硬阈值算法，并在标准假设下进行了广义收敛分析。理论结果表明，新算法消除了对随机方向数量的限制，相较于 SZOHT 提高了收敛速度并扩大了适用性。最后，我们在岭回归问题以及黑箱对抗攻击中展示了我们方法的实用性。

View on arXiv Download PDF AI Translation

cs.AI / 106 / 2605.18048

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

DocOS：面向图形用户界面代理的主动文档引导行动

Liu, Jingjing, Huang, Ziye, Cheng, Zihao, Liu, Zeming, Wu, Jiahong, Guo, Yuhang, Chen, Kehai, Wang, Yunhong, Wang, Haifeng

Abstract

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

Chinese Translation

尽管图形用户界面（GUI）代理在自动化设备交互中表现出良好的性能，但它们主要依赖于来自预训练或指令调优的静态参数知识。这种依赖根本上限制了它们处理需要明确程序知识的长尾任务的能力，而这些知识在模型参数中并不存在，往往迫使代理采用低效且脆弱的试错探索。为了解决这一局限性，我们提出了面向动态开放网络环境的图形用户界面代理的 extbf{主动文档引导行动}，这一新颖范式通过使代理能够自主搜索相关文档来解决长尾任务，从而反映了人类的问题解决过程。为了评估代理在这一范式中的能力，我们提出了 extbf{DocOS}，一个旨在评估完全互动环境中文档引导问题解决能力的基准。DocOS要求代理自主导航网页浏览器，定位相关的在线文档，理解程序指令，并将其忠实地转化为可执行的图形用户界面行动。广泛的实验表明，进展受到双重瓶颈的严格限制：代理在主动搜索过程中难以可靠地定位相关信息，并且经常无法将检索到的指令忠实地转化为精确的行动，这表明文档引导交互是使自我进化的图形用户界面代理在动态环境中得以实现的关键途径。

View on arXiv Download PDF AI Translation

cs.AI / 107 / 2605.18077

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

基于大型语言模型引导的合作多智能体强化学习通信

Bae, Sangjun, Park, Yisak, Lee, Sanghyeon, Han, Seungyul

Abstract

Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-Agent Communication (LMAC), which leverages an LLM's reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents' knowledge. Experiments on diverse MARL benchmarks show that LMAC improves state reconstruction across agents and yields substantial performance gains over prior communication baselines.

Chinese Translation

通信是多智能体强化学习（MARL）中一个关键组成部分，有助于缓解部分可观测性的问题。然而，之前的方法往往依赖于低效的信息交换，或未能传递足够的状态信息。为了解决这一问题，我们提出了基于大型语言模型驱动的多智能体通信（LLM-driven Multi-Agent Communication, LMAC），该方法利用大型语言模型的推理能力设计通信协议，使所有智能体能够尽可能准确和统一地重建潜在状态。LMAC通过明确的状态意识标准迭代优化协议，改善状态恢复的同时缩小智能体知识之间的差异。在多种MARL基准测试中的实验表明，LMAC在智能体之间改善了状态重建，并在性能上显著超越了之前的通信基线。

View on arXiv Download PDF AI Translation

cs.AI / 108 / 2605.18094

Learning to Solve Compositional Geometry Routing Problems

学习解决组合几何路由问题

Fan, Mingfeng, Zhou, Jianan, Cheng, Jiaqi, Zhang, Yifeng, Zhang, Jie, Sartoretti, Guillaume Adrien

Abstract

We study the Compositional Geometry Routing Problem (CGRP), a unified superclass of traditional routing problems that covers point-only, line-only, area-only, and arbitrary hybrid task geometries, providing a broad abstraction for real-world routing scenarios. Beyond standard point-based routing, CGRP with non-point tasks can be inherently asymmetric, tightly coupled travel routes with the intrinsic path, and enlarges the action space with numerous feasible yet often irrelevant options, thereby posing significant challenges for both representation learning and decision-making. To address these challenges, we propose DiCon, a differential attention-assisted solver with contrastive learning, as a plug-and-play framework that tackles the problem from two complementary angles. First, we introduce a differential attention mechanism that actively suppresses the probability mass on less competitive candidate actions. Second, we design a double-level contrastive learning objective to promote robust global instance representations and regularize geometry-aware task representations. Extensive experiments demonstrate that DiCon achieves strong performance, broad versatility, and superior generalization across diverse CGRP instances with different compositions.

Chinese Translation

我们研究了组合几何路由问题（Compositional Geometry Routing Problem, CGRP），这是一个统一的传统路由问题超类，涵盖了仅点、仅线、仅区域以及任意混合任务几何形状，为现实世界的路由场景提供了广泛的抽象。与标准的基于点的路由不同，具有非点任务的CGRP可能固有地是非对称的，紧密耦合的旅行路线与内在路径，并且通过大量可行但往往无关的选项扩大了行动空间，从而对表示学习和决策制定提出了重大挑战。为了解决这些挑战，我们提出了DiCon，一种基于差异注意力的辅助求解器，结合对比学习，作为一个即插即用的框架，从两个互补的角度来解决问题。首先，我们引入了一种差异注意力机制，主动抑制在竞争力较低的候选动作上的概率质量。其次，我们设计了双层对比学习目标，以促进稳健的全局实例表示并规范几何感知的任务表示。大量实验表明，DiCon在不同组合的多样化CGRP实例中实现了强大的性能、广泛的适应性和优越的泛化能力。

View on arXiv Download PDF AI Translation

cs.AI / 109 / 2605.18104

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

多模态大语言模型中的安全几何崩溃与自适应漂移校正

Guo, Jiahe, Guo, Xiangran, Chen, Jiaxuan, Zhao, Weixiang, Zhao, Yanyan, Hou, Yutai, Wang, Qianchao, Tu, Dandan, Qin, Bing

Abstract

Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the estimated drift restores refusal separability and improves multimodal safety. After drift correction, we further observe self-rectification, where the model recovers its ability to recognize and refuse harmful multimodal inputs during forward dynamics. This effect also provides an internal signal of the model's perceived harmfulness of each input. Motivated by this signal, we propose ReGap, a training-free inference-time method that adaptively corrects modality drift using self-rectification. Experiments across multiple multimodal safety benchmarks and utility benchmarks demonstrate the effectiveness of ReGap, which significantly improves the safety of MLLMs without compromising general capabilities. Our findings highlight representation-level modality alignment as a crucial direction for real-time safety improvement and for building safer, more reliable MLLMs.

Chinese Translation

多模态大语言模型（MLLMs）在将文本模态中学习到的安全能力转移到语义上等价的非文本输入时，常常失败，这揭示了一个持续存在的多模态安全差距。我们从表征几何的角度研究这一差距，通过分析与文本对齐的拒绝方向和模态引起的漂移方向。我们表明，多模态输入压缩了沿拒绝方向的可用分离，使其不再可靠于识别和拒绝有害输入。我们将这种失败模式称为安全几何崩溃。我们通过条件拒绝可分离性对其进行量化，并表明更强的模态引起的漂移与更弱的拒绝可分离性和更高的攻击成功率之间存在一致的关联。然后，我们通过固定强度的激活干预验证模态引起的漂移的因果作用：抵消估计的漂移恢复了拒绝可分离性并改善了多模态安全性。在漂移校正后，我们进一步观察到自我修正现象，即模型在前向动态中恢复了识别和拒绝有害多模态输入的能力。这一效应还提供了模型对每个输入感知有害性的内部信号。基于这一信号，我们提出了ReGap，一种无训练的推理时方法，利用自我修正自适应地校正模态漂移。跨多个多模态安全基准和效用基准的实验表明ReGap的有效性，它显著提高了MLLMs的安全性而不损害其一般能力。我们的发现强调了表征层面的模态对齐作为实时安全改进和构建更安全、更可靠的MLLMs的重要方向。

View on arXiv Download PDF AI Translation

cs.AI / 110 / 2605.18109

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround：用于全场景家庭推理的结构化可执行任务推断

Feng, ZhiYuan, Deng, Yu, An, Ruichuan, Liu, Zhenhua, Li, Qixiu, Wu, Keming, Du, Zhiying, Wang, Weijie, Wang, Haoxiao, Chen, Shuang, Xu, Sicheng, Liang, Yaobo, Yang, Jiaolong, Guo, Baining

Abstract

In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.

Chinese Translation

在实际家庭部署中，家庭代理通常必须基于完整的家庭场景和特定的家庭请求进行操作，而不是依赖于干净的任务规范。这类请求要求代理识别与任务相关的实体，恢复预期的任务条件，并从周围场景上下文中解决顺序约束。我们将这种能力形式化为全场景家庭推理：给定一个完整的家庭场景和一个特定的家庭请求，代理必须推断出可执行的任务结构，然后生成一个基于实际情况的技能级动作序列。这一设置具有挑战性，因为完整的家庭场景包含大量与任务无关的信息，使得直接的完整场景提示效率低下且容易出错。在实际部署中，这一挑战因隐私和本地计算限制而进一步加剧，这些限制更倾向于使用具有有限长上下文推理能力的紧凑型开放权重模型。我们提出了TaskGround，一个无训练且与模型无关的Ground-Infer-Execute框架，它将完整场景转化为紧凑的与任务相关的场景切片，推断可执行的任务结构，并将其编译成基于实际情况的技能级动作序列。为了评估这一设置，我们引入了FullHome，一个经过人工验证的评估套件，涵盖400个家庭任务，涉及多样的家庭环境以及目标导向和过程约束的需求。在FullHome上，TaskGround在专有和开放权重模型中均显著提高了任务成功率。值得注意的是，它使得Qwen3.5-9B在直接完整场景提示下与GPT-5具备竞争力，同时将总输入令牌成本降低了多达18倍。我们的结果表明，可执行任务结构推断是全场景家庭推理中的一个核心瓶颈，并显示结构化的基础可以使紧凑的本地模型在实际家庭部署中显著更有效。

View on arXiv Download PDF AI Translation

cs.AI / 111 / 2605.18128

POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

POST：多变量时间序列异常检测的时空关联先验观察对抗学习

Zhang, Suofei, Zheng, Yaxuan, Hu, Haifeng

Abstract

Existing Multivariate Time Series Anomaly Detection (MTSAD) frameworks increasingly rely on integrating Graph Neural Networks (GNNs) with sequence models to capture complex spatio-temporal dependencies. However, less attention is paid to the spatial over-generalization problem, where unconstrained structural modeling indiscriminately reconstructs anomalies, inevitably degrading detection recall. To tackle this problem, we propose a novel framework that unifies spatio-temporal modeling through a joint prior-observation adversarial learning paradigm. In the spatial dimension, the model alternately learns adjacency matrices as structural prior and models the association discrepancy between prior and data-driven observation in a minimax manner during training. Such adversarial optimization not only improves the model sensitivity for time-wise detection, but also enables the model to localize anomalies to specific channels. To systematically evaluate this anomaly localization capability, we further construct a synthetic benchmark equipped with precise channel-wise annotations. Extensive experiments across public datasets and our dedicated benchmark demonstrate that the proposed framework establishes a new state-of-the-art in both time-wise detection and spatial localization tasks. Our code, pre-trained models, and benchmark are publicly available at https://github.com/anocodetest1/POST.

Chinese Translation

现有的多变量时间序列异常检测（MTSAD）框架越来越依赖于将图神经网络（GNN）与序列模型结合，以捕捉复杂的时空依赖关系。然而，空间过度泛化问题却受到的关注较少，该问题导致不受限制的结构建模无差别地重构异常，必然降低检测的召回率。为了解决这一问题，我们提出了一种新颖的框架，通过联合先验观察对抗学习范式统一时空建模。在空间维度上，该模型交替学习邻接矩阵作为结构先验，并在训练过程中以极小极大方式建模先验与数据驱动观察之间的关联差异。这种对抗优化不仅提高了模型在时间检测上的敏感性，还使模型能够将异常定位到特定通道。为了系统地评估这种异常定位能力，我们进一步构建了一个配备精确通道注释的合成基准。针对公共数据集和我们专门的基准进行的广泛实验表明，所提出的框架在时间检测和空间定位任务上均建立了新的最先进水平。我们的代码、预训练模型和基准可在 https://github.com/anocodetest1/POST 上公开获取。

View on arXiv Download PDF AI Translation

cs.AI / 112 / 2605.18143

Generative AI and the Productivity Divide: Human-AI Complementarities in Education

生成性人工智能与生产力差距：教育中的人机互补性

Idan, Lihi, Anand, Bharat

Abstract

Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance. On average, GenAI access significantly increased task performance, but the distribution of gains was highly uneven. Improvements were not predicted by GPA or prior knowledge, but by \textit{AI Interaction Competence (AIC)} -- the ability to elicit, filter, and verify model outputs. High-AIC participants realized outsized gains; low-AIC participants saw limited or even negative marginal returns. A scaffolding intervention (conceptual maps) reduced outcome variance, indicating that standardized workflows can mitigate inequality in AI-mediated performance. We interpret these findings through the lens of human-AI complementarities: GenAI raises mean productivity while introducing a new axis of capability inequality. Managerially, firms should pair GenAI access with short AIC micro-training and simple standard operating procedures to capture value consistently and avoid uneven adoption outcomes.

Chinese Translation

生成性人工智能（GenAI）正在改变企业创造、处理和应用知识的方式，但关于其在用户之间生产力效应的异质性知之甚少。我们报告了一项随机对照实验的结果，参与者是早期职业知识工作者的类比，他们被分配使用传统资源或大型语言模型（LLM）辅助自学一个技术领域。平均而言，GenAI的使用显著提高了任务表现，但收益的分布极为不均。绩点（GPA）或先前知识并未预测改进，而是由 extit{人工智能互动能力（AIC）}——引导、过滤和验证模型输出的能力所决定。高AIC参与者获得了超额收益；低AIC参与者则看到有限甚至负的边际收益。一项支架干预（概念图）减少了结果的方差，表明标准化工作流程可以缓解AI介导表现中的不平等。我们通过人机互补性的视角解读这些发现：GenAI提高了平均生产力，同时引入了新的能力不平等轴线。在管理上，企业应将GenAI的使用与短期AIC微培训和简单的标准操作程序相结合，以持续捕获价值并避免不均衡的采用结果。

View on arXiv Download PDF AI Translation

cs.AI / 113 / 2605.18144

Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

基于证据的前沿映射与纳米医学中的自主假设生成

Viviers, Christiaan G. A., de Bruin, Koen, Trines, Mirre M., Hokke, Ayla M., van der Meel, Roy, Schroeder, Avi, Lammers, Twan, Mulder, Willem J. M., van der Sommen, Fons

Abstract

Nanomedicine research spans delivery chemistry, immunology, imaging, biomaterials, and disease-specific translational science, yet its conceptual design space remains fragmented across a large and heterogeneous literature. To date, artificial intelligence in nanomedicine has focused primarily on property prediction and formulation optimization, with much less attention to evidence-grounded discovery support at the level of research direction selection. We introduce pArticleMap, a literature-mapping and research-hypothesis-generation system that combines article embeddings, similarity-graph analysis, sparse frontier extraction, structured evidence-pack retrieval, and an audited large-language-model (LLM) workflow for grounded ideation. Rather than forecasting future concept co-occurrence, pArticleMap targets low-density article-level bridge regions and cluster interfaces, then generates and scores citation-grounded hypotheses with large language models in an agentic setup. We evaluate the system with a retrospective realization benchmark (generate later literature under a historical cutoff) and a blinded human reader assessment layer across cue-conditioned nanomedicine tasks. Across 4 selected retrospective bundles, pArticleMap generated ideas and selected task-retained hypotheses (winner ideas) under the benchmark protocol. For task-level retained hypotheses, a pooled gold recovery rate of 10.8% was obtained, with a recall@10 of 15.9% and a future-neighborhood rate of 61.0%, indicating that the system often reached the correct forward-looking neighborhood (paper ideas) even without exact paper-level recovery. Human-agent agreement is modest overall, indicating that internal scoring is useful as a support signal but does not replace expert judgment. These results position pArticleMap as a conservative, evidence-grounded research assistant for nanomedicine.

Chinese Translation

纳米医学研究涵盖了递送化学、免疫学、成像、生物材料以及特定疾病的转化科学，但其概念设计空间在大量异质文献中仍然碎片化。迄今为止，纳米医学中的人工智能主要集中在属性预测和配方优化上，而在研究方向选择层面的基于证据的发现支持上关注较少。我们介绍了 pArticleMap，一个文献映射和研究假设生成系统，它结合了文章嵌入、相似性图分析、稀疏前沿提取、结构化证据包检索以及经过审计的大型语言模型（LLM）工作流程，以实现基于证据的创意生成。pArticleMap 的目标不是预测未来概念的共现，而是针对低密度的文章级桥接区域和聚类接口，然后在自主设置中生成并评分基于引用的假设。我们通过回顾性实现基准（在历史截止日期下生成后续文献）和针对提示条件的纳米医学任务的盲人读者评估层来评估该系统。在4个选定的回顾性包中，pArticleMap 在基准协议下生成了想法并选择了任务保留的假设（优胜想法）。对于任务级保留的假设，获得了10.8%的汇总黄金回收率，recall@10为15.9%，未来邻域率为61.0%，这表明该系统即使在没有精确的论文级回收的情况下，仍然经常达到了正确的前瞻性邻域（论文想法）。总体而言，人类与代理之间的协议适中，表明内部评分作为支持信号是有用的，但并不能替代专家判断。这些结果将 pArticleMap 定位为纳米医学中一个保守的、基于证据的研究助手。

View on arXiv Download PDF AI Translation

cs.AI / 114 / 2605.18150

Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

噪声中的低语：通过多智能体框架的替代引导概念觉醒

Sun, Mengyu, Yang, Ziyuan, Zhou, Zunlong, Liu, Junxu, Hu, Haibo, Zhang, Yi

Abstract

Diffusion models (DMs) are widely used for text-to-image generation, but their strong generative capabilities also raise concerns about unsafe or undesirable content. Concept erasure aims to mitigate these risks by removing specific concepts from pretrained models. However, recent studies show that such methods often suppress rather than fully eliminate target concepts, leaving models vulnerable to awakening attacks. Existing approaches primarily rely on white-box access through optimization or inversion, while concept awakening under black-box constraints remains underexplored. In this work, we revisit the denoising process from a trajectory perspective and show that concept erasure mainly disrupts early-stage text-semantic alignment but does not fully prevent semantic information from propagating along the denoising dynamics. As generation proceeds, the model increasingly depends on the evolving noisy state rather than textual conditions, which creates an opportunity to bypass erased mappings. Motivated by this observation, we propose ConceptAgent, a training-free, black-box, multi-agent framework that awakens erased concepts by initializing the denoising trajectory from surrogate-guided noisy states. Extensive experiments demonstrate that ConceptAgent enables accurate and controllable awakening of erased concepts under black-box settings without access to model parameters, gradients, or internal representations. These results highlight fundamental limitations of current concept erasure methods and provide new insights into the dynamic nature of semantic control in DMs.

Chinese Translation

扩散模型（DMs）广泛应用于文本到图像的生成，但其强大的生成能力也引发了对不安全或不良内容的担忧。概念抹除旨在通过从预训练模型中去除特定概念来减轻这些风险。然而，最近的研究表明，这些方法往往是抑制目标概念，而不是完全消除，导致模型容易受到觉醒攻击。现有方法主要依赖于通过优化或反演进行的白盒访问，而在黑盒约束下的概念觉醒仍然未被充分探索。在本研究中，我们从轨迹的角度重新审视去噪过程，并表明概念抹除主要干扰早期的文本语义对齐，但并未完全阻止语义信息沿去噪动态传播。随着生成的进行，模型越来越依赖于不断演变的噪声状态，而不是文本条件，这为绕过已抹除的映射创造了机会。基于这一观察，我们提出了ConceptAgent，一个无训练、黑盒、多智能体框架，通过从替代引导的噪声状态初始化去噪轨迹来觉醒已抹除的概念。大量实验表明，ConceptAgent能够在黑盒环境下准确且可控地觉醒已抹除的概念，而无需访问模型参数、梯度或内部表示。这些结果突显了当前概念抹除方法的基本局限性，并为扩散模型中语义控制的动态特性提供了新的见解。

View on arXiv Download PDF AI Translation

cs.AI / 115 / 2605.18163

TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

TRACE：基于跨层证据的轨迹修正以减少幻觉

Ranade, Tej Sanibh

Abstract

Hallucination correction is not a one-direction problem. We show that intermediate layers are neither uniformly more truthful than final layers nor uniformly less trustworthy. Yet hallucination reduction is usually instantiated through one fixed intervention form: contrast one layer against another, steer along a truthfulness direction, or defer to external evidence. This framing is structurally incomplete. Cross-layer factual evidence does not evolve uniformly: in some failures truthful support is present internally and later suppressed, whereas in others candidate competition remains genuinely multi-directional across depth, so no single signed scalar family is generally sufficient. We introduce Trajectory Correction from Cross-layer Evidence for Hallucination Reduction (TRACE), a deterministic, training-free algorithm which corrects hallucinations at inference time by deriving both the corrective layer and the appropriate correction operator from each input's cross-layer candidate trajectory inside the LLM's own forward pass. Under one frozen hyperparameter setting, TRACE selects among scalar reversal, earlier-state recovery, and candidate-space correction using only model-internal evidence. Evaluated as a single universal algorithm across 15 models, 8 model families, and 3 factuality benchmarks, TRACE improves every evaluation cell, yielding mean gains of +12.26 MC1 points and +8.65 MC2-style points with no regressions, with gains reaching +47.20 MC1 and +43.38 MC2-style points. The method uses no labels, retrieval, pretraining, finetuning, or per-model calibration.

Chinese Translation

幻觉修正并非单向问题。我们展示了中间层既不是始终比最终层更真实，也不是始终不可信。然而，幻觉减少通常通过一种固定的干预形式来实现：对比一个层与另一个层，沿着真实度方向引导，或依赖外部证据。这种框架在结构上是不完整的。跨层事实证据并不均匀演变：在某些失败中，真实支持在内部存在并随后被抑制，而在其他情况下，候选竞争在深度上仍然是真正多方向的，因此没有单一的有符号标量族通常是足够的。我们引入了基于跨层证据的轨迹修正方法（TRACE），这是一种确定性的、无训练的算法，通过从每个输入的跨层候选轨迹中推导修正层和适当的修正算子，在推理时修正幻觉。在一个冻结的超参数设置下，TRACE仅使用模型内部证据，在标量反转、早期状态恢复和候选空间修正之间进行选择。作为一种通用算法在15个模型、8个模型家族和3个事实性基准上进行评估，TRACE改善了每个评估单元，平均提高了+12.26 MC1分和+8.65 MC2风格分，没有回归，增益达到+47.20 MC1和+43.38 MC2风格分。该方法不使用标签、检索、预训练、微调或每个模型的校准。

View on arXiv Download PDF AI Translation

cs.AI / 116 / 2605.18172

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

可视化不可见：生成视觉定位赋能多模态大语言模型中的通用脑电图理解

Pan, Junyu, Wang, Yansen, Zhang, Enze, Lu, Baoliang, Zheng, Weilong, Li, Dongsheng

Abstract

Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.

Chinese Translation

利用预训练的大语言模型（LLMs）和多模态大语言模型（MLLMs）的通用表示为脑基础模型提供了一条有前景的路径。然而，视觉诱发的脑电图（EEG）数据集仍然稀缺，导致现有方法主要将神经信号与抽象文本对齐，这种有损的翻译可能会丢失编码在脑活动中的细粒度感知信息。我们提出了生成视觉定位（Generative Visual Grounding, GVG）框架，通过使用EEG到图像的生成模型作为视觉翻译器，来可视化不可见的内容。GVG并不单纯地将EEG强制转换为文本，而是为非视觉EEG幻觉生成特定实例的代理图像，提供结构化的视觉上下文，使得MLLMs能够利用其视觉先验进行临床状态的解释。我们在两个MLLM骨干网络上验证了这一思想，分别是GVG-X-Omni和GVG-Janus。仅图像对齐的效果已经具有竞争力：轻量级的GVG-X-Omni在冻结的7B骨干网络上仅调整170M参数，就能与1.7B参数的文本对齐基线相匹配。我们进一步扩展了GVG-Janus，采用三模态的图像+文本对齐，其中文本提供类别语义锚点，视觉代理则丰富了神经表示的感知细节。实验表明，在EEG理解和视觉生成方面均取得了一致的提升，表明视觉代理定位作为文本对齐的有效补充。

View on arXiv Download PDF AI Translation

cs.AI / 117 / 2605.18181

Scalable Environments Drive Generalizable Agents

可扩展环境驱动可泛化代理

Zhang, Jiayi, Kong, Fanqi, Zhang, Guibin, Song, Maojia, Yu, Zhaoyang, Ruan, Jianhao, Xiang, Jinyu, Liu, Bang, Wu, Chenglin, Luo, Yuyu

Abstract

Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues that such generalization requires environment scaling: expanding the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world-level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule-sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule-set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open-endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross-environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.

Chinese Translation

可泛化的代理应能够适应多样化的任务和超出其训练分布的未见环境。本文提出的观点认为，这种泛化需要环境扩展：扩大代理与之互动的可执行规则集的分布，而不仅仅是在固定基准下增加轨迹或任务。目前的扩展实践主要集中在在固定交互规则下收集更多经验或更广泛的任务集，这使得代理在基础接口、动态、观察或反馈信号发生变化时变得脆弱。因此，核心挑战是世界级分布的转变：代理需要系统地接触具有显著不同可执行规则集的环境。为明确这一挑战，我们提出了一个统一的分类法，将轨迹扩展、任务扩展和环境扩展按其主要成果及可执行规则集的变化进行区分。在此分类法的基础上，我们综合了可扩展环境的构建范式，对比了优先考虑可控性和可验证性的程序生成器与提供更广泛覆盖和开放性生成的世界模型。我们进一步概述了如何将环境扩展与状态学习机制结合，强调了用于跨环境适应的学习更新规则。最后，我们讨论了替代视角，并认为可扩展环境为朝着稳健的可泛化代理实现可测量和可控的进展提供了必要的基础。

View on arXiv Download PDF AI Translation

cs.AI / 118 / 2605.18191

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

成对偏好奖励与基于群体的多样性增强用于优越的开放式生成

Cao, Guining, Peng, Jiaxin, Zeng, Chu, Zhao, Yu, Song, Shuangyong, Yongxiang

Abstract

Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks, verifying the correctness of responses remains challenging, and training reward models incurs substantial computational and annotation costs. Moreover, reinforcement learning (RLVR) often leads to diversity collapse and produces stereotypical or rigid outputs, outcomes that are particularly undesirable in open-domain scenarios. We propose Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a RL method that is more suitable for open-ended generation. PPR-GDE does not require scalar rewards and incorporates group-level diversity into the reward signal, it preserves the comparative structure of subjective evaluation through a pairwise preference reward, mitigates judge position bias via repeated comparisons with swapped response order, and introduces a group-based diversity reward that explicitly encourages semantic dispersion within a response group, all of these reward signals are integrated into a unified group-relative policy optimization objective. We instantiate PPR-GDE on role-playing task, experiments show that PPR-GDE achieves a better alignment quality as well as expressive diversity than strong RL baselines. Further analysis shows that pairwise preference is critical for preference alignment in subjective perspective, while the diversity metric plays an essential role in achieving superior expressive diversity and broader semantic coverage.

Chinese Translation

当前的强化学习（RL）方法在可以提供标量奖励的可验证环境中广泛适用且功能强大。然而，在开放式生成任务中，验证响应的正确性仍然具有挑战性，训练奖励模型会产生大量的计算和标注成本。此外，强化学习（RLVR）往往导致多样性崩溃，产生刻板或僵化的输出，这在开放领域场景中特别不受欢迎。我们提出了成对偏好奖励与基于群体的多样性增强（PPR-GDE），这是一种更适合开放式生成的RL方法。PPR-GDE不需要标量奖励，并将群体层面的多样性纳入奖励信号中，通过成对偏好奖励保留主观评估的比较结构，通过交换响应顺序的重复比较减轻评审位置偏差，并引入一种基于群体的多样性奖励，明确鼓励响应群体内的语义分散，所有这些奖励信号都整合到一个统一的群体相对政策优化目标中。我们在角色扮演任务上实例化了PPR-GDE，实验表明PPR-GDE在对齐质量和表现多样性方面优于强大的RL基线。进一步分析表明，成对偏好对于主观视角下的偏好对齐至关重要，而多样性指标在实现优越的表现多样性和更广泛的语义覆盖方面发挥着重要作用。

View on arXiv Download PDF AI Translation

cs.AI / 119 / 2605.18194

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

超越笛卡尔幻觉：在感知瓶颈下测试两阶段多模态心智理论

Zhou, Yajing, Kong, Xiangyu

Abstract

While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a "geometric-to-semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.

Chinese Translation

尽管多模态大型语言模型（MLLMs）在一般推理方面展现出令人印象深刻的能力，但其具身空间智能仍受到“笛卡尔幻觉”的制约——依赖于缺乏扎根的三维拓扑理解的基于文本的概率分布。这一局限性在多智能体环境中尤为明显，这些环境不仅需要场景感知，还要求第二阶心智理论（ToM）。具体而言，智能体A必须能够推断智能体B对环境的信念，而这一信念严格受限于智能体B的物理方向和感官限制。在本文中，我们通过一项新颖的视听任务探讨MLLMs在两阶段空间推理中的局限性：要求智能体A预测智能体B对A相对位置的估计。为了解决这一问题，我们提出了一种认知感官瓶颈模块，摒弃了僵化的基于规则的坐标变换。相反，我们引入了一种基于锚点的具身空间分解思维链（CoT）。这引导MLLM通过“几何到语义”的投影，迫使其首先建立B的局部坐标系统，然后根据A是否位于B的视觉锥内动态加权视觉和听觉模态。广泛的评估表明，尽管当前的MLLM在空间对称性和视野外歧义方面根本存在困难（建立了42%的严格零-shot基线准确率），我们的感官限制推理链却显著优于纯自我中心和他中心基线。通过系统性地基准测试这些感知瓶颈，我们的工作揭示了当前MLLM空间推理的局限性，并为具身人工智能中的认知、模态感知推理建立了基础范式。

View on arXiv Download PDF AI Translation

cs.AI / 120 / 2605.18298

DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

DARE-EEG：一种用于挖掘双对齐表示的基础模型

Shao, Yang, Gong, Peiliang, Dai, Qun, Zhang, Daoqiang

Abstract

Foundation models pre-trained through masked reconstruction on large-scale EEG data have emerged as a promising paradigm for learning generalizable neural representations across diverse brain-computer interface applications. However, a critical yet overlooked challenge is that EEG encoders must learn representations invariant to incomplete observations-when different masked views of the same signal have minimal overlap, existing methods fail to constrain them to a consistent latent subspace, leading to degraded transferability. To address this, we propose DARE-EEG, a self-supervised foundation model that explicitly enforces the mask-invariance property through dual-aligned representation learning during pre-training. Specifically, we introduce mask alignment that constrains representations from multiple masked views of the same EEG sample via contrastive learning, complementing anchor alignment that aligns masked representations to momentum-updated complete features for semantic stability. Additionally, we propose conv-linear-probing, a parameter-efficient strategy that adapts pre-trained representations to heterogeneous electrode configurations and sampling rates through decoupled spectro-spatial projections. Extensive experiments across diverse EEG benchmarks demonstrate that DARE-EEG consistently achieves state-of-the-art in accuracy performance while maintaining relatively low parameter complexity and superior cross-dataset portability compared to existing methods. Furthermore, DARE-EEG contributes to effectively discovering and utilizing the rich potential representations in EEG.

Chinese Translation

通过对大规模脑电图（EEG）数据进行掩码重建预训练的基础模型已成为学习可泛化神经表示的一种有前景的范式，适用于多种脑-机接口应用。然而，一个关键但被忽视的挑战是，EEG 编码器必须学习对不完整观测具有不变性的表示——当同一信号的不同掩码视图几乎没有重叠时，现有方法无法将其约束到一致的潜在子空间，从而导致迁移能力下降。为了解决这一问题，我们提出了 DARE-EEG，这是一种自监督基础模型，通过在预训练期间显式强制掩码不变性属性来进行双对齐表示学习。具体而言，我们引入了掩码对齐，通过对比学习约束来自同一 EEG 样本的多个掩码视图的表示，补充了锚点对齐，将掩码表示与动量更新的完整特征对齐，以确保语义稳定性。此外，我们提出了 conv-linear-probing，这是一种参数高效的策略，通过解耦的频谱空间投影将预训练表示适应于异构电极配置和采样率。在多种 EEG 基准测试中的广泛实验表明，DARE-EEG 在准确性表现上始终实现了最先进的水平，同时相比于现有方法保持了相对较低的参数复杂性和优越的跨数据集可移植性。此外，DARE-EEG 有助于有效发现和利用 EEG 中丰富的潜在表示。

View on arXiv Download PDF AI Translation

cs.AI / 121 / 2605.18299

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

SD-Search：基于策略的回顾自蒸馏用于搜索增强推理

Ma, Yufei, Liang, Zihan, Chen, Ben, Qian, Zhipeng, Dai, Huangyu, Mao, Lingtao, Zhang, Xuxin, Lei, Chenyi, Ou, Wenwu

Abstract

Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.

Chinese Translation

搜索增强推理代理将内部推理与对外部检索器的调用交替进行，其性能依赖于每个发出的查询的质量。然而，在结果奖励强化学习下，回放中的每个搜索决策共享相同的轨迹级奖励，导致个别查询缺乏步骤特定的信用。最近的过程监督方法通过从政策外部提取步骤级信号来解决这一问题，依赖于一个更大规模的教师模型或由更强大的外部系统生成的子问题注释。相比之下，我们提出了SD-Search，该方法通过基于策略的回顾自蒸馏从政策自身推导步骤级监督，既不需要外部教师，也不需要额外的注释。在SD-Search中，单个模型扮演两个仅在条件上不同的角色：一个学生仅看到推理时可用的上下文，和一个教师额外依赖于一个紧凑的回顾块，该块总结了从同一问题中抽样的一组回放的搜索查询和最终结果。由于教师知道每个回放的展开情况及哪些成功，因此其查询分布隐含地标记了哪些决策是值得做的，学生则通过最小化在搜索查询位置上与教师的令牌级詹森-香农散度来训练以恢复这种行为。这在GRPO的粗略轨迹奖励之上叠加了一个密集的步骤级信号。关键是，这个信号是在标准强化学习训练循环内由政策自身生成的，无需外部模型推理、辅助注释管道或额外的训练阶段。

View on arXiv Download PDF AI Translation

cs.AI / 122 / 2605.18327

Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

Causely：企业人工智能的因果智能层——关于SRE和可靠性工作流程的基准研究

Dalal, Dhairya, Sara, Endre, Yemini, Ben, Miller, Christine, Kliger, Shmuel

Abstract

AI agents deployed into SRE workflows currently derive their understanding of environment state from raw observability telemetry at query time, paying a semantic-interpretation tax in tokens, latency, and inferential reliability. We propose Causely, a causal intelligence layer that maintains a structured representation of environment topology, attribute dependencies, and causal relationships that are anchroed to a ontological representation of the managed environment. Causely transforms raw telemetry into a live, queryable model providing the semantic and causal foundation AI agents require to diagnose, evaluate impact, and act safely in production. We evaluate this value proposition through a benchmark study conducted in a controlled setting with injected faults in a 24-microservice OpenTelemetry demo application. Our experiments compare four agent configurations (Claude Code, OpenAI Codex, HolmesGPT with Sonnet and Gemini backends). Experiments are run with and without access to Causely under two scenarios: an active incident and a healthy baseline. On the active-fault scenario, causal grounding reduces mean time-to-diagnosis by 63\%, mean token consumption by 60\%, and mean tool-call count by 78\%, compressing the investigation footprint by 4.8$\times$ and lowering direct API cost per run by 57\%; root-cause-diagnosis accuracy rises from 75\% to 100\%.

Chinese Translation

目前部署在SRE工作流程中的AI代理在查询时从原始可观察性遥测中获取环境状态的理解，这在语义解释上付出了代价，包括令牌、延迟和推理可靠性。我们提出了Causely，一个因果智能层，它维护着环境拓扑、属性依赖关系和因果关系的结构化表示，这些关系与被管理环境的本体表示相结合。Causely将原始遥测转换为一个实时可查询的模型，为AI代理提供了诊断、评估影响和在生产环境中安全行动所需的语义和因果基础。我们通过在一个控制环境中进行的基准研究来评估这一价值主张，该研究在一个包含24个微服务的OpenTelemetry演示应用中注入了故障。我们的实验比较了四种代理配置（Claude Code、OpenAI Codex、HolmesGPT与Sonnet和Gemini后端）。实验在两种场景下进行：一个是主动故障场景，另一个是健康基线场景，分别在有无Causely的情况下进行。在主动故障场景中，因果基础将平均诊断时间缩短了63%，平均令牌消耗减少了60%，平均工具调用次数减少了78%，调查范围压缩了4.8倍，直接API每次运行的成本降低了57%；根本原因诊断的准确率从75%上升到100%。

View on arXiv Download PDF AI Translation

cs.AI / 123 / 2605.18380

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

QSTRBench：评估语言模型在定性空间和时间推理能力的新基准

Cohn, Anthony G., Blackwell, Robert E.

Abstract

We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR calculi, Point Algebra (PA), Allen's Interval Algebra, Interval and Duration (INDU), Region Connection Calculus (RCC-5, RCC-8, and RCC-22), the nine intersection model, cardinal direction calculus, and STAR. The RCC-22 CN is published here for the first time. An extended benchmark systematically varies question presentation including prefix/infix, words/symbols/nonce terms and schematic descriptions for selected calculi. We report results for contemporary frontier models. All models tested perform better than guessing but none can consistently answer all questions correctly. Performance varies sharply by calculus, with PA being the most straightforward, and RCC-22 the most difficult. We release the benchmark, and our results under an open licence to facilitate further assessment of qualitative spatio/temporal reasoning in LLMs.

Chinese Translation

我们引入了一个广泛的定性空间和时间推理（QSTR）基准，用于评估大型语言模型（LLMs）。我们提出了关于组合推理（使用组合表，CT）、对话关系和概念邻域（CN）的问题，涉及QSTR演算、点代数（PA）、艾伦区间代数、区间与持续时间（INDU）、区域连接演算（RCC-5、RCC-8和RCC-22）、九交集模型、基数方向演算和STAR。RCC-22 CN在此首次发布。扩展基准系统地变化问题呈现方式，包括前缀/中缀、词语/符号/临时术语以及所选演算的示意描述。我们报告了当代前沿模型的结果。所有测试的模型表现均优于随机猜测，但没有一个模型能够始终正确回答所有问题。性能因演算而异，其中PA是最简单的，而RCC-22是最困难的。我们以开放许可发布基准及我们的结果，以促进对LLMs中定性空间/时间推理的进一步评估。

View on arXiv Download PDF AI Translation

cs.AI / 124 / 2605.18460

When Fireflies Cluster; Enhancing Automatic Clustering via Centroid-Guided Firefly Optimization

萤火虫聚集时；通过中心引导的萤火虫优化增强自动聚类

Ariyaratne, MKA, Gusrialdi, Azwirman, Nikulin, Yury, Peltonen, Jaakko

Abstract

This work presents a novel variant of the Firefly Algorithm (FA) for data clustering, addressing limitations of traditional methods like K-Means that struggle with non-uniform cluster shapes, densities, and the need for pre-defining the number of clusters. The proposed algorithm introduces a centroid movement strategy and a multi-objective fitness function that balances compactness, separation, and a novel TSP-based navigation penalty. It automatically estimates the optimal number of clusters and dynamically adjusts cluster boundaries. Application to robotic sensor networks highlights its practical value, with experiments showing improved clustering quality and reduced intra-cluster path distances compared to K-Means. These results confirm the algorithm's robustness in complex spatial clustering tasks, with potential for future extensions to higher-dimensional and adaptive scenarios.

Chinese Translation

本研究提出了一种新型的萤火虫算法（Firefly Algorithm, FA）变体用于数据聚类，解决了传统方法如K均值（K-Means）在处理非均匀聚类形状、密度以及预先定义聚类数量方面的局限性。所提出的算法引入了一种中心移动策略和一个多目标适应度函数，平衡了紧凑性、分离性以及一种基于旅行商问题（TSP）的导航惩罚。它能够自动估计最佳聚类数量并动态调整聚类边界。对机器人传感器网络的应用突显了其实际价值，实验结果显示与K均值相比，聚类质量得到了改善，聚类内路径距离减少。这些结果证实了该算法在复杂空间聚类任务中的鲁棒性，并具有在更高维度和自适应场景中扩展的潜力。

View on arXiv Download PDF AI Translation

cs.AI / 125 / 2605.18481

OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

OCCAM：开放集因果概念解释与黑箱视觉模型的本体生成

Russo, Chiara Maria, Carnemolla, Simone, Palazzo, Simone, Giordano, Daniela, Spampinato, Concetto, Pennisi, Matteo

Abstract

Interpreting the decisions of deep image classifiers remains challenging, particularly in black-box settings where model internals are inaccessible. We introduce OCCAM, a framework for open-set causal concept explanation and ontology induction in vision models. OCCAM discovers visual concepts in an open-set manner, localizes them via text-guided segmentation, and performs object-level interventions by removing concepts to measure changes in class confidence, estimating each concept's causal contribution. Beyond local explanations, OCCAM aggregates interventional evidence across a dataset to induce a structured concept ontology that captures how classifiers globally organize visual concepts. Reasoning over this ontology reveals consistent dependencies between concepts, exposes latent causal relations, and uncovers systematic model biases. Experiments on Broden and ImageNet-S across multiple classifiers show that OCCAM improves explanation quality in open-set black-box settings while providing richer global insight than per-image attribution methods.

Chinese Translation

深度图像分类器的决策解释仍然具有挑战性，尤其是在模型内部不可访问的黑箱环境中。我们提出了OCCAM，一个用于视觉模型的开放集因果概念解释和本体生成的框架。OCCAM以开放集的方式发现视觉概念，通过文本引导的分割对其进行定位，并通过去除概念进行对象级干预，以测量类别置信度的变化，从而估计每个概念的因果贡献。除了局部解释，OCCAM还聚合数据集中的干预证据，以生成一个结构化的概念本体，捕捉分类器如何在全局上组织视觉概念。对该本体的推理揭示了概念之间的一致依赖关系，暴露潜在的因果关系，并揭示系统性的模型偏差。在Broden和ImageNet-S上对多个分类器的实验表明，OCCAM在开放集黑箱环境中提高了解释质量，同时提供了比逐图归因方法更丰富的全局洞察。

View on arXiv Download PDF AI Translation

cs.AI / 126 / 2605.18511

A Practical Noise2Noise Denoising Pipeline for High-Throughput Raman Spectroscopy

高通量拉曼光谱的实用噪声去除管道

Martin-Calle, David, Llamas, Cesar Alvarez, Ros, Vincent Motto-, Dujardin, Christophe, Margueritat, Jérémie, Rodney, David

Abstract

A lightweight and reproducible denoising pipeline for high-throughput Raman spectroscopy is presented. The approach relies on a one-dimensional convolutional autoencoder trained using a Noise2Noise strategy, requiring neither external spectral libraries nor high signal-to-noise reference spectra for training. From a reduced training subset composed of repeated short-exposure acquisitions, the model learns to reconstruct Raman spectra while efficiently suppressing stochastic noise. The method is evaluated on a heterogeneous mineral sample, using both quantitative spectral fidelity metrics (RMSE, SNR, SSIM) and task-oriented criteria based on unsupervised K-means classification. Results demonstrate that integration times as short as 5 ms per spectrum, which are typically insufficient for reliable interpretation, yield denoised spectra with high fidelity to the reference data while preserving chemically coherent maps. This work provides a practical trade-off between spectral quality and acquisition speed, enabling fast, adaptable Raman workflows compatible with routine laboratory use. It also offers a transferable framework for other one-dimensional spectroscopic modalities.

Chinese Translation

本文提出了一种轻量且可重复的高通量拉曼光谱去噪管道。该方法依赖于使用噪声对噪声（Noise2Noise）策略训练的一维卷积自编码器，训练过程中不需要外部光谱库或高信噪比参考光谱。从由重复短曝光采集组成的缩减训练子集出发，该模型学习重构拉曼光谱，同时有效抑制随机噪声。该方法在一个异质矿物样品上进行了评估，使用了定量光谱保真度指标（均方根误差RMSE、信噪比SNR、结构相似性指数SSIM）以及基于无监督K均值分类的任务导向标准。结果表明，集成时间短至每个光谱5毫秒，这通常不足以进行可靠的解释，仍能生成与参考数据高度一致的去噪光谱，同时保持化学一致的图谱。该工作在光谱质量与采集速度之间提供了实用的权衡，使快速、灵活的拉曼工作流程能够与常规实验室使用相兼容。它还为其他一维光谱学模式提供了可转移的框架。

View on arXiv Download PDF AI Translation

cs.AI / 127 / 2605.18529

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

AMR-SD：用于标记级信用分配的非对称元反射自蒸馏

Wei, Zhenlin, Jian, Pu, Deng, Yingzhuo, Wang, Xiaohan, Chai, Jiajun, Hu, Zhexin, Lin, Wei, Zhang, Shanbin, Yin, Guojun

Abstract

The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and critiques. Furthermore, we introduce Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate these reflections into sparse, highly precise token-level advantage modulations. Combined with temporal annealing, this mechanism preserves the base environmental reward while filtering out distributional noise. Experiments across scientific, mathematical, and tool-use benchmarks demonstrate that AMR-SD significantly outperforms existing baselines, achieving robust long-horizon stability and successfully preventing late-stage collapse.

Chinese Translation

大型语言模型（LLMs）在复杂推理中的对齐严重依赖于可验证奖励的强化学习（RLVR）。然而，标准算法如GRPO对所有标记均匀应用序列级奖励，造成严重的信用分配瓶颈。尽管在线自蒸馏试图通过将自教师条件化于特权上下文来解决这一问题，但直接接触原始oracle解决方案往往会导致过度条件化的教师分布、隐性答案泄漏和后期训练崩溃。为克服这些限制，我们提出了非对称元反射自蒸馏（AMR-SD）。AMR-SD并不是直接在原始参考轨迹上进行条件化，而是插入了一个反射瓶颈：它将来自验证者结果、同行回滚或参考反馈的诊断信号压缩为简洁的自生成苏格拉底提示和批评。此外，我们引入了因果信息增益（CIG），采用非对称的ReLU门限，将这些反思转化为稀疏且高度精确的标记级优势调制。结合时间退火，该机制在过滤分布噪声的同时保留了基础环境奖励。在科学、数学和工具使用基准测试中的实验表明，AMR-SD显著优于现有基线，达到了稳健的长时间稳定性，并成功防止了后期崩溃。

View on arXiv Download PDF AI Translation

cs.AI / 128 / 2605.18547

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

VISAFF：面向说话者的视觉情感特征学习用于对话中的情感识别

ZHU, Linan, Zhai, Zihao, Han, Xiao, Fu, Yuqian, Chen, Xiangfan, Kong, Xiangjie, Shen, Guojiang

Abstract

Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information. While recent Vision-Language Models (VLMs) address this by analyzing video directly, they are not inherently tailored for ERC and often focus on emotionally irrelevant background regions or passive listeners rather than the active speaker. Furthermore, fine-tuning these large models incurs prohibitive computational costs. Additionally, isolated visual signals are frequently ambiguous or technically compromised without the context of linguistic content and vocal prosody. To address these challenges, we propose VISAFF, a speaker-centered VISual AFFective feature learning framework for ERC. VISAFF consists of two stages: Speaker-Centered Affective Grounding and Reliability-Guided Affective Complementation. VISAFF utilizes a tuning-free approach to unlock the reasoning capabilities of frozen VLMs, efficiently steering them to focus on the active speaker's emotional visual cues without heavy training overheads. In the second stage, we introduce a reliability-guided affective complementation mechanism that dynamically leverages textual and acoustic modalities to compensate for visual uncertainty. Experiments on two real-world datasets demonstrate that VISAFF achieves highly competitive performance compared to state-of-the-art methods in a tuning-free setting, significantly enhancing computational efficiency by eliminating the need for expensive fine-tuning of large VLMs. The source code is available at https://anonymous.4open.science/r/speaker-2365/.

Chinese Translation

对话中的情感识别（Emotion Recognition in Conversation，ERC）对于有效的人机交互至关重要，旨在识别多轮对话中说话者的情感状态。早期基于文本的方法在处理复杂场景（如讽刺）时表现不佳，因为它们本质上忽视了重要的非语言信息。尽管最近的视觉-语言模型（Vision-Language Models，VLMs）通过直接分析视频来解决这一问题，但它们并未专门针对ERC，且通常关注于与情感无关的背景区域或被动听众，而非主动说话者。此外，微调这些大型模型会产生高昂的计算成本。此外，孤立的视觉信号在缺乏语言内容和声调的上下文时，往往模糊或技术上受到限制。为了解决这些挑战，我们提出了VISAFF，一个面向说话者的视觉情感特征学习框架用于ERC。VISAFF由两个阶段组成：面向说话者的情感基础和可靠性引导的情感补充。VISAFF采用无调优的方法，解锁冻结的VLM的推理能力，有效引导其关注主动说话者的情感视觉线索，而无需繁重的训练开销。在第二阶段，我们引入了一种可靠性引导的情感补充机制，动态利用文本和声学模态来补偿视觉的不确定性。在两个真实世界数据集上的实验表明，VISAFF在无调优设置下与最先进的方法相比，表现出高度竞争力的性能，显著提高了计算效率，消除了对大型VLMs昂贵微调的需求。源代码可在 https://anonymous.4open.science/r/speaker-2365/ 获取。

View on arXiv Download PDF AI Translation

cs.AI / 129 / 2605.18570

Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning

基于查询条件的知识对齐用于可靠的跨系统医学推理

Jiao, Yan, Xu, Jingran, Ho, Pin-Han, Peng, Limei

Abstract

Cross-domain knowledge alignment is essential for integrating heterogeneous medical systems, yet existing approaches typically treat entity alignment as a static matching problem, ignoring query context and cross-system asymmetry. This limitation is particularly critical in integrative medical settings, where correspondence between concepts is inherently context-dependent, non-bijective, and direction-sensitive. In this paper, we propose Query-Conditioned Entity Alignment (QCEA), which reformulates entity alignment as a query-conditioned correspondence problem. Instead of learning a fixed mapping between entity representations, QCEA treats the textual description of a source entity as a query and ranks candidate entities in the target graph, enabling context-dependent alignment. The framework integrates semantic encoding, graph-based representation learning, and a direction-aware transformation module to capture asymmetric and many-to-many correspondence across heterogeneous knowledge systems. We evaluate QCEA on TCM--WM knowledge graphs derived from SymMap, covering both symptom alignment and herb--molecule alignment tasks. Experimental results show consistent improvements over representative baselines, particularly on rank-sensitive metrics such as Hit@K and MRR. Furthermore, downstream retrieval-augmented generation (RAG) experiments demonstrate that improved alignment leads to better evidence retrieval, stronger grounding, and higher answer accuracy. These findings highlight that alignment is not merely a data integration step, but a key factor that shapes knowledge accessibility and reliability in cross-system medical reasoning.

Chinese Translation

跨领域知识对齐对于整合异构医学系统至关重要，但现有方法通常将实体对齐视为静态匹配问题，忽视了查询上下文和跨系统的不对称性。这一局限性在综合医学环境中尤为关键，因为概念之间的对应关系本质上依赖于上下文、非双射且对方向敏感。本文提出了基于查询条件的实体对齐（Query-Conditioned Entity Alignment, QCEA），将实体对齐重新表述为一个基于查询条件的对应问题。QCEA不再学习实体表示之间的固定映射，而是将源实体的文本描述视为查询，并在目标图中对候选实体进行排序，从而实现上下文依赖的对齐。该框架整合了语义编码、基于图的表示学习和方向感知转换模块，以捕捉异构知识系统之间的不对称和多对多的对应关系。我们在从SymMap派生的中医-西医知识图谱上评估了QCEA，涵盖了症状对齐和草药-分子对齐任务。实验结果显示，在代表性基线之上，QCEA在排名敏感的指标（如Hit@K和MRR）上均有一致的提升。此外，下游检索增强生成（Retrieval-Augmented Generation, RAG）实验表明，改进的对齐导致了更好的证据检索、更强的基础支持和更高的答案准确性。这些发现强调了对齐不仅仅是数据整合的一个步骤，而是塑造跨系统医学推理中知识可及性和可靠性的关键因素。

View on arXiv Download PDF AI Translation

cs.AI / 130 / 2605.18580

When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

当结果看似正确但纪律失效时：隐藏竞争者状态下的基于轨迹的评估

Zhu, Peiying, Chang, Sidi

Abstract

Outcome-only evaluation can certify economically unsafe agents: a policy can hit a business KPI while violating deployable behavioral discipline. In hotel pricing with hidden competitor state, a learner can achieve plausible revenue per available room while failing to preserve the rate discipline of a rule-based revenue-management competitor. We introduce discipline stability, a trace-based evaluation paradigm: define the benchmark behavior, restrict observations to the deployment regime, induce trace diagnostics from failure, separate mechanisms with ablations, and test transfer and deployment. Across a two-hotel benchmark and a compact hidden-budget bidding task, reward-only PPO variants miss trace alignment; revealing hidden state reduces label uncertainty; deterministic copy collapses uncertainty; and trace-prior or corrected history policies better preserve price or bid distributions. Pure behavior cloning is nearly enough for symmetric imitation, while Trace-Prior RL adds bounded adaptation under capacity asymmetry. The contribution is an evaluation and benchmark paradigm, not a new optimizer or a universal claim about MARL

Chinese Translation

仅基于结果的评估可能会认证经济上不安全的代理：一种政策可以达到商业关键绩效指标（KPI），同时违反可部署的行为纪律。在隐藏竞争者状态的酒店定价中，学习者可以实现每间可用房间的可行收入，但未能保持基于规则的收益管理竞争者的价格纪律。我们引入了纪律稳定性，这是一种基于轨迹的评估范式：定义基准行为，限制观察到的部署机制，从失败中引导轨迹诊断，通过消融分离机制，并测试转移和部署。在一个包含两个酒店的基准和一个紧凑的隐藏预算竞标任务中，仅基于奖励的PPO变体未能实现轨迹对齐；揭示隐藏状态减少了标签的不确定性；确定性复制消除了不确定性；而轨迹优先或修正历史策略更好地保持了价格或竞标分布。纯行为克隆几乎足以实现对称模仿，而轨迹优先强化学习在容量不对称下增加了有界适应性。我们的贡献是一个评估和基准范式，而不是一个新的优化器或关于多智能体强化学习（MARL）的普遍声明。

View on arXiv Download PDF AI Translation

cs.AI / 131 / 2605.18597

Latent Action Reparameterization for Efficient Agent Inference

高效代理推理的潜在动作重参数化

Huang, Wenhao, Zeng, Qingwen, Chen, Qiyue, Guo, Zijie, Sun, Yu, Yang, Cheng, Ouyang, Siru, Gesi, Jiri, Wu, Fang, Zhang, Jiayi, Chen, Huaming, Liu, Bang, Tang, Xiangru, Wu, Chenglin

Abstract

Large language model (LLM) agents often rely on long sequences of low-level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system-level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM-based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall-clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.

Chinese Translation

大型语言模型（LLM）代理通常依赖于长序列的低级文本动作，这导致了较大的有效决策范围和高推理成本。尽管之前的研究主要集中在通过系统级优化或提示工程来提高推理效率，但我们认为关键瓶颈在于动作空间本身的表示。我们提出了潜在动作重参数化（Latent Action Reparameterization, LAR），这是一个学习紧凑潜在动作空间的框架，其中每个潜在动作对应于一个多步骤的语义行为。通过将代理动作重参数化为潜在单元，LAR能够在较短的有效范围内进行决策，同时保持原始动作空间的表达能力。与手工制作的宏或层次控制器不同，潜在动作是从代理轨迹中学习的，并直接集成到模型中，使得规划和执行能够在抽象动作表示上进行。通过一系列基于LLM的代理基准测试，LAR显著减少了有效动作范围，并在固定计算预算下提高了推理效率。因此，我们的方法在动作令牌和相应的实际推理时间上实现了显著减少，同时保持或提高了任务成功率。这些结果表明，动作表示学习是扩展高效LLM代理推理的一个关键且未被充分探索的因素，与模型架构和硬件的进展相辅相成。

View on arXiv Download PDF AI Translation

cs.AI / 132 / 2605.18627

Learning Lifted Action Models from Traces with Minimal Information About Actions and States

从具有最少关于动作和状态信息的轨迹中学习提升的动作模型

Gösgens, Jonas, Jansen, Niklas, Geffner, Hector

Abstract

It has been recently shown that lifted STRIPS models can be learned correctly and efficiently from action traces alone; i.e., applicable action sequences from a hidden STRIPS model. The result is remarkable because the states are not assumed to be observable at all, and yet it is not practical enough as STRIPS actions include arguments that are not needed for selecting the actions. This shortcoming has been addressed by assuming that the action traces come instead from a hidden STRIPS+ model where some action arguments are implicit in the hidden action preconditions. A limitation of this approach, however, is that it assumes that the states are fully observable. In this work, we relax these restrictions and consider the problem of learning STRIPS+ action domains from traces in a more general context where the traces carry partial information about both actions and states. In particular, we formulate algorithms and completeness results for three general cases, all of which assume full observability of selected action arguments. In the first case, no observability of the state is assumed; in the second case, full observability of some state predicates is assumed, and in the third case, local observability of some state predicates is assumed instead. Given a STRIPS+ domain, these results characterize the conditions under which an equivalent domain can be learned from traces. Experimental results are reported.

Chinese Translation

最近的研究表明，可以仅通过动作轨迹有效且正确地学习提升的STRIPS模型；即，从一个隐藏的STRIPS模型中提取适用的动作序列。这个结果是显著的，因为并不假设状态是可观察的，然而这并不够实用，因为STRIPS动作包含了选择动作时并不需要的参数。为了解决这一缺陷，假设动作轨迹来自一个隐藏的STRIPS+模型，其中一些动作参数隐含在隐藏的动作前提中。然而，这种方法的一个局限性是它假设状态是完全可观察的。在本研究中，我们放宽了这些限制，考虑在一个更一般的背景下从轨迹中学习STRIPS+动作域的问题，其中轨迹携带关于动作和状态的部分信息。特别地，我们为三种一般情况制定了算法和完备性结果，所有这些情况都假设所选动作参数的完全可观察性。在第一种情况下，不假设状态的可观察性；在第二种情况下，假设某些状态谓词的完全可观察性，而在第三种情况下，则假设某些状态谓词的局部可观察性。给定一个STRIPS+域，这些结果表征了可以从轨迹中学习到等效域的条件。实验结果也被报告。

View on arXiv Download PDF AI Translation

cs.AI / 133 / 2605.18630

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

SCICONVBENCH：在计算科学任务表述中的多轮澄清对大型语言模型的基准测试

Somasekharan, Nithin, Hassan, Youssef, Lin, Shiyao, Panapitiya, Gihan, Emami, Patrick, Acharya, Anurag, Horawalavithana, Sameera, Pan, Shaowu

Abstract

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.

Chinese Translation

大型语言模型（LLMs）越来越多地被用作科学人工智能助手，越来越多的基准测试评估它们在知识检索、推理、代码生成和工具使用等方面的能力。然而，这些评估通常假设科学问题已经被很好地表述，而实际的科学辅助往往始于一个不明确的用户请求，该请求必须通过对话进行细化，才能在任何计算、分析或实验之前可靠地进行。我们引入了SCICONVBENCH，这是一个针对四个计算科学问题领域（流体力学、固体力学、材料科学和偏微分方程（PDEs））的科学任务表述中的多轮澄清的基准测试。SCICONVBENCH针对两个互补的能力：引出缺失信息（消歧义）和检测及纠正包含内部矛盾信息的错误请求（不一致性解决）。我们的基准测试将结构化任务本体与基于评分标准的评估框架相结合，使得能够在三个维度上系统地测量LLM的表现：澄清行为、对话基础和最终规范的保真度。目前的前沿模型在不一致性解决方面表现相对较好，但即使是最好的模型在流体力学中的消歧义案例解决率也仅为52.7%。我们进一步发现，前沿LLMs经常做出无声假设，并进行隐式规范修复，这些修复并未与用户的对话相结合。SCICONVBENCH为评估可靠的计算科学助手所需的上游对话推理奠定了基础。代码和数据可以在https://github.com/csml-rpi/SciConvBench找到。

View on arXiv Download PDF AI Translation

cs.AI / 134 / 2605.18661

AI for Auto-Research: Roadmap & User Guide

自动化研究中的人工智能：路线图与用户指南

Kong, Lingdong, Sun, Xian, Chow, Wei, Li, Linfeng, Lin, Kevin Qinghong, Zhang, Xuan Billy, Wang, Song, Li, Rong, Wu, Qing, Gao, Wei, Wang, Yingshuo, Xie, Shaoyuan, Liu, Jiachen, Qu, Leigang, Li, Shijie, Ng, Lai Xing, Cottereau, Benoit R., Liu, Ziwei, Chua, Tat-Seng, Ooi, Wei Tsang

Abstract

AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.

Chinese Translation

人工智能辅助研究正跨越一个新阶段：全自动系统现在可以以低至15美元的成本生成研究论文，而长期代理可以在最少人类输入的情况下执行实验、撰写手稿和模拟评审。然而，这一生产力前沿暴露出更深层次的诚信问题：在科学压力下，即使是前沿的大型语言模型（LLMs）仍然会伪造结果、遗漏隐藏错误，并且无法可靠地判断新颖性。通过研究到2026年4月的发展，我们呈现了对人工智能在整个研究生命周期中的端到端分析，分为四个认识论阶段：创作（创意生成、文献综述、编码与实验、表格与图形）、写作（论文撰写）、验证（同行评审、反驳与修订）和传播（海报、幻灯片、视频、社交媒体、项目页面和互动代理）。我们识别出一个明显的、依赖于阶段的界限，区分可靠的辅助和不可靠的自主：人工智能在结构化、基于检索和工具介导的任务中表现出色，但在真正的新颖想法、研究级实验和科学判断方面仍然脆弱。生成的想法在实施后往往会退化，研究代码远远落后于模式匹配基准，而端到端的自主系统尚未始终达到主要会议的接受标准。我们进一步表明，更大的自动化可能会掩盖而不是消除失败模式，使得人类主导的协作成为最可信的部署范式。最后，我们提供了一个结构化的分类法、基准套件和工具清单、跨阶段设计原则以及面向实践者的操作手册，相关资源维护在我们的项目页面上。

View on arXiv Download PDF AI Translation

cs.AI / 135 / 2605.18663

GIM: Evaluating models via tasks that integrate multiple cognitive domains

GIM：通过整合多个认知领域的任务评估模型

Patel, Rohit, Rezende, Alexandre, McClain, Steven

Abstract

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.

Chinese Translation

随着大型语言模型（LLM）基准测试的饱和，评估社区采取了两种策略来提高难度：增加知识需求（GPQA, HLE）或完全去除知识，转而关注抽象推理（ARC-AGI）。前者将记忆与能力混为一谈；后者则将推理与其重要的实际背景割裂开来。我们采取了一种不同的方法。基础整合测量（Grounded Integration Measure，GIM）是一个包含820个原创问题（615个公开，205个私有）的基准，其中的难度来自于整合；每个问题都需要协调多个认知操作（约束满足、状态跟踪、认知警觉、受众校准），并基于广泛可获取的知识，使得推理保持在现实任务中，而不依赖于专业知识。每个问题都是由专家创作的原创作品，大多数问题采用分解评分标准（中位数6个独立评判标准）。平衡的公开与私有分割提供了内置的污染诊断。我们在超过20万个提示-响应对上，对28个模型校准了一个连续响应的2参数逻辑（2PL）IRT模型，生成了稳健的能力估计，即使在原始准确性因错误或缺失数据而失真时，也能正确排序测试配置，从而解决了基准报告中的一个常见挑战。利用这一框架，我们展示了一个涵盖22个模型和47个测试配置（独特模型与思维水平对）的综合排行榜，并进行了一项我们所知的最广泛的已发表研究，探讨在固定基准上测试时计算能力与模型能力之间的权衡：11个模型在35个测试配置中进行了评估。我们观察到，在同一家族的配置选择中，如思维预算和量化，与模型选择同样重要。我们发布了评估框架、校准的IRT参数和所有公开问题。

View on arXiv Download PDF AI Translation

cs.AI / 136 / 2605.18672

Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

立场：安全的 LLM 代理部署需要三层概率假设-保证架构

Bensalem, S., Dong, Y., Franzle, M., Huang, X., Kroger, J., Nickovic, D., Nouri, A., Roy, R., Wu, C.

Abstract

This position paper argues that enforcing LLM agent safety within a single abstraction layer is not merely suboptimal but categorically insufficient for deployed LLM agents -- a structural consequence of how agent execution works, not a contingent limitation of current systems. The three dimensions that jointly constitute safe operation -- semantic intent and policy compliance, environmental validity, and dynamical feasibility -- each depend on a strictly distinct set of information that becomes available at different stages of execution. No single guardrail can certify all three. We argue that the community must respond with a contract-based architecture in which each safety dimension is enforced by an independently certified layer whose probabilistic guarantee satisfies the next layer's assumption. We sketch such an architecture and derive the compositional system-level safety bounds it admits via the chain rule of probability. Three open problems stand between this and a deployable standard: bound estimation from non-i.i.d.\ traces, graceful degradation of contracts under deployment drift, and extension to multi-agent settings -- the most important unfinished business in LLM agent runtime assurance.

Chinese Translation

本文立场论文认为，在单一抽象层内强制执行 LLM 代理的安全性不仅是次优的，而且对于已部署的 LLM 代理来说是根本不足的——这是代理执行工作方式的结构性结果，而不是当前系统的偶然限制。共同构成安全操作的三个维度——语义意图与政策合规性、环境有效性和动态可行性——各自依赖于在执行的不同阶段可获得的严格不同的信息集。没有单一的保护措施能够认证这三者。我们认为，学术界必须以基于合同的架构作出回应，其中每个安全维度由一个独立认证的层来强制执行，其概率保证满足下一个层的假设。我们勾勒了这样的架构，并通过概率链规则推导出它所允许的组合系统级安全界限。在实现可部署标准的过程中，有三个未解决的问题：来自非独立同分布（non-i.i.d.）轨迹的界限估计、在部署漂移下合同的优雅降级，以及扩展到多代理环境——这是 LLM 代理运行时保证中最重要的未完成事务。

View on arXiv Download PDF AI Translation

cs.AI / 137 / 2605.18674

Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

高效的前瞻编码与抽象宽度在经典规划中学习通用策略

Aichmüller, Michael, Ståhlberg, Simon, Funkquist, Martin, Geffner, Hector

Abstract

Generalized planning aims to learn policies that generalize across collections of instances within a classical planning domain. Recent Graph Neural Network (GNN) approaches have learned nearly perfect policies for several domains. This work improves on the recently published idea of Iterated Width (IW) policies. Therein, the policy broadens its successor scope through an IW-lookahead search that can "jump" over multiple transitions, simplifying the problem structure. Yet, each transition is evaluated individually, leading to unscalable compute costs and expressivity limitations. Furthermore, although IW(1) is attractive because it scales linearly with the number of atoms, it becomes inefficient once thousands of objects are considered, as in the International Planning Competition (IPC) 2023 benchmark. We address both limitations. First, we introduce a vastly more efficient holistic encoding of the entire search tree. It jointly represents IW(1)-reachable states only by their relational differences to the current state, enabling Relational GNNs (R-GNNs) to score all transitions in a single forward pass. Second, we define Abstracted IW(1) to improve scaling through relational abstraction during novelty checks. Rather than testing fully instantiated atoms, it abstracts each atom by replacing all but one argument with its type. The original atom is novel if any of its abstracted forms is novel. This structural compression shifts novelty search scaling from atoms to objects, while preserving meaningful subgoal structure. We evaluate our contributions on the hyperscaling IPC 2023 benchmark and across diverse domains, including domains requiring features beyond the $C_2$ logic fragment. Our policies achieve new state-of-the-art performance, significantly surpassing prior work, including the classical planner LAMA.

Chinese Translation

通用规划旨在学习能够在经典规划领域内跨多个实例集合进行泛化的策略。近期的图神经网络（Graph Neural Network, GNN）方法已在多个领域学习到了几乎完美的策略。本研究改进了最近发布的迭代宽度（Iterated Width, IW）策略的理念。在该方法中，策略通过IW前瞻搜索扩大其后继范围，该搜索能够“跳过”多个过渡，简化问题结构。然而，每个过渡都是单独评估的，这导致了不可扩展的计算成本和表达能力的限制。此外，尽管IW(1)因其与原子数量呈线性关系而具有吸引力，但在考虑数千个对象时（如国际规划竞赛（IPC）2023基准），其效率变得低下。我们解决了这两方面的限制。首先，我们引入了一种更高效的整体编码，表示整个搜索树。它仅通过与当前状态的关系差异共同表示IW(1)可达状态，使得关系GNN（Relational GNN, R-GNN）能够在一次前向传递中对所有过渡进行评分。其次，我们定义了抽象IW(1)，通过在新颖性检查中进行关系抽象来提高扩展性。与其测试完全实例化的原子，不如通过用其类型替换所有但一个参数来抽象每个原子。如果其任何抽象形式是新颖的，则原始原子也是新颖的。这种结构压缩将新颖性搜索的扩展性从原子转移到对象，同时保留有意义的子目标结构。我们在超扩展的IPC 2023基准和多个领域（包括需要超出$C_2$逻辑片段的特征的领域）上评估了我们的贡献。我们的策略实现了新的最先进性能，显著超越了先前的工作，包括经典规划器LAMA。

View on arXiv Download PDF AI Translation

cs.AI / 138 / 2605.18681

Learning Quantifiable Visual Explanations Without Ground-Truth

无真实标签的可量化视觉解释学习

Singh, Amritpal, Barsky, Andrey, Souibgui, Mohamed Ali, Valveny, Ernest, Karatzas, Dimosthenis

Abstract

Explainable AI (XAI) techniques are increasingly important for the validation and responsible use of modern deep learning models, but are difficult to evaluate due to the lack of good ground-truth to compare against. We propose a framework that serves as a quantifiable metric for the quality of XAI methods, based on continuous input perturbation. Our metric formally considers the sufficiency and necessity of the attributed information to the model's decision-making, and we illustrate a range of cases where it aligns better with human intuitions of explanation quality than do existing metrics. To exploit the properties of this metric, we also propose a novel XAI method, considering the case where we fine-tune a model using a differentiable approximation of the metric as a supervision signal. The result is an adapter module that can be trained on top of any black-box model to output causal explanations of the model's decision process, without degrading model performance. We show that the explanations generated by this method outperform those of competing XAI techniques according to a number of quantifiable metrics.

Chinese Translation

可解释人工智能（XAI）技术在验证和负责任地使用现代深度学习模型方面变得越来越重要，但由于缺乏良好的真实标签进行比较，这些技术的评估变得困难。我们提出了一个框架，作为XAI方法质量的可量化指标，基于连续输入扰动。我们的指标正式考虑了归因信息对模型决策的充分性和必要性，并展示了一系列案例，其中该指标与人类对解释质量的直觉比现有指标更为一致。为了利用该指标的特性，我们还提出了一种新颖的XAI方法，考虑到在使用该指标的可微近似作为监督信号来微调模型的情况。最终结果是一个适配模块，可以在任何黑箱模型之上进行训练，以输出模型决策过程的因果解释，而不会降低模型性能。我们展示了该方法生成的解释在多个可量化指标上优于竞争的XAI技术。

View on arXiv Download PDF AI Translation

cs.AI / 139 / 2605.18692

Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

通过大型语言模型引导的模型补丁实现大规模再优化的民主化

Ye, Tinghan, Deza, Arnaud, Mohan, Ved, Raqabi, El Mehdi Er, Van Hentenryck, Pascal

Abstract

Optimization models developed by operations research (OR) experts are often deployed as decision-support systems in industrial settings. However, real-world environments are dynamic, with evolving business rules, previously overlooked constraints, and unforeseen perturbations. In such contexts, end users must rapidly re-optimize models to recover feasible and implementable solutions. This paper introduces an agentic re-optimization framework in which a large language model (LLM) acts as an OR expert, dynamically supporting end users through natural-language interaction. The LLM translates user prompts into structured updates of the underlying optimization model, selects suitable re-optimization techniques from an optimization toolbox, and solves the resulting instance to return implementable solutions. The toolbox leverages primal information, including historical solutions, valid inequalities, solver configurations, and metaheuristics, to accelerate re-optimization while preserving solution quality. The proposed framework enables interactive and continuous adaptation of deployed optimization models, reducing dependence on OR experts and improving the sustainability of decision-support systems. Extensive experiments on two complementary large-scale real-world case studies demonstrate the effectiveness and scalability of the proposed framework. The first considers online supply chain re-optimization, where solutions must be generated rapidly while remaining close to the deployed plan, whereas the second focuses on offline university exam scheduling, where solution quality is prioritized over runtime. Results show that the toolbox-driven architecture significantly improves computational efficiency through primal-based and solver-aware re-optimization techniques, while the structured patch-based updates improve interpretability and traceability of model modifications.

Chinese Translation

由运筹学（OR）专家开发的优化模型通常作为决策支持系统在工业环境中部署。然而，现实环境是动态的，商业规则不断演变，之前被忽视的约束条件以及不可预见的扰动时有发生。在这种情况下，最终用户必须快速重新优化模型，以恢复可行且可实施的解决方案。本文介绍了一种代理式再优化框架，其中大型语言模型（LLM）充当运筹学专家，通过自然语言交互动态支持最终用户。LLM将用户提示翻译为基础优化模型的结构化更新，从优化工具箱中选择合适的再优化技术，并求解生成的实例以返回可实施的解决方案。该工具箱利用原始信息，包括历史解决方案、有效不等式、求解器配置和元启发式算法，加速再优化，同时保持解决方案质量。所提出的框架使得已部署的优化模型能够进行互动和持续的适应，减少对运筹学专家的依赖，提高决策支持系统的可持续性。在两个互补的大规模真实案例研究中进行了广泛实验，证明了所提框架的有效性和可扩展性。第一个案例考虑在线供应链再优化，其中解决方案必须快速生成，同时保持与已部署计划的接近，而第二个案例则关注离线大学考试排程，其中解决方案质量优先于运行时间。结果表明，基于工具箱的架构通过基于原始信息和求解器感知的再优化技术显著提高了计算效率，同时结构化的补丁式更新改善了模型修改的可解释性和可追溯性。

View on arXiv Download PDF AI Translation

cs.AI / 140 / 2605.18693

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

SkillGenBench：针对大型语言模型代理的技能生成管道基准测试

Zhou, Yifan, Zhang, Zhentao, Cheng, Ziming, Zhang, Shuo, Lan, Qizhen, Chen, Zhangquan, Yang, Zhi, QianyuXu, Chen, Ronghao, Wang, Huacan, Hu, Sen

Abstract

As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is revealed, and task-agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository-grounded instances, where procedures are distributed across code, configuration, and scripts, and document-grounded instances, where procedures and constraints must be distilled from long-form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution-based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill-generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long-form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.

Chinese Translation

随着大型语言模型（LLM）代理越来越多地围绕可重用技能构建，一个核心挑战不再仅仅是代理是否能够使用提供的技能，而是它们是否能够从知识库和文档中生成正确、可重用且可执行的技能。现有的基准测试主要评估给定技能的有效性或代理从原始上下文解决下游任务的能力，但并未将技能生成本身作为研究对象。我们引入了SkillGenBench，这是一个在统一和受控协议下评估技能生成管道的基准测试。在SkillGenBench中，生成器接收原始语料并生成标准化的技能工件，这些工件随后在固定的执行环境中被执行，并通过统一的评估程序进行评估。该基准测试涵盖两种生成模式：任务条件生成，其中在任务揭示后合成特定任务的技能，以及任务无关生成，其中必须在下游任务已知之前提炼出可重用的技能库。它还涵盖两种互补的程序来源：基于知识库的实例，其中程序分布在代码、配置和脚本中，以及基于文档的实例，其中程序和约束必须从长文本中提炼。我们提供标准化的任务规范、固定环境和以确定性执行为中心的评估协议，并辅以辅助信号进行诊断。针对一系列技能生成方法和基础架构的实验显示出显著的性能差异，突显了可重用技能提炼的困难，并揭示了从软件知识库与长文本文档生成技能的不同失败模式。SkillGenBench建立了一个可重复的测试平台，以独立研究代理系统中的技能生成问题。

View on arXiv Download PDF AI Translation

cs.AI / 141 / 2605.18738

What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

人工智能医生重视什么？语言模型临床伦理的多元审计

Chandak, Payal, Alkin, Victoria, Wu, David, Dagan, Maya, Roy, Taposh Dutta, Menezes, Maria Clara Saad, Noori, Ayush, Somia, Nirali, Brownstein, John S., Balicer, Ran, Brendel, Rebecca W., Dagan, Noa, Kohane, Isaac S., Brat, Gabriel A.

Abstract

Medicine is inherently pluralistic. Principles such as autonomy, beneficence, nonmaleficence, and justice routinely conflict, and such ethical dilemmas often sharply divide reasonable physicians. Good clinical practice navigates these tensions in concert with each patient's values rather than imposing a single ethical stance. The ethical values that large language models bring to medical advice, however, have not been systematically examined. We present a framework for auditing value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities directly from decisions. The ecosystem of frontier models spans physician-level value heterogeneity, and models discuss competing values in their reasoning (Overton pluralism) before committing to a decision. However, individual model decisions are near-deterministic across repeated sampling and semantic variations, failing to reproduce the distributional pluralism of the physician panel. Across benchmark cases, these consistent decisions reflect committed, systematic value preferences. While most model priorities fall within the natural range of inter-physician variation, some significantly underweight patient autonomy. A single LLM deployed without regard for its value priorities could amplify those priorities at scale to every patient it serves. Without explicit efforts to balance ethical perspectives with one or multiple models, these tools risk replacing clinical pluralism with a deployment monoculture.

Chinese Translation

医学本质上是多元的。自主性、行善、不伤害和公正等原则常常发生冲突，这些伦理困境常常使合理的医生产生明显分歧。良好的临床实践应与每位患者的价值观相结合，妥善处理这些紧张关系，而不是强加单一的伦理立场。然而，大型语言模型在医学建议中所带来的伦理价值尚未得到系统的审查。我们提出了一个审计医疗人工智能中价值多元主义的框架，包括一个经过临床医生验证的困境基准和一种从决策中直接恢复价值优先级的归因方法。前沿模型的生态系统涵盖了医生层面的价值异质性，模型在推理过程中讨论竞争价值（Overton pluralism），然后再做出决策。然而，个别模型的决策在重复采样和语义变体中几乎是确定性的，未能再现医生小组的分布性多元性。在基准案例中，这些一致的决策反映了坚定的、系统的价值偏好。尽管大多数模型优先级在医生间的自然变异范围内，但有些模型显著低估了患者的自主权。如果不考虑其价值优先级而单独部署一个大型语言模型，可能会将这些优先级在其服务的每位患者中放大。若不明确努力平衡伦理视角与一个或多个模型，这些工具可能会将临床多元主义替换为单一的部署文化。

View on arXiv Download PDF AI Translation

cs.AI / 142 / 2605.18743

Actionable World Representation

可操作的世界表征

Xu, Kunqi, Li, Jitao, Ye, Jianglong, Tang, Tianshu, Liu, Isabella, Liu, Sifei, Zou, Xueyan

Abstract

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

Chinese Translation

受到大型语言模型中涌现行为的启发，这些行为概括了人类智能，研究界正在追求在世界模型中实现类似的涌现能力，特别是对物理世界的建模。在物理世界模型的范围内，物体是构成物理现实的基本原始元素。从人类到计算机，我们几乎与之互动的所有事物都是物体。这些物体很少是静态的；它们是具有不同状态的可操作实体，这些状态由其内在属性决定。虽然当前的方法通过视频生成或动态场景重建来处理物体的动作状态，但没有一种方法以统一、原则性的方式明确建模这一基本元素，以构建可操作的物体表征。我们提出了WorldString，这是一种神经架构，能够通过直接从点云或RGB-D视频流中学习来建模现实世界物体的状态流形。作为一个多功能的数字双胞胎，它作为物理世界模型的基础构建块，因此我们将其命名为WorldString。值得一提的是，它的完全可微结构无缝地支持未来与策略学习和神经动态的集成。

View on arXiv Download PDF AI Translation

计算语言学 (Computation and Language)

114

cs.CL / 1 / 2605.16508

The Scaling Laws of Skills in LLM Agent Systems

大规模语言模型代理系统中的技能扩展规律

Chen, Charles, Yu, Qiming, Gu, Yuhang, Huang, Zhuoye, Li, Hanjing, Liu, Hongyu, Liu, Simin, Liu, Jinhao, Peng, Dengyun, Wang, Jiangyi, Yan, Zheng, Meng, Fanqing, Qin, Ethan, Che, Carl, Hu, Mengkang

Abstract

As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single-step routing accuracy decays logarithmically with library size ($R^2{>}0.97$ for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general "black-hole skills". Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about $4{\times}$. A single parameter, the routing logarithmic decay slope $b$, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability. The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.

Chinese Translation

随着代理系统的扩展，技能积累成大型可重用库，但其扩展规律仍然不甚明了。在15个前沿大规模语言模型（LLMs）、1141个现实世界技能和超过300万次路由或执行决策中，我们识别出两条耦合规律。路由规律：单步路由准确性随着库的规模呈对数衰减（所有模型的$R^2{>}0.97$），错误从局部技能竞争进展到跨家族漂移，并被过于通用的“黑洞技能”捕获。执行规律：在状态实现之前，联合路由大致呈乘法关系，而正确执行可以将困难的下游决策改善约$4{ imes}$。一个单一参数，即路由对数衰减斜率$b$，将这两条规律耦合在一起：路由侧的拟合预测了执行侧的救援，显示出同一库特性控制了预执行崩溃和下游可恢复性。这些规律具有可操作性：基于规律的优化将保留的路由准确性从71.3%提高到91.7%，将劫持率从22.4%降低到4.1%，并在下游ClawBench和ClawMark执行设置中方向性转移，提高了ClawBench的平均通过率从49.3%到61.6%，ClawMark的平均通过率从28.4%到34.5%。这些结果表明，代理性能不仅依赖于模型能力，还依赖于技能库的结构、粒度和暴露策略。

View on arXiv Download PDF AI Translation

cs.CL / 2 / 2605.16551

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

PQR：一个生成多样化和真实用户查询以引发问答代理失败的框架

Lu, Yunan, Liu, Luigi, Yahia, Omar, Sharma, Arpit, Yu, Zhou

Abstract

Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.

Chinese Translation

评估基于大型语言模型（LLM）的代理仍然具有挑战性，因为识别有意义的失败案例通常需要大量人力来设计真实的测试场景。之前的研究主要集中在自动发现由对抗性用户引发的代理失败，而忽视了那些具有真实用户意图的查询，这些查询同样会触发代理失败。我们提出了PQR，一个框架，不仅能够揭示代理在特定目标（例如，有用性、安全性等）方面的失败，还能模拟真实用户的意图。PQR通过两个互补模块之间的迭代交互来运行。查询优化模块执行重写以探索多样化的查询变体，而提示优化模块利用先前的反馈推导出新的目标违反策略和现实性政策，以优化提示，从而生成引发失败但又真实的查询。我们在检测电子商务问答代理的无用响应方面评估了PQR。我们的方法发现了23% - 78%更多的无用响应，并且我们生成的查询相比于之前的方法更加多样化和真实。

View on arXiv Download PDF AI Translation

cs.CL / 3 / 2605.16562

Scaling Accessible Mathematics on arXiv: HTML Conversion and MathML 4

在 arXiv 上扩展可访问数学：HTML 转换与 MathML 4

Ginev, Deyan, Caruso, Brian, Miller, Bruce, Sank, Jeff, Weiskoff, Jacob

Abstract

We report on the ongoing development of arXiv's HTML Papers offering, available on every new TeX/LaTeX submission since its initial release in 2023. The main highlights from 2025 and early 2026 are: (i) community-driven improvements to HTML fidelity and service health, with roughly half of 6,000 user reports resolved; (ii) corpus-scale conversion work aimed at 90% error-free HTML (currently 75%); (iii) initial MathML 4 Intent annotations for accessible speech output; (iv) an in-progress Rust port of LaTeXML, reducing compute costs and enabling faster previews on submission. The arXiv HTML Papers project remains experimental, but is gradually maturing as we better understand the needs of arXiv's readers and the technical opportunities presented by new standards and by advances in programming languages and AI.

Chinese Translation

我们报告了 arXiv HTML 论文服务的持续开发，自 2023 年首次发布以来，适用于每个新的 TeX/LaTeX 提交。2025 年和 2026 年初的主要亮点包括：（i）社区驱动的 HTML 精度和服务健康的改进，约有一半的 6,000 份用户报告已解决；（ii）旨在实现 90% 无错误 HTML 的语料库规模转换工作（目前为 75%）；（iii）用于可访问语音输出的初步 MathML 4 意图注释；（iv）正在进行的 LaTeXML Rust 移植，降低计算成本并加快提交时的预览速度。arXiv HTML 论文项目仍处于实验阶段，但随着我们对 arXiv 读者需求和新标准、编程语言及人工智能进展所带来的技术机遇的理解逐渐加深，该项目正在逐步成熟。

View on arXiv Download PDF AI Translation

cs.CL / 4 / 2605.16613

Beyond Sentiment Classification: A Generative Framework for Emotion Intensity Evaluation in Text

超越情感分类：一种用于文本情感强度评估的生成框架

Fabozzi, Francesco A., Kim, Dasol, Goetzmann, William N.

Abstract

We introduce a novel approach to emotion modeling that shifts the focus from identification to evaluation, addressing the limitations of discrete classification in applied domains such as finance. By constructing a dataset of emotional intensity scores and fine-tuning open-weight generative language models to output continuous values from 0-100, we demonstrate a more expressive, generalizable framework for sentiment and emotion analysis. Our findings not only outperform classification baselines but also reveal surprising generalization capabilities and transfer effects to related constructs such as sentiment and arousal. This work contributes to the interdisciplinary recontextualization of NLP by introducing emotion intensity evaluation as an alternative to classification, arguing that this shift better aligns with the needs of domains--such as finance--where the degree of emotional content is central to interpretation and decision-making.

Chinese Translation

我们提出了一种新的情感建模方法，将重点从识别转向评估，解决了在金融等应用领域中离散分类的局限性。通过构建一个情感强度评分的数据集，并对开放权重的生成语言模型进行微调，使其输出从0到100的连续值，我们展示了一种更具表现力和可推广性的情感与情绪分析框架。我们的研究结果不仅超越了分类基准，还揭示了惊人的泛化能力和向相关构念（如情感和唤醒）的迁移效应。这项工作通过引入情感强度评估作为分类的替代方案，为自然语言处理的跨学科重新语境化做出了贡献，认为这种转变更符合金融等领域的需求，在这些领域中，情感内容的程度对解释和决策至关重要。

View on arXiv Download PDF AI Translation

cs.CL / 5 / 2605.16650

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

SKG-Eval：通过增量语义知识图谱进行多轮对话的状态评估

Shil, Avijit, Samui, Suman

Abstract

Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.

Chinese Translation

评估多轮对话系统仍然具有挑战性，因为响应质量不仅依赖于当前的提示，还依赖于先前建立的实体、主张和对话承诺。现有的自动评估工具，包括LLM-as-a-judge框架和基于嵌入的度量，主要依赖于平面或轮次孤立的表示，使其在检测长距离问题（如矛盾、话题漂移和实体不一致性）方面的效果较差。为了解决这个问题，我们提出了SKG-Eval，一个准确定义且可解释的框架，将对话建模为一个跨轮次的不断演变的语义知识图谱（SKG），其中包含实体、关系和承诺。该框架通过结构化三元组提取增量更新图，并计算三个互补信号：（i）局部相关性，测量与当前提示和可选参考的一致性；（ii）历史一致性，评估新引入的信息如何与先前的对话上下文相连接，使用基于图的和嵌入驱动的信号；（iii）逻辑一致性，由几何矛盾引擎评估，该引擎检测跨轮次冲突，而不依赖于NLI模型或LLM评判。通过近期加权趋势分析，这些信号被自适应融合并聚合为一个长度不变的会话评分。在多个基准测试中，SKG-Eval与人类判断的相关性更高，并显著提高了在扩展对话中检测长距离不一致性的能力。此外，该框架为固定输入生成明确的矛盾证书和确定性评分，支持可重复和可审计的评估。总体而言，我们的结果表明，通过语义知识图谱进行结构化的外部状态跟踪为基于LLM的对话评估器中的隐式推理提供了一种可扩展的替代方案。

View on arXiv Download PDF AI Translation

cs.CL / 6 / 2605.16654

A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research

一种可扩展的工具用于测量发展语言研究中的方式动词和结果动词

Singh, Divyesh Pratap, Gusain, Dakshesh, Bulgarelli, Federica, Hendricks, Alison Eisel, Beavers, John, Beers, Nathan M., Nwogu, Ifeoma

Abstract

Manner and result verbs encode different aspects of event structure and have been discussed in developmental work as a potentially informative distinction for studying early verb learning. However, this distinction remains difficult to measure at scale because large annotated resources for manner and result classification are not currently available. We present a computational approach for identifying manner and result verbs in sentence context. Using linguistically informed prompts, we generate sentence-level annotations with large language models over data drawn from MASC and InterCorp, extending coverage from previously annotated portions of VerbNet to 436 classes. We then train a RoBERTa-based classifier on these annotations and evaluate it on three held-out gold-standard datasets, including previously annotated items and a new expert-annotated set. Across these evaluations, the model shows promising performance, with average accuracy up to 89.6%. We present this work as a scalable measurement tool that can support future research on verb semantics in developmental and other language datasets, while noting that further validation is needed for borderline cases, mixed manner/result verbs, and downstream developmental applications.

Chinese Translation

方式动词和结果动词编码事件结构的不同方面，并在发展研究中被讨论为研究早期动词学习的潜在重要区分。然而，由于目前没有大规模的标注资源用于方式和结果分类，这一区分仍然难以大规模测量。我们提出了一种计算方法，用于在句子上下文中识别方式动词和结果动词。通过使用语言学信息驱动的提示，我们利用大型语言模型生成句子级别的标注，数据来源于MASC和InterCorp，覆盖范围从之前标注的VerbNet部分扩展到436个类别。随后，我们在这些标注上训练了一个基于RoBERTa的分类器，并在三个保留的金标准数据集上进行了评估，包括之前标注的项目和一个新的专家标注集。在这些评估中，该模型显示出良好的性能，平均准确率高达89.6%。我们将这项工作呈现为一种可扩展的测量工具，可以支持未来在发展语言和其他语言数据集中的动词语义研究，同时指出对于边界案例、混合方式/结果动词以及下游发展应用仍需进一步验证。

View on arXiv Download PDF AI Translation

cs.CL / 7 / 2605.16679

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

CHI-Bench：人工智能代理能否自动化端到端、长期、政策丰富的医疗工作流程？

Chen, Haolin, Metelski, Deon, Qi, Leon, Xia, Tao, Lee, Joonyul, Brown, Steve, Riley, Kevin, Wang, Frank, Liu, T. Y. Alvin, MD, Hank Capps, Tang, Zeyu, Song, Xiangchen, Kong, Lingjing, Feng, Fan, Zeng, Tianyi, Liu, Zhiwei, Ma, Zixian, Jiang, Hang, Geng, Fangli, Yuan, Yuan, You, Chenyu, Wen, Qingsong, Wei, Hua, Fu, Yanjie, Zhao, Yue, Yang, Carl, Huang, Biwei, Zhang, Kun, Xiong, Caiming, Koyejo, Sanmi, Xing, Eric P., Yu, Philip S., Yao, Weiran

Abstract

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $\chi$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

Chinese Translation

现实医疗操作的端到端自动化强调了当前基准中代表性不足的三种能力：政策密度，决策必须基于大量的医疗、保险和操作规则库；多角色组合：单一任务要求代理扮演多个角色并进行交接；以及多边互动：中间工作流程步骤是多轮对话，例如同行评审和患者外展。我们介绍了$ ext{χ}$-Bench，这是一个涵盖三个领域的长期医疗工作流程基准：提供者事先授权、付款方利用管理和护理管理。每个任务将一个临床案例交给代理，在一个高保真的模拟器中模拟20个医疗应用程序，通过87个MCP工具进行暴露，代理必须通过工具调用和撰写角色的文档，将其推进到终止状态，受一个包含超过1290份文档的管理护理操作手册技能的指导。在30种代理配置/模型中，表现最佳的代理仅解决了28.0%的任务，没有代理在严格的pass^3标准下清除20%的任务，而在单一会话中执行所有任务则使性能下降至3.8%。这些结果提出了一个假设，即在其他政策密集、角色组合、不可逆的企业领域中，类似的差距可能会显现。

View on arXiv Download PDF AI Translation

cs.CL / 8 / 2605.16758

Language Acquisition Device in Large Language Models

大型语言模型中的语言习得装置

Mita, Masato, Someya, Taiga, Yoshida, Ryo, Oseki, Yohei

Abstract

Large Language Models (LLMs) remain substantially less data-efficient than humans. Pre-pretraining (PPT) on synthetic languages has been proposed to close this gap, with prior work emphasizing highly expressive formal languages such as $k$-Shuffle Dyck. Inspired by the Language Acquisition Device (LAD) hypothesis, which posits that innate constraints preemptively restrict the learner's hypothesis space to natural-language-like structure, we propose LAD-inspired PPT: pre-pretraining on MP-STRUCT, a formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE. A brief 500-step PPT with MP-STRUCT matches strong formal-language baselines in token efficiency while additionally imparting a human-like resistance to structurally implausible languages (e.g., REVERSE). Analyzing simplified variants, we find that MP-STRUCT CORE outperforms $k$-Shuffle Dyck despite not being definable in C-RASP (a formal bound on transformer expressivity), challenging the prior hypothesis that effective PPT languages must be both hierarchically expressive and circuit-theoretically learnable. We show that functional landmarks, which reduce dependency resolution ambiguity, are a key driver, suggesting that effective PPT design depends not only on expressivity but also on the accessibility of dependency resolution.

Chinese Translation

大型语言模型（LLMs）在数据效率上仍显著低于人类。预预训练（PPT）在合成语言上被提出以缩小这一差距，之前的研究强调了高度表达性的形式语言，如 $k$-Shuffle Dyck。受到语言习得装置（LAD）假说的启发，该假说认为先天约束会预先限制学习者的假设空间，使其趋向于自然语言的结构，我们提出了受LAD启发的PPT：在MP-STRUCT上进行预预训练，这是一种形式语言，其字符串编码了层次组合、基于特征的依赖关系以及通过MERGE、AGREE和MOVE实现的远距离位移。对MP-STRUCT进行简短的500步PPT在标记效率上与强大的形式语言基准相匹配，同时还赋予了对结构上不合理语言（例如REVERSE）的类人抵抗力。通过分析简化变体，我们发现MP-STRUCT CORE尽管无法在C-RASP（对变换器表达能力的形式限制）中定义，但仍优于 $k$-Shuffle Dyck，这挑战了之前的假设，即有效的PPT语言必须在层次表达性和电路理论可学习性上兼具。我们表明，功能性地标，能够减少依赖解析的模糊性，是一个关键驱动因素，这表明有效的PPT设计不仅依赖于表达性，还依赖于依赖解析的可达性。

View on arXiv Download PDF AI Translation

cs.CL / 9 / 2605.16767

Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free

基于检索的多标签法律注释：可扩展、数据高效且无幻觉

Zhang, Li, Savelka, Jaromir, Ashley, Kevin

Abstract

Multi-label legal annotation requires assigning multiple labels from large, evolving taxonomies to long, fact-intensive documents, often under limited supervision. Parametric encoders typically require task-specific training and retraining when the label set changes, while prompting generative large language models becomes costly and degrades as the label space grows. We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space, enabling updates by re-embedding and re-indexing rather than gradient-based backpropagation. Across three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels), retrieval achieves competitive accuracy and strong data efficiency; on Eurlex, Qwen-8B retrieval improves Macro-F1 from 40.41 (GPT-5.2, zero-shot) to 49.12 while reducing estimated compute by 20-30 times compared to fine-tuning. With only (N=100) training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87). We also quantify a reliability failure mode of generative inference: GPT-5.2 hallucinates labels outside the provided taxonomy in 0.12-0.9% of test samples under deterministic decoding. In contrast, retrieval strictly respects defined label sets, eliminating hallucination by design. These results suggest retrieval-model-based annotators are a practical, deployable alternative for high-cardinality and rapidly changing legal label spaces.

Chinese Translation

多标签法律注释需要在有限的监督下，为长篇、信息密集的文档从大型、不断发展的分类法中分配多个标签。参数编码器通常在标签集变化时需要特定任务的训练和再训练，而提示生成大型语言模型的成本较高，并且随着标签空间的增长而降低效果。我们将法律注释视为检索：我们使用冻结的检索模型对文档和标签描述进行嵌入，并通过嵌入空间中的k近邻预测标签，从而通过重新嵌入和重新索引来实现更新，而不是基于梯度的反向传播。在三个法律数据集（ECtHR-A、ECtHR-B和Eurlex，包含100个标签）上，检索实现了竞争性的准确性和强大的数据效率；在Eurlex上，Qwen-8B检索将宏观F1从40.41（GPT-5.2，零样本）提高到49.12，同时将估计计算量减少了20-30倍，相较于微调。仅使用（N=100）个训练样本，检索在ECtHR-A上几乎将微观F1翻倍，相较于层次化的Legal-BERT（48.29对比27.87）。我们还量化了生成推理的一种可靠性失效模式：在确定性解码下，GPT-5.2在0.12-0.9%的测试样本中幻觉出提供的分类法之外的标签。相比之下，检索严格遵循定义的标签集，通过设计消除幻觉。这些结果表明，基于检索模型的注释工具是高基数和快速变化的法律标签空间的实用、可部署的替代方案。

View on arXiv Download PDF AI Translation

cs.CL / 10 / 2605.16770

Exploring Lightweight Large Language Models for Court View Generation

探索轻量级大型语言模型在法庭视图生成中的应用

Hou, Zhitian, Hao, Tianyong, Zeng, Nanli, Chao, Zhixiong, Zeng, Kun

Abstract

Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade-offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \url{https://github.com/ZhitianHou/CVGEvalKit}

Chinese Translation

刑事法庭视图生成（CVG）是法律人工智能（Legal AI）中的一项关键任务，涉及基于案件事实生成法庭视图。在本研究中，我们系统性地探讨了轻量级（小于2B）大型语言模型（LLMs）在CVG中的能力及其对指控预测的影响。我们的研究解决了四个关键问题：（1）不同架构的LLMs如何影响CVG质量和指控预测；（2）LLMs的规模如何影响性能；（3）轻量级LLMs在这些任务中与深度神经网络（DNNs）的比较；（4）通过法庭视图生成预测指控与直接预测的比较。此外，我们还开发了CVGEvalKit，一个评估框架，包括三个公开可用的数据集，用于CVG任务及其指控预测。我们在该框架上进行了全面的实验，模型在混合训练集上训练，并在每个数据集的测试集上进行评估。实验结果为模型架构、模型规模与不同任务之间的影响提供了新的见解，突显了轻量级LLMs在司法人工智能应用中的潜力。源代码可匿名获取，链接为 https://github.com/ZhitianHou/CVGEvalKit

View on arXiv Download PDF AI Translation

cs.CL / 11 / 2605.16819

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena：面向泛化的GPU内核优化代理基准测试

Younesian, Sharareh, Ouyang, Wenwen, Rafati, Sina, Rezagholizadeh, Mehdi, Zhou, Sharon, Liu, Ji, Liu, Yue, Yang, Yuchen, Li, Hao, Liu, Ziqiong, Li, Dong, Appia, Vikram, Gu, Zhenyu, Barsoum, Emad

Abstract

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

Chinese Translation

GPU内核优化对于高效的深度学习系统日益重要，但编写高性能内核仍然需要大量的低级专业知识。最近的AI编码代理能够迭代地读取代码、调用编译器和分析器，并优化实现，然而现有的内核基准测试评估的是单个大型语言模型（LLM）调用，而非完整的代理工作流程，并且没有一个包括内核到内核的优化和未见配置的泛化测试。我们提出了AgentKernelArena，这是一个用于测量AI编码代理在GPU内核优化方面的开源基准测试。该基准测试包含196个任务，涵盖HIP到HIP优化、Triton到Triton优化和PyTorch到HIP转换，并在隔离工作空间中使用门控编译、正确性和性能检查、集中评分以及未见配置的泛化协议来评估完整的代理工作流程，该协议测试优化是否能转移到代理从未观察过的输入配置。在包括Cursor Agent、Claude Code和Codex Agent等生产代理中，我们发现大多数任务类别的编译接近完美，正确率很高，最强配置在PyTorch到HIP任务上实现了最高6.89倍的平均加速，在HIP到HIP任务上实现了6.69倍，在Triton到Triton任务上实现了2.13倍。我们的未见配置评估显示，HIP到HIP和Triton到Triton的优化在很大程度上能够转移到未见的输入形状，而PyTorch到HIP的正确性显著下降，表明从头生成内核的代理经常硬编码特定形状的假设。AgentKernelArena被设计为一个模块化、可扩展的框架，用于对不同代理、任务和硬件目标进行严格评估的代理GPU内核优化。

View on arXiv Download PDF AI Translation

cs.CL / 12 / 2605.16829

Constrained Code Generation with Discrete Diffusion

受限代码生成与离散扩散

Shao, Lize, Cardei, Michael, Xie, Zichen, Fioretto, Ferdinando, Wang, Wenxi

Abstract

Discrete diffusion models are a powerful, emerging paradigm for code generation. They construct programs through iterative refinement of partially corrupted token sequences and enable parallel token refinement. Importantly, this paradigm exposes a global program state at each denoising step, which provides a natural intervention point for enforcing program-level functionality and security constraints, guiding the generation before the final code is committed. Building on this observation, the paper introduces Constrained Diffusion for Code (CDC), a training-free neurosymbolic inference framework that integrates constraint satisfaction directly into the reverse denoising process. CDC augments the base discrete diffusion sampler with constraint-aware denoising operators that combine mathematical optimization with program analysis to identify constraint-relevant regions of the intermediate program state and locally adjust the denoising trajectory, steering generation toward feasible programs while remaining close to the base model. Across code generation benchmarks, CDC consistently improves constraint satisfaction in functional correctness, security, and even syntax, outperforming discrete diffusion and autoregressive baselines with less corrective computation and more localized edits.

Chinese Translation

离散扩散模型是代码生成的一种强大且新兴的范式。它们通过对部分损坏的令牌序列进行迭代细化来构建程序，并支持并行令牌细化。重要的是，这一范式在每个去噪步骤中暴露了全局程序状态，这为强制执行程序级功能和安全约束提供了自然的干预点，指导生成在最终代码提交之前的过程。基于这一观察，本文引入了受限扩散代码（Constrained Diffusion for Code, CDC），这是一种无训练的神经符号推理框架，直接将约束满足集成到反向去噪过程中。CDC通过约束感知去噪算子增强了基础的离散扩散采样器，这些算子结合了数学优化与程序分析，以识别中间程序状态中与约束相关的区域，并局部调整去噪轨迹，引导生成朝向可行程序，同时保持与基础模型的接近。在代码生成基准测试中，CDC在功能正确性、安全性甚至语法的约束满足方面始终表现出色，超越了离散扩散和自回归基线，减少了修正计算并实现了更局部的编辑。

View on arXiv Download PDF AI Translation

cs.CL / 13 / 2605.16839

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

紧凑注意力：通过块联合 KV 选择加速分块预填充

Song, Jiwon, Jo, Dongwon, Kang, Beomseok, Kim, Jae-Joon

Abstract

Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk. QUOKA, a recent method that directly targets chunked prefill, avoids sparse-kernel overhead but relies on query-subsampled, token-level KV selection, which can miss query-specific KV entries and introduce explicit KV-copy overhead. To address these limitations, we propose CompactAttention, a chunked-prefill attention mechanism based on Block-Union KV Selection. CompactAttention treats 2D block-sparse masks as KV-selection signals rather than direct sparse-kernel execution plans, and converts them into GQA-aware per-group KV block tables through Q-block union and intra-group union. This construction produces the minimal block tables that preserve all KV blocks selected by the input masks under paged execution constraints, enabling selected KV blocks to be accessed in place without explicit KV compaction. On LLaMA-3.1-8B-Instruct, CompactAttention maintains accuracy close to dense attention on the RULER benchmark while delivering up to 2.72$\times$ attention speedup at 128K context length under chunked prefill.

Chinese Translation

分块预填充已成为长上下文大型语言模型广泛采用的服务策略，但在这一模式下高效的注意力计算仍然面临挑战。现有的稀疏注意力方法主要为一次性预填充设计，无法有效转化为分块预填充：当查询长度受到分块大小限制时，块稀疏内核的效率降低，而在每个分块上对累积的 KV 缓存进行重复的细粒度模式搜索则成本高昂。QUOKA 是一种最近针对分块预填充的方法，它避免了稀疏内核的开销，但依赖于查询子采样的、基于令牌的 KV 选择，这可能会遗漏特定查询的 KV 条目并引入显式的 KV 拷贝开销。为了解决这些局限性，我们提出了紧凑注意力（CompactAttention），这是一种基于块联合 KV 选择的分块预填充注意力机制。紧凑注意力将二维块稀疏掩码视为 KV 选择信号，而不是直接的稀疏内核执行计划，并通过 Q 块联合和组内联合将其转换为 GQA 感知的每组 KV 块表。这种结构生成了在分页执行约束下保留所有由输入掩码选择的 KV 块的最小块表，使得所选的 KV 块能够原地访问，而无需显式的 KV 压缩。在 LLaMA-3.1-8B-Instruct 上，紧凑注意力在 RULER 基准测试中保持接近密集注意力的准确性，同时在 128K 上下文长度下实现高达 2.72 倍的注意力加速。

View on arXiv Download PDF AI Translation

cs.CL / 14 / 2605.16843

RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis

RTI-Bench：印度信息公开决策分析的结构化数据集

Bose, Joy

Abstract

India's Right to Information Act, 2005 gives every citizen the right to demand information from public authorities, yet in practice most people cannot make sense of the dense administrative language used in Central Information Commission (CIC) decisions, let alone predict whether an appeal is worth filing. This paper introduces RTI-Bench, a structured dataset of CIC decisions with outcome labels, exemption citations, IRAC-style reasoning components, and procedural timelines. To the best of our knowledge it is the first publicly released structured dataset for Indian RTI administrative decisions. The dataset draws from two sources: 1,218 cases from a publicly available instruction-response corpus (with structured fields added through rule-based extraction), and 298 CIC decision PDFs collected directly from the Commission portal, spanning five commissioners and three document format generations from 2023 to 2026. Label coverage reaches 89% on the instruction-response corpus. For the PDF subset of 239 primary decisions, coverage is 51% in this first release. A random sample of 50 labelled cases was manually reviewed, yielding a label precision of 95.3%. A zero-shot Mistral 7B baseline on 100 cases gives 57.3% accuracy and 37.0% macro-F1 on outcome prediction, well above the majority-class baseline of 14.3% macro-F1. RTI-Bench is available at https://huggingface.co/datasets/joyboseroy/rti-bench

Chinese Translation

印度2005年《信息公开法》赋予每位公民向公共机构索取信息的权利，但在实践中，大多数人无法理解中央信息委员会（CIC）决策中使用的复杂行政语言，更不用说预测上诉是否值得提出。本文介绍了RTI-Bench，这是一个包含结果标签、豁免引用、IRAC风格推理组件和程序时间线的CIC决策结构化数据集。据我们所知，这是第一个公开发布的印度RTI行政决策结构化数据集。该数据集来源于两个渠道：1,218个来自公开可用的指令-响应语料库的案例（通过基于规则的提取添加了结构化字段），以及298个直接从委员会门户收集的CIC决策PDF，涵盖了五位专员和2023年至2026年间的三种文档格式。指令-响应语料库的标签覆盖率达到89%。对于239个主要决策的PDF子集，此次首次发布的覆盖率为51%。对50个标记案例的随机抽样进行了人工审核，标签精度为95.3%。在100个案例上的零样本Mistral 7B基线模型在结果预测中达到了57.3%的准确率和37.0%的宏观F1值，远高于14.3%的多数类基线宏观F1值。RTI-Bench可在https://huggingface.co/datasets/joyboseroy/rti-bench获取。

View on arXiv Download PDF AI Translation

cs.CL / 15 / 2605.16865

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD：用于知识注入的混合上下文自蒸馏

Liu, Jiarui, Zhang, Lechen, Yang, Yongjin, He, Yinghui, Wang, Yingheng, Xuan, Weihao, Jin, Zhijing, Diab, Mona

Abstract

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

Chinese Translation

监督微调（SFT）广泛用于将新知识注入语言模型，但它往往会降低预训练能力，如推理和通用领域性能。我们认为，这种遗忘现象的产生是因为来自人类或外部系统的微调目标与模型的自回归分布存在偏差，迫使优化器模仿低概率的标记序列。为了解决这个问题，我们提出了MixSD，这是一种简单的无外部教师的方法，用于分布对齐的知识注入。MixSD并不是在固定目标上进行训练，而是通过混合基础模型自身的两个条件的标记动态构建监督：一个观察注入事实的专家条件和一个反映模型原始先验的天真条件。生成的监督序列保留了事实学习信号，同时在很大程度上更接近基础模型的分布。我们在两个合成语料库上评估MixSD，这些语料库是我们构建的，用于在受控环境中研究事实回忆和算术功能获取，同时还包括开放领域事实问答和知识编辑的既定基准。在多个模型规模和设置中，MixSD始终实现了比SFT和在线自蒸馏基线更好的记忆保留权衡，保留了高达100%的基础模型的保留能力，同时保持近乎完美的训练准确性，而标准SFT的保留能力最低仅为1%。我们进一步表明，MixSD在基础模型下产生了显著较低的负对数似然（NLL）监督目标，并减少了沿Fisher敏感参数方向的有害移动。这些结果表明，使监督与模型的本地生成分布对齐是一个简单而有效的知识注入原则，可以减轻灾难性遗忘。

View on arXiv Download PDF AI Translation

cs.CL / 16 / 2605.16881

PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

PaliBench：经典语言翻译基准的多参考蓝图

Metzger, Máté, Phophichit, Nadnapang

Abstract

Digital humanities projects increasingly rely on machine translation and large language models to widen access to classical, religious, and otherwise under-translated textual traditions. Yet standard translation benchmarks are poorly suited to such materials: they typically compare a system output against a single reference translation, even though classical texts often support multiple faithful renderings that differ in terminology, register, and interpretation. This article introduces PaliBench, both a benchmark for Pali-to-English translation and a reusable method for constructing multi-reference translation benchmarks for classical languages. The Pali case study draws on passages from the Sutta Pitaka aligned with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow combines LLM-assisted alignment of independently segmented translations, automated verification against source files, passage-level quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against multiple human references. The resulting benchmark contains 1,700 passages spanning 8,389 segments and approximately 345,000 tokens. We use it to evaluate ten contemporary large language models with complementary metrics, finding strong cross-metric concordance in system rankings alongside substantial variation in reliability and semantic outlier rates. The broader contribution is methodological: PaliBench shows how existing scholarly translations can be transformed into evaluation infrastructure for interpretive textual traditions without treating any single translation as definitive. Although developed for Pali Buddhist texts, the approach could be portable to other classical corpora where sufficient independent reference translations exist.

Chinese Translation

数字人文学科项目日益依赖机器翻译和大型语言模型，以扩大对经典、宗教及其他翻译不足的文本传统的访问。然而，标准翻译基准并不适合这些材料：它们通常将系统输出与单一参考翻译进行比较，尽管经典文本通常支持多种忠实的翻译，这些翻译在术语、语域和解释上各有不同。本文介绍了PaliBench，它既是巴利语到英语翻译的基准，也是构建经典语言多参考翻译基准的可重用方法。巴利案例研究基于与比丘苏贾托（Bhikkhu Sujato）、比丘塔尼斯萨罗（Bhikkhu Thanissaro）和比丘博迪（Bhikkhu Bodhi）独立英语翻译对齐的《经藏》（Sutta Pitaka）中的段落。该工作流程结合了大型语言模型（LLM）辅助的独立分段翻译对齐、对源文件的自动验证、段落级质量过滤、公式化重复的去重以及对多个人工参考的多指标评估。最终的基准包含1,700个段落，涵盖8,389个片段和大约345,000个标记。我们利用它对十个当代大型语言模型进行评估，使用互补指标，发现系统排名在跨指标一致性方面表现良好，同时在可靠性和语义异常率上存在显著差异。更广泛的贡献在于方法论：PaliBench展示了如何将现有学术翻译转化为解释性文本传统的评估基础设施，而不将任何单一翻译视为权威。尽管该方法是为巴利佛教文本开发的，但其方法可以移植到其他经典语料库，只要存在足够的独立参考翻译。

View on arXiv Download PDF AI Translation

cs.CL / 17 / 2605.16882

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

E-PMQ：专家引导的后合并量化与合并权重锚定

Wang, Wenjun, Gu, Yanggan, Cai, Shuo, Wang, Yuanyi, Wang, Pengkai, Wu, Jianmin, Yang, Hongxia

Abstract

Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for integrating multiple task- or domain-specialized experts into a single model without joint training or multi-model serving. Together, quantization and model merging enable an efficient low-resource deployment pipeline by integrating multiple experts into one low-bit model. We formulate this setting as Post-Merge Quantization (PMQ). We show that directly applying post-training quantization (PTQ) to a merged model is unreliable because two distinct deviations are coupled: the quantization deviation introduced by low-bit reconstruction and the expert-relative merging deviation inherited from model merging. To mitigate these deviations, we propose E-PMQ, an expert-guided PMQ framework that uses source expert weights to provide expert- guided output targets during layer-wise calibration, together with merged-weight anchoring to stabilize the calibration and preserve the integrated behavior of the merged model. On CLIP-ViT-B/32 eight-task merging, E-PMQ improves 4-bit GPTQ from 65.0% to 73.6% under Task Arithmetic and from 69.1% to 74.8% under TIES-Merging. On harder settings, E-PMQ improves GPTQ from 34.8% to 76.7% on 20-task CLIP-ViT-L/14 and from 78.26% to 83.34% on FLAN-T5- base GLUE. These results demonstrate that E-PMQ enables effective post-merge quantization and low-bit deployment.

Chinese Translation

低资源部署限制使得模型量化对于在保持性能的情况下部署神经网络变得至关重要。同时，模型合并已成为一种越来越实用的低资源策略，可以将多个任务或领域专用的专家集成到一个单一模型中，而无需联合训练或多模型服务。量化和模型合并共同实现了通过将多个专家整合为一个低比特模型来高效的低资源部署流程。我们将这一设置表述为后合并量化（Post-Merge Quantization, PMQ）。我们表明，直接将后训练量化（Post-Training Quantization, PTQ）应用于合并模型是不可靠的，因为存在两个不同的偏差相互耦合：由低比特重构引入的量化偏差和从模型合并继承的专家相关合并偏差。为了减轻这些偏差，我们提出了E-PMQ，一种专家引导的PMQ框架，利用源专家权重在逐层校准过程中提供专家引导的输出目标，并结合合并权重锚定来稳定校准并保持合并模型的综合行为。在CLIP-ViT-B/32的八任务合并中，E-PMQ使得4比特GPTQ在任务算术下从65.0%提升至73.6%，在TIES合并下从69.1%提升至74.8%。在更困难的设置中，E-PMQ使得GPTQ在20任务CLIP-ViT-L/14上从34.8%提升至76.7%，在FLAN-T5-base GLUE上从78.26%提升至83.34%。这些结果表明E-PMQ能够有效实现后合并量化和低比特部署。

View on arXiv Download PDF AI Translation

cs.CL / 18 / 2605.16896

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

JSPG：通过联合语义-拼音-字形检索实现动态字典过滤的中文上下文自动语音识别

Zhou, Shilin, Li, Zhenghua

Abstract

Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a base ASR model to generate preliminary hypotheses, followed by semantic text retrievers to fetch a concise subset of relevant keywords. However, this approach frequently fails in Chinese ASR. Base models often produce homophonic or near-homophonic errors that preserve the phonetic cues of the target keywords but severely distort their semantic meaning, rendering standard semantic retrievers ineffective. To resolve this, we propose a filtering framework that jointly integrates Semantic, Pinyin, and Glyph features (JSPG). Pinyin effectively retrieves targets based on phonetic similarity, while glyph provides complementary structural cues to filter out numerous irrelevant homophones inherent in Chinese. To bridge the gap between character-level pinyin/glyph metrics and sequence-level filtering, we introduce an extended Smith-Waterman algorithm that computes similarity scores between the N-best hypothesis sequences and keywords. Experiments on the Aishell-1 and RWCS-NER datasets demonstrate that JSPG significantly outperforms single-feature baselines. Furthermore, downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.

Chinese Translation

上下文自动语音识别（ASR）在处理大规模关键词字典时面临挑战，因为过多无关候选项会引入噪声，从而降低准确性。为了解决这个问题，动态过滤通常使用基础ASR模型生成初步假设，然后通过语义文本检索器获取相关关键词的简明子集。然而，这种方法在中文ASR中常常失败。基础模型通常会产生同音或近音错误，这些错误保留了目标关键词的语音线索，但严重扭曲了其语义含义，使得标准语义检索器失效。为此，我们提出了一种过滤框架，联合整合了语义、拼音和字形特征（JSPG）。拼音有效地基于语音相似性检索目标，而字形提供了补充的结构线索，以过滤掉中文中固有的众多无关同音词。为了弥合字符级拼音/字形指标与序列级过滤之间的差距，我们引入了一种扩展的Smith-Waterman算法，该算法计算N-best假设序列与关键词之间的相似性得分。在Aishell-1和RWCS-NER数据集上的实验表明，JSPG显著优于单一特征基线。此外，由JSPG指导的下游上下文ASR模型在关键词识别准确性上取得了显著提升。

View on arXiv Download PDF AI Translation

cs.CL / 19 / 2605.16928

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

全注意力的反击：在百次训练步骤内将全注意力转化为稀疏注意力

Zhou, Yanke, Li, Yiduo, Tang, Hanlin, Li, Maohua, Liu, Kan, Tao, Lan, Qu, Lin, Yao, Yuan, Ma, Xiaoxing

Abstract

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.

Chinese Translation

大型语言模型中的长上下文推理受到全注意力的平方成本的制约。现有的高效替代方案通常依赖于原生稀疏训练或启发式的令牌驱逐，导致效率、训练成本和准确性之间的不理想权衡。在本研究中，我们展示了全注意力的LLM（Large Language Models）本质上已经是稀疏的，并且可以通过最小的适应性转化为高度稀疏的模型。我们的方法基于三个观察结果：（1）只有一小部分注意力头真正需要进行全长上下文处理；（2）长距离检索主要由低维子空间主导，使得相关令牌可以通过16维索引器高效检索；（3）有用的令牌预算强烈依赖于查询，使得动态top-$p$选择比固定top-$k$稀疏化更为合适。基于这些见解，我们提出了RTPurbo，它仅为检索头保留完整的KV缓存，并引入轻量级令牌索引器以实现稀疏注意力。通过利用模型的内在稀疏性，RTPurbo在仅需几百个训练步骤的情况下实现了稀疏化。在长上下文基准和推理任务上的实验表明，RTPurbo在保持近乎无损的准确性的同时，带来了显著的效率提升，包括在1M上下文下高达9.36倍的预填充加速和约2.01倍的解码加速。这些结果表明，从标准的全注意力训练中可以获得强大的稀疏推理，而无需昂贵的原生稀疏预训练。

View on arXiv Download PDF AI Translation

cs.CL / 20 / 2605.16938

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

努力作为上限，而非调节器：推理预算并未调节人类与大型推理模型之间的认知成本对齐

Hu, Yueqing, Wang, Tianhong

Abstract

Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.

Chinese Translation

大型推理模型（Large Reasoning Models, LRMs）生成的思维链迹长度与人类在认知任务中的反应时间相对应，但最近的讨论质疑这种对齐是否反映了真正的计算结构或表面冗长。我们测试了这种对齐是否会随着推理时的努力程度而变化。在GPT-OSS-20B和GPT-OSS-120B、三种努力水平和六个推理任务中，任务内和跨任务的对齐保持不变：贝叶斯因子倾向于零，平均对齐在各条件下数值上几乎相同。操控检查表明，努力参数设定了生成的上限预算，而非驱动实时分配，这表明分配策略在训练时已被固化。算术复杂度的对比进一步表明，令牌分配跟踪了细粒度的、格式依赖的人类难度模式，模型规模的提升改善了匹配程度。LRMs与人类之间的认知成本对齐似乎是训练时的成就，对推理时的扰动具有鲁棒性，支持了LRM问题解决的编译而非在线解释。

View on arXiv Download PDF AI Translation

cs.CL / 21 / 2605.16941

Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

展开与回退：扩散大型语言模型自我提升效率

Zeng, Fanqin, Hong, Feng, Yu, Geng, Zheng, Huangjie, Cao, Xiaofeng, Zhang, Ya, Han, Bo, Wang, Yanfeng, Yao, Jiangchao

Abstract

Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open-source DLLMs still face a severe quality-speed trade-off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train-inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context-dependent ones are deferred. This view motivates two complementary methods: an inference-time method that makes parallel decoding revokable, and a training-time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re-masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at https://github.com/Feng-Hong/WINO-DLLM/tree/WINO-plus.

Chinese Translation

扩散大型语言模型（DLLMs）承诺实现快速并行生成，然而开源的DLLMs仍面临严重的质量与速度权衡：通过同时揭示多个标记来加速解码往往会导致显著的质量下降。我们将这一困境归因于不可逆解码所放大的训练-推理不匹配。训练是从随机损坏的状态重构标记，而高效推理则需要自适应去噪顺序，其中较容易的标记被优先揭示，而依赖上下文的标记则被延后揭示。这一观点激励了两种互补的方法：一种是在推理时使并行解码可撤销的方法，另一种是在训练时扩展，通过这一可撤销过程提炼出可靠的顺序。因此，我们首先提出了宽进窄出（Wide-In, Narrow-Out，WINO），这是一种无训练的解码算法，能够实现可撤销的并行生成。WINO积极草拟多个标记，利用丰富的全局上下文验证生成的标记，并对不可靠的标记进行重新掩码以便后续精炼。在此发现的顺序基础上，我们进一步引入WINO+，该方法将WINO生成的经过验证的去噪轨迹注入模型参数，从而使训练与高效推理相一致。在LLaDA和MMaDA上的实验表明，WINO在提高质量和效率方面均有显著改善，而WINO+进一步增强了这一进展。在GSM8K上，WINO将准确率从73.24%提高到75.82%，并减少了6.10倍的步骤，而WINO+进一步达到了76.58%，减少了6.83倍的步骤。在Flickr30K上，WINO+实现了16.22倍的步骤减少，并改善了CIDEr。这些结果表明，DLLMs可以通过首先通过可撤销解码发现可靠的去噪顺序，然后学习遵循这些顺序以实现更快的生成，来作为自身的效率教师。代码可在https://github.com/Feng-Hong/WINO-DLLM/tree/WINO-plus获取。

View on arXiv Download PDF AI Translation

cs.CL / 22 / 2605.16984

Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution

在CRAC 2026中缩小差距：基于LLM的多语言共指解析的两阶段适应

Bourgois, Antoine, Seminck, Olga, Poibeau, Thierry

Abstract

We present our submission to the LLM track of the 2026 Computational Models of Reference, Anaphora and Coreference (CRAC 2026) shared task. With an average CoNLL F1 score of 74.32 on the official test set, our system ranked first in the LLM track, and third overall. Our system is based on the Gemma-3-27b model, fine-tuned using a two-stage strategy with a multilingual base adapter followed by dataset-specific adapters. We represent mention spans by their headword using an XML-inspired format with local reindexing and annotate documents iteratively. These design choices proved effective across languages, document lengths, and annotation guidelines.

Chinese Translation

我们提交了2026年计算参考、指代和共指模型（CRAC 2026）共享任务的LLM轨道的参赛作品。在官方测试集上，我们的系统平均CoNLL F1得分为74.32，排名LLM轨道第一，总体排名第三。我们的系统基于Gemma-3-27b模型，采用两阶段策略进行微调，首先使用多语言基础适配器，然后使用特定数据集的适配器。我们通过其主词使用一种受XML启发的格式表示提及范围，并对文档进行迭代注释。这些设计选择在不同语言、文档长度和注释指南中均证明了其有效性。

View on arXiv Download PDF AI Translation

cs.CL / 23 / 2605.16986

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

即时技能：针对 LLM 代理的测试时自适应技能合成

Wang, Jingxing, Zhou, Chenyu, Fu, Zhihui, Wang, Jun, Liu, Weiwen, Zhang, Weinan, Lin, Jianghao

Abstract

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose \emph{SkillTTA}, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-$k$ retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

Chinese Translation

LLM 代理受益于可重用技能，但测试时任务往往需要比静态技能库提供的更具体的指导。我们提出了 extit{SkillTTA}，一种测试时自适应技能合成方法，该方法检索与当前任务相关的一小组训练轨迹，并将其合成到一个临时的、任务特定的文本技能中。求解模型保持固定，因此适应完全通过生成的上下文而非参数更新来实现。我们在 SpreadsheetBench、ALFWorld 和 BigCodeBench 上评估该方法。与使用 GPT-5.5 的静态轨迹到技能合成相比，任务特定技能将 SpreadsheetBench 的 Pass@1 从 0.397 提高到 0.505，将 BigCodeBench 的 Pass@1 从 0.517 提高到 0.651。在 ALFWorld 上，该方法的成功率与一个更重的记忆学习基线相匹配，差距仅为四个百分点，同时生成的成功轨迹是所有报告方法中最短的。在 SpreadsheetBench 上的消融实验进一步表明，合成技能优于原始轨迹提示，top-$k$ 检索应保持较小，并且失败的轨迹尤其有用，因为它们暴露了反复出现的评估者面临的错误。

View on arXiv Download PDF AI Translation

cs.CL / 24 / 2605.16991

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

基于微调变换器的无响应项目难度建模：组件级表示与多任务学习

Netík, Jan, Martinková, Patrícia

Abstract

Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.

Chinese Translation

无响应项目难度建模有望减少对基于响应的校准的依赖，但在阅读理解的多项选择题中本质上是困难的，因为难度依赖于各个词语组件之间的推理需求。尽管大多数现有方法提取项目文本特征并将其传递给单独的统计或机器学习模型，我们却在项目措辞上对变换器编码器进行端到端的微调，从而消除了手动特征工程和丢弃信息的预处理。此外，提出了对这一联合编码方法的两个扩展：一种组件级变体，通过共享编码器分别编码措辞组件；另一种多任务变体，保留联合编码并在共享编码器上增加辅助的多项选择问题回答目标。每种方法在三种训练集大小下的保留测试集上通过蒙特卡洛子抽样设计进行评估。我们发现，联合编码是特征工程管道的可行端到端替代方案；尽管组件级变体未显示出可检测的益处，这与自注意力已经捕获跨组件信号的事实一致，但多任务变体在最小样本规模下提供了显著的配对改进。变换器微调，特别是如果通过合适的辅助任务进行正则化，能够在应用测量的典型训练集大小下恢复相当一部分可由措辞推导的信号。该框架提供了一个可定制的接口，以便进行心理测量学驱动的扩展。

View on arXiv Download PDF AI Translation

cs.CL / 25 / 2605.16996

Evaluation Drift in LLM Personality Induction: Are We Moving the Goalpost?

大语言模型个性引导中的评估漂移：我们是否在改变目标？

Rajput, Prateek, Song, Yewei, Olatunji, Iyiola E., Klein, Jacques, Bissyandé, Tegawendé F.

Abstract

Can large language models reliably express a human-like personality, or are they merely mimicking surface cues without a stable underlying profile? To investigate this, we induce personality in LLMs by fine-tuning them on the long-form essays, where each essay is associated with a target Big Five personality profile. We then evaluate the stability and fidelity of the induced personality using the IPIP-NEO questionnaire. Specifically, we ask: (i) does post-training (SFT, DPO, ORPO) stabilize questionnaire scores under prompt rephrasings, and (ii) can it induce target Big Five profiles from unguided essays? Our results demonstrate that fine-tuning consistently reduces variance in questionnaire responses across five models, directly mitigating the evaluation fragility reported in pre-trained models. However, this newfound stability reveals a more fundamental limitation: accuracy on the full five-dimensional profile remains near chance, even when single-trait scores improve. This indicates that unguided essays lack the cues needed for faithful personality expression. We therefore argue for scenario-grounded datasets or interactive elicitation that accumulates test-aligned evidence over time.

Chinese Translation

大型语言模型能否可靠地表达类人个性，还是仅仅在模仿表面线索而没有稳定的基础特征？为此，我们通过在长篇论文上进行微调来引导大语言模型的个性，每篇论文都与一个目标的五大人格特质（Big Five）相关联。随后，我们使用IPIP-NEO问卷评估引导个性的稳定性和忠实度。具体而言，我们提出以下问题：(i) 后训练（SFT, DPO, ORPO）是否能在提示重述下稳定问卷得分，以及 (ii) 它能否从无指导的论文中引导出目标的五大人格特质？我们的结果表明，微调在五个模型中始终减少了问卷反应的方差，直接缓解了在预训练模型中报告的评估脆弱性。然而，这种新发现的稳定性揭示了一个更根本的局限性：即使单一特质得分有所改善，完整的五维特征的准确性仍接近随机水平。这表明无指导的论文缺乏忠实表达个性所需的线索。因此，我们主张使用情境基础的数据集或互动引导，以便随着时间的推移积累与测试对齐的证据。

View on arXiv Download PDF AI Translation

cs.CL / 26 / 2605.17007

HalluScore: Large Language Model Hallucination Question Answering Benchmark

HalluScore：大型语言模型幻觉问答基准

Alansari, Aisha, Luqman, Hamzah

Abstract

Large language models (LLMs) have achieved remarkable progress in natural language generation, but remain susceptible to hallucination. In response to growing concerns about hallucinations, several benchmarks have been developed, primarily in English and Chinese. However, Arabic remains underrepresented, with limited benchmarks for LLMs hallucination due to scarce annotated resources and the language's morphological complexity. Consequently, existing benchmarks do not adequately reflect the linguistic, cultural, and reasoning characteristics of Arabic. To address this gap, we introduce HalluScore, a structured Arabic question answering benchmark designed to evaluate hallucination behavior in LLMs across different levels of reasoning difficulty, various knowledge domains, historical timelines, and culturally grounded Arabic scenarios. It contains 827 carefully curated questions for evaluating, detecting, and mitigating hallucination in LLMs. The dataset was constructed through a structured pipeline involving quality assurance, filtering for clarity and factual validity, and model-driven selection to retain questions that consistently trigger hallucinations. Each question is linked to verified ground-truth evidence, answer explanations, and multi-label annotations. Using the HalluScore benchmark, we conduct a comprehensive empirical analysis of hallucination patterns across 17 Arabic, multilingual, and reasoning LLMs. Moreover, we provide high-quality human annotations identifying hallucinated, non-hallucinated, and partially hallucinated responses of all evaluated LLMs. These results suggest that hallucination in Arabic LLMs extends beyond factual inaccuracies, encompassing challenges related to cultural understanding, linguistic reasoning, and logical consistency. We release HalluScore to support future research on improving the reliability and cultural competence of LLMs in Arabic.

Chinese Translation

大型语言模型（LLMs）在自然语言生成方面取得了显著进展，但仍然容易出现幻觉。针对日益增长的幻觉问题，已经开发了多个基准，主要集中在英语和中文。然而，阿拉伯语的相关研究仍然不足，由于缺乏标注资源和语言的形态复杂性，针对LLMs幻觉的基准数量有限。因此，现有基准无法充分反映阿拉伯语的语言、文化和推理特征。为了解决这一问题，我们推出了HalluScore，这是一个结构化的阿拉伯语问答基准，旨在评估LLMs在不同推理难度、各种知识领域、历史时间线和文化背景下的幻觉行为。该基准包含827个经过精心策划的问题，用于评估、检测和减轻LLMs中的幻觉。数据集通过一个结构化流程构建，涉及质量保证、清晰度和事实有效性的筛选，以及基于模型的选择，以保留那些能够持续引发幻觉的问题。每个问题都与经过验证的真实证据、答案解释和多标签注释相关联。利用HalluScore基准，我们对17个阿拉伯语、多语言和推理LLMs的幻觉模式进行了全面的实证分析。此外，我们提供了高质量的人类注释，识别出所有评估LLMs的幻觉、非幻觉和部分幻觉的响应。这些结果表明，阿拉伯语LLMs中的幻觉不仅限于事实不准确，还涉及文化理解、语言推理和逻辑一致性等方面的挑战。我们发布HalluScore，以支持未来在提高阿拉伯语LLMs的可靠性和文化能力方面的研究。

View on arXiv Download PDF AI Translation

cs.CL / 27 / 2605.17028

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

PARALLAX：将真实幻觉检测与基准构建伪影分离

Hussain, Khizar, Kantarcioglu, Murat

Abstract

Large language models (LLMs) hallucinate with confidence: their outputs can be fluent, authoritative, and simply wrong. In medical, legal, and scientific applications this failure causes direct harm, and detecting it from internal model states offers a path to safer deployment. A growing body of work reports that this problem is increasingly tractable, with recent methods achieving high detection performance on widely used benchmarks. We show, however, that much of this apparent progress does not survive scrutiny. Four of the six corpora embed the ground-truth answer directly in the input prompt. A na\"{i}ve text-similarity baseline we call \textsc{TxTemb} exploits this to achieve near-perfect detection scores without any access to model internals. To measure what genuine detection capability remains once these artifacts are controlled, we conduct a large-scale evaluation spanning twenty-two detection methods, twelve open-source models spanning six architectural families, and six corpora. We further introduce \textbf{DRIFT}, a supervised probe over inter-layer hidden-state transitions, as a point of comparison for live-generation detection. Our findings suggest that the field's reported progress on hallucination detection is substantially explained by benchmark construction artifacts in widely used corpora, and that the majority of established baselines perform near chance under controlled conditions; the consistent exceptions are SAPLMA and DRIFT, both supervised probes on upper-layer hidden states.

Chinese Translation

大型语言模型（LLMs）自信地产生幻觉：它们的输出可能流畅、权威，但却完全错误。在医学、法律和科学应用中，这种失败会造成直接伤害，而从模型的内部状态中检测这种失败为更安全的部署提供了一条途径。越来越多的研究表明，这一问题正变得越来越可处理，最近的方法在广泛使用的基准上取得了高检测性能。然而，我们展示了，这些表面上的进展在仔细审查后并不成立。六个语料库中的四个直接在输入提示中嵌入了真实答案。我们称之为 extsc{TxTemb} 的天真文本相似性基线利用这一点，在没有任何访问模型内部的情况下实现了近乎完美的检测分数。为了测量在控制这些伪影后，真正的检测能力剩余多少，我们进行了大规模评估，涵盖了二十二种检测方法、十二个开源模型（跨越六个架构系列）以及六个语料库。我们进一步引入了 extbf{DRIFT}，这是一个针对层间隐藏状态转变的监督探测器，作为实时生成检测的比较点。我们的研究结果表明，领域内报告的幻觉检测进展在很大程度上是由广泛使用的语料库中的基准构建伪影所解释的，并且在控制条件下，大多数已建立的基线表现接近随机；一致的例外是 SAPLMA 和 DRIFT，这两者都是针对上层隐藏状态的监督探测器。

View on arXiv Download PDF AI Translation

cs.CL / 28 / 2605.17041

Agentic AI Translate: An Agentic Translator Prototype for Translation as Communication Design

Agentic AI Translate：一种用于翻译作为交流设计的代理翻译原型

Yamada, Masaru

Abstract

We present Agentic AI Translate, an agentic translator prototype that operationalises the thesis of Yamada (forthcoming) -- that the metalanguage of Translation Studies has become an instruction code for generative AI. The system replaces the dominant text-in / text-out paradigm of machine translation with a four-stage agentic cycle (Identify -> Prompt -> Generate -> Verify), preceded by an interactive specification phase in which the user composes -- through model-assisted dialogue -- a structured translation brief grounded in skopos theory, register, audience, and genre conventions. The verification stage adopts the GEMBA-MQM error-span protocol (Kocmi & Federmann, 2023) for evidence-grounded scoring, and document-level coherence is preserved through a DelTA-lite memory of proper nouns and a running bilingual summary, after Wang et al. (2025). We describe the philosophical motivation, the architectural commitments, the four reference-material categories the system consumes, and the principal design tensions the architecture makes explicit. Empirical validation is left for future work; the contribution here is conceptual and architectural -- an executable embodiment of the position that translation in the GenAI era is communication design, not text conversion.

Chinese Translation

我们提出了Agentic AI Translate，这是一种代理翻译原型，具体化了Yamada（即将出版）的论点——翻译研究的元语言已成为生成性人工智能的指令代码。该系统用一个四阶段的代理循环（识别 -> 提示 -> 生成 -> 验证）取代了机器翻译中主导的文本输入/文本输出范式，前面是一个互动规范阶段，在此阶段，用户通过模型辅助对话撰写一个基于目的论、语域、受众和体裁惯例的结构化翻译简报。验证阶段采用GEMBA-MQM错误跨度协议（Kocmi & Federmann, 2023）进行基于证据的评分，并通过Wang等人（2025）提出的DelTA-lite记忆保持专有名词的文档级一致性和持续的双语摘要。我们描述了哲学动机、架构承诺、系统所需的四类参考材料，以及架构所显露的主要设计张力。实证验证留待未来工作；这里的贡献是概念性和架构性的——一个可执行的体现，表明在生成性人工智能时代，翻译是交流设计，而非文本转换。

View on arXiv Download PDF AI Translation

cs.CL / 29 / 2605.17079

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

大型语言模型能像消费者一样思考吗？基于ConsumerSimBench的群体反应重建基准测试

Wang, Tianyu, Li, Jiajun, Lin, Jianghao

Abstract

LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.

Chinese Translation

大型语言模型（LLMs）越来越多地被用作“数字消费者”，以模拟公众舆论、预先测试市场决策并预测观众反应。然而，现有的评估很少询问模型是否能够重建真实消费者在公共话语中表现出的具体反应模式。我们介绍了ConsumerSimBench，这是一个基于1553个真实中国社交媒体话题和23122个经过规则审核的原子标准构建的基准，涵盖了四种反应类型。与其通过整体偏好评判来评分开放式生成，ConsumerSimBench将每个任务分解为可审核的是非决策，针对具体反应点的三位评审一致性从65.8%提高到92.1%，而逐点评审决策与人类多数标签之间的一致性达到98.4%。在13个前沿生成器中，最强的模型Gemini-3.1-Pro仅覆盖了47.8%的真实反应标准，而尽管GPT-5.2和Claude-4.6在技术基准上表现强劲，但仍远远落后。这些失败揭示了技术基准性能与社会基础消费者直觉之间的显著差距。直接的结构化推理提示降低了覆盖率，而生成-反思的多智能体管道则使MiMo-V2.5-Pro在一个子集上的覆盖率从32.9%提高到37.6%。ConsumerSimBench将消费者模拟重新框架为对真实公共话语反应的预测问题，显示出前沿的LLMs在可靠预测消费者在高语境中国消费者话语中真正关心的内容方面仍然相距甚远。

View on arXiv Download PDF AI Translation

cs.CL / 30 / 2605.17088

ACIL: Auto Chain of Thoughts for In-Context Learning

ACIL：用于上下文学习的自动思维链

Chu, Rui

Abstract

Recent advances in large language models (LLMs) have shown that Chain-of-Thought (CoT) reasoning can substantially improve performance on complex reasoning tasks. At the same time, In-Context Learning (ICL) has become an important mechanism for adapting LLMs to new tasks without updating model parameters, using only examples provided in the prompt. However, standard ICL often struggles on tasks that require multi-step reasoning, because the demonstrations usually contain only input-output pairs and lack explicit intermediate reasoning steps. This paper introduces an Automatic Chain-of-Thought (Auto-CoT) framework to improve ICL by automatically constructing reasoning-enhanced demonstrations. Auto-CoT generates reasoning chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process. By incorporating high-quality reasoning examples into the ICL prompt, Auto-CoT guides the model toward more reliable reasoning and improves prediction accuracy. Experiments across multiple reasoning tasks demonstrate that the proposed framework improves ICL performance by providing explicit intermediate reasoning guidance.

Chinese Translation

近期大型语言模型（LLMs）的进展表明，思维链（Chain-of-Thought, CoT）推理能够显著提升复杂推理任务的表现。同时，上下文学习（In-Context Learning, ICL）已成为将LLMs适应新任务的重要机制，无需更新模型参数，仅使用提示中提供的示例。然而，标准的ICL在需要多步推理的任务上往往表现不佳，因为演示通常仅包含输入-输出对，缺乏明确的中间推理步骤。本文提出了一种自动思维链（Automatic Chain-of-Thought, Auto-CoT）框架，通过自动构建增强推理的演示来改善ICL。Auto-CoT为输入-输出示例生成推理链，用结构化的中间解释增强提示上下文，并通过系统选择过程去除无关或低质量的演示。通过将高质量的推理示例纳入ICL提示，Auto-CoT引导模型朝向更可靠的推理，并提高预测准确性。在多个推理任务中的实验表明，所提出的框架通过提供明确的中间推理指导来提升ICL性能。

View on arXiv Download PDF AI Translation

cs.CL / 31 / 2605.17101

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG：一种自我演化的多智能体检索增强生成框架用于医学推理

Huang, Yongfeng, Chen, Ruiying, Cheng, James

Abstract

Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.

Chinese Translation

检索增强生成（RAG）被广泛应用于减轻医学问答中的幻觉和知识过时等风险，然而其主要的单轮静态检索范式与临床推理的多阶段过程不符。这种压缩的工作流程导致了两个结构性缺陷：问题到查询的翻译往往缺乏临床基础的语义解释，而检索缺乏迭代充分性反馈，使得形成可靠的证据链变得困难。我们认为这两个问题源于一个更深层次的原因：将异构任务的解释、探索和裁决过载到单一推理链上。解决方案是通过任务解耦和动态多轮探索重构工作流程。为此，我们提出了SEMA-RAG，一种用于医学问答的自我演化多智能体RAG框架，它将这些角色分配给三个专业代理：解释代理（Interpreter Agent）负责临床模式解释，探索代理（Explorer Agent）负责充分性驱动的自我演化检索，以及裁决代理（Arbiter Agent）负责证据裁决和答案选择。在五个基准测试和五个大型语言模型（LLM）基础上，SEMA-RAG在每个基础模型上平均提高了最强基线6.46个准确率点。

View on arXiv Download PDF AI Translation

cs.CL / 32 / 2605.17106

HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

HyDRA：异构大语言模型池的混合动态路由架构

Garg, Aashna, Roy, Siddharth Singha, Jang, Jinu, Brancasi, Federico, Fu, Shengyu

Abstract

Production LLM deployments increasingly maintain heterogeneous model pools spanning order-of-magnitude cost differences. Existing routers make binary strong-vs-weak decisions and couple learned parameters to specific model identities, requiring retraining whenever the catalog changes. We present HyDRA (Hybrid Dynamic Routing Architecture), a framework that predicts fine-grained, multi-dimensional capability requirements per query and matches them against configuration-defined model profiles via shortfall matching. A ModernBERT encoder with K=4 independent sigmoid heads scores each query along reasoning, code generation, debugging, and tool use; a shortfall-matching algorithm then selects the cheapest model whose capabilities meet the predicted requirements. The deployed predictor runs at 86 ms median CPU inference latency in production, and is fully decoupled from the model catalog -- adding or removing models requires only a configuration change, with zero retraining. On SWE-Bench Verified (5-model pool: GPT-5.4-mini, Claude Haiku 4.5, GPT-5.3 Codex, Claude Sonnet 4.6, GPT-5.4), HyDRA's tunable shortfall threshold spans three regimes: peak-quality exceeds the always-strong Claude Sonnet 4.6 baseline (75.4% vs. 74.2% resolution) at 12.9% cost savings; iso-quality matches Sonnet at 54.1% cost savings, a 6x improvement over our prior in-house binary router at 9.1%; aggressive pushes savings to 72.5% for a 3.2-point quality trade. Results generalize across LiveCodeBench, BigCodeBench, and tau-bench. HyDRA is deployed to all users in GitHub Copilot's VS Code Chat auto-mode and -- to our knowledge for the first time in the LLM routing literature -- demonstrates language-invariant routing across CJK, European, and other script families.

Chinese Translation

生产环境中的大语言模型（LLM）部署越来越多地维护跨越数量级成本差异的异构模型池。现有的路由器做出强弱二元决策，并将学习到的参数与特定模型身份耦合，这要求在目录变化时进行重新训练。我们提出了HyDRA（混合动态路由架构），这是一个框架，能够预测每个查询的细粒度、多维能力需求，并通过短缺匹配将其与配置定义的模型特征相匹配。一个具有K=4个独立sigmoid头的ModernBERT编码器对每个查询在推理、代码生成、调试和工具使用等方面进行评分；然后，短缺匹配算法选择能力满足预测需求的最便宜模型。部署的预测器在生产环境中以86毫秒的中位CPU推理延迟运行，并与模型目录完全解耦——添加或删除模型只需更改配置，无需重新训练。在SWE-Bench Verified（5模型池：GPT-5.4-mini、Claude Haiku 4.5、GPT-5.3 Codex、Claude Sonnet 4.6、GPT-5.4）上，HyDRA的可调短缺阈值涵盖三个领域：在节省12.9%成本的情况下，峰值质量超过始终强大的Claude Sonnet 4.6基线（75.4%对74.2%解析度）；等质量匹配Sonnet时节省54.1%成本，相较于我们之前的内部二元路由器提高了6倍（9.1%）；激进的推送将节省提高到72.5%，以换取3.2分的质量折衷。结果在LiveCodeBench、BigCodeBench和tau-bench中具有普遍性。HyDRA已在GitHub Copilot的VS Code Chat自动模式下部署给所有用户，并且——据我们所知，这是LLM路由文献中的首次——展示了跨CJK、欧洲及其他文字家族的语言不变路由。

View on arXiv Download PDF AI Translation

cs.CL / 33 / 2605.17113

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

无法回头的时刻：语言模型推理中欺骗承诺的反事实定位

Merrill, Scott, Srivastava, Shashank

Abstract

Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model's reasoning trace. This obscures a more fundamental question: when does a language model become committed to deception? We introduce counterfactual localization: for each sentence prefix in a reasoning trace, we fix the prefix, resample continuations, and estimate the probability of a deceptive outcome. To scale this, we construct five environments (spanning strategic bluffing, maze guidance, financial advice, used-car sales, and offer negotiation) in which deception is never prompted but emerges from strategic incentives and labels follow mechanically from environment state rather than subjective human judgment. The resulting corpus localizes $\sim$1.46M sentences across four reasoning models, drawn from over 94.1M sampled continuations, 91.5B generated tokens, and over 100K scenarios. Sentence-level human evaluation confirms that detected commitment points correspond to interpretable shifts in decision state. Using this resource, we show that lexical cues for commitment prediction transfer poorly across environments, whereas attention-based transition features generalize out of distribution, suggesting that deceptive commitment is reflected in reusable changes in reasoning dynamics rather than surface form. We further identify compact attention-head sets (under 10% of heads) that, selected on one environment, causally suppress deceptive commitment across held-out environments. We release the corpus as a substrate for studying deception, and more broadly commitment, in language-model reasoning.

Chinese Translation

现有的欺骗数据集将完成的输出标记为诚实或欺骗，视欺骗为最终响应的属性，而非模型推理轨迹的功能。这掩盖了一个更根本的问题：语言模型何时开始承诺于欺骗？我们引入了反事实定位：对于推理轨迹中的每个句子前缀，我们固定前缀，重新抽样后续内容，并估计欺骗结果的概率。为了扩展这一方法，我们构建了五个环境（涵盖战略虚张声势、迷宫引导、财务建议、二手车销售和报价谈判），在这些环境中，欺骗并不被直接引导，而是源于战略激励，标签则机械地从环境状态中得出，而非主观的人类判断。生成的语料库在四个推理模型中定位了约146万句，来自超过9410万的抽样后续内容、915亿生成的标记和超过10万个场景。句子级别的人类评估确认，检测到的承诺点对应于决策状态的可解释变化。利用这一资源，我们展示了承诺预测的词汇线索在不同环境间的迁移效果较差，而基于注意力的转变特征则能在分布外进行泛化，这表明欺骗承诺反映在推理动态的可重用变化中，而非表面形式。我们进一步识别出紧凑的注意力头集合（少于10%的头），在一个环境中选择后，能够因果性地抑制在保留环境中的欺骗承诺。我们发布该语料库，以作为研究语言模型推理中欺骗及更广泛的承诺的基础。

View on arXiv Download PDF AI Translation

cs.CL / 34 / 2605.17152

Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages

多语言和多模态大语言模型的实践：为低资源语言构建

Alam, Firoj, Chowdhury, Shammur Absar, Prince, Enamul Hoque

Abstract

Multimodal LLMs are evolving from vision-language to tri-modality that see, hear, and read, yet pipelines and benchmarks remain English-centric and compute-heavy. The tutorial offers an overview of this emerging research area for multilingual multimodality across text, speech, and vision under limited data/compute budgets, synthesizing foundations, recent multilingual models (PALO, Maya), speech-text LLMs. We cover low-cost data creation/curation; adapter stacks for tri-modal alignment; culture-aware evaluation beyond English and hands on resources for fine-tuning a compact multilingual VLM and wiring a speech->text->LLM pipeline. The content will be delivered as an interactive half-day tutorial, designed for researchers and practitioners working on multilingual, multimodal AI in low-resource language settings.

Chinese Translation

多模态大语言模型正从视觉-语言模式演变为三模态，能够看、听和读，但现有的流程和基准仍然以英语为中心且计算资源消耗较大。本教程概述了在有限数据/计算预算下，针对文本、语音和视觉的多语言多模态这一新兴研究领域，综合了基础知识、近期的多语言模型（PALO、Maya）以及语音-文本大语言模型。我们将讨论低成本数据创建/整理；用于三模态对齐的适配器堆栈；超越英语的文化敏感评估，以及微调紧凑型多语言视觉语言模型（VLM）和构建语音->文本->大语言模型（LLM）流程的实用资源。内容将以互动半天的教程形式呈现，旨在为在低资源语言环境中从事多语言、多模态人工智能研究和实践的研究人员和从业者提供支持。

View on arXiv Download PDF AI Translation

cs.CL / 35 / 2605.17173

Why Do Safety Guardrails Degrade Across Languages?

为什么安全护栏在不同语言中会退化？

Zhang, Max, Patel, Ameen, Truong, Sang T., Koyejo, Sanmi

Abstract

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ($\theta$), intrinsic prompt hardness ($\beta$), global language processing difficulty ($\gamma$), and a prompt-specific cross-lingual safety gap ($\tau$). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high-$\tau$ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with $\tau$, severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to $\tau$. In predictive validation, the IRT framework achieves $\mathrm{AUC} = 0.940$, outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.

Chinese Translation

大型语言模型在非英语语言中表现出安全性退化。标准评估依赖于监狱突破成功率（Jailbreak Success Rate, JSR），该指标将多个影响安全的因素混合为一个，模糊了安全失败的具体原因。我们引入了一种潜变量模型，即多组项目反应理论（Multi-Group Item Response Theory, IRT）框架，该框架将影响安全的因素解耦，包括语言无关的安全鲁棒性（$ heta$）、内在提示难度（$eta$）、全球语言处理难度（$ heta$）以及特定提示的跨语言安全差距（$ au$）。利用MultiJail数据集，我们评估了61种模型配置在5个封闭模型家族和10种不同资源语言中的安全鲁棒性，汇总了190万行的数据。探索性因子分析表明，安全性主要是一维的：模型通过共享机制拒绝不同类型的伤害。与预期的趋势相反，安全性在低资源语言中退化的程度较低，22种模型配置在英语中的脆弱性高于低资源语言。低资源语言产生的响应不确定性（高熵）高于高资源语言。此外，高$ au$的提示集中在诸如盗窃和武器等物理伤害类别以及低资源语言中，这一趋势通过跨数据集的泛化得到了验证。虽然全球翻译质量与$ au$的相关性较低，但严重的误译导致了高偏差的异常值，这一点得到了母语者的验证。文化和概念基础的不匹配也对$ au$产生了影响。在预测验证中，IRT框架的AUC值为0.940，优于更简单的基线模型在预测对不安全提示的安全拒绝方面的表现。我们的框架揭示了被聚合指标掩盖的概念-语言脆弱性，从而实现了更公平的跨语言安全评估和数据集构建的针对性改进。

View on arXiv Download PDF AI Translation

cs.CL / 36 / 2605.17187

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

PluRule：社交媒体上多元化社区管理的基准测试

Kachwala, Zoher, Truong, Bao Tran, Muralidharan, Rasika, Kwak, Haewoon, An, Jisun, Menczer, Filippo

Abstract

Social media are shifting towards pluralism -- community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.

Chinese Translation

社交媒体正朝着多元化发展——由社区管理的平台，群体自行定义其规范。在一个社区中违反的规则在另一个社区可能是完全可以接受的。人工智能模型能否帮助管理这样的多元化社区？我们将这一任务形式化为一个多项选择问题，模拟人类管理员在现实世界中的操作：给定一条评论及其周围上下文，识别出违反的具体规则（如果有的话）。我们推出了PluRule，这是一个多模态、多语言的基准，旨在检测在9种语言中跨越1,989个Reddit社区的13,371条规则违反情况。利用这一基准，我们展示了最先进的视觉-语言模型面临显著挑战：即使是具有高推理能力的GPT-5.2，其表现也仅略优于一个简单的基线。我们还发现，模型规模增大和上下文增加带来的收益微乎其微，而诸如文明和自我推广等普遍规则更容易被检测到。我们的结果表明，社交媒体上多元化社区的管理是语言模型面临的一个根本性挑战。我们的代码和基准测试已公开可用。

View on arXiv Download PDF AI Translation

cs.CL / 37 / 2605.17205

LLMs for automatic annotation of Mandarin narrative transcripts

用于自动注释普通话叙事转录的语言模型

Zhao, Qingwen, Zhu, Hongao, He, Yunqi, Wang, Rui, Huang, Aijun, Hu, Hai

Abstract

Linguistic annotation of transcribed speech is essential for research in language acquisition, language disorders, and sociolinguistics, yet remains labor-intensive and time-consuming. While Large Language Models (LLMs) have shown promise in automating annotation tasks, their ability to handle complex discourse-level annotation in non-English languages remains understudied. This study evaluates whether LLMs can reliably annotate narrative macrostructure-the hierarchical organization of story grammar elements-in spoken Mandarin, using the Multilingual Assessment Instrument for Narratives (MAIN) as a testbed. We compared four LLMs against trained human annotators on narratives produced by children, young adults, and older adults. The best-performing model achieved agreement with human raters (k=.794) approaching human-human reliability levels (k=.872) while reducing annotation time by 65%, whereas the locally deployable lightweight model performed substantially worse. Annotation difficulty varied systematically by macrostructure element type, with categories requiring subtle semantic differentiation posing persistent challenges. Furthermore, model reliability decreased on young adult narratives, which exhibited greater lexical variation, semantic ambiguity, and multi-element integration within single utterances. These findings suggest that LLMs can effectively support discourse-level annotation in non-English spoken corpora, while highlighting the continued need for human oversight in semantically complex tasks. Our prompt templates are open sourced for future use.

Chinese Translation

转录语音的语言注释对于语言习得、语言障碍和社会语言学的研究至关重要，但仍然是劳动密集型且耗时的过程。尽管大型语言模型（LLMs）在自动化注释任务方面展现出潜力，但它们在处理非英语语言中的复杂话语级注释的能力仍然未得到充分研究。本研究评估了LLMs是否能够可靠地注释叙事宏观结构——故事语法元素的层次组织——在口语普通话中，使用多语言叙事评估工具（MAIN）作为测试平台。我们将四个LLMs与经过培训的人类注释者进行了比较，分析了儿童、年轻成年人和老年人所产生的叙事。表现最佳的模型与人类评分者的协议达到了（k=.794），接近人类之间的可靠性水平（k=.872），同时将注释时间减少了65%，而本地可部署的轻量级模型表现明显较差。注释难度在宏观结构元素类型上系统性变化，要求细微语义区分的类别持续面临挑战。此外，模型在年轻成年人叙事上的可靠性下降，这些叙事表现出更大的词汇变异、语义模糊性和单个话语内的多元素整合。这些发现表明，LLMs可以有效支持非英语口语语料库中的话语级注释，同时强调在语义复杂任务中仍需人类监督。我们的提示模板已开源以供未来使用。

View on arXiv Download PDF AI Translation

cs.CL / 38 / 2605.17228

Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making

人工不容忍：临床文档中的污名化语言扭曲大型语言模型的决策

Huang, Jen-tse, Zhou, Didi, Kamau, Faith, Oh, Amy, Links, Anne R., Dredze, Mark, Beach, Mary Catherine, Saha, Somnath

Abstract

Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as clinical decision support and medical documentation. However, the robustness of these models against subtle linguistic variations, specifically stigmatizing language (SL) commonly found in human-authored clinical notes, remains critically under-explored. In this work, we investigate whether frontier LLMs inherit and propagate this human bias when processing clinical text. We systematically evaluate nine frontier LLMs across four stigmatized medical conditions, utilizing clinical vignettes injected with varying intensities and phenotypes of SL (doubt, blame, and maligning). Our results demonstrate that all evaluated models exhibit substantial bias, with clinical decision-making significantly skewed towards less aggressive patient management. Notably, we observe a high sensitivity to linguistic framing, where a single SL sentence is sufficient to alter model outputs, revealing a clear dose-response relationship. Furthermore, we evaluate standard prompt-based mitigation strategies, including Chain-of-Thought (CoT) reasoning and model self-debiasing. These approaches show limited efficacy; models struggle to explicitly identify SL while remaining implicitly influenced by it. Our findings expose a critical vulnerability in current LLMs regarding fairness and robustness in clinical NLP, underscoring the need for rigorous algorithmic guardrails to prevent the automation of health disparities.

Chinese Translation

大型语言模型（LLMs）越来越多地应用于临床决策支持和医疗文档等高风险领域。然而，这些模型对细微语言变异的鲁棒性，特别是在人类撰写的临床笔记中常见的污名化语言（SL），仍然未得到充分探索。在本研究中，我们调查了前沿LLMs在处理临床文本时是否继承和传播这种人类偏见。我们系统地评估了九个前沿LLMs在四种被污名化的医疗状况下的表现，利用注入不同强度和表型的SL（怀疑、指责和恶意）的临床案例。我们的结果表明，所有评估的模型均表现出显著的偏见，临床决策明显倾向于较少激进的患者管理。值得注意的是，我们观察到对语言框架的高度敏感性，其中单个SL句子足以改变模型输出，揭示出明显的剂量-反应关系。此外，我们评估了标准的基于提示的缓解策略，包括思维链（Chain-of-Thought, CoT）推理和模型自我去偏见。这些方法显示出有限的有效性；模型在明确识别SL方面存在困难，同时仍受到其隐性影响。我们的研究结果揭示了当前LLMs在临床自然语言处理中的公平性和鲁棒性方面的关键脆弱性，强调了需要严格的算法防护措施，以防止健康差异的自动化。

View on arXiv Download PDF AI Translation

cs.CL / 39 / 2605.17283

OProver: A Unified Framework for Agentic Formal Theorem Proving

OProver：一个统一的代理形式定理证明框架

Ma, David, Ma, Kaijing, Guo, Shawn, Shi, Yunfeng, Zhao, Enduo, Shi, Jiajun, Zhang, Zhaoxiang, Cheung, Gavin, Liu, Jiaheng, Wang, Zili

Abstract

Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT data, and uses unresolved hard cases for RL. OProofs is built from public Lean resources, large-scale proof synthesis, and agentic proving traces, containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized trajectories with retrieved context, failed attempts, feedback, and repairs. Across five benchmarks, OProver-32B attains the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second on MathOlympiad (22.8%) and ProofNet (33.2%) more top placements than any prior open-weight whole-proof prover.

Chinese Translation

近期在形式定理证明领域的进展得益于大规模的证明生成和验证器感知训练，但代理证明很少被整合到证明器训练中，仅在推理时出现。我们提出了OProver，这是一个在Lean 4中用于代理形式定理证明的统一框架，其中失败的证明尝试通过检索的编译器验证证明和Lean编译器反馈进行迭代修正。OProver通过持续的预训练和迭代后训练进行训练：每次迭代运行代理证明，将新验证的证明索引到OProofs和检索记忆中，使用修复轨迹作为SFT数据，并利用未解决的难题进行强化学习（RL）。OProofs由公共Lean资源、大规模证明合成和代理证明轨迹构建，包含1.77M个Lean语句、6.86M个编译器验证的证明，以及包含检索上下文、失败尝试、反馈和修复的序列化轨迹。在五个基准测试中，OProver-32B在MiniF2F（93.3%）、ProverBench（58.2%）和PutnamBench（11.3%）上获得了最佳的Pass@32，并在MathOlympiad（22.8%）和ProofNet（33.2%）上排名第二，获得的顶级名次超过了任何先前的开放权重全证明证明器。

View on arXiv Download PDF AI Translation

cs.CL / 40 / 2605.17301

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

ConflictRAG：检测和解决检索增强生成中的知识冲突

Wang, Chenyu, Liu, Yingmin, Shu, Yang

Abstract

Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents -- an assumption that frequently fails in practice. We present ConflictRAG, a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation. The framework introduces three contributions: (1) a two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90.8% detection accuracy; (2) an Entropy-TOPSIS framework for data-driven source credibility assessment, improving selection accuracy by 7.1% over manual heuristics; and (3) a Conflict-Aware RAG Score (CARS) for diagnostic evaluation of conflict-handling capabilities. Experiments on three benchmarks against six baselines demonstrate 88.7% conflict-detection F1 and consistent 5.3--6.1% correctness gains over the strongest conflict-aware baseline, with the pipeline transferring effectively across backbone LLMs.

Chinese Translation

检索增强生成（RAG）系统隐含地假设检索到的文档之间是相互一致的——这一假设在实际应用中经常失效。我们提出了ConflictRAG，一种冲突感知的RAG框架，在答案生成之前检测、分类和解决知识冲突。该框架提出了三项贡献：（1）一个两阶段的冲突检测模块，结合了轻量级嵌入式多层感知器（MLP）分类器与选择性大语言模型（LLM）精炼，将API成本降低了62%，同时保持90.8%的检测准确率；（2）一个基于熵-拓扑排序（Entropy-TOPSIS）框架的数据驱动源可信度评估，提高了选择准确率7.1%，优于手动启发式方法；（3）一个冲突感知RAG评分（CARS），用于冲突处理能力的诊断评估。在三个基准测试上与六个基线进行的实验表明，冲突检测的F1值达到88.7%，并且在最强的冲突感知基线之上，准确性持续提高了5.3%至6.1%，该管道在不同的主干LLM之间有效迁移。

View on arXiv Download PDF AI Translation

cs.CL / 41 / 2605.17314

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

通过不匹配的错误草稿实现弱到强的引导

Deng, Wei

Abstract

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@$k$. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@$k$ above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from $k=1$ to $k=1024$ across 2 seeds ($+14.2$pp on 2025 and $+9.0$pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).

Chinese Translation

我们考虑来自较小、较弱模型的离策略经验是否能够引导更强学习者的能力，而这种能力是通过在策略强化学习（例如，GRPO）微调中无法达到的。我们发现，将来自一个较小但更具领域训练的模型的数学错误草稿——与当前问题不匹配——注入到更强学习者的GRPO上下文中，能够在保留的MATH-500和分布外的AIME 2025/2026上始终优于标准的在策略GRPO。具体而言，我们使用Mathstral-7B作为学习者，Qwen2.5-Math-1.5B作为草稿模型，8.8K个3级至5级的数学问题（保留MATH-500），并使用Dr. GRPO进行训练。不匹配是一个关键因素：在保持其他条件不变的情况下，将草稿随机打乱以适应不匹配的问题，在MATH-500上相较于匹配错误变体提高了$+1.62$pp（贪婪通过@1）（$n=10$种种子，$p=0.0015$，Welch's $t$）。事实上，不匹配错误变体在MATH-500的贪婪通过@1和采样通过@$k$上领先我们测试的所有其他变体。在分布外的AIME 2025和2026上，不匹配错误变体在每个样本预算从$k=1$到$k=1024$中，独特地将通过@$k$提升到高于Mathstral-7B（以其本地[INST]格式）和Qwen2.5-Math-1.5B草稿模型，2025年提高了$+14.2$pp，2026年提高了$+9.0$pp（在通过@1024时相较于Mathstral-7B），并且在通过@1024时也领先于无草稿、匹配错误和不匹配正确的变体。所有变体在测试时使用相同的提示，没有草稿注入。该方法——在单个GPU上训练，没有SFT、没有奖励模型、没有合成数据、没有生成-评估-修订的内部循环——在Mathstral-7B-v0.1上达到了71.98%的MATH-500，这是我们所知的该模型上发布的最高结果，超过了较重的WizardMath管道在完整MATH上达到的70.9%（SFT + PPO与过程/指令奖励模型）。

View on arXiv Download PDF AI Translation

cs.CL / 42 / 2605.17342

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

及物性与循环性相遇：动态大语言模型对齐的显式偏好分解

Huang, Yucong, Li, Xiucheng, Zhao, Kaiqi, Li, Jing

Abstract

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.

Chinese Translation

标准的强化学习人类反馈（RLHF）依赖于及物标量奖励，未能捕捉人类偏好的循环特性。尽管一些方法如一般偏好模型（General Preference Model, GPM）对此进行了处理，但我们识别出一个理论限制：它们的隐式表述将层次结构与循环性纠缠在一起，未能保证主导解的存在。为了解决这一问题，我们提出了混合奖励-循环（Hybrid Reward-Cyclic, HRC）模型，该模型利用博弈论分解将偏好显式地分解为正交的及物（标量）和循环（向量）成分。作为补充，我们引入了动态自我博弈偏好优化（Dynamic Self-Play Preference Optimization, DSPPO），将对齐视为一个时变博弈，以逐步引导策略朝向纳什均衡。合成数据实验进一步验证了HRC在混合及物-循环设置中的结构优势，其中HRC的收敛速度更快，准确性高于GPM。在RewardBench 2上的实验表明，HRC在BT和GPM基线之上持续改进（例如，在Gemma-2B-it上提高了1.23%）。特别是在平局领域，其卓越的表现实证验证了模型在处理复杂非严格偏好方面的鲁棒性。在AlpacaEval 2.0、Arena-Hard-v0.1和MT-Bench上的广泛下游评估确认了我们框架的有效性。值得注意的是，当使用Gemma-2B-it作为基础偏好模型时，HRC+DSPPO在AlpacaEval 2.0上达到了44.75%的峰值长度控制胜率，在Arena-Hard-v0.1上达到了46.8%，显著优于使用BT或GPM训练的SPPO基线。我们的代码已公开发布在https://github.com/lab-klc/Hybrid-Reward-Cyclic。

View on arXiv Download PDF AI Translation

cs.CL / 43 / 2605.17348

Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

驯服“僵尸”代理：一种基于马尔可夫状态的弹性多智能体演化框架

Zhang, Taolin, Zhao, Pukun, Chen, Qizhou, Wan, Jiuheng, Chen, Chen, He, Xiaofeng, Wang, Chengyu, Hong, Richang

Abstract

Recent advancements in LLM-based multi-agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for ``zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state-aware framework for resilient multi-agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State-Aware Policy Learning: Agent states are divided into ``Active'', ``Standby'', and ``Terminated'' states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing ``Terminated'' nodes and retaining ``Standby'' nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state-aware agent scheduling.

Chinese Translation

近期基于大型语言模型（LLM）的多智能体系统在复杂任务中的协作能力取得了显著进展。为了提高整体效率，现有方法通常依赖于代理之间的激进图演化（例如，节点或边缘剪枝），这可能因幻觉或暂时的知识缺口等瞬时问题而过早地丢弃有价值的代理。然而，这种硬剪枝忽视了“僵尸”代理在后续讨论轮次中恢复和贡献的潜力。本文提出了AgentRevive，一种基于马尔可夫状态的弹性多智能体演化框架。我们的方法通过软状态转换动态管理代理协作，主要通过两个关键组件实现：（1）状态感知策略学习：代理状态分为“活跃”、“待命”和“终止”状态，基于代理记忆选择性传播消息。该策略采用风险评估器，通过评估幻觉风险来优化代理状态转换，最小化不可靠节点的影响，同时保护有价值的节点。（2）状态感知边优化：根据从策略中学习到的状态修剪子图边缘，永久移除“终止”节点，并保留“待命”节点以评估其未来潜在贡献。针对一般推理、特定领域和幻觉挑战任务的广泛实验表明，我们的方法始终优于强基线，并通过状态感知的代理调度显著减少了令牌消耗。

View on arXiv Download PDF AI Translation

cs.CL / 44 / 2605.17352

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

AMATA：用于知识密集型问答的自适应多智能体轨迹对齐

Zhang, Taolin, Li, Dongyang, Chen, Chen, Chen, Qizhou, Wan, Jiuheng, He, Xiaofeng, Wang, Chengyu, Hong, Richang

Abstract

Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long-tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, which captures cross-agent tool dependencies through a novel dependency-aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems on five established knowledge-intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.

Chinese Translation

尽管大型语言模型（LLMs）取得了显著进展，但为知识密集型问答生成事实一致的响应仍然具有挑战性。这些困难主要源于幻觉现象以及LLMs在弥补长尾知识差距方面的局限性。为了解决这一问题，我们提出了AMATA，一个自适应多智能体轨迹对齐框架，动态整合外部知识以提高响应的可解释性和事实基础。我们的架构利用六个专门的智能体，协同执行结构化操作以进行复杂的问题推理。我们将与外部工具的多智能体协作形式化为轨迹偏好对齐问题，结合了基于问题的智能体定制和智能体间偏好协调。AMATA引入了两个主要创新：（1）轨迹内偏好学习，学习以目标为导向的偏好以优先考虑关键智能体；（2）智能体间依赖学习，通过一种新颖的依赖感知直接偏好优化技术捕捉跨智能体工具依赖关系。实证结果表明，AMATA在五个已建立的知识密集型问答基准上，始终优于基线方法、知识增强框架和基于LLM的轨迹系统。进一步分析表明我们的方法在减少令牌消耗方面的效率。

View on arXiv Download PDF AI Translation

cs.CL / 45 / 2605.17359

Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains

学习可转移拓扑先验以实现跨领域的多智能体大语言模型协作

Zhang, Taolin, Zhou, Zijie, Wan, Jiuheng, Hu, Tingyuan, Wang, Chengyu, He, Xiaofeng, Hong, Richang

Abstract

Large language model (LLM)-based multi-agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication. However, existing topology-evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference-time token consumption, and limited scalability in multi-domain settings. We propose TopoPrior, a framework for learning transferable topology priors for multi-agent LLM collaboration across domains. Rather than repeatedly searching for effective collaboration structures online, TopoPrior learns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query-conditioned initial collaboration graphs for downstream refinement. By shifting part of topology search from per-query online optimization to offline prior learning, TopoPrior amortizes search cost while remaining compatible with existing topology-evolution backbones. Technically, TopoPrior contains two key components. First, a transferable topology prior learning module employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space. Second, a query-conditioned latent adaptation module introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query-relevant structural variation. Experiments on multi-domain reasoning benchmarks show that TopoPrior consistently improves several heterogeneous topology-evolution backbones while reducing online inference-time token usage, with only modest additional trainable parameters. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi-agent LLM collaboration across domains.

Chinese Translation

基于大语言模型（LLM）的多智能体系统通过结构化通信协调专业化代理，展现出在复杂推理方面的强大潜力。然而，现有的拓扑演化方法通常为每个查询从头构建或优化协作拓扑，这导致了显著的在线搜索开销、高推理时间的令牌消耗以及在多领域环境中的有限可扩展性。我们提出了TopoPrior，一个用于学习可转移拓扑先验以实现跨领域多智能体LLM协作的框架。TopoPrior并不是反复在线搜索有效的协作结构，而是从多个领域离线收集的参考协作图中学习可重用的拓扑先验，并利用这些先验生成条件于查询的初始协作图以供下游优化。通过将部分拓扑搜索从每个查询的在线优化转移到离线先验学习，TopoPrior在保持与现有拓扑演化基础架构兼容的同时，摊销了搜索成本。从技术上讲，TopoPrior包含两个关键组件。首先，一个可转移拓扑先验学习模块采用条件变分图框架，以捕捉潜在空间中跨领域的可重用结构规律。其次，一个条件于查询的潜在适应模块引入对抗性对齐，以减少不必要的领域差异，同时保留与查询相关的结构变化。在多领域推理基准上的实验表明，TopoPrior在减少在线推理时间令牌使用的同时，始终提高了几种异构拓扑演化基础架构的性能，并仅增加了适度的可训练参数。这些结果表明，可转移的拓扑初始化是一种有效且轻量的机制，可以提高跨领域多智能体LLM协作的效率。

View on arXiv Download PDF AI Translation

cs.CL / 46 / 2605.17364

NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation

NewsLens：一种多智能体框架用于对抗性新闻偏见导航

Bose, Joy

Abstract

Media bias detection has predominantly been framed as a classification task: assign a political label to an article or outlet. We argue this framing is too shallow: it identifies that bias exists but not where, how, or crucially, what is structurally omitted. We present NewsLens, a five-agent adversarial pipeline for structured news bias navigation. A Fact Verifier, Progressive Framing Analyst, Conservative Framing Analyst, Propaganda Detector, and Neutral Summarizer collaborate to deconstruct articles into interpretable framing maps, exposing ideological omissions, rhetorical manipulation, and framing boundaries. The system is evaluated on 15 articles across four geopolitical event clusters (India-Pakistan Kashmir, Gaza, Climate Policy, Ukraine) using Qwen2.5-3B-Instruct (4-bit quantised, Google Colab T4), with cross-model validation using Mistral 7B on the Kashmir cluster. Center outlets show the highest mean Perspective Divergence Score (PDS: Qwen 0.907, Mistral 0.729 on Kashmir subset); conservative-framing outlets show the highest mean Manipulation Index (MI: 0.600 across both models). Cross-model comparison shows high consistency for high-propaganda content (Republic World delta-PDS=0.125, MI=0.8 both models) and greater variance for nuanced reporting. Mann-Whitney U tests find no statistically significant between-group differences at n=15, reported honestly as a sample-size limitation confirmed by post-hoc power analysis. A partial ablation removing the Propaganda Detector shows degraded omission precision in the Neutral Summarizer output. The architecture extends prior lexical-geometric bias work to agentic LLM reasoning, and is fully reproducible using open-weight models without API keys.

Chinese Translation

媒体偏见检测主要被视为一种分类任务：为文章或媒体分配政治标签。我们认为这种框架过于浅显：它识别出偏见的存在，但未能指出偏见的具体位置、方式，以及至关重要的被结构性遗漏的内容。我们提出了NewsLens，一个由五个智能体组成的对抗性管道，用于结构化新闻偏见导航。事实验证者（Fact Verifier）、渐进框架分析师（Progressive Framing Analyst）、保守框架分析师（Conservative Framing Analyst）、宣传检测器（Propaganda Detector）和中立摘要生成器（Neutral Summarizer）协同工作，将文章解构为可解释的框架图，揭示意识形态遗漏、修辞操控和框架边界。该系统在四个地缘政治事件集群（印度-巴基斯坦克什米尔、加沙、气候政策、乌克兰）中的15篇文章上进行了评估，使用Qwen2.5-3B-Instruct（4位量化，Google Colab T4），并在克什米尔集群上使用Mistral 7B进行跨模型验证。中心媒体显示出最高的平均观点差异分数（PDS：Qwen 0.907，Mistral 0.729，基于克什米尔子集）；保守框架媒体显示出最高的平均操控指数（MI：两个模型均为0.600）。跨模型比较显示高宣传内容（Republic World delta-PDS=0.125，MI=0.8，两个模型）的一致性高，而细致报道的方差更大。Mann-Whitney U检验发现n=15的组间差异没有统计学显著性，诚实地报告为样本量限制，并通过事后功效分析得到确认。去除宣传检测器的部分消融实验显示中立摘要生成器输出的遗漏精度下降。该架构将先前的词汇几何偏见研究扩展到智能LLM推理，并且可以使用开放权重模型在没有API密钥的情况下完全复现。

View on arXiv Download PDF AI Translation

cs.CL / 47 / 2605.17379

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

通过更好的标记实现更快的学习：面向专业文本摘要的参数高效词汇适应

Balde, Gunjan, Roy, Soumyadeep, Mondal, Mainack, Ganguly, Niloy

Abstract

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

Chinese Translation

在通用领域语料库上预训练的大型语言模型在应用于专业领域时，通常表现出标记化效率低下的问题。尽管针对领域适应的持续预训练在一定程度上缓解了性能下降，但并未解决根本的词汇不匹配问题。为了解决这一问题，我们提出了一种针对性的参数高效领域适应方法，该方法将词汇适应与基于大型语言模型（LLM）的文本摘要预训练相结合。我们的统一框架通过引入领域特定的标记来增强预训练的标记器，同时选择性地替换训练不足和无法到达的标记，以限制参数增长。我们在 Llama-3.1-8B 和 Qwen2.5-7B 上评估了我们的方法，针对法律和医学摘要任务，采用了以挑战为导向的评估协议，重点关注专家驱动的文本和摘要，这些文本和摘要通常具有更高浓度的过度分割的词汇外（OOV）词。词汇适应算法通过提高生成摘要与其参考文献之间的语义相似性，增强了摘要模型的整体质量。此外，适应后的模型生成的摘要包含更多合适的新颖和领域特定的词汇，从而提高了连贯性、相关性和忠实度。我们进一步观察到，与持续预训练相比，我们提出的方法显著减少了训练时间，降低了 $35-55\%$，并且与仅扩展方法相比，参数数量减少了高达 $37\\%$。我们将代码库公开发布在 https://github.com/gb-kgp/VocabReplace-Then-Expand。

View on arXiv Download PDF AI Translation

cs.CL / 48 / 2605.17398

MiniGPT: Rebuilding GPT from First Principles

MiniGPT：从基本原理重建GPT

Joseph, Jibin

Abstract

This paper presents MiniGPT, a compact from-scratch implementation of GPT-style autoregressive language modeling in PyTorch. The aim is to rebuild the core GPT pipeline from first principles after studying the design of nanoGPT by Andrej Karpathy, while keeping the model and training code independently written in a single notebook. MiniGPT implements token and positional embeddings, causal multi-head self-attention, pre-LayerNorm Transformer blocks, residual connections, feed-forward MLP layers, next-token cross-entropy training (teacher forcing), validation tracking, checkpoint selection, and autoregressive text generation. This paper evaluates the implementation on Tiny Shakespeare dataset using character-level tokenization. A baseline 0.83M-parameter model reaches a validation loss of 1.7236 after 3000 training iterations. A stronger 10.77M-parameter configuration, using a larger context length and improved training settings, reaches a best validation loss of 1.4780 and generates text with recognizable Shakespeare-style dialogue structure. MiniGPT does not introduce a new language-model architecture. Instead, it documents a clear and reproducible implementation path from raw text to trained character-level generation, including design choices, training behavior, generation quality, and practical limitations.

Chinese Translation

本文介绍了MiniGPT，这是一个在PyTorch中从头开始实现的紧凑型GPT风格自回归语言模型。其目的是在研究Andrej Karpathy的nanoGPT设计后，从基本原理重建核心GPT流程，同时保持模型和训练代码独立编写在一个笔记本中。MiniGPT实现了标记和位置嵌入、因果多头自注意力、预层归一化Transformer块、残差连接、前馈多层感知机（MLP）层、下一个标记的交叉熵训练（教师强迫）、验证跟踪、检查点选择和自回归文本生成。本文在使用字符级标记化的Tiny Shakespeare数据集上评估了该实现。一个基线的0.83M参数模型在3000次训练迭代后达到了1.7236的验证损失。一个更强的10.77M参数配置，使用更大的上下文长度和改进的训练设置，达到了最佳验证损失1.4780，并生成了具有可识别的莎士比亚风格对话结构的文本。MiniGPT并未引入新的语言模型架构，而是记录了从原始文本到训练后的字符级生成的清晰且可重复的实现路径，包括设计选择、训练行为、生成质量和实际限制。

View on arXiv Download PDF AI Translation

cs.CL / 49 / 2605.17435

BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

BELIEF：用于生物医学问答的结构化证据建模与不确定性感知融合

Zong, Chang, Ning, Hao, Tang, Siliang, Huang, Jie, Wan, Jian

Abstract

Biomedical question answering often requires decisions from retrieved literature whose relevance, quality, and support for candidate answers are uneven. Most retrieval-augmented large language model (LLM) methods feed this literature to the model as flat text, leaving evidence reliability and remaining uncertainty largely implicit. We propose BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Rather than treating retrieved documents as undifferentiated context, BELIEF converts them into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These evidence objects provide a shared basis for two complementary reasoning paths. The symbolic path constructs reliability-weighted basic probability assignments based on Dempster--Shafer (D-S) theory over a finite answer space and performs uncertainty-aware symbolic evidence fusion to estimate belief and residual uncertainty. The neural path uses the same structured evidence for LLM-based semantic inference, while a reliability-aware arbitration module reconciles the symbolic and neural outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency. Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones show that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings. Comparisons with biomedical-domain models indicate that BELIEF is competitive on MedQA and MedMCQA, while specialized biomedical pretraining remains advantageous on PubMedQA. Ablation, complementarity, uncertainty-stratified, and cost analyses further show that BELIEF improves retrieved-evidence utilization by making evidence structure, path disagreement, and decision uncertainty explicit.

Chinese Translation

生物医学问答通常需要从检索到的文献中做出决策，而这些文献的相关性、质量和对候选答案的支持程度往往不均衡。大多数增强检索的大型语言模型（LLM）方法将这些文献作为平面文本输入模型，导致证据的可靠性和剩余不确定性在很大程度上是隐性的。我们提出了BELIEF，一个用于封闭集生物医学问答的结构化证据建模与不确定性感知融合框架。BELIEF并不将检索到的文档视为无差别的上下文，而是将其转换为记录临床属性、来源质量、问题相关性、支持强度及相关候选假设的证据对象。这些证据对象为两条互补的推理路径提供了共享基础。符号路径基于Dempster-Shafer（D-S）理论在有限答案空间上构建可靠性加权的基本概率分配，并执行不确定性感知的符号证据融合，以估计信念和剩余不确定性。神经路径使用相同的结构化证据进行基于LLM的语义推理，同时一个可靠性感知的仲裁模块根据信念强度、不确定性、证据可靠性和语义一致性调和符号和神经输出。在PubMedQA、MedQA和MedMCQA上进行的实验显示，BELIEF在30个骨干网络-数据集-指标设置中的25个中获得了最佳结果。与生物医学领域模型的比较表明，BELIEF在MedQA和MedMCQA上具有竞争力，而专门的生物医学预训练在PubMedQA上仍然具有优势。消融、互补性、不确定性分层和成本分析进一步表明，BELIEF通过使证据结构、路径分歧和决策不确定性显性化，提高了检索证据的利用率。

View on arXiv Download PDF AI Translation

cs.CL / 50 / 2605.17442

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

超越目录计数：低资源多语言自然语言处理中的数据集可见性不对称性

Tan, Zhiyin, Duan, Changxu

Abstract

Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM-assisted citation-mining pipeline over the Semantic Scholar corpus to these 141 low-visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large-speaker languages appear data-poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long-term accessibility. Code and data are publicly available at (https://github.com/zhiyintan/dataset-visibility-asymmetry).

Chinese Translation

多语言自然语言处理（NLP）通常依赖于来自集中目录的数据集计数，以表征哪些语言资源丰富或资源匮乏。然而，这些目录仅记录了数据集可见性的一个层面：已注册或机构分发的内容。它们并不一定反映哪些数据集在研究文献中被创建、引用或重用。为了探讨这一差距，我们结合了基于目录的基线和文献支持的数据集流通证据。我们引入了资源密度指数（Resource Density Index, RDI），定义为每百万说话者的目录数据集数量，并对《Ethnologue》中200种最广泛使用的语言进行了计算。在这些语言中，118种语言（59%）在LRE地图和语言数据联盟（Linguistic Data Consortium, LDC）中平均RDI为零，另外23种语言的RDI低于0.1，最多对应每千万说话者一个目录数据集。随后，我们对这141种低可见性语言应用了基于大型语言模型（LLM）辅助的引用挖掘管道，覆盖了Semantic Scholar语料库。经过人工验证和整合，我们在53种语言中识别出609个独特的数据集，其中356个通过有效的公共链接保持开放访问。这些结果揭示了一个显著的可见性差距：许多大语种在目录记录中显得数据匮乏，但在研究文献中却显示出明显的数据集活动证据。我们的研究结果表明，多语言数据稀缺不仅应被理解为一个生产问题，还应被视为一个文档、可发现性和长期可访问性的问题。代码和数据可在（https://github.com/zhiyintan/dataset-visibility-asymmetry）公开获取。

View on arXiv Download PDF AI Translation

cs.CL / 51 / 2605.17443

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩国口语问答中ASR-LLM级联的错误传播

Jung, Donghyuk, Choi, Youngwon

Abstract

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a distinct semantic-failure channel, where the gold answer becomes entirely absent from the downstream prediction despite only a minimal transcription difference. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

Chinese Translation

我们分析了自动语音识别（ASR）错误如何在韩国口语问答（SQA）的ASR-LLM级联中传播，重点关注传统ASR指标无法完全捕捉的下游语义失败。我们的分析表明，由ASR错误引起的相对下游退化在不同绝对性能的LLM中是一致的，这表明级联退化在很大程度上跟踪ASR阶段的信息损失。我们进一步识别出单字符的韩国ASR错误作为一种独特的语义失败通道，在这种情况下，尽管仅存在最小的转录差异，金标准答案在下游预测中完全缺失。最后，辅助比较显示，在嘈杂的韩国SQA中，一个大型音频语言模型的表现优于具有匹配语言骨干的ASR-LLM管道，表明直接音频输入有潜力减轻转录引起的信息损失。

View on arXiv Download PDF AI Translation

cs.CL / 52 / 2605.17467

VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

VerifyMAS：用于大语言模型多智能体系统中故障归因的假设验证

Qiao, Hezhe, Tong, Hanghang, Lim, Ee-Peng, Liu, Bing, Pang, Guansong

Abstract

Large language model-driven multi-agent systems (LLM-MAS) excel at complex tasks, yet unreliable agents remain a key bottleneck to system-level reliability. Automatic failure attribution is therefore critical, but existing approaches, such as direct prediction of agent-error pairs and agent-first failure attribution, rely on local logs of agents and miss global failures that only manifest over full interaction trajectories, such as cross-step inconsistencies and inter-agent coordination errors. Moreover, directly predicting failures induces a large combinatorial search space, hindering fine-grained attribution. To address these challenges, we propose VerifyMAS, a hypothesis verification framework for agent failure attribution. Instead of directly predicting faulty agents and error types, VerifyMAS formulates and verifies failure hypotheses against full trajectories. This verification-based approach decomposes attribution into trajectory-level error validation and fine-grained agent localization, providing an error-first attribution approach that captures global failure patterns while substantially reducing the search space. We further introduce a hypothesis-based data construction strategy grounded in a structured error taxonomy and fine-tune a specialized LLM verifier model for trajectory-level failure verification and agent attribution. Experiments on Aegis-Bench and Who&When show that VerifyMAS consistently improves diverse backbone models, including open-source Qwen and API-based GPT models, outperforming prior methods without sacrificing inference efficiency for long multi-agent trajectories.

Chinese Translation

由大型语言模型驱动的多智能体系统（LLM-MAS）在复杂任务中表现出色，但不可靠的智能体仍然是系统级可靠性的关键瓶颈。因此，自动故障归因至关重要，但现有方法，如直接预测智能体错误对和智能体优先故障归因，依赖于智能体的局部日志，无法捕捉到仅在完整交互轨迹中显现的全局故障，例如跨步骤不一致和智能体间协调错误。此外，直接预测故障会导致巨大的组合搜索空间，妨碍细粒度归因。为了解决这些挑战，我们提出了VerifyMAS，一个用于智能体故障归因的假设验证框架。VerifyMAS并不是直接预测故障智能体和错误类型，而是针对完整轨迹制定和验证故障假设。这种基于验证的方法将归因分解为轨迹级错误验证和细粒度智能体定位，提供了一种错误优先的归因方法，能够捕捉全局故障模式，同时显著减少搜索空间。我们进一步引入了一种基于假设的数据构建策略，该策略基于结构化错误分类法，并对专门的LLM验证模型进行了微调，以实现轨迹级故障验证和智能体归因。在Aegis-Bench和Who&When上的实验表明，VerifyMAS在包括开源Qwen和基于API的GPT模型在内的多种基础模型上始终表现出色，超越了先前的方法，同时不牺牲长多智能体轨迹的推理效率。

View on arXiv Download PDF AI Translation

cs.CL / 53 / 2605.17481

Hybrid Feature Combinations with CNN for Bangla Fake News Classification

基于卷积神经网络的孟加拉假新闻分类的混合特征组合

Hussain, Md Gulzar, Sultana, Babe, Ali, Md Rinku

Abstract

Nowadays, people in Bangladesh frequently rely on the internet and social media for daily news instead of traditional newspapers. However, the spread of false Bangla news through these platforms poses risks and challenges to the credibility of authentic media. Although several studies have been conducted on detecting Bangla fake news, there is still significant room for improvement in this area. To assist people, this research explores the effectiveness of feature selection approaches in identifying appropriate features, such as semantic, statistical, and character-level features, or their combinations, on the BanFakeNews-2.0 dataset for detecting Bangla fake news using a CNN model. In this paper, key findings reveal that combining multiple features significantly improves recall and F1-scores compared to using individual features alone. The code for this research can be availed here, https://github.com/gulzar09/Bn\_FNews\_H.Feature.

Chinese Translation

如今，孟加拉国的人们越来越依赖互联网和社交媒体获取日常新闻，而不是传统报纸。然而，通过这些平台传播的虚假孟加拉新闻对真实媒体的可信度构成了风险和挑战。尽管已有多项研究致力于检测孟加拉假新闻，但在这一领域仍有显著的改进空间。为了帮助人们，本研究探讨了特征选择方法在识别适当特征（如语义特征、统计特征和字符级特征）或它们组合的有效性，使用卷积神经网络（CNN）模型在BanFakeNews-2.0数据集上检测孟加拉假新闻。本文的关键发现表明，与单独使用特征相比，组合多种特征显著提高了召回率和F1分数。本研究的代码可在此获取：https://github.com/gulzar09/Bn_FNews_H.Feature。

View on arXiv Download PDF AI Translation

cs.CL / 54 / 2605.17482

Residual Semantic Decomposition of Word Embeddings

词嵌入的残差语义分解

Jin, Seungmin

Abstract

We introduce Residual Semantic Decomposition (RSD), a neural additive decomposition of word embeddings that balances embedding reconstruction with relational structure preservation. RSD supports recursive binary decomposition: each $K=2$ fit extracts a local semantic axis, while residuals expose information not absorbed by that axis. In manually specified paired-context diagnostics over ambiguous words, RSD separates supplied context anchors above shuffled-label controls, but entropy diagnostics show that ambiguous targets are not uniformly high-entropy boundary points in static GloVe. We therefore treat residual neighborhoods as qualitative diagnostics rather than benchmark sense predictions.

Chinese Translation

我们提出了残差语义分解（Residual Semantic Decomposition, RSD），这是一种神经加性分解方法，旨在平衡词嵌入的重构与关系结构的保留。RSD 支持递归二元分解：每次 $K=2$ 的拟合提取一个局部语义轴，而残差则揭示了该轴未吸收的信息。在对模糊词的手动指定配对上下文诊断中，RSD 能够将提供的上下文锚点与随机标签控制组区分开来，但熵诊断显示，在静态 GloVe 中，模糊目标并不是均匀的高熵边界点。因此，我们将残差邻域视为定性诊断，而非基准意义预测。

View on arXiv Download PDF AI Translation

cs.CL / 55 / 2605.17598

Mixture of Experts for Low-Resource LLMs

低资源大语言模型的专家混合模型

Joseph, Ori Bar, Arvatz, Smadar, Kayzer, Noam, Revital, Dan, Weinberger, Sarel

Abstract

Mixture-of-Experts (MoE) architectures enable efficient model scaling, yet expert routing behavior across underrepresented languages remains poorly understood. We analyze routing dynamics in two architecturally distinct MoE models -- a pure Transformer (Qwen3-30B-A3B) and a hybrid Mamba-Transformer (Nemotron-3-Nano-30B-A3B) -- using Hebrew as a morphologically rich, low-resource testbed. Both pre-trained models exhibit \emph{deep-layer routing collapse}: usage entropy drops sharply in final layers and tokens concentrate on a narrow expert subset, a pattern largely absent for English. Continual pre-training (CPT) on balanced bilingual data substantially corrects this imbalance, increasing entropy and shifting routing toward shared, language-agnostic experts; supervised fine-tuning (SFT) alone achieves less complete correction. Extending the analysis to Japanese reveals quantitatively consistent collapse signatures, providing cross-linguistic evidence that the phenomenon is a systematic consequence of pre-training underrepresentation rather than any language-intrinsic property. Routing improvements correlate with consistent downstream benchmark gains, positioning routing entropy and expert specialization as principled diagnostics for multilingual capacity in MoE systems.

Chinese Translation

专家混合模型（Mixture-of-Experts, MoE）架构能够实现高效的模型扩展，但在代表性不足的语言中，专家路由行为仍然不够清晰。我们分析了两种架构上不同的 MoE 模型的路由动态——一种是纯 Transformer（Qwen3-30B-A3B），另一种是混合 Mamba-Transformer（Nemotron-3-Nano-30B-A3B），以希伯来语作为形态丰富的低资源测试平台。这两种预训练模型均表现出 extit{深层路由崩溃}：在最后几层中，使用熵急剧下降，令令牌集中在狭窄的专家子集上，而这一模式在英语中几乎不存在。对平衡双语数据进行持续预训练（Continual Pre-Training, CPT）显著纠正了这种不平衡，增加了熵并将路由转向共享的、与语言无关的专家；而仅进行监督微调（Supervised Fine-Tuning, SFT）则未能实现如此全面的纠正。将分析扩展到日语，发现量化一致的崩溃特征，提供了跨语言证据，表明这一现象是预训练下代表性不足的系统性结果，而非任何语言固有属性。路由改进与下游基准测试的一致性增益相关，表明路由熵和专家专业化是 MoE 系统中多语言能力的原则性诊断指标。

View on arXiv Download PDF AI Translation

cs.CL / 56 / 2605.17639

Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations

共引可预测性的时间衰减：来自396百万乌克兰法院引用的20年法规检索基准

Ovcharov, Volodymyr

Abstract

Co-citation structure is widely assumed to provide stable retrieval signal in legal information systems. We test this assumption longitudinally by constructing UA-StatuteRetrieval, a benchmark that measures co-citation predictability across 20 annual snapshots (2007-2026) of 396 million codex citations from 101 million Ukrainian court decisions. Using a leave-one-out protocol over the full bipartite citation graph, we find that Adamic-Adar MRR declines 33% on a fixed set of articles (from 0.43 to 0.29) and 47% under a train/test temporal split (from 0.51 to 0.27) confirming genuine temporal decay rather than compositional shift or evaluation artifact. The decay is non-uniform: criminal procedure maintains stable co-citation patterns (MRR ~0.40), while civil law degrades from 0.35 to 0.15, coinciding with the 2017 judicial reform. Hub articles (>100K citations) resist decay, but mid-frequency articles (1K-10K) -- the practical retrieval frontier lose half their predictability. A BM25 text baseline decays even faster (31%), and embedding drift analysis with E5-large reveals a 4.3% semantic shift in how articles are cited, providing a mechanistic explanation for the observed decay. The benchmark is released at https://huggingface.co/datasets/overthelex/ua-statute-retrieval.

Chinese Translation

共引结构被广泛认为在法律信息系统中提供稳定的检索信号。我们通过构建UA-StatuteRetrieval这一基准，纵向测试这一假设，测量来自1.01亿乌克兰法院判决的396百万法典引用在20个年度快照（2007-2026）中的共引可预测性。使用全双部引用图的留一法协议，我们发现Adamic-Adar的平均排名回归（MRR）在固定文章集上下降了33%（从0.43降至0.29），在训练/测试的时间分割下下降了47%（从0.51降至0.27），确认了真正的时间衰减，而非组成变化或评估伪影。衰减呈现非均匀性：刑事程序保持稳定的共引模式（MRR约为0.40），而民法则从0.35降至0.15，恰逢2017年司法改革。中心文章（>10万引用）抵御衰减，但中频文章（1千-1万）——实际检索前沿则失去了一半的可预测性。BM25文本基线衰减更快（31%），而使用E5-large的嵌入漂移分析显示文章引用方式的语义变化达4.3%，为观察到的衰减提供了机制性解释。该基准已发布于https://huggingface.co/datasets/overthelex/ua-statute-retrieval。

View on arXiv Download PDF AI Translation

cs.CL / 57 / 2605.17652

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

超越文字记录：迭代同行编辑结合音频解锁高质量人类对话语音摘要

Chaparala, Kaavya, Thebaud, Thomas, López, Jesús Villalba, Moro-Velazquez, Laureano, Viechnicki, Peter, Dehak, Najim

Abstract

There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows varying input modality (audio, transcript, or both) and the inclusion of editing (self or peer-editing) to investigate potential quality tradeoffs from using human annotators to summarize audio. We compare human audio-based summaries to human transcript-based summaries to track the impact of the different information modalities on summary quality. We also compare the human outputs against four LLM benchmarks (three text, one audio) to examine whether human-written summaries are less informative than highly fluent automated outputs. We find that audio-based summaries are less informative and more compressed than transcript summaries. However, iterative peer-editing with audio mitigates this difference, enabling audio-based summaries to be as informative as their transcript counterparts and LLM summaries. These findings validate iterative peer-editing among human annotators for the creation of benchmarks informed by both lexical and prosodic information. This enables crucial dataset collection even in setting where transcripts are unavailable.

Chinese Translation

目前针对语音摘要任务的建立基准尚不够充分。创建新的基准需要人工标注，因为大语言模型（LLMs）可能会将系统性错误和偏见嵌入数据集中。我们测试了十种标注工作流程，变化的输入方式（音频、文字记录或两者结合）以及编辑的包含（自我编辑或同行编辑），以探讨使用人类标注者进行音频摘要时可能出现的质量权衡。我们将基于音频的人类摘要与基于文字记录的人类摘要进行比较，以追踪不同信息方式对摘要质量的影响。我们还将人类输出与四个大语言模型基准（三个文本，一个音频）进行比较，以检验人类撰写的摘要是否比高度流畅的自动输出信息量更少。我们的研究发现，基于音频的摘要信息量较少且更为压缩，然而，结合音频的迭代同行编辑减轻了这一差异，使得基于音频的摘要在信息量上与其文字记录对应物及大语言模型摘要相当。这些发现验证了在人类标注者之间进行迭代同行编辑的有效性，为创建同时考虑词汇和韵律信息的基准提供了支持。这使得在缺乏文字记录的情况下，关键数据集的收集成为可能。

View on arXiv Download PDF AI Translation

cs.CL / 58 / 2605.17672

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

当推理收敛时停止：语义保留的推理模型早期退出方法

Min, Dehai, Vaccarino, Giovanni, Chen, Huiyi, Wu, Yongliang, Yona, Gal, Cheng, Lu

Abstract

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at \url{https://github.com/giovanni-vaccarino/PUMA}.

Chinese Translation

大型推理模型（Large Reasoning Models, LRM）通过生成长链思维（Chain of Thought, CoT）来实现强大的性能，但往往会过度思考，在解决方案已经稳定后仍继续推理，从而浪费令牌并增加延迟。现有的推理时早期退出方法主要依赖于答案级信号，如置信度或试探性答案一致性，来决定何时停止。然而，这些信号主要反映答案的准备情况，而非推理的收敛性：它们可能在模型尚未完成探索或自我修正时就触发，导致过早退出，从而可能降低最终答案的准确性，并使保留的推理链在语义上不完整。我们将推理级的语义冗余识别为语义保留早期退出的补充信号：当连续步骤不再增加新进展，而是重访已建立的结论时，推理轨迹可能已经收敛。基于这一见解，我们提出了PUMA，一个即插即用的框架，结合了轻量级冗余检测器和答案级验证。检测器标记语义冗余的候选退出，而验证则确认停止是否安全，从而允许PUMA在保留答案准确性和连贯推理前缀的同时，消除冗余的继续。在五个LRM和五个具有挑战性的推理基准上，PUMA实现了26.2%的平均令牌减少，同时保持了准确性和保留的CoT质量。关于代码生成、零样本视觉-语言推理和学习停止策略内化的额外实验进一步证明，推理级冗余是一个强大、可转移且可学习的高效推理信号。我们的代码可在 {https://github.com/giovanni-vaccarino/PUMA} 获得。

View on arXiv Download PDF AI Translation

cs.CL / 59 / 2605.17691

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

验证您的权威：对多标签判例处理分类的LLMs基准测试

Demir, M. Mikail, Canbaz, M. Abdullah

Abstract

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

Chinese Translation

在法律判例中自动化负面处理的分类是一项关键但复杂的自然语言处理任务，错误分类带来了显著的风险。为了解决标准准确率的不足，本文引入了一种更为稳健的评估框架。我们在一个新的专家标注数据集上对现代大型语言模型进行了基准测试，该数据集包含239个真实世界的法律引用，并提出了一种新颖的平均严重性错误（Average Severity Error）指标，以更好地衡量分类错误的实际影响。我们的实验揭示了性能的分化。谷歌的Gemini 2.5 Flash在高层次分类任务中取得了最高准确率（79.1%），而OpenAI的GPT-5-mini在更复杂的细粒度模式中表现最佳（67.7%）。这项工作建立了一个重要的基准，提供了一个新的丰富上下文数据集，并引入了一种针对这一复杂法律推理任务需求量身定制的评估指标。

View on arXiv Download PDF AI Translation

cs.CL / 60 / 2605.17694

Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?

大型语言模型代理是否反映了权力不对称对话中的社会认知效应？

Vijjini, Anvesh Rao, Manjunath, Sagar, Chaturvedi, Snigdha

Abstract

Power differences shape human communication through well documented socio cognitive effects, including language coordination, pronoun usage, authority bias, and harmful compliance. We examine whether large language models (LLMs) exhibit similar behaviors when assigned high or low status personas. Using personas from diverse professions, we simulate multi turn, power asymmetric dialogues (e.g., principal teacher, justice lawyer) and measure (i) linguistic coordination, (ii) pronoun usage, (iii) persuasion success, and (iv) compliance with unsafe requests. Our results show that LLMs show key socio cognitive effects of power, albeit with nuances and variability, linking simulated interactions to both desirable and unsafe behaviors.

Chinese Translation

权力差异通过众所周知的社会认知效应塑造人类沟通，这些效应包括语言协调、代词使用、权威偏见和有害的顺从。我们研究大型语言模型（LLMs）在被赋予高或低地位角色时是否表现出类似的行为。通过使用来自不同职业的角色，我们模拟了多轮权力不对称对话（例如，校长与教师、律师与法官），并测量了（i）语言协调，（ii）代词使用，（iii）说服成功率，以及（iv）对不安全请求的顺从。我们的结果表明，LLMs表现出权力的关键社会认知效应，尽管存在细微差别和变异，将模拟的互动与既期望的行为和不安全的行为联系起来。

View on arXiv Download PDF AI Translation

cs.CL / 61 / 2605.17710

Sometin Beta Pass Notin (SBPN): Improving Multilingual ASR for Nigerian Languages via Knowledge Distillation

Sometin Beta Pass Notin (SBPN)：通过知识蒸馏提升尼日利亚语言的多语言自动语音识别

Ogun, Sewade

Abstract

Although modern multilingual Automatic Speech Recognition (ASR) systems support several Nigerian languages, their performance consistently lags behind high-resource languages like English and French. Nigerian languages present unique modelling hurdles, including acute data scarcity, inconsistent orthography, tonal diacritics, diverse accents, frequent code-switching, and localized named entities. To address these challenges, we developed a multilingual ASR framework utilizing a two-stage distillation process. First, we employ student-teacher knowledge distillation from existing monolingual models, conditioned on robust language-specific N-gram language models. Second, we perform iterative self improvement using pseudo-labelled data to further refine accuracy. Our method significantly bridges the performance gap, achieving on average a relative Word Error Rate (WER) reduction of 29 % over monolingual baselines. Our models also outperform state-of-the-art multilingual models across major benchmarks, including Common Voice and Fleurs. We introduce Sometin Beta Pass Notin (SBPN), a foundational multilingual ASR model covering Yor\`ub\'a, Hausa, Igbo, Nigerian Pidgin, and Nigerian English. SBPN is released in two sizes: SBPN-Base (120 M parameters) and SBPN-Large (600 M parameters). By releasing these as open foundation models, we aim to provide ASR resources for further research into the rich phonetic and cultural landscape of the region.

Chinese Translation

尽管现代多语言自动语音识别（ASR）系统支持多种尼日利亚语言，但其性能始终落后于英语和法语等高资源语言。尼日利亚语言面临独特的建模挑战，包括数据稀缺、拼写不一致、声调符号、多样的口音、频繁的代码切换以及本地化的命名实体。为了解决这些挑战，我们开发了一种多语言ASR框架，采用两阶段的蒸馏过程。首先，我们利用现有的单语模型进行学生-教师知识蒸馏，并依赖于强大的语言特定N-gram语言模型。其次，我们使用伪标注数据进行迭代自我改进，以进一步提高准确性。我们的方法显著缩小了性能差距，平均实现了相对于单语基线29%的词错误率（WER）降低。我们的模型在包括Common Voice和Fleurs在内的主要基准测试中也超越了最先进的多语言模型。我们推出了Sometin Beta Pass Notin (SBPN)，这是一个基础的多语言ASR模型，覆盖了约鲁巴语、豪萨语、伊博语、尼日利亚皮钦语和尼日利亚英语。SBPN以两种规模发布：SBPN-Base（120 M参数）和SBPN-Large（600 M参数）。通过将这些模型作为开放基础模型发布，我们旨在为进一步研究该地区丰富的语音和文化景观提供ASR资源。

View on arXiv Download PDF AI Translation

cs.CL / 62 / 2605.17714

From Documents to Segments: A Contextual Reformulation for Topic Assignment

从文档到片段：主题分配的上下文重构

Yoon, Hoonsang, Kim, Takyoung, Lee, Wonkee, Cho, Ilmin, Hakkani-Tür, Dilek, Choi, Stanley Jungkyu

Abstract

Traditional topic modeling assigns a single topic to each document. In practice, however, many real-world documents, such as product reviews or open-ended survey responses, contain multiple distinct topics. This mismatch often leads to topic contamination, where unrelated themes are merged into a single topic, making it difficult to identify documents that truly focus on a specific subject. We address this issue by introducing segment-based topic allocation (SBTA), a reformulation of topic modeling that assigns topics not to entire documents, but to segments: short, coherent spans of text that each express a single theme. By modeling topical structure at the segment level, our approach yields cleaner and more interpretable topics and better supports analysis of multi-theme documents. To support systematic evaluation, we construct a SemEval-STM, a new dataset inspired by aspect-based sentiment analysis. Documents are first decomposed into topical segments using large language models (LLMs), followed by human refinement to ensure segment quality. We also propose a segment-level extension of the word intrusion task, enabling human evaluation of topical coherence at the granularity where topics are actually assigned. Across multiple models and evaluation metrics, we show that SBTA improves clustering quality and interpretability. Overall, this work provides a practical, scalable framework for fine-grained topic analysis in heterogeneous text corpora where documents naturally span multiple topics. URL: https://huggingface.co/datasets/LG-AI-Research/SemEval-STM

Chinese Translation

传统的主题建模为每个文档分配一个单一主题。然而，在实际应用中，许多现实世界的文档，如产品评论或开放式调查反馈，包含多个不同的主题。这种不匹配常常导致主题污染，即不相关的主题被合并为一个单一主题，从而使得识别真正专注于特定主题的文档变得困难。我们通过引入基于片段的主题分配（Segment-Based Topic Allocation, SBTA）来解决这一问题，这是一种主题建模的重构方法，它不是将主题分配给整个文档，而是分配给片段：每个片段是表达单一主题的短小而连贯的文本段落。通过在片段级别建模主题结构，我们的方法产生了更清晰、更易解释的主题，并更好地支持多主题文档的分析。为了支持系统评估，我们构建了一个名为SemEval-STM的新数据集，该数据集受到基于方面的情感分析的启发。文档首先使用大型语言模型（Large Language Models, LLMs）被分解为主题片段，然后通过人工精炼以确保片段质量。我们还提出了一种片段级别的词干扰任务扩展，能够在人类评估主题连贯性时，达到实际分配主题的粒度。在多个模型和评估指标中，我们展示了SBTA提高了聚类质量和可解释性。总体而言，这项工作为在自然跨越多个主题的异构文本语料库中进行细粒度主题分析提供了一个实用、可扩展的框架。

View on arXiv Download PDF AI Translation

cs.CL / 63 / 2605.17755

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes

弥合版本差距：多版本训练改善ICD编码预测，尤其是针对稀有编码

Liu, Jinghui, Nguyen, Anthony

Abstract

Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.

Chinese Translation

临床编码将临床文档映射到标准化的医疗编码，这是一个重要但耗时的行政任务，能够从自动化中受益。当前的ICD编码模型通常针对特定ICD版本的编码进行优化。然而，实际上，ICD系统不断演变，不同版本在不同时间段和地区被采用。此外，ICD编码还面临长尾问题，稀有编码的表现可能成为开发可实施模型的瓶颈。我们研究通过结合不同ICD版本注释的数据，训练版本独立模型是否可行，这可能有助于解决这些挑战。我们将ICD-9数据添加到修改后的标签级注意力模型的ICD-10预测训练中，发现尽管存在版本不匹配，添加ICD-9与仅使用ICD-10训练相比，18K稀有ICD编码的微F1值提高了27%。在8K常见ICD-10编码上，多版本训练也显著改善了宏观指标，同时模型参数大幅减少。

View on arXiv Download PDF AI Translation

cs.CL / 64 / 2605.17774

Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

通过QLoRA微调在小型语言模型中内化工具知识

Shemla, Yuval, Yakobe, Ayal, Agarwal, Tanmay

Abstract

Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6\% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62\% less memory and running 2.5$\times$ faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality--retention trade-off, with $r=32$ maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.

Chinese Translation

大型语言模型越来越多地被用作代理系统中的规划组件，但当前的工具使用流程通常要求在每个提示中包含完整的工具模式，这造成了显著的令牌开销，并限制了较小模型的实用性。本文探讨了是否可以通过参数高效的微调将工具使用知识内化到小型语言模型中，从而在推理时实现结构化规划而无需显式的工具描述。我们以AssetOpsBench作为主要基准，使用8位QLoRA对Gemma 4 E4B和Qwen3-4B进行了微调，涉及约1,700个工具使用示例，涵盖工具知识、问题到计划的映射以及执行风格的痕迹。我们在无描述推理下评估了结果模型，其中提示完全省略了工具目录。微调后的模型在输入长度减少82.6%的同时，超越了接收完整工具描述的知情未微调基线，并改善了结构性和LLM评审的规划分数。在最佳的Gemma运行中，该模型达到了0.65的AT-F1和3.88的整体评审分数，而知情基线的得分分别为0.47和2.88。Qwen3-4B在使用62%的内存和运行速度比Gemma快2.5倍的情况下，获得了3.78的强整体评审分数，尽管它在一般多项选择基准上表现出更大的灾难性遗忘。额外的消融实验表明，LoRA秩控制了质量与保留之间的权衡，其中$r=32$最大化了规划质量，而较小的秩则保留了更多的一般知识。这些结果表明，对于固定的工具目录，QLoRA微调可以将工具知识从提示上下文转移到模型权重中，显著减少推理开销，同时保持或改善工具规划质量。

View on arXiv Download PDF AI Translation

cs.CL / 65 / 2605.17775

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

对百万条笔记规模下由大型语言模型（LLMs）改写的合成临床笔记质量的系统评估

Liu, Jinghui, Soni, Sarvesh, Nguyen, Anthony

Abstract

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

Chinese Translation

大型语言模型（LLMs）能够生成或合成临床文本，适用于多种应用，从改善临床文档到增强临床文本分析。然而，现有评估通常集中于狭窄的方面——例如相似性或效用比较——尽管这些方面是互补的，最好并行考虑。在本研究中，我们旨在对LLM生成的临床文本进行系统评估，包括对从MIMIC数据库中改写的合成临床笔记进行内在、外在和事实性评估，规模达到百万条笔记。我们的分析表明，尽管存在显著的语言变化，合成笔记仍能保留核心临床信息和粗粒度任务的预测效用，但在ICD编码等任务中却丧失了细粒度细节。我们展示了通过分块而非整体改写笔记可以显著减轻这一细节损失，但代价是降低了在不完整上下文下的事实精度。通过事实核查和错误分析，我们进一步发现，合成错误主要源于对临床上下文的误解，以及时间混淆、测量错误和虚假陈述。最后，我们表明，尽管合成笔记具有任务无关性，但仍能有效增强针对稀有ICD编码的任务特定训练。

View on arXiv Download PDF AI Translation

cs.CL / 66 / 2605.17789

SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?

SocialMemBench：人工智能记忆系统是否准备好应对社交群体环境？

Owolabi, Olukunle

Abstract

Memory systems for AI assistants were built for single-user dialogue and fail characteristically when applied to multi-party social group settings. This gap matters for the social assistants being built today: group-acting agents embedded in chat platforms, and proactive personal-assistant agents whose holistic model of a user must include their social context. Existing memory benchmarks evaluate dyadic or workplace dialogue; none targets multi-party social groups, where memory must anchor facts in shared history rather than professional roles, separate group norms from individual exceptions, and correctly attribute even after member departure. We introduce SocialMemBench, a benchmark of human-verified synthetic social group networks across five archetypes (close friends, family, recreational, interest community, acquaintance network) and three group-size tiers (4-30 members), with 430 personas and 7,355 conversation turns, yielding 1,031 QA pairs across nine question categories. Each category isolates an architectural capability, and the five failure modes (single-stream conflation, temporal-state overwrite, entity merging at scale, missing cross-persona knowledge, norm-individual conflation) are testable hypotheses; our two research probes Subject-Mem and SMG provide evidence on two, three remain open. A full-context Gemini 2.5 Flash reference reaches only 0.721 against a blind-critic reasoning-model mean of 0.98 on small networks, indicating the benchmark is genuinely difficult even with complete access to the conversation. Across all 43 networks, the four open-source memory frameworks evaluated (Mem0, LangMem, Graphiti, Cognee) cluster in the 0.12-0.18 question-weighted range with overlapping 95% CIs, well below an uncompressed retrieval reference of 0.345 and a matched-answerer full-context reference of 0.369 (GPT-4o-mini). Current memory systems show a measurable gap.

Chinese Translation

人工智能助手的记忆系统是为单用户对话构建的，在多方社交群体环境中应用时表现出明显的不足。这一差距对当今正在构建的社交助手至关重要：嵌入聊天平台的群体行动代理，以及其整体用户模型必须包括社交背景的主动个人助手代理。现有的记忆基准评估的是双人或工作场所对话；没有针对多方社交群体的基准，其中记忆必须将事实锚定在共享历史中，而非专业角色，区分群体规范与个体例外，并在成员离开后正确归属。我们引入了SocialMemBench，这是一个涵盖五种原型（密友、家庭、休闲、兴趣社区、熟人网络）和三个群体规模层级（4-30名成员）的经人验证的合成社交群体网络基准，包含430个角色和7,355个对话轮次，生成了跨九个问题类别的1,031个问答对。每个类别都孤立出一种架构能力，五种失败模式（单流混淆、时间状态覆盖、大规模实体合并、缺失跨角色知识、规范与个体混淆）是可测试的假设；我们的两个研究探针Subject-Mem和SMG提供了两个的证据，三项仍待探索。完整上下文的Gemini 2.5 Flash参考在小型网络中仅达到0.721，而盲评推理模型的均值为0.98，表明即使完全访问对话，该基准仍然非常困难。在所有43个网络中，评估的四个开源记忆框架（Mem0、LangMem、Graphiti、Cognee）聚集在0.12-0.18的问题加权范围内，95%置信区间重叠，远低于未压缩检索参考的0.345和匹配回答者的完整上下文参考的0.369（GPT-4o-mini）。当前的记忆系统显示出明显的差距。

View on arXiv Download PDF AI Translation

cs.CL / 67 / 2605.17849

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

从有机数据生成预训练标记以实现数据绑定扩展

Yu, Zichun, Xiong, Chenyan

Abstract

LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at https://github.com/cxcscmu/SynPro.

Chinese Translation

大规模语言模型（LLM）的预训练正从计算限制转向数据限制，其中可用的人类（有机）文本远远不能满足扩展需求。然而，达到数据限制并不意味着模型已充分利用其有机语料库。本文介绍了SynPro，一种合成数据生成框架，旨在帮助LLM更全面地从有限的有机数据中学习。SynPro应用了两种操作，即重述（rephrasing）和重新格式化（reformat），以多样化的形式呈现相同的有机来源，从而促进更深入的学习，而不引入外部信息。这两种生成器通过强化学习进行优化，奖励包括质量、忠实度和数据影响，并在预训练达到平台期时持续更新，以针对模型尚未吸收的内容。我们使用来自DCLM-Baseline的10% Chinchilla最优标记（0.8B和2.2B）对400M和1.1B模型进行预训练，反映了前沿预训练中的现实数据绑定状态。我们的结果表明，标准重复方式显著低估了有机数据的利用率：SynPro解锁了3.7-5.2倍的有效重复标记，甚至超过了在1.1B规模上训练等效唯一数据的非数据绑定预言机。分析确认，忠实且模型感知的合成能够维持数据绑定扩展，而不会导致分布崩溃。我们在https://github.com/cxcscmu/SynPro上开源了我们的代码。

View on arXiv Download PDF AI Translation

cs.CL / 68 / 2605.17860

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

PAREDA：一个多口音的自然语言处理研究讨论语音数据集

Jin, Sicheng, Srirag, Dipankar, Joshi, Aditya

Abstract

While modern Automatic Speech Recognition (ASR) systems achieve high accuracy on benchmark corpora, their performance often degrades when there is real-world variability. This work focuses on variability arising due to accented, spontaneous, and domain-specific speech. In particular, we introduce PAper REading DAtaset (PAREDA), a first-of-its-kind multi-accent speech dataset consisting of discussions on academic Natural Language Processing (NLP) papers between speakers with Australian, Indian-English, and Chinese English accents. Each session elicits a spontaneous monologue (a summary of a paper's abstract) and a non-monologue (a question-and-answer session between participants), resulting in a corpus rich with technical jargon and conversational phenomena. We evaluate the performance of SOTA ASR models on PAREDA, analysing the impact of accent mixing and increased speech rate. Our results show that, in the zero-shot setting, models perform worse, confirming the dataset's challenging nature. However, fine-tuning on PAREDA significantly reduces the Word Error Rate (WER), demonstrating that our dataset captures linguistic characteristics often missing from existing corpora. PAREDA serves as a valuable new resource for building and evaluating more robust and inclusive ASR systems for specialised, real-world applications.

Chinese Translation

尽管现代自动语音识别（ASR）系统在基准语料库上取得了高准确率，但在真实世界的变异性下，其性能往往会下降。本研究关注由于口音、自发性和特定领域的语言而产生的变异性。特别地，我们介绍了PAper REading DAtaset（PAREDA），这是首个多口音语音数据集，包含了具有澳大利亚、印度英语和中国英语口音的讲者之间关于学术自然语言处理（NLP）论文的讨论。每个会话引发一个自发的独白（论文摘要的总结）和一个非独白（参与者之间的问答环节），从而形成一个充满技术术语和对话现象的语料库。我们评估了最先进的ASR模型在PAREDA上的表现，分析了口音混合和语速增加的影响。我们的结果显示，在零样本设置下，模型的表现较差，证实了该数据集的挑战性。然而，在PAREDA上进行微调显著降低了词错误率（WER），表明我们的数据集捕捉到了现有语料库中常常缺失的语言特征。PAREDA为构建和评估更强大、更具包容性的ASR系统在特定领域的真实应用提供了一个宝贵的新资源。

View on arXiv Download PDF AI Translation

cs.CL / 69 / 2605.17885

Multi-agent AI systems outperform human teams in creativity

多智能体人工智能系统在创造力方面超越人类团队

Hu, Tiancheng, Jiang, Yixuan, Li, Haotian, Hernández-Orallo, José, Xie, Xing, Collier, Nigel, Stillwell, David, Sun, Luning

Abstract

Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi-agent systems with augmented creative capabilities.

Chinese Translation

尽管人工智能（AI）在众多认知任务中已达到或超过人类表现，但创造力仍然是一个高度争议的前沿领域。随着基于大型语言模型（LLMs）的AI系统在研究和创新中的日益普及，理解和增强其创造力变得至关重要。在此，我们展示了多智能体LLM团队不仅超越单一代理，而且在创造力方面显著超越人类团队（Cohen's d=1.50），在六个不同的问题解决任务中，涉及4,541个多智能体LLM创意和341个人类团队创意。这一优势源于新颖性，同时保持了相当的实用性。为了探讨两个组别的生成过程，我们将对话表示为通过语义空间的路径，使用神经语言模型表示。无论是LLM团队还是人类团队，当对话范围广泛而不是集中于单一主题（低全局连贯性）时，都会产生更具创造力的想法。然而，预测创造力的附加模式有所不同：LLM团队受益于高效探索（高语义扩散，较短路径），而人类团队则受益于保持流畅的对话流程（高局部连贯性，频繁转折）。此外，我们还确定了模型选择和讨论结构作为正交设计杠杆，这两者共同解释了LLM对话动态中26.8%的方差，为开发具有增强创造能力的多智能体系统提供了系统化的方法。

View on arXiv Download PDF AI Translation

cs.CL / 70 / 2605.17911

A Pilot Benchmark for NL-to-FOL Translation in Planetary Exploration

行星探索中自然语言到一阶逻辑翻译的初步基准

Moore, Hayden, Saha, Suman, Farooque, Mahfuza

Abstract

Future planetary exploration envisions autonomous robotic agents operating under severe communication constraints, without global positioning, and with minimal human intervention. In such environments, agents must not only perceive and act, but also reason over mission objectives, operational constraints, and evolving environmental conditions. While prior work has largely focused on perception and control, the translation of high-level mission knowledge into structured, machine-interpretable representations remains underexplored. We introduce a pilot benchmark for translating natural language (NL) into First-Order Logic (FOL) within the domain of planetary exploration. The dataset is constructed from real mission documentation sourced from NASA's Planetary Data System (PDS), spanning missions from 2003 to 2013. These documents describe mission phases such as launch, boost, coast, cruise, and orbital operations in rich natural language. We manually annotate these documents with corresponding FOL representations that capture temporal structure, agent roles, and operational dependencies. In addition, we provide structured predicate vocabularies and typed constants to enable controlled experimentation with varying levels of prior knowledge. This pilot benchmark provides a foundation for research at the intersection of language understanding and formal reasoning, grounded in real-world, safety-critical mission data. The dataset is provided at: https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json

Chinese Translation

未来的行星探索设想在严苛的通信限制下，自动化机器人代理能够自主操作，且不依赖全球定位，且人类干预最小。在这样的环境中，代理不仅需要感知和行动，还必须对任务目标、操作约束和不断变化的环境条件进行推理。尽管之前的研究主要集中于感知和控制，但将高层次的任务知识翻译为结构化的、机器可解释的表示仍然未得到充分探索。我们在行星探索领域引入了一个自然语言（NL）到一阶逻辑（FOL）翻译的初步基准。该数据集由来自美国国家航空航天局（NASA）行星数据系统（PDS）的真实任务文档构建，涵盖了2003年至2013年的任务。这些文档以丰富的自然语言描述了任务阶段，如发射、助推、滑行、巡航和轨道操作。我们手动为这些文档注释了相应的FOL表示，捕捉了时间结构、代理角色和操作依赖关系。此外，我们提供了结构化的谓词词汇和类型常量，以便在不同的先验知识水平下进行受控实验。这个初步基准为语言理解与形式推理交叉领域的研究奠定了基础，基于真实世界的安全关键任务数据。数据集可在以下链接获取：https://github.com/HaydenMM/planetary-logic-benchmark/blob/main/pilot_benchmark.json

View on arXiv Download PDF AI Translation

cs.CL / 71 / 2605.17932

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

扩散大语言模型中的提示压缩：在 LLaDA 上评估 LLMLingua-2

Huang, Sterling, Brown, Abigayle, Noh, Jiyoo, Xu, Jiakang, Huo, Wantong, Kyaw, Kaung Myat, Chan, Jonathan

Abstract

Prompt compression reduces inference cost and context length in large language models, but prior evaluations focus primarily on autoregressive architectures. This study investigates whether prompt compression transfers effectively to diffusion large language models (DLLMs) using LLMLingua-2, specifically the 8B-parameter DLLM LLaDA. We evaluate compression performance on GSM8K, DUC2004, and ShareGPT using 250 prompts per dataset at an approximate 2$\times$ compression ratio, across mathematical reasoning, prompt reconstruction, and summarization tasks. Outputs generated from original prompts, compressed prompts, reconstructed prompts, and reconstructed-prompt reasoning were compared using exact-match accuracy, BLEU, ROUGE, and BERTScore. Results show that semantic preservation does not necessarily imply stable downstream behavior in diffusion models. Summarization tasks remained comparatively robust under compression, while mathematical reasoning degraded substantially despite high semantic similarity scores. Reconstruction experiments further showed that semantically similar prompts may still omit reasoning-critical information required for stable denoising. Across tasks, BERTScore recall was consistently lower than precision, suggesting that compression failures are primarily driven by information omission rather than semantic drift. These findings indicate that prompt compression methods designed for autoregressive models do not transfer uniformly to diffusion large language models and motivate the development of diffusion-aware compression strategies.

Chinese Translation

提示压缩可以降低大语言模型的推理成本和上下文长度，但之前的评估主要集中在自回归架构上。本研究探讨了提示压缩是否能够有效转移到扩散大语言模型（DLLMs），使用 LLMLingua-2，特别是 8B 参数的 DLLM LLaDA。我们在 GSM8K、DUC2004 和 ShareGPT 上评估了压缩性能，每个数据集使用约 250 个提示，压缩比约为 2$ imes$，涵盖数学推理、提示重建和摘要任务。我们使用精确匹配准确率、BLEU、ROUGE 和 BERTScore 比较了从原始提示、压缩提示、重建提示和重建提示推理生成的输出。结果表明，语义保留并不一定意味着扩散模型中的下游行为稳定。摘要任务在压缩下仍然相对稳健，而尽管数学推理的语义相似度得分较高，但其性能却显著下降。重建实验进一步表明，语义相似的提示可能仍然遗漏了稳定去噪所需的推理关键性信息。在各项任务中，BERTScore 的召回率始终低于精确率，表明压缩失败主要是由于信息遗漏而非语义漂移。这些发现表明，针对自回归模型设计的提示压缩方法并不能均匀地转移到扩散大语言模型，并激励开发适应扩散模型的压缩策略。

View on arXiv Download PDF AI Translation

cs.CL / 72 / 2605.17936

Universal Adversarial Triggers

通用对抗触发器

Arockiaraj, Benedict Florance, Feng, Alexander, Cai, Jianxiong, Cheng, Xiaoyu

Abstract

Recent works have illustrated that modern NLP models trained for diverse tasks ranging from sentiment analysis to language generation succumb to universal adversarial attacks, a class of input-agnostic attacks where a common trigger sequence is used to attack the model. Although these attacks are successful, the triggers generated by such attacks are ungrammatical and unnatural. Our work proposes a novel technique combining parts-of-speech filtering and perplexity based loss function to generate sensible triggers that are closer to natural phrases. For the task of sentiment analysis on the SST dataset, the method produces sensible triggers that achieve accuracies as low as 0.04 and 0.12 for flipping positive to negative predictions and vice-versa. To build robust models, we also perform adversarial training using the generated triggers that increases the accuracy of the model from 0.12 to 0.48. We aim to illustrate that adversarial attacks can be made difficult to detect by generating sensible triggers, and to facilitate robust model development through relevant defenses.

Chinese Translation

近期的研究表明，现代自然语言处理（NLP）模型在情感分析到语言生成等多种任务上，容易受到通用对抗攻击的影响。这类攻击是一种与输入无关的攻击方式，使用一个共同的触发序列来攻击模型。尽管这些攻击成功，但此类攻击生成的触发器通常不符合语法且不自然。我们的研究提出了一种新颖的技术，结合词性过滤和基于困惑度的损失函数，以生成更接近自然短语的合理触发器。在对SST数据集进行情感分析的任务中，该方法生成的合理触发器使得模型在将正面预测翻转为负面预测时的准确率低至0.04，而将负面预测翻转为正面预测时的准确率低至0.12。为了构建更强健的模型，我们还使用生成的触发器进行对抗训练，将模型的准确率从0.12提高到0.48。我们的目标是通过生成合理的触发器，使对抗攻击变得难以检测，并通过相关防御措施促进强健模型的开发。

View on arXiv Download PDF AI Translation

cs.CL / 73 / 2605.17937

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench：用于自动化量化策略回测的大型语言模型基准测试

Wang, Zhensheng, Yang, Wenmian, Wu, Qingtai, Ma, Lequan, Zhang, Yiquan, Jia, Weijia

Abstract

Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalability. While Large Language Models (LLMs) offer a transformative path to automate this complex, interdisciplinary workflow through advanced code generation, tool usage, and agentic planning, the practical realization is significantly challenged by the current lack of a large-scale benchmark dedicated to automated quantitative backtesting, which hinders progress in this field. To bridge this critical gap, we introduce BacktestBench, the first large-scale benchmark for automated quantitative backtesting. Built from over 6 million real market records, it comprises 18,246 meticulously annotated question-answering pairs across four task categories: metrics calculation, ticker selection, strategy selection, and parameter confirmation. We also propose AutoBacktest, a robust multi-agent baseline that translates natural language strategies into reproducible backtests by coordinating a Summarizer for semantic factor extraction, a Retriever for validated SQL generation, and a Coder for Python backtesting implementation. Our evaluation on 23 mainstream LLMs, complemented by targeted ablations, identifies key factors that influence end-to-end performance and highlights the importance of grounded verification and standardized indicator representations.

Chinese Translation

量化回测对于评估交易策略至关重要，但仍受到高技术门槛和有限可扩展性的制约。尽管大型语言模型（LLMs）通过先进的代码生成、工具使用和自主规划提供了自动化这一复杂跨学科工作流程的变革性路径，但由于缺乏专门针对自动化量化回测的大规模基准，当前的实际实现面临重大挑战，这阻碍了该领域的进展。为填补这一关键空白，我们推出了BacktestBench，这是第一个用于自动化量化回测的大规模基准。该基准基于超过600万条真实市场记录，包含18,246对经过精心注释的问题-回答对，涵盖四个任务类别：指标计算、股票选择、策略选择和参数确认。我们还提出了AutoBacktest，这是一个强大的多智能体基线，通过协调用于语义因子提取的总结器、用于验证SQL生成的检索器和用于Python回测实现的编码器，将自然语言策略转换为可重复的回测。我们对23个主流LLMs的评估，以及针对性的消融实验，识别了影响端到端性能的关键因素，并强调了基于实证验证和标准化指标表示的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 74 / 2605.17978

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

AutoVecCoder：教会大型语言模型生成显式向量化代码

Li, Shangzhan, Yin, Xinyu, Jin, Xuanyu, He, Ye, Zhou, Yuxin, Li, Yuxuan, Han, Xu, Che, Wanxiang, Shi, Qi, Liu, Ting, Sun, Maosong

Abstract

Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.

Chinese Translation

通过单指令多数据（SIMD）架构进行向量化是高性能计算的基石。为了充分利用硬件潜力，开发者通常依赖于使用内置函数的显式向量化，因为基于编译器的自动向量化由于保守的静态分析常常产生次优结果。尽管大型语言模型（LLMs）在一般代码生成方面表现出色，但由于高质量语料库的稀缺以及低级硬件指令的严格语义约束，它们在显式向量化方面面临挑战。本文提出了AutoVecCoder，一个旨在赋予LLMs自动显式向量化能力的新框架。AutoVecCoder集成了两个核心组件：VecPrompt，一个自动化数据合成管道，用于注入领域特定的内置知识；以及VecRL，一个将代码生成与执行效率对齐的强化学习框架。通过该框架训练的AutoVecCoder-8B在SimdBench的SSE和AVX子集上达到了最先进的性能，并且在某些情况下生成的实现超越了标准-O3优化，有效克服了传统自动向量化的固有瓶颈。

View on arXiv Download PDF AI Translation

cs.CL / 75 / 2605.17989

Predictive Prefetching for Retrieval-Augmented Generation

用于检索增强生成的预测预取

Zhang, Wuyang, Pei, Shichao

Abstract

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

Chinese Translation

检索增强生成（Retrieval-Augmented Generation, RAG）在大型语言模型中改善了事实基础，但由于同步检索导致了显著的延迟。尽管近期的研究探索了异步检索，但现有的方法依赖于检索与生成之间的启发式协调，并假设在解码过程中信息需求是稳定的，这在复杂的多领域环境中往往会失效。本文提出了一种先进的异步检索框架，能够根据不断变化的信息需求实现预测预取。该框架通过利用在不确定性变得关键之前几个标记中出现的生成动态的语义前兆，明确预测何时触发检索以及应检索何种信息，使用三个组件：检索预测器、上下文监控器和查询生成器。多个基准测试的实验表明，该方法在端到端延迟上减少了多达43.5%，在首次标记的响应时间上提高了62.4%，同时保持了与同步RAG基线相当的答案质量。

View on arXiv Download PDF AI Translation

cs.CL / 76 / 2605.18001

Bridging the Gap: Converting Read Text to Conversational Dialogue

弥合差距：将朗读文本转换为对话语音

Singla, Parshav, Banerjee, Agnik, Arora, Aaditya, Aggarwal, Shruti, Verma, Anil Kumar, M, Vikram C, Gohil, Raj Prakash, Agarwal, Gopal Kumar

Abstract

In recent advancements within speech processing, converting read speech to conversational speech has gained significant attention. The primary challenge in this domain is maintaining naturalness and intelligibility while minimizing computational overhead for real-time applications. Traditional read speech often lacks the nuanced prosodic variation essential for natural conversational interactions, posing challenges for applications in virtual assistants, customer service, and language learning tools. This paper introduces a novel approach, Prosodic Adjustment with Conversational Context (PACC), aimed at converting read speech into natural conversational speech used in various modern applications. PACC utilizes advanced deep neural networks to analyze and modify prosodic features such as intonation, stress, and rhythm. Unlike conventional methods, our approach uses High-Fidelity Generative Adversarial Networks (HiFi-GAN) for speech synthesis. Our experimental results demonstrate significant improvements in speech conversion, enhancing naturalness and achieving better model accuracy with additional training on speech datasets. This research establishes new benchmarks in speech conversion tasks and Mean Opinion Score (MOS) evaluation for testing model accuracy, and we show that our approach can be successfully extended to other speech conversion applications.

Chinese Translation

在语音处理的最新进展中，将朗读语音转换为对话语音引起了广泛关注。该领域的主要挑战在于保持自然性和可懂性，同时最小化实时应用的计算开销。传统的朗读语音往往缺乏自然对话互动所需的细腻韵律变化，这给虚拟助手、客户服务和语言学习工具等应用带来了挑战。本文提出了一种新颖的方法，称为具有对话上下文的韵律调整（Prosodic Adjustment with Conversational Context，PACC），旨在将朗读语音转换为用于各种现代应用的自然对话语音。PACC利用先进的深度神经网络分析和修改韵律特征，如语调、重音和节奏。与传统方法不同，我们的方法采用高保真生成对抗网络（High-Fidelity Generative Adversarial Networks，HiFi-GAN）进行语音合成。我们的实验结果显示，语音转换显著改善，增强了自然性，并通过对语音数据集的额外训练实现了更好的模型准确性。本研究在语音转换任务和主观评价分数（Mean Opinion Score，MOS）评估模型准确性方面建立了新的基准，并展示了我们的方法可以成功扩展到其他语音转换应用。

View on arXiv Download PDF AI Translation

cs.CL / 77 / 2605.18007

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

推理时对修辞角色标注中的困难示例进行语义重排序

Belfathi, Anas, Hernandez, Nicolas, Monceaux, Laura, Bonnard, Warren, Dufour, Richard

Abstract

Rhetorical Role Labeling (RRL) assigns a functional role to each sentence in a document and is widely used in legal, medical, and scientific domains. While language models (LMs) achieve strong average performance, they remain unreliable on hard examples, where prediction confidence is low. Existing approaches typically handle uncertainty implicitly and treat labels as discrete identifiers, overlooking the semantic information encoded in label names. We introduce RISE, an inference-time semantic reranking framework that leverages label semantics to refine predictions on hard instances. RISE automatically identifies low-confidence predictions and reranks model outputs using contrastively learned label representations, without retraining or modifying the underlying model. Experiments on eight domain-specific RRL datasets with seven LMs, including encoder-based and causal architectures, show an average gain of +9.15 macro-F1 points on hard examples. For explainability, we further propose manual hardness annotations to study difficulty from both model and human perspectives, revealing a moderate agreement with Cohen's kappa = 0.40.

Chinese Translation

修辞角色标注（RRL）为文档中的每个句子分配一个功能角色，广泛应用于法律、医学和科学领域。尽管语言模型（LMs）在平均性能上表现强劲，但在困难示例上仍然不可靠，此时预测置信度较低。现有方法通常隐式处理不确定性，并将标签视为离散标识符，忽视了标签名称中编码的语义信息。我们提出了RISE，一个推理时的语义重排序框架，利用标签语义来优化对困难实例的预测。RISE自动识别低置信度预测，并使用对比学习的标签表示对模型输出进行重排序，而无需重新训练或修改基础模型。在八个特定领域的RRL数据集上进行的实验，使用七种语言模型，包括基于编码器和因果架构的模型，显示在困难示例上平均提升了+9.15个宏F1分数。为了提高可解释性，我们进一步提出手动困难度注释，从模型和人类的角度研究难度，揭示了与Cohen's kappa = 0.40的中等一致性。

View on arXiv Download PDF AI Translation

cs.CL / 78 / 2605.18032

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA：多智能体大语言模型工作流的离线评估与迭代优化

Kawamura, Kazuki, Waki, Satoshi, Tateno, Kei

Abstract

Multi-agent LLM workflows -- systems composed of multiple role-specific LLM calls -- often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.

Chinese Translation

多智能体大语言模型（LLM）工作流——由多个角色特定的LLM调用组成的系统——通常优于单一提示基线，但它们仍然难以调试和优化。故障可能源于中间输出中的细微错误，这些错误会传播到下游节点，要求开发者检查长时间的追踪记录并推断需要修改的智能体。我们提出了PROTEA，一个用于多智能体工作流的离线、测试驱动改进的统一接口。PROTEA执行工作流，使用可配置的评分标准对中间节点输出进行评分，并在工作流图上叠加每个节点的状态和理由，以定位可能的瓶颈。为了支持最终答案引用作为主要监督的复杂系统，PROTEA执行反向节点评估：它根据最终答案引用和图形上下文生成候选节点级期望，然后将其与观察到的节点输出进行比较。对于选定的节点，PROTEA以可编辑的前后对比形式呈现有针对性的提示修订，然后自动重新运行和重新评估工作流，以显示输出变化和评分轨迹。在两个接近生产的工作流中，PROTEA将文档检查准确率从64.3%提高到83.9%，推荐的Hit@5从0.30提高到0.38。在与六位经验丰富的LLM开发者进行的形成性研究中，参与者重视图级定位、每个节点的理由以及可编辑的前后提示修订。

View on arXiv Download PDF AI Translation

cs.CL / 79 / 2605.18067

PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence

PPAI：实现个性化大型语言模型代理的互操作性以促进协作边缘智能

Wang, Zile, Liu, Qianli, Guo, Kaibin, Wang, Haodong, Lin, Jian, Hong, Zicong, Guo, Song

Abstract

Deploying large language model (LLM) on edge device enables personalized LLM agents for various users. The growing availability of diverse personalized agents presents a unique opportunity for peer-to-peer (P2P) collaboration, wherein each user can delegate tasks beyond the local agent's expertise to remote agents more suited for the specific query. This paper introduces PPAI, the first personalized LLM agent interoperability system, which enables users to collaborate with each other based on agent specialization. However, the ever-changing pool of agents and their interchangeable capacity introduce new challenges when it comes to matching queries to agents and balancing loads, compared with existing P2P systems. Therefore, we propose a scalable query-agent pair scoring mechanism based on prototypes to identify suitable agents within a P2P network with churn. Moreover, we propose a multi-agent interoperability Bayesian game to balance local demand and global efficiency, when changes in remote agent load occur too quickly to be observed. Finally, we implement a prototype of PPAI and demonstrate that it substantially broadens the range of tasks that could be carried out while maintaining load balance. On average, it achieves an accuracy improvement of up to 7.96% across multiple tasks, while reducing latency by 16.34% compared to the baseline.

Chinese Translation

在边缘设备上部署大型语言模型（LLM）使得为不同用户提供个性化的LLM代理成为可能。日益丰富的个性化代理的可用性为点对点（P2P）协作提供了独特的机会，在这种协作中，每个用户可以将超出本地代理专业领域的任务委托给更适合特定查询的远程代理。本文介绍了PPAI，这是第一个个性化LLM代理互操作性系统，能够使用户基于代理专业化进行协作。然而，与现有的P2P系统相比，代理池的不断变化及其可互换性在将查询匹配到代理和负载平衡方面引入了新的挑战。因此，我们提出了一种基于原型的可扩展查询-代理配对评分机制，以识别在动态变化的P2P网络中适合的代理。此外，我们提出了一种多代理互操作性贝叶斯博弈，以在远程代理负载变化过快而无法观察时平衡本地需求和全球效率。最后，我们实现了PPAI的原型，并证明它在保持负载平衡的同时显著扩展了可执行任务的范围。与基线相比，平均提高了多项任务的准确率达7.96%，同时将延迟降低了16.34%。

View on arXiv Download PDF AI Translation

cs.CL / 80 / 2605.18071

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

KVDrive：一个面向长上下文 LLM 推理的整体多层次 KV 缓存管理系统

Lin, Jian, Mi, Jiazhi, Hong, Zicong, Wang, Haodong, Liu, Qianli, Zhang, Haodyue, Li, Peng, Guo, Song

Abstract

Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances three fundamental capabilities: it adapts cache management to attention behavior to maximize reuse and minimize redundant data movement; it restructures the decoding pipeline to overlap I/O- and CPU/GPU compute-bound stages, eliminating stalls across heterogeneous resources; and it harmonizes data movement across memory tiers to unlock scalable long-context inference far beyond GPU and DRAM limits. We have implemented a fully functional prototype of KVDrive and evaluated it on long-context benchmarks with popular LLMs. The system achieves up to 1.74x higher throughput compared to state-of-the-art works while preserving accuracy.

Chinese Translation

支持长上下文 LLM 的推理面临着关键值（KV）缓存的巨大内存需求的挑战。现有的卸载系统将完整的缓存存储在主机内存中，并在解码过程中选择性地获取关键条目，但这一策略很快就达到了瓶颈：在不降低准确性的情况下，稀疏性无法进一步提升。因此，当上下文长度和批量大小增加时，KV 传输的量急剧上升，成为解码延迟的主要来源。我们提出了 KVDrive，一个跨越 GPU 内存、主机 DRAM 和 SSD 的整体多层次 KV 缓存管理系统。与以往通过算法优化追求更高稀疏性的工作不同，KVDrive 从系统的角度解决问题——共同协调缓存放置、管道调度和跨层次协作，以在紧张的 GPU 预算下维持高吞吐量的推理。KVDrive 提升了三项基本能力：它根据注意力行为调整缓存管理，以最大化重用并最小化冗余数据移动；它重构了解码管道，以重叠 I/O 和 CPU/GPU 计算密集型阶段，消除异构资源间的停滞；并且它协调了跨内存层次的数据移动，以解锁远超 GPU 和 DRAM 限制的可扩展长上下文推理。我们已实现了 KVDrive 的一个完全功能原型，并在流行 LLM 的长上下文基准测试中进行了评估。该系统在保持准确性的同时，吞吐量比最先进的工作提高了最高 1.74 倍。

View on arXiv Download PDF AI Translation

cs.CL / 81 / 2605.18083

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE

通向多语言大型语言模型的数据高效路径：通过后训练PARAM$ riangle$集成到升级的专家混合模型中实现语言扩展

Zhou, Hao, Li, Tianhao, Wang, Zhijun, She, Shuaijie, Wu, Linjuan, Wei, Hao-Ran, Yang, Baosong, Chen, Jiajun, Huang, Shujian

Abstract

Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($\Delta_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.

Chinese Translation

将大型语言模型（LLMs）扩展到新语言是一项成本高昂的工作，需进行广泛的持续预训练（CPT）和数据密集型的对齐。尽管最近的数据无关合并技术试图通过将多语言CPT增强模型与其指令对应模型融合来绕过对齐，但它们面临一个关键的权衡：为了减轻参数冲突以保留原有能力，不可避免地会稀释新语言的获取，反之亦然。为了解决这一冲突，我们提出了 extit{method}，该方法将一个稠密模型升级为专家混合模型（MoE）架构，为不同语言分配不同的专家。然后，通过将MoE扩展的参数增量（$ riangle_{ ext{post}}$）嫁接到CPT增强的基础模型上，转移对齐能力，从而绕过复杂的对齐阶段。实验表明， extit{method}在与具有相似FLOPs或参数数量的基线相比时表现出优越性；它在扩展语言上的性能得到了提升，同时有效保留了原有能力。我们进一步展示了我们的方法在不同模型和后训练增量中具有高度适用性。

View on arXiv Download PDF AI Translation

cs.CL / 82 / 2605.18105

How Loud Rumbles Hit Newsstands: A Data Analysis of Coverage and Spatial Bias in German News about Landslides Around the World

轰鸣声如何冲击新闻摊位：关于全球滑坡事件的德国新闻报道的覆盖与空间偏差的数据分析

Madureira, Brielen, Niekler, Andreas, Keuschnigg, Marc, de Brito, Mariana Madruga

Abstract

Landslides often hit newsstands due to their destructive and potentially fatal effects. News are a valuable source of information for creating or enriching disaster databases and for expediting media-based studies of the dynamics of media attention. To accomplish that, news datasets must be filtered, geolocated and validated. This paper focuses on how landslides around the world are reported in German newspapers. We analyse almost 60k news articles about 5.5k news events in a 25-year period, compare it with external measures of countries' susceptibility to landslides and provide insights, e.g.~the overreporting of Southern and Western Europe, to foment further studies on inequalities in media attention to international disasters.

Chinese Translation

滑坡事件因其破坏性和潜在致命性而常常成为新闻报道的焦点。新闻是创建或丰富灾害数据库以及加速基于媒体的媒体关注动态研究的重要信息来源。为了实现这一目标，新闻数据集必须经过筛选、地理定位和验证。本文聚焦于全球滑坡事件在德国报纸中的报道情况。我们分析了25年期间近6万篇关于5500个新闻事件的新闻文章，并将其与国家滑坡易发性外部指标进行比较，提供了诸如南欧和西欧报道过度的见解，以促进对国际灾害媒体关注不平等的进一步研究。

View on arXiv Download PDF AI Translation

cs.CL / 83 / 2605.18111

How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking

大型语言模型在回答孟加拉医学视觉问题方面的表现如何？数据集与基准测试

Ahmed, Rafid, Tahmid, Intesar, Hossain, Mir Sazzat, Tomal, Tasnimul Hossain, Fahim, Md, Bhuiyan, Md Farhad Alam

Abstract

Recent advancements in Large Language Models (LLMs) and Large Vision Language Models (LVLMs) have enabled general-purpose systems to demonstrate promising capabilities in complex reasoning tasks, including those in the medical domain. Medical Visual Question Answering (MedVQA) has particularly benefited from these developments. However, despite Bangla being one of the most widely spoken languages globally, there exists no established MedVQA benchmark for it. To address this gap, we introduce BanglaMedVQA, a dataset comprising clinically validated image-question-answer pairs, along with a comprehensive evaluation of current foundation models on this resource. Consistent with prior findings that report low performance of current models on English MedVQA benchmarks, our analysis reveals that Bangla performance is substantially lower, reflecting the challenges inherent to low-resource languages. Even top-performing models such as Gemini and GPT-4.1 mini fail to accurately answer specialized diagnostic questions, indicating severe limitations in fine-grained medical reasoning. Although certain open-source models, such as Gemma-3, occasionally outperform these models in general categories, they too struggle with clinically complex questions, underscoring the urgent need for top-notch evaluation method.

Chinese Translation

近年来，大型语言模型（LLMs）和大型视觉语言模型（LVLMs）的进展使得通用系统在复杂推理任务中展现出良好的能力，包括医学领域的任务。医学视觉问答（MedVQA）特别受益于这些发展。然而，尽管孟加拉语是全球使用最广泛的语言之一，但尚未建立针对该语言的MedVQA基准。为了解决这一空白，我们引入了BanglaMedVQA，一个包含临床验证的图像-问题-答案对的数据集，并对当前基础模型在该资源上的表现进行了全面评估。与先前发现的当前模型在英语MedVQA基准上表现不佳的结果一致，我们的分析显示，孟加拉语的表现显著更低，反映出低资源语言固有的挑战。即使是表现最佳的模型，如Gemini和GPT-4.1 mini，也未能准确回答专业的诊断问题，显示出在细粒度医学推理方面的严重局限性。尽管某些开源模型，如Gemma-3，在一般类别中偶尔超越这些模型，但它们在临床复杂问题上也面临困难，突显出对高质量评估方法的迫切需求。

View on arXiv Download PDF AI Translation

cs.CL / 84 / 2605.18113

iPOE: Interpretable Prompt Optimization via Explanations

iPOE：通过解释实现可解释的提示优化

Li, Jiahui, Papay, Sean, Klinger, Roman

Abstract

Prompt optimization has often been framed as a discrete search problem to find high-performing and robust instructions for an LLM. However, the search result might not make it transparent why and where specific prompt changes lead to performance gains. This is in contrast to how humans are instructed for annotation tasks. Here, researchers carefully design annotation guidelines, leading to enhanced annotation consistency. Our paper aims at joining these two approaches and introduces iPOE, a novel interpretable prompt optimization strategy via explanations. We guide the prompt optimization process by automatically created guidelines from explanations of annotation decisions (either automatically generated or from humans). This set of guidelines is furthermore optimized by as series of operations, including removing, adding, shuffling, and merging. The resulting prompt includes guidelines that instruct the annotation, making the decision process of the LLM and the optimization transparent. It therefore supports also laypeople in the area of prompt optimization, particularly in challenging domains requiring expertise. In our experiments on four datasets, we find that iPOE can improves over prompts without guidelines and with random selected guidelines by up to $31\%$ and $35\%$, respectively. Moreover, LLM explanations can replace human explanations in the proposed method.

Chinese Translation

提示优化通常被视为一个离散搜索问题，以寻找高性能和稳健的指令用于大型语言模型（LLM）。然而，搜索结果可能无法清晰地说明特定提示更改为何以及在何处导致性能提升。这与人类在注释任务中的指导方式形成对比。在此，研究人员精心设计注释指南，从而提高注释的一致性。我们的论文旨在将这两种方法结合起来，提出iPOE，一种通过解释实现的新型可解释提示优化策略。我们通过从注释决策的解释（无论是自动生成的还是来自人类的）自动创建的指南来引导提示优化过程。这组指南进一步通过一系列操作进行优化，包括删除、添加、打乱和合并。最终生成的提示包含指导注释的指南，使得LLM的决策过程和优化过程变得透明。因此，它也支持非专业人士在提示优化领域，特别是在需要专业知识的挑战性领域。在我们对四个数据集的实验中，我们发现iPOE相比于没有指南的提示和随机选择的指南，分别提高了高达31%和35%的性能。此外，LLM的解释可以在该方法中替代人类的解释。

View on arXiv Download PDF AI Translation

cs.CL / 85 / 2605.18155

FOL2NS: Generating Natural Sentences from First-Order Logic

FOL2NS：从一阶逻辑生成自然句子

Jia, Mei

Abstract

Translating formal language into natural language is a foundational challenge in NLP, driving various downstream applications in semantic parsing, theorem validation, and question answering. In this study, we introduce First-Order Logic to Natural Sentence (FOL2NS), a neurosymbolic framework designed to generate synthetic FOL formulas and convert them into natural human expressions. It handles deeply nested structures with varying quantifier depths (QD), which are rarely captured by existing corpora. By combining rule-driven modules with fine-tuned language models, FOL2NS enhances the diversity and coverage of the generated samples. In our experiments, we systematically evaluate the framework's capabilities through both character-level analysis and overall performance metrics. Experimental results show that FOL2NS can reliably produce well-formed templates and fluent statements, but it faces challenges in achieving precise semantic representations and natural generation as structural complexity increases.

Chinese Translation

将形式语言翻译为自然语言是自然语言处理（NLP）中的一个基础性挑战，推动了语义解析、定理验证和问答等多种下游应用的发展。在本研究中，我们介绍了一阶逻辑到自然句子（FOL2NS），这是一个旨在生成合成一阶逻辑（FOL）公式并将其转换为自然人类表达的神经符号框架。它能够处理具有不同量词深度（QD）的深度嵌套结构，这在现有语料库中很少被捕捉。通过将基于规则的模块与微调的语言模型相结合，FOL2NS 增强了生成样本的多样性和覆盖率。在我们的实验中，我们通过字符级分析和整体性能指标系统地评估了该框架的能力。实验结果表明，FOL2NS 可以可靠地生成格式良好的模板和流畅的陈述，但在结构复杂性增加时，它在实现精确的语义表示和自然生成方面面临挑战。

View on arXiv Download PDF AI Translation

cs.CL / 86 / 2605.18211

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

利用图结构提升序列到序列模型在知识图谱链接预测中的表现

Phuc, Luu Huu, Thapa, Ratan Bahadur, Nayyeri, Mojtaba, Wu, Jingcheng, Kharlamov, Evgeny, Staab, Steffen

Abstract

We introduce Graph-Augmented Sequence-to-Sequence (GA-S2S), a novel framework that integrates a T5-small encoder-decoder with a Relational Graph Attention Network (RGAT) to improve link prediction in knowledge graphs. While existing Seq2Seq models rely solely on surface-level textual descriptions of entities and relations and at best, flatten the neighborhoods of a query entity into a single linear sequence, thereby discarding the inherent graph structure, GA-S2S jointly encodes both textual features and the full $k$-hop subgraph topology surrounding the query entity. By integrating raw encoder outputs with RGAT's relation-aware embeddings, our model captures and leverages richer multi-hop relational patterns and textual information. Our preliminary experiments on the CoDEx dataset demonstrate that GA-S2S outperforms competitive Seq2Seq-based baseline models, achieving up to a 19\% relative gain in link prediction accuracy.

Chinese Translation

我们提出了一种图增强序列到序列（Graph-Augmented Sequence-to-Sequence，GA-S2S）新框架，该框架将T5-small编码器-解码器与关系图注意网络（Relational Graph Attention Network，RGAT）相结合，以提高知识图谱中的链接预测性能。现有的序列到序列模型仅依赖于实体和关系的表面文本描述，最多将查询实体的邻域展平为单一线性序列，从而丢弃了固有的图结构。而GA-S2S则共同编码文本特征和围绕查询实体的完整$k$-跳子图拓扑。通过将原始编码器输出与RGAT的关系感知嵌入相结合，我们的模型捕捉并利用了更丰富的多跳关系模式和文本信息。我们在CoDEx数据集上的初步实验表明，GA-S2S在链接预测准确性上优于竞争性的基于序列到序列的基线模型，达到了高达19%的相对提升。

View on arXiv Download PDF AI Translation

cs.CL / 87 / 2605.18226

Context Memorization for Efficient Long Context Generation

高效长上下文生成的上下文记忆

Okoshi, Yasuyuki, Chen, Hao Mark, Lu, Guanxi, Fan, Hongxiang, Motomura, Masato, Fujiki, Daichi

Abstract

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

Chinese Translation

现代大型语言模型（LLM）应用越来越依赖于长的条件前缀，以在推理时控制模型行为。尽管增强前缀的推理是有效的，但它存在两个结构性限制：i）随着生成的进行，前缀的影响逐渐减弱；ii）对前缀的注意力计算随着其长度线性增长。现有的方法要么在压缩前缀的同时保持对其的注意力，要么通过基于梯度的训练将其内化为模型参数。前者在推理时仍然关注前缀，而后者则训练密集且不适合前缀更新。为了解决这些问题，我们提出了注意力状态记忆（attention-state memory），这是一种无训练的方法，将前缀外部化为一种轻量级的、基于查找的记忆，存储前缀与查询标记之间的预计算注意力状态。在使用 LLaMA-3.1-8B 的 ManyICLBench 上，我们的方法在 1K-8K 内存预算下提高了上下文学习的准确性，同时在 8K 时将注意力延迟减少了 1.36 倍，并在 NBA 基准测试中仅使用其 20% 的内存占用超越了全注意力 RAG 的性能。

View on arXiv Download PDF AI Translation

cs.CL / 88 / 2605.18232

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

SomaliWeb v1：一个经过质量过滤的索马里网络语料库，配备匹配的分词器和公共语言识别基准

Dahir, Khalid Yusuf

Abstract

Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3% byte-exact duplicates, 56.1% of its documents contain fixable mojibake, and 10.7% of its byte-unique documents are near-duplicates at Jaccard tau=0.80. Our BPE-16K tokenizer emits 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest as a tokenizer-level measurement; downstream language-model perplexity comparisons are deferred to a follow-up release.

Chinese Translation

索马里语是非洲之角的一种库希特语，约有2500万说话者，但尚未公开发布任何专门的索马里预训练语料库，配有相应的分词器和语言识别基准。现有的索马里文本出现在多语言分发中（如 HPLT v2、CC100、MADLAD-400、OSCAR、mC4），或在 Hugging Face 上的小型、未记录的索马里专用上传中。我们推出了 SomaliWeb v1，这是一个经过质量过滤的索马里语料库，包含819,322个文档（约303M个标记），该语料库通过六个阶段的可重复流程从三个上游来源（HPLT v2、CC100、索马里维基百科）构建而成。我们发布了（i）该语料库，（ii）一个匹配的 BPE-16K 分词器，以及（iii）第一个公共的并排索马里基准，包含三个生产语言识别器。我们的测量结果揭示了现有分发中的具体质量缺陷：HPLT v2 的“清理”索马里版本保留了17.3%的字节精确重复，56.1%的文档包含可修复的乱码，10.7%的字节唯一文档在 Jaccard tau=0.80 的情况下为近重复。我们的 BPE-16K 分词器在 FLORES-200 索马里开发测试中发出的标记比 GPT-4 的 cl100k_base 少40.2%，作为分词器级别的测量；下游语言模型的困惑度比较将推迟到后续发布中。

View on arXiv Download PDF AI Translation

cs.CL / 89 / 2605.18239

Multilingual jailbreaking of LLMs using low-resource languages

利用低资源语言对大型语言模型进行多语言越狱

Marx, Dylan, Dunaiski, Marcel

Abstract

Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved ineffective, while multi-turn conversations achieved English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), Afrikaans from 60.0% (Claude 3.5 Haiku) to 78.2% (GPT-4o-mini), and Kiswahili from 41.8% (Claude 3.5 Haiku) to 70.9% (DeepSeek). Human red-teaming increased jailbreak rates compared to automated methods. Over all evaluated languages, the average jailbreak rate increased from 59.8% to 75.8%, with improvements of +20.0% (Afrikaans), +12.7% (isiZulu), +12.3% (isiXhosa), and +1% (Kiswahili), demonstrating that poor translation quality limits jailbreak success. These findings suggest that vulnerabilities in LLMs persist in multilingual contexts and that translation quality is the critical factor determining jailbreak success in low-resource languages.

Chinese Translation

大型语言模型（LLMs）仍然容易受到绕过安全防护措施的越狱尝试的攻击。我们研究了使用低资源非洲语言（南非荷兰语、斯瓦希里语、伊西科萨语和伊西祖卢语）进行多轮对话是否能够绕过商业LLMs的安全机制。我们对现有数据集中的提示进行了翻译，并通过自动化测试和与母语者的人工红队测试评估了ChatGPT、Claude、DeepSeek、Gemini和Grok。单轮翻译攻击被证明无效，而多轮对话在英语中的有害响应率从52.7%（Claude 3.5 Haiku）到83.6%（GPT-4o-mini），南非荷兰语从60.0%（Claude 3.5 Haiku）到78.2%（GPT-4o-mini），斯瓦希里语从41.8%（Claude 3.5 Haiku）到70.9%（DeepSeek）。与自动化方法相比，人工红队测试提高了越狱成功率。在所有评估的语言中，平均越狱率从59.8%提高到75.8%，其中南非荷兰语提高了20.0%、伊西祖卢语提高了12.7%、伊西科萨语提高了12.3%、斯瓦希里语提高了1%，这表明翻译质量的差异限制了越狱的成功。这些发现表明，LLMs在多语言环境中仍然存在脆弱性，而翻译质量是决定低资源语言中越狱成功的关键因素。

View on arXiv Download PDF AI Translation

cs.CL / 90 / 2605.18253

Machine Unlearning for Masked Diffusion Language Models

针对掩蔽扩散语言模型的机器遗忘

Lee, Georu, Jeong, Seungwon, Kim, Hoki, Park, Jinseong, Lee, Woojin

Abstract

Recent masked diffusion language models (MDLMs), such as LLaDA and Dream, have achieved performance comparable to autoregressive large language models. Unlike autoregressive models, which generate text sequentially, MDLMs generate text by iteratively denoising masked positions in parallel. During fine-tuning, MDLMs learn to recover responses from masked response states conditioned on a prompt, thereby shifting their predictions from a prompt-masked unconditional distribution toward a prompt-conditional distribution. Despite this distinct generative and fine-tuning mechanism, machine unlearning for MDLMs remains largely unexplored. In this paper, we propose Masked Diffusion Unlearning (MDU), the first unlearning framework for MDLMs, by revisiting the process of learning specific knowledge in terms of diffusion. Specifically, MDU minimizes a forward KL divergence from the prompt-conditional prediction to a prompt-masked unconditional anchor at every masked response position, with a temperature scaling parameter to control the privacy-utility trade-off. Our empirical results on standard benchmarks and MDLM backbones show that MDU achieves high unlearning performance compared to existing LLM unlearning methods. Code is available at https://github.com/leegeoru/MDU.

Chinese Translation

近期的掩蔽扩散语言模型（Masked Diffusion Language Models，MDLMs），如LLaDA和Dream，已经达到了与自回归大型语言模型相当的性能。与自回归模型逐步生成文本不同，MDLMs通过并行地迭代去噪掩蔽位置来生成文本。在微调过程中，MDLMs学习从基于提示的掩蔽响应状态中恢复响应，从而将其预测从基于提示的掩蔽无条件分布转向基于提示的条件分布。尽管这种生成和微调机制具有独特性，但针对MDLMs的机器遗忘仍然在很大程度上未被探索。本文提出了掩蔽扩散遗忘（Masked Diffusion Unlearning，MDU），这是针对MDLMs的第一个遗忘框架，通过重新审视扩散过程中的特定知识学习。具体而言，MDU在每个掩蔽响应位置最小化从基于提示的条件预测到基于提示的掩蔽无条件锚点的前向KL散度，并引入温度缩放参数以控制隐私与效用之间的权衡。我们在标准基准和MDLM基础模型上的实证结果表明，MDU在遗忘性能上优于现有的LLM遗忘方法。代码可在 https://github.com/leegeoru/MDU 获取。

View on arXiv Download PDF AI Translation

cs.CL / 91 / 2605.18261

Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains

知识到验证：探索知识密集领域中大语言模型的可验证奖励强化学习（RLVR）

Yuan, Zhonghang, Wang, Zhefan, Hu, Fang, Chen, Zihong, Li, Jinzhe, Li, Gang, Ying, Jie, Kong, Huanjun, Zhang, Songyang, Dong, Nanqing

Abstract

Reinforcement learning with verifiable rewards (RLVR) has demonstrated promising potential to enhance the reasoning capabilities of large language models (LLMs) in domains such as mathematics and coding. However, its applications on knowledge-intensive domains have not been effectively explored due to the scarcity of high-quality verifiable data. Furthermore, current RLVR focuses solely on the correctness of final answers, leading to the limitations of flawed reasoning and sparse reward signals. In this work, we propose Knowledge-to-Verification (K2V), a framework that extends RLVR to knowledge-intensive domains through automated verifiable data synthesis, while enabling verification of the LLM's reasoning process. Extensive experiments demonstrate that K2V enhances the reasoning of LLM in knowledge-intensive domains without significantly compromising the model's general capabilities. This study also suggests that integrating automated data synthesis with reasoning verification is a promising direction to enhance model capabilities in these broader domains. Code is available at https://github.com/SeedScientist/K2V.

Chinese Translation

可验证奖励的强化学习（RLVR）在数学和编程等领域展示了增强大语言模型（LLMs）推理能力的良好潜力。然而，由于高质量可验证数据的稀缺，其在知识密集领域的应用尚未得到有效探索。此外，目前的RLVR仅关注最终答案的正确性，导致推理缺陷和稀疏奖励信号的局限性。在本研究中，我们提出了知识到验证（K2V）框架，通过自动化可验证数据合成将RLVR扩展到知识密集领域，同时实现对LLM推理过程的验证。大量实验表明，K2V在知识密集领域增强了LLM的推理能力，而没有显著损害模型的整体能力。本研究还表明，将自动化数据合成与推理验证相结合是提升模型在这些更广泛领域能力的一个有前景的方向。代码可在 https://github.com/SeedScientist/K2V 获取。

View on arXiv Download PDF AI Translation

cs.CL / 92 / 2605.18271

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

从量到值：面向设备的偏好对齐记忆构建

Lee, Changmin, Kim, Jaemin, Gong, Taesik

Abstract

With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature of real-world requests, such agents must ground their generation in device-resident personal context. However, under tight memory budgets, the core bottleneck is what to store so that retrieval remains aligned with the user. We propose EPIC (Efficient Preference-aligned Index Construction), which focuses on user preferences as a compact and stable form of personal context and integrates them throughout the RAG pipeline. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts. Across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 20.17 percentage points, and achieves 33.33 times lower retrieval latency over the best-performing baseline. In our on-device experiment, EPIC maintains a memory footprint under 1 MB with 29.35 ms/query latency in streaming updates.

Chinese Translation

随着基于大型语言模型（LLMs）的个人人工智能代理的快速出现，在设备上实现这些代理已成为隐私和响应性的必要条件。为了处理现实世界请求的固有个人性和上下文依赖性，这些代理必须将其生成基于设备本地的个人上下文。然而，在严格的内存预算下，核心瓶颈在于存储什么，以确保检索与用户保持一致。我们提出了EPIC（高效偏好对齐索引构建），它将用户偏好作为一种紧凑且稳定的个人上下文形式，并在整个RAG（检索增强生成）流程中进行整合。EPIC从原始数据中选择性地保留与偏好相关的信息，并使检索与偏好对齐的上下文保持一致。在涵盖对话、辩论、解释和推荐的四个基准测试中，EPIC将索引内存减少了2404倍，提高了偏好跟随准确性20.17个百分点，并在最佳基线之上实现了33.33倍更低的检索延迟。在我们的设备实验中，EPIC保持了低于1 MB的内存占用，并在流式更新中实现了29.35毫秒/查询的延迟。

View on arXiv Download PDF AI Translation

cs.CL / 93 / 2605.18337

Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

Infini-News：高效可查询的13亿条处理过的Common Crawl新闻文章访问

Lazzaroni, Ruggero Marino, Lasser, Jana, Solovev, Kirill

Abstract

Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.

Chinese Translation

大规模新闻语料库支持计算社会科学和自然语言处理领域的广泛研究，但访问仍然受到限制：商业档案库施加了高昂的费用和许可限制，而像Common Crawl的CC-News这样的开放替代方案则需要TB级别的存储和计算密集型处理。我们提出了Infini-News，一个用于2016年8月至最新可用快照的整个CC-News档案的检索工具包和索引。我们的贡献主要有三方面。首先，我们提取、清理文本，并解析超过13.5亿篇文章的结构化元数据。其次，我们通过使用三种前沿语言分类器（GlotLID、lingua和CommonLingua）进行语言检测，并通过多源地理归属为222个国家中83.4%的文章解决原产国问题，从而丰富了语料库。第三，我们构建了Infini-gram索引：后缀数组结构，使研究人员能够在亚秒时间内搜索整个档案中的任意文本模式。这些资源共同降低了对纵向跨国媒体研究的门槛。

View on arXiv Download PDF AI Translation

cs.CL / 94 / 2605.18352

Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs

条件句中的预设与推理：基于理论的人类与大型语言模型的研究

Azin, Tara, Yu, Yongan, Singh, Raj, Jouravlev, Olessia

Abstract

Presupposition projection in conditionals is central to theories of meaning and pragmatics, yet it remains largely unevaluated in large language models. We address this gap through a parallel behavioral study comparing human judgments and LLM predictions on a normed dataset of conditional sentences that controls the relation between the antecedent and the projected presupposition. We collect likelihood ratings from 120 participants and four LLMs under matched contextual conditions. Results show that humans integrate probabilistic and pragmatic cues in their judgment, whereas LLMs show variable alignment with human patterns. Using a linguistically motivated checklist within an LLM-as-a-Judge framework, we further evaluate model reasoning. We observe models that best match human ratings often lack coherent pragmatic reasoning, while models with stronger reasoning produce less human-like judgments. These findings suggest that LLMs' performance on such tasks may result from surface pattern matching rather than pragmatic competence. Our findings highlight the importance of benchmarks grounded in linguistic theory for comparing humans and models.

Chinese Translation

条件句中的预设投射是意义与语用理论的核心内容，但在大型语言模型中仍然缺乏评估。我们通过一项平行行为研究来填补这一空白，比较人类判断与大型语言模型在一个规范化的条件句数据集上的预测，该数据集控制了前提与投射预设之间的关系。我们在匹配的上下文条件下收集了120名参与者和四个大型语言模型的可能性评分。结果表明，人类在判断中整合了概率和语用线索，而大型语言模型与人类模式的对齐程度则表现出变异性。通过在大型语言模型作为评判者框架内使用语言学驱动的检查表，我们进一步评估了模型推理。我们观察到，与人类评分最匹配的模型往往缺乏连贯的语用推理，而推理能力较强的模型则产生了较少的人类相似判断。这些发现表明，大型语言模型在此类任务上的表现可能源于表面模式匹配，而非语用能力。我们的研究强调了基于语言理论的基准在比较人类与模型时的重要性。

View on arXiv Download PDF AI Translation

cs.CL / 95 / 2605.18401

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

SkillsVote：代理技能的生命周期治理，从收集、推荐到演化

Liu, Hongyi, Yang, Haoyan, Jiang, Tao, Tang, Bo, Xiong, Feiyu, Li, Zhiyu

Abstract

Long-horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an experience schema that couples executable scripts, with non-executable guidance on procedures. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills from collection and recommendation to evolution. SkillsVote profiles a million-scale open-source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution, SkillsVote performs agentic library search over structured skill library to expose instructional skill context. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. In our evaluation, offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp, while online evolution improves SWE-Bench Pro by up to 2.6 pp. Overall, governed external skill libraries can improve frozen agents without model updates when systems control exposure, credit, and preservation.

Chinese Translation

长期的 LLM 代理会留下可重用经验的痕迹，但原始轨迹噪声大且难以治理。我们将代理技能视为一种经验模式，它将可执行脚本与关于程序的不可执行指导相结合。然而，开放的技能生态系统中包含冗余、不均匀、环境敏感的工件，随意的更新可能会污染未来的上下文。我们提出了 SkillsVote，这是一个针对代理技能的生命周期治理框架，涵盖从收集和推荐到演化的各个环节。SkillsVote 针对环境需求、质量和可验证性对百万规模的开源语料库进行分析，然后合成可验证技能的任务。在执行之前，SkillsVote 在结构化技能库上进行代理库搜索，以揭示指导性技能的上下文。在执行之后，它将轨迹分解为与技能相关的子任务，将结果归因于技能使用、代理探索、环境和结果信号，并仅允许成功的可重用发现进入证据门控更新。在我们的评估中，离线演化使 GPT-5.2 在 Terminal-Bench 2.0 上提高了最高 7.9 个百分点，而在线演化使 SWE-Bench Pro 提高了最高 2.6 个百分点。总体而言，治理的外部技能库可以在系统控制曝光、信用和保存时改善冻结代理，而无需模型更新。

View on arXiv Download PDF AI Translation

cs.CL / 96 / 2605.18421

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

EvoMemBench：从自我演化的视角评估智能体记忆

Wang, Yuyao, Zhang, Zhongjian, Chi, Mo, Yu, Kaichi, Li, Yuhan, Peng, Miao, Tong, Bing, Zhang, Chen, Zhou, Yan, Li, Jia

Abstract

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.

Chinese Translation

近期针对大型语言模型（Large Language Model, LLM）智能体的基准测试主要评估推理、规划和执行能力。然而，记忆对于智能体同样至关重要，因为它使智能体能够随时间存储、更新和检索信息。这一能力仍然未得到充分评估，主要是因为现有基准测试未能提供系统化的方法来评估记忆机制。在本文中，我们从自我演化的视角研究智能体记忆，并引入EvoMemBench，这是一个统一的基准，沿着两个维度组织：记忆范围（剧集内 vs. 跨剧集）和记忆内容（知识导向 vs. 执行导向）。我们在标准化协议下比较了15种具有代表性的记忆方法与强大的长上下文基线。结果表明，当前的记忆系统距离通用解决方案仍然相去甚远：长上下文基线依然具有很强的竞争力，记忆在当前上下文不足或任务困难时最为有效，且没有单一的记忆形式在所有设置中都能始终如一地发挥作用。基于检索的方法在知识密集型设置中依然表现强劲，而过程性和长期记忆方法在其存储的经验与任务结构匹配时对于执行导向任务更为有效。我们希望EvoMemBench能够促进未来对基于LLM的智能体更有效记忆系统的研究。我们的代码可在 https://github.com/DSAIL-Memory/EvoMemBench 获取。

View on arXiv Download PDF AI Translation

cs.CL / 97 / 2605.18462

From BERT to T5: A Study of Named Entity Recognition

从 BERT 到 T5：命名实体识别研究

Jia, Mei

Abstract

Named entity recognition (NER) has been one of the essential preliminary steps in modern NLP applications. This report focuses on implementing the NER task on finetuning two pretrained models: (i) an encoder-only model (BERT) with a simple classification head, and (ii) a sequence-to-sequence model (T5) with few-shot prompts. Under the original 7-class tag and 3-class simplified tag schemes, BERT is applied a weighted cross-entropy for training loss, and T5 is fine-tuned with two validation strategies. It also conducted an ablation study with different hyperparameters. Moreover, the related analysis provides valuable insights into common errors in BERT and the two models' performance. Based on a bunch of performance metrics, this report aims to compare the above two architectures and explore their abilities in the sequence labelling task, laying the groundwork for further practical use cases.

Chinese Translation

命名实体识别（NER）是现代自然语言处理（NLP）应用中的重要前置步骤。本报告聚焦于在微调两个预训练模型上实现 NER 任务：（i）一个仅包含编码器的模型（BERT），配有简单的分类头；（ii）一个序列到序列模型（T5），使用少量示例提示。在原始的 7 类标签和 3 类简化标签方案下，BERT 应用加权交叉熵作为训练损失，而 T5 则通过两种验证策略进行微调。同时，还进行了不同超参数的消融研究。此外，相关分析提供了对 BERT 和这两个模型性能中常见错误的宝贵见解。基于一系列性能指标，本报告旨在比较上述两种架构，并探索它们在序列标注任务中的能力，为进一步的实际应用案例奠定基础。

View on arXiv Download PDF AI Translation

cs.CL / 98 / 2605.18490

Vector RAG vs LLM-Compiled Wiki: A Preregistered Comparison on a Small Multi-Domain Research

向量 RAG 与 LLM 编译的维基：一项关于小型多领域研究的预注册比较

Cochran, Theodore O.

Abstract

We preregistered a comparison of two ways to help an LLM answer questions over a small research corpus: a single-round Vector RAG system and an LLM-compiled markdown wiki. Both systems answered the same 13 questions over 24 papers using the same answer-generating model, and their answers were scored by blinded LLM judges. The wiki scored much better at connecting findings across papers, but its advantage in answer organization was not strong after judge adjustment. RAG met the preregistered test for single-fact lookup questions. The clean query-side cost result went against the expected wiki advantage: under the tested setup, the wiki used far more query tokens than RAG, so it could not recover any upfront build cost through cheaper queries. Two exploratory analyses changed how we interpret the result. First, claim-level citation checking favored the wiki: its cited pages more often supported the exact claims being made, even though RAG scored better on the overall groundedness rubric. Second, a decomposition-based RAG variant recovered most of the wiki's advantage on cross-paper synthesis at lower LLM-token cost, but it did not recover the wiki advantage in claim-by-claim citation support. The main conclusion is that grounded research synthesis is not a single capability. Systems can differ in how well they organize evidence, how well their citations support each claim, and how much they cost to run. In this study, no architecture was best on all three.

Chinese Translation

我们预注册了一项比较，研究两种帮助 LLM 在小型研究语料库上回答问题的方法：单轮向量 RAG 系统和 LLM 编译的 markdown 维基。这两个系统使用相同的答案生成模型回答了 24 篇论文中的 13 个相同问题，且其答案由盲评的 LLM 评审进行评分。维基在跨论文连接研究发现方面得分较高，但在评审调整后，其在答案组织上的优势并不明显。RAG 满足了预注册的单事实查找问题测试。清晰的查询侧成本结果与预期的维基优势相悖：在测试设置下，维基使用的查询标记远多于 RAG，因此无法通过更便宜的查询来弥补任何前期构建成本。两项探索性分析改变了我们对结果的解读。首先，声明级引用检查更有利于维基：其引用的页面更常支持所提出的确切声明，尽管 RAG 在整体基础性评分标准上得分更高。其次，一种基于分解的 RAG 变体在较低的 LLM 标记成本下恢复了维基在跨论文综合方面的大部分优势，但在逐条声明的引用支持上未能恢复维基的优势。主要结论是，基础研究综合并不是单一能力。系统在组织证据的效率、引用对每个声明的支持程度以及运行成本方面可能存在差异。在本研究中，没有一种架构在这三方面都是最佳的。

View on arXiv Download PDF AI Translation

cs.CL / 99 / 2605.18500

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

隐式层次GRPO：将工具调用与执行解耦以实现工具集成的数学推理

Wang, Li, Wang, Xiaohan, Lu, Xiaodong, Zhang, Zipeng, Wu, Jinyang, Chai, Jiajun, Lin, Wei, Yin, Guojun

Abstract

Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invocation with immediate execution. Such immediate tool interaction may disrupt the reasoning coherence of LLMs and constrain their expressivity, ultimately degrading reasoning performance. To this end, for the first time, we propose and formalize the problem of decoupling tool invocation from execution during reasoning, and introduce delayed execution with explicit control to enhance tool-integrated reasoning (TIR). Furthermore, we propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm. Extensive experiments on IH-GRPO achieve absolute improvements of 1.87\%, 2.16\%, and 2.53\% on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain mathematical reasoning benchmarks over the strongest baseline method, while also yielding consistent performance gains in other domains. Our code is available at https://github.com/Lumina04/IH-GRPO-01.

Chinese Translation

大型语言模型（LLMs）越来越多地利用工具调用来增强其推理能力。然而，现有的方法通常将工具调用与即时执行紧密耦合。这种即时工具交互可能会破坏LLMs的推理连贯性，并限制其表达能力，最终降低推理性能。为此，我们首次提出并形式化了在推理过程中将工具调用与执行解耦的问题，并引入了带有显式控制的延迟执行，以增强工具集成推理（TIR）。此外，我们提出了一个层次控制框架，并理论推导出一个替代损失，使得隐式层次策略能够学习与显式层次策略等效的行为，从而形成了所提出的IH-GRPO算法。在IH-GRPO上的大量实验显示，在六个领域外的数学推理基准测试中，相较于最强基线方法，Qwen3-1.7B、Qwen3-4B和Qwen3-8B的绝对提升分别为1.87%、2.16%和2.53%，同时在其他领域也获得了一致的性能提升。我们的代码可在 https://github.com/Lumina04/IH-GRPO-01 获取。

View on arXiv Download PDF AI Translation

cs.CL / 100 / 2605.18504

Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

古希腊语到现代希腊语的机器翻译：一个新基准及对大型语言模型和神经机器翻译模型的微调实验

Mavromatis, Spyridon, Sofianopoulos, Sokratis, Prokopidis, Prokopis, Giagkou, Maria

Abstract

Machine Translation (MT) for Ancient Greek (AG) to Modern Greek (MG) is a low-resource task, constrained by the lack of large-scale, high-quality parallel data. We address this gap by introducing the AG-MG Parallel Corpus, a new resource containing 132,481 sentence-aligned pairs derived from literary, historical, and biblical texts. We present a novel corpus creation pipeline that combines web-scraped, excerpt-level data with a multi-stage sentence-level alignment, and refinement process. Our method uses VecAlign with LaBSE embeddings, which we first fine-tune on a manually-aligned AG-MG subset, followed by an LLM-based error/misalignment correction phase using Gemini 2.5 Flash to ensure high alignment quality. Furthermore, we provide the first comprehensive benchmark of modern MT models on this task, evaluating three fine-tuning strategies across NMT models (NLLB, M2M100) and a Greek LLM (Llama-Krikri-8B). Our experiments show that fine-tuning yields significant improvements over base models, increasing performance by up to +10.3 BLEU points. Specifically, full-parameter fine-tuning of Llama-Krikri-8B achieves the highest overall performance with a BLEU score of 13.16, while the QLoRA-adapted M2M100-1.2B model demonstrates the largest relative gains and highly competitive results. Our dataset and models represent a significant contribution to Greek NLP.

Chinese Translation

古希腊语（AG）到现代希腊语（MG）的机器翻译（MT）是一项低资源任务，受限于缺乏大规模、高质量的平行数据。我们通过引入AG-MG平行语料库来填补这一空白，该语料库包含来自文学、历史和圣经文本的132,481对句子对齐的句子。我们提出了一种新颖的语料库创建流程，该流程结合了网络抓取的摘录级数据与多阶段的句子级对齐和精炼过程。我们的方法使用VecAlign与LaBSE嵌入，首先在手动对齐的AG-MG子集上进行微调，然后通过使用Gemini 2.5 Flash的基于大型语言模型（LLM）的错误/错位修正阶段，确保高对齐质量。此外，我们提供了现代MT模型在该任务上的首个全面基准，评估了三种微调策略在神经机器翻译模型（NMT）上（NLLB, M2M100）以及一个希腊LLM（Llama-Krikri-8B）上的表现。我们的实验表明，微调相较于基础模型显著提升了性能，最高可提高+10.3 BLEU分数。具体而言，Llama-Krikri-8B的全参数微调实现了最高的整体性能，BLEU分数为13.16，而QLoRA适配的M2M100-1.2B模型则展示了最大的相对增益和高度竞争的结果。我们的数据集和模型对希腊自然语言处理（NLP）做出了重要贡献。

View on arXiv Download PDF AI Translation

cs.CL / 101 / 2605.18512

Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

判断比寻找更容易：预测上下文学习成功的演示选择

Wang, Haochun, Yang, Chaofen, Liu, Jiatong, Wang, Jingbo, Qiang, Zewen, Zhao, Sendong, Qin, Bing, Liu, Ting

Abstract

In-context learning (ICL) is highly sensitive to which demonstrations appear in the prompt, but selecting them is expensive because the space of possible demonstration contexts and combinations is enormous. We argue that demonstration selection is \emph{easier to judge than to find}: predicting whether a specific query--context pair $(q,D)$ will succeed is cheaper and more general than searching for an optimal $D^\star$. Based on this insight, we propose DiSP, a sample-and-judge framework that stratifies queries by difficulty. DiSP runs random demonstration trials to estimate success rate of each training query, trains a lightweight router to predict difficulty from the query, and trains level-specific judges for sampled demonstrations. At inference, DiSP performs stop-on-acceptance judging under an explicit budget, emitting diagnostic risk tags when no suitable context is found. Across five classification datasets with Llama~3--8B and Qwen~2.5--7B, DiSP achieves the best average accuracy, improving over strong learned selection baselines by up to 3.4\%, while achieving up to $23\times$ end-to-end wall-clock speedup.

Chinese Translation

上下文学习（In-context learning, ICL）对提示中出现的演示非常敏感，但选择这些演示的成本很高，因为可能的演示上下文和组合的空间非常庞大。我们认为，演示选择是“判断比寻找更容易”的：预测特定查询-上下文对 $(q,D)$ 是否会成功的成本低于寻找最优 $D^igstar$。基于这一洞察，我们提出了 DiSP，一个样本与判断的框架，该框架根据难度对查询进行分层。DiSP 运行随机演示试验以估计每个训练查询的成功率，训练一个轻量级路由器从查询中预测难度，并为采样的演示训练特定级别的评判者。在推理阶段，DiSP 在明确预算下执行接受即停的判断，当未找到合适的上下文时发出诊断风险标签。在使用 Llama~3--8B 和 Qwen~2.5--7B 的五个分类数据集上，DiSP 实现了最佳平均准确率，相比强大的学习选择基线提高了最多 3.4\%，同时实现了高达 $23 imes$ 的端到端时钟速度提升。

View on arXiv Download PDF AI Translation

cs.CL / 102 / 2605.18530

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

连续扩散在语言模型中与离散扩散竞争性地扩展

Yang, Zhihan, Guo, Wei, Zhang, Shuibai, Sahoo, Subham Sekhar, Chen, Yongxin, Vahdat, Arash, Mardani, Morteza, Thickstun, John

Abstract

While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.

Chinese Translation

尽管扩散在语言建模社区中引起了相当大的关注，但连续扩散似乎在可扩展性上不如离散方法。为了挑战这一观点，我们重新审视了Plaid，一个基于似然的连续扩散语言模型（DLM），并通过将Plaid的架构与现代离散DLM对齐构建了RePlaid。在这个统一的框架中，我们建立了第一个与离散DLM相媲美的连续DLM的扩展法则：与自回归模型相比，RePlaid的计算差距仅为$20 imes$，在参数使用更少的情况下超越了Duo，并在过度训练的情况下超越了MDLM。我们将RePlaid与最近的连续DLM进行了基准测试：在OpenWebText上，RePlaid在连续DLM中达到了新的最优PPL界限$22.1$，并且生成质量优越。这些结果表明，基于似然训练的连续扩散是离散DLM的一个高度竞争和可扩展的替代方案。此外，我们提供了理论见解以理解基于似然训练的优势。我们展示了优化噪声调度以最小化ELBO的方差自然地随着时间推移产生线性交叉熵（信息损失）。这均匀分配了去噪难度，而无需任何特定案例的时间重新参数化。此外，我们发现通过似然优化嵌入会创建结构化几何，并推动最显著的似然增益。

View on arXiv Download PDF AI Translation

cs.CL / 103 / 2605.18548

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

STT-Arena：一种更具现实性的工具使用环境，具有时空动态特性

Hui, Tingfeng, Xu, Hao, Zhu, Pengyu, Xin, Hongsheng, Zhan, Kun, Su, Sen, Liu, Chunxiao, Miao, Ning

Abstract

Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary models, including Claude-4.6-Opus, achieves less than 40\% overall accuracies, highlighting the fundamental difficulty of spatio-temporal dynamic reasoning. Systematic analysis of failure trajectories uncovers three recurring error modes of existing models: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Guided by these findings, we propose an iterative trajectory refinement technique that eliminates these failure patterns from training data, and combine it with online RL to produce STT-Agent-4B which outperforms frontier LLMs on STT-Arena.

Chinese Translation

在现实世界的智能应用中部署的大型语言模型（LLMs）必须能够在任务中断时重新规划和适应，以应对先前决策失效的情况。现有的动态基准主要测量LLMs是否能够及时检测到时间变化，而在时空动态下的自适应重新规划这一补充挑战则尚未得到充分探索。我们提出了STT-Arena（时空工具使用竞技场），这是一个包含227个高质量交互任务的基准，涵盖九种时空冲突类型和四个可解性水平。每个任务都基于一个现实的、可执行的环境，配备了注入的时空触发器，这些触发器可以突然使正在进行的计划失效，迫使模型检测状态变化并构建修订后的执行策略。对前沿LLMs的广泛评估显示，即使是包括Claude-4.6-Opus在内的最先进的专有模型，其整体准确率也低于40%，突显了时空动态推理的基本难度。对失败轨迹的系统分析揭示了现有模型的三种重复错误模式：过时状态执行、动态触发器误诊断和缺失后适应验证。在这些发现的指导下，我们提出了一种迭代轨迹优化技术，消除了训练数据中的这些失败模式，并结合在线强化学习，生成了STT-Agent-4B，该模型在STT-Arena上超越了前沿LLMs。

View on arXiv Download PDF AI Translation

cs.CL / 104 / 2605.18549

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

监测内部独白：探针轨迹揭示推理动态

Chrabąszcz, Maciej, Szymczyk, Aleksander, Sendera, Marcin, Trzciński, Tomasz, Cygert, Sebastian

Abstract

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.

Chinese Translation

大型推理模型（LRMs）通过其思维链（Chain of Thought, CoT）推理为安全监测提供了新的机会。然而，CoT并不总是忠实于模型的最终输出，这削弱了其作为监测工具的可靠性。为了解决这个问题，我们研究LRMs的隐藏表示，以确定是否可以从提示和CoT表示中预测未来行为。通过在每个生成的标记上评估探针，我们构建了探针轨迹，即概念在推理过程中的概率连续演变。我们发现，当在整个轨迹上进行检查时，未来模型行为更易于区分，而不是从单一静态预测中得出。为了表征这些时间动态，我们提取了捕捉波动性、趋势和稳态行为的信号处理特征，显著提高了未来模型状态的分离度。我们还提出了两个方法论见解。首先，基于模板的训练数据在与动态生成的模型响应接近的情况下，消除了对昂贵的初始推理和标记的需求。其次，池化操作的选择至关重要：平均池化和最后标记方法的性能接近随机，而最大池化则达到了95%的AUROC，并产生了稳定的探针轨迹。通过使用四个数据集和四个推理模型，涵盖安全和数学领域，我们证明了轨迹特征编码了特定任务的动态，从而改善了结果的可分性。这些发现确立了探针轨迹作为监测LRM行为的补充框架。警告：本文包含潜在有害内容。

View on arXiv Download PDF AI Translation

cs.CL / 105 / 2605.18563

Readers make targeted regressions to plausible errors in reanalysis of "noisy-channel garden-path" sentences

读者针对“噪声通道花园路径”句子中的合理错误进行有针对性的回退

Clark, Thomas Hikaru, Levy, Roger, Gibson, Edward

Abstract

A key question in psycholinguistics is how inferences about the meaning of linguistic input unfold incrementally a comprehender's mind. In this work, we study reading dynamics for ``noisy-channel garden-path'' sentences, which temporarily appear well-formed but feature late-appearing violations of expectation that can be resolved not by inferring an alternative syntactic structure, but by inferring the presence of an error. We find evidence for targeted regressions -- eye movements towards regions that are promising loci of possible errors in light of later-arriving information, showing patterns consistent with the posterior inferences of a model of noisy-channel processing with reanalysis. We discuss the implications of these findings for theories of noisy-channel language comprehension and information-theoretic explanations of reading dynamics.

Chinese Translation

心理语言学中的一个关键问题是，理解者的心智如何逐步展开对语言输入意义的推断。在本研究中，我们研究了“噪声通道花园路径”句子的阅读动态，这些句子在短暂时间内看似结构良好，但后期出现的期望违背无法通过推断替代的句法结构来解决，而是通过推断错误的存在来解决。我们发现有针对性的回退的证据——眼动向可能错误的有希望位置移动，这些位置在后续信息的照射下显得更具可能性，显示出与噪声通道处理模型的后验推断一致的模式。我们讨论了这些发现对噪声通道语言理解理论和信息论阅读动态解释的影响。

View on arXiv Download PDF AI Translation

cs.CL / 106 / 2605.18565

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT：在多目标干扰下评估长时间跨度代理系统的记忆

Lee, Hyunji, Chen, Justin Chih-Yao, Singh, Joykirat, Khan, Zaid, Stengel-Eskin, Elias, Bansal, Mohit

Abstract

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce LongMINT (Long-Horizon Memory under INTerference), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, LongMINT has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are later revised or interfered with by subsequent context, with performance degrading as the number of intervening updates increases.

Chinese Translation

现实世界中的代理在长时间和不断演变的环境中操作，其中信息会被反复更新并可能在记忆之间产生干扰，这要求对多条信息进行准确的回忆和综合推理。然而，现有的基准测试侧重于静态、独立的回忆，未能捕捉到这些不断演变的记忆之间的动态交互。在本文中，我们研究了当前的记忆增强代理在现实的、干扰严重的长时间跨度环境中在不同领域和问题类型下的表现。我们引入了LongMINT（长时间跨度下的记忆干扰），这是一个基准测试，具有以下特点：(1) 长的、高度互联的上下文，包含频繁更新的信息，导致显著的干扰；(2) 多样的领域（状态跟踪、多轮对话、维基百科修订和GitHub提交），使得领域泛化的评估成为可能；(3) 多样的问题类型，评估对干扰的鲁棒性，包括(i) 单目标回忆任务，要求从长上下文中检索特定目标，以及(ii) 多目标聚合任务，要求对多条相关信息进行推理。总体而言，LongMINT包含15600对问题-回答，长时间跨度上下文的平均长度为138800个标记，单个实例最多可扩展至1800000个标记。我们评估了7个代表性系统，包括普通长上下文的LLM、RAG和记忆增强代理框架。在所有系统中，我们观察到一致的低性能（平均准确率为27.9%），尤其是在需要对多条证据进行聚合推理的问题上。我们的分析表明，性能主要受到检索和记忆构建的限制。此外，当前的记忆系统在回忆和推理早期事实时遇到困难，这些事实在后续上下文中被修订或干扰，随着干预更新次数的增加，性能逐渐下降。

View on arXiv Download PDF AI Translation

cs.CL / 107 / 2605.18567

GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems

GUT-IS：一种基于数据的方法，用于整合信息系统中的构念及其关系

Reinhardt, Maximilian, Scharfenberger, Jonas, Funk, Burkhardt

Abstract

Structural equation modeling is widely used in IS research. However, inconsistent construct definitions impede the cumulative development of knowledge. In this work, we present an approach that aims at the integration of structural equation models into a unified model: We use a combination of task-adapted text embeddings and clustering to produce a candidate set of construct groupings. Subsequently, we select the optimal solution using a loss function that explicitly trades off semantic purity and parsimony in the number of clusters. By making this trade-off explicit, our approach allows to analyze how construct groupings and their relations change as one shifts the priority from purity to parsimony. Empirically, we evaluate and explore the proposed methodology on two datasets from the IS domain.

Chinese Translation

结构方程模型在信息系统（IS）研究中被广泛使用。然而，不一致的构念定义阻碍了知识的累积发展。在本研究中，我们提出了一种旨在将结构方程模型整合为统一模型的方法：我们结合任务适应的文本嵌入和聚类，生成构念分组的候选集。随后，我们使用损失函数选择最优解，该函数明确权衡语义纯度与聚类数量的简约性。通过明确这一权衡，我们的方法能够分析构念分组及其关系如何随着从纯度向简约性的优先级转变而变化。我们在IS领域的两个数据集上对所提方法进行了实证评估和探索。

View on arXiv Download PDF AI Translation

cs.CL / 108 / 2605.18572

MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

MA$^{2}$P：一种用于复杂劝说的元认知自主智能体框架

Zhang, Dingyi, Zhuang, Ziqing, Zhang, Linhai, Gao, Ziyang, Zhou, Deyu

Abstract

Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee's internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee's latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA$^{2}$P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.

Chinese Translation

劝说对话生成在决策、谈判、咨询和行为改变中发挥着至关重要的作用，但仍然是一个具有挑战性的问题。在复杂劝说中，劝说对象的内在状态并未明确表达，劝说者必须解读回应，推断劝说对象的潜在心理状态（例如，信念和欲望），并将其转化为针对性的、符合策略的行动；然而，当前的方法即使在识别出这些线索时，往往也会产生通用或基础薄弱的回应。此外，尽管大型语言模型（LLMs）可以生成劝说内容，但由于知识覆盖不均和推理泛化能力有限，它们在不同领域的表现差异显著。为了解决这些挑战，我们提出了MA$^{2}$P，一种用于复杂劝说的元认知自主智能体框架。具体而言，我们开发了一种自主多智能体架构，协调感知管理、心理状态推断、策略执行、记忆维护和性能评估。为了减轻跨领域性能变异，我们进一步设计了一个元认知配置器，在一开始从结构化知识库中选择合适的元策略，从而指导后续的推理和规划。实验结果表明，我们的方法在劝说成功率上超过了基线。

View on arXiv Download PDF AI Translation

cs.CL / 109 / 2605.18607

Forecasting Downstream Performance of LLMs With Proxy Metrics

利用代理指标预测大型语言模型的下游性能

Patel, Arkil, Reddy, Siva, Mosbach, Marius, Bahdanau, Dzmitry

Abstract

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly $10{,}000\times$ less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an $18\times$ compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.

Chinese Translation

语言模型发展的进展通常受到比较决策的驱动：选择采用哪种架构、使用哪种预训练语料库或应用哪种训练方案。做出这些决策需要可靠的性能预测，但两种常用的信号在根本上存在局限性。交叉熵损失与下游能力的对齐程度较差，而直接的下游评估成本高、稀疏，并且在早期训练阶段往往缺乏信息。因此，我们建议通过聚合候选模型在专家编写的解决方案上的下一个标记分布的标记级统计数据（如熵、top-k 准确率和专家标记排名）来构建代理指标。在三种设置中，我们的代理指标始终优于基于损失和计算的基线：1）在跨家族模型选择中，它们对异构推理模型的人群进行排名，平均斯皮尔曼相关系数 Rho = 0.81（相比之下，交叉熵损失的 Rho = 0.36）；2）在预训练数据选择中，它们以大约 $10{,}000 imes$ 的计算量可靠地对目标模型的 25 个候选语料库进行排名，推动了帕累托前沿超越现有方法；3）在训练时间预测中，它们在 $18 imes$ 的计算范围内推断下游准确率，误差大约是现有替代方案的一半。这些结果表明，专家轨迹是评估模型能力的广泛有用信号来源，使得在模型开发生命周期中能够进行可靠的性能预测。

View on arXiv Download PDF AI Translation

cs.CL / 110 / 2605.18646

Language-Switching Triggers Take a Latent Detour Through Language Models

语言切换触发器通过语言模型走了一条潜在的迂回路

Kulumba, Francis, Antoun, Wissam, Lasnier, Théo, Sagot, Benoît, Seddah, Djamé

Abstract

Backdoor attacks on language models pose a growing security concern, yet the internal mechanisms by which a trigger sequence hijacks model computations remain poorly understood. We identify a circuit underlying a language-switching backdoor in an 8B-parameter autoregressive language model, where a three-word Latin trigger (nine tokens) redirects English output to French. We decompose the circuit into three phases: (1) distributed attention heads at early layers compose the trigger tokens into the last sequence position; (2) the resulting signal propagates through mid-layers in a subspace orthogonal to the model's natural language-identity direction; (3) the MLP at the final layer converts this latent signal into French logits. The entire circuit flows through a serial bottleneck at a single position: corrupting that position at any layer entirely mitigate the trigger but also hinder the model's capabilities. The orthogonal latent encoding suggests that defenses that search for language-like signals in intermediate representations would miss this trigger entirely.

Chinese Translation

对语言模型的后门攻击日益成为一个安全隐患，但触发序列劫持模型计算的内部机制仍然不甚明了。我们识别出一个潜在的语言切换后门电路，该电路存在于一个具有8B参数的自回归语言模型中，其中一个三词拉丁触发器（九个标记）将英语输出重定向至法语。我们将该电路分解为三个阶段：（1）早期层的分布式注意力头将触发标记组合到最后的序列位置；（2）生成的信号通过中间层在与模型自然语言身份方向正交的子空间中传播；（3）最后一层的多层感知器（MLP）将该潜在信号转换为法语对数几率。整个电路通过一个位置的串行瓶颈流动：在任何层中破坏该位置将完全消除触发器，但也会妨碍模型的能力。正交的潜在编码表明，搜索中间表示中的类语言信号的防御措施将完全错过这一触发器。

View on arXiv Download PDF AI Translation

cs.CL / 111 / 2605.18703

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

EnvFactory：通过可执行环境合成和稳健强化学习扩展工具使用代理

Xu, Minrui, Wang, Zilin, DENG, Mengyi, Li, Zhiwei, Yang, Zhicheng, Zhu, Xiao, Liu, Yinhong, Zhu, Boyu, Huang, Baiyu, Chen, Chao, Deng, Heyuan, Mi, Fei, Shang, Lifeng, Zeng, Xingshan, Guo, Zhijiang

Abstract

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including $\tau^2$-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.

Chinese Translation

通过代理强化学习（Agentic RL）为大型语言模型（LLMs）赋予工具使用能力面临两个挑战：缺乏可扩展、稳健的执行环境以及缺乏能够捕捉隐含人类推理的真实训练数据。现有方法依赖于昂贵的现实世界API、易产生幻觉的大型语言模型模拟器，或往往是单轮的合成环境，这些环境依赖于预先收集的文档。此外，合成轨迹通常过于具体，类似于指令序列而非自然人类意图，从而降低了其在强化学习训练中的有效性。我们提出了EnvFactory，这是一个完全自动化的框架，旨在解决这两个挑战。EnvFactory能够自主探索和验证来自真实资源的有状态、可执行的工具环境，并通过拓扑感知采样和校准精炼合成自然的多轮轨迹，生成具有隐含意图的基础查询。仅使用85个经过验证的环境，跨越7个领域，EnvFactory生成了2,575个SFT和RL轨迹。尽管使用的环境数量显著少于以往工作（通常多出5倍），EnvFactory仍然实现了更优的训练效率和下游性能，使Qwen3系列模型在BFCLv3上提高了高达15%，在MCP-Atlas上提高了8.6%，在包括$ au^2$-Bench和VitaBench在内的对话基准上提高了6%。通过完全自动化环境构建和轨迹合成，EnvFactory为代理强化学习提供了一个可扩展、可扩展且稳健的基础。

View on arXiv Download PDF AI Translation

cs.CL / 112 / 2605.18732

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

可预测的虚构：大型语言模型的事实回忆与模型规模和主题频率相关

Smith, Matthew L., Shock, Jonathan P., Segun, Samuel T., Olatunji, Iyiola E., Bissyandé, Tegawendé F.

Abstract

While scaling laws govern aggregate large language model performance, no scaling law has linked factual recall to both model size and training-data composition. We evaluated 38 models on over 8,900 scholarly references evaluated by an automated reference verification system. Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data. These two variables alone explain 60% of the variance across 16 dense models from four families, rising to 74-94% within individual families. The form matches a superposition-inspired account in which recall is gated by a signal-to-noise ratio: signal strength scales with concept frequency and the noise floor with model capacity.

Chinese Translation

尽管规模法则支配着大型语言模型的整体性能，但尚无规模法则将事实回忆与模型规模和训练数据组成联系起来。我们评估了38个模型，涉及超过8,900个学术参考文献，这些文献通过自动化参考验证系统进行了评估。回忆质量在模型参数数量和训练数据中主题表示的对数线性组合中呈现S型曲线。这两个变量单独解释了来自四个家族的16个密集模型中60%的方差，而在单个家族内则上升至74-94%。这种形式与一种受叠加启发的解释相匹配，其中回忆受到信噪比的限制：信号强度与概念频率成正比，而噪声底线与模型容量相关。

View on arXiv Download PDF AI Translation

cs.CL / 113 / 2605.18747

Code as Agent Harness

代码作为代理的支架

Ning, Xuying, Tieu, Katherine, Fu, Dongqi, Wei, Tianxin, Li, Zihao, Bei, Yuanchen, Zou, Jiaru, Ai, Mengting, Liu, Zhining, Li, Ting-Wei, Chen, Lingjie, Zhao, Yanjun, Yang, Ke, Li, Bingxuan, Qian, Cheng, Li, Gaotang, Lin, Xiao, Zeng, Zhichen, Qiu, Ruizhong, Chen, Sirui, Sun, Yifan, Yang, Xiyuan, Wang, Ruida, Pan, Rui, Yang, Chenyuan, Zhang, Dylan, Fang, Liri, Cui, Zikun, Cao, Yang, Chen, Pan, Sun, Dorothy, Chen, Ren, Srinivasan, Mahesh, Mathur, Nipun, Xia, Yinglong, Li, Hong, Yan, Hong, Lu, Pan, Zhang, Lingming, Zhang, Tong, Tong, Hanghang, He, Jingrui

Abstract

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

Chinese Translation

近期的大型语言模型（LLMs）在理解和生成代码方面展现出了强大的能力，从竞争性编程到仓库级软件工程。在新兴的代理系统中，代码不再仅仅是一个目标输出。它越来越多地作为代理推理、行动、环境建模和基于执行的验证的操作基础。我们通过代理支架的视角来框定这一转变，并引入代码作为代理支架的概念：一个以代码为基础的代理基础设施的统一视角。为了系统地研究这一观点，我们将调查组织为三个相互关联的层次。首先，我们研究支架接口，代码在此连接代理与推理、行动和环境建模。其次，我们考察支架机制：规划、记忆和工具使用以实现长时间执行，以及使支架可靠和自适应的反馈驱动控制和优化。第三，我们讨论将支架从单代理系统扩展到多代理环境，其中共享的代码工件支持多代理协调、审查和验证。在这些层次中，我们总结了代码作为代理支架的代表性方法和实际应用，涵盖了编码助手、GUI/OS自动化、具身代理、科学发现、个性化和推荐、DevOps以及企业工作流程。我们进一步概述了支架工程中的开放挑战，包括超越最终任务成功的评估、在不完整反馈下的验证、无回归的支架改进、多个代理之间的一致共享状态、安全关键行动的人类监督，以及对多模态环境的扩展。通过将代码作为代理人工智能的支架，本调查提供了通向可执行、可验证和有状态的人工智能代理系统的统一路线图。

View on arXiv Download PDF AI Translation

cs.CL / 114 / 2605.18753

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

DashAttention：可微分且自适应的稀疏层次注意力

Huang, Yuxiang, Gonçalves, Nuno M. T., Alvetreti, Federico, Li, Lei, Han, Xu, Ponti, Edoardo M., Martins, André F. T., Treviso, Marcos V.

Abstract

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse $\alpha$-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models (LLMs) show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes. We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time. Overall, DashAttention offers a cost-effective strategy to model long contexts.

Chinese Translation

当前的层次注意力方法，如NSA和InfLLMv2，基于粗略的注意力分数选择前k个相关的键值（KV）块，并随后对选定的标记应用细粒度的softmax注意力。然而，前k操作假设任何查询的相关标记数量是固定的，并且阻碍了稀疏阶段与密集阶段之间的梯度流动。在本研究中，我们提出了DashAttention（可微分且自适应的稀疏层次注意力），该方法利用自适应稀疏的$eta$-entmax变换，根据当前查询在第一阶段选择可变数量的块。这反过来为第二阶段的softmax注意力提供了先验，从而保持整个层次结构完全可微分。与其他层次注意力方法相反，我们展示了DashAttention是非分散的，这转化为更好的长上下文建模能力。与大型语言模型（LLMs）的实验表明，DashAttention在75%稀疏度下实现了与全注意力相当的准确性，并且在高稀疏度条件下比NSA和InfLLMv2具有更好的帕累托前沿。我们还提供了DashAttention在Triton中的高效GPU感知实现，在推理时实现了超过FlashAttention-3的加速。总体而言，DashAttention提供了一种具有成本效益的策略来建模长上下文。

View on arXiv Download PDF AI Translation

arXiv Papers

Haptic Rendering of Fractional-Order Viscoelasticity: Passivity and Rendering Fidelity

OrbiSim: World Models as Differentiable Physics Engines for Embodied Intelligence

Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery

SCAR: Self-Supervised Continuous Action Representation Learning

MR-SLAM: Immersive Spatial Supervision for Multi-Robot Mapping via Mixed Reality

Hierarchical Two-Stage Framework for Environment-Aware Long-Horizon Vessel Trajectory Prediction

No Plan, Yet Human: A Reactive Robotics Model Predicts Human Planning Failures on a Clinical Task

A Mechanistic Model for Collective Motion from Sensorimotor Regularities

Nori Bot: A Sub-$1,000 Floor-to-Counter Mobile Manipulator

Policy Library CBF: Finite-Horizon Safety at Runtime via Parallel Rollouts

Bayesian Networks for Path-Based Sensors: Gathering Information and Path Planning in Communication Denied Environments

DriveSafer: End-to-End Autonomous Driving with Safety Guidance

LACE: Latent Visual Representation for Cross-Embodiment Learning

"I'm Not Mad, Just Focused'': Understanding Human Emotions in Human-Robot Collaboration

Pedestrian-Aware LLM-Driven Behavioral Planning for Autonomous Vehicles

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

SSTL: Self-Sensing Tendon Loop for Hysteresis Modeling and Compensation in Tendon-Sheath Mechanisms

SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

Beyond Safety Filtering: Control Barrier Function-Informed Reinforcement Learning for Connected and Automated Vehicles

MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation

NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints

Generalizable and Actionable Parts Pose Estimation with Symmetry Annotation-Free Learning Strategy

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

Generating Realistic Safety-Critical Scenarios for Vehicle-Pedestrian Interactions

SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation

Stretch-ICP: A Continuous-Trajectory Registration and Deskewing Algorithm in Scenarios of Aggressive Motions

Task Capability Improvement Algorithm for Collaborative Manipulators

HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds

Beyond Geometry: Efficient Topologically-Grounded Navigation in Complex 3D Environments

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

MUSE: Multimodal Uncertainty Quantification of State Estimation

Rapid Vibration Suppression and Trajectory Tracking of a Serial Manipulator with Multi-Flexible Links

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting

Motion-Uncertainty-Aware Next-Best-View Planning for Moving Object Reconstruction

From a Single Demonstration to a General Policy for Contact-Rich Manipulation

Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping

PRIME: Physically-consistent Robotic Inertial and Motion Estimation for Legged and Humanoid Robots

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

A Dexterous and Compliant Gripper With Soft Hydraulic Actuation for Microgravity Manipulation

WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform

Learning-Based Adaptive Control for Surgical Robotic Exposure Task on Deformable Tissues

Transfer Learning for Customized Car Racing Environments

TacSE3: Equivariant SE(3) Motion Estimation from Low-Texture Visuotactile Images for In-Gripper Tracking and Compensation

Active Defense Against False Data Injection Attacks in Robotic Manipulators

Scenario Generation in Roundabouts with Adjustable Interaction Intensity

Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help?

FUSE: A Framework for Unified State Estimation in Robotic SLAM Systems

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

On Improving Multimodal Pedestrian Trajectory Prediction with CVAE: A Study on Benchmark and Robot Data

Assessing Localization Technologies for Pedestrian Collision Avoidance

Dynamic robotic cloth folding with efficient Koopman operator-based model predictive control

Towards Ubiquitous Mapping and Localization for Dynamic Indoor Environments

REBAR: Reference Ethical Benchmark for Autonomy Readiness

REACT: Environment-Adaptive Architecture for Continuous Formation Navigation of Wheeled Mobile Robots

Bidirectional Optical sensors for Actuation Tracking (BOAT) in soft lattice systems

Geometry-Aware Surrogate for Real-Time Hydrodynamics Estimation of Autonomous Ground Vehicles in Amphibious Environments

Key-Gram: Extensible World Knowledge for Embodied Manipulation

Unified Walking, Running, and Recovery for Humanoids via State-Dependent Adversarial Motion Priors

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

Data-Driven Dynamic Modeling of a Tendon-Actuated Continuum Robot

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

Robo-Cortex: A Self-Evolving Embodied Agent via Dual-Grain Cognitive Memory and Autonomous Knowledge Induction

Noise2Params: Unification and Parameter Determination from Noise via a Probabilistic Event Camera Model

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning